OpenAI just dropped a surprisingly practical tool: a 1.5 billion parameter model that does one thing well. Privacy Filter detects and masks personally identifiable information in text, and it runs locally on your machine. No API calls, no data leaving your infrastructure, unlike Atlassian's recent shift toward training on customer data or the fully on-device approach seen in Ghost Pepper. The model handles up to 128,000 tokens of context in a single pass and hits 96-97% F1 scores on standard benchmarks. It's open-weight under Apache 2.0, so you can fine-tune it for your own use cases.
Under the hood, it's a pretrained language model with the generation head swapped for a token-classification head. One forward pass labels everything. It catches eight types of sensitive data, from names and addresses to API keys, using BIOES tagging to keep boundaries clean.
If you've used Microsoft Presidio, you know it takes a different approach: regex, heuristics, and pluggable NER models combined into custom pipelines. Privacy Filter is a single, opinionated model optimized for span detection. Presidio gives you flexibility. Privacy Filter gives you speed and simplicity. The trade-off is real. Presidio pipelines typically cap around 512 tokens with standard BERT models. Privacy Filter handles 128K tokens with no extra components.
OpenAI says they use a fine-tuned version internally for their own privacy workflows. That's a meaningful signal. When a company eats its own cooking, the tool usually holds up better in production. For developers building training pipelines, logging systems, or document indexing workflows where PII scrubbing matters, this is worth a close look.