OpenMed Privacy Filter (Nemotron) — MLX BF16
A native MLX port of
OpenMed/privacy-filter-nemotron
for fast, on-device PII detection on Apple Silicon. This BF16 artifact
preserves the full source precision; for a smaller / faster sibling, see
OpenMed/privacy-filter-nemotron-mlx-8bit.
Family at a glance. Same architecture and training data, three runtimes:
- PyTorch —
OpenMed/privacy-filter-nemotron— CPU + CUDA.- MLX BF16 (this repo) — Apple Silicon, full precision (~2.6 GB).
- MLX 8-bit —
OpenMed/privacy-filter-nemotron-mlx-8bit— Apple Silicon, ~1.4 GB, ~1.7× faster.
What it does
The model is a token classifier built on OpenAI's open Privacy Filter
architecture (the same openai_privacy_filter model type used by
openai/privacy-filter).
It tags each token with a BIOES label across 55 PII span classes, then
a Viterbi pass over the BIOES grammar yields clean entity spans. Detected
categories include:
- Personal identifiers —
first_name,last_name,user_name,gender,age,date_of_birth - Contact —
email,phone_number,fax_number,street_address,city,state,country,county,postcode,coordinate - Government / legal IDs —
ssn,national_id,tax_id,certificate_license_number - Financial —
account_number,bank_routing_number,credit_debit_card,cvv,pin,swift_bic - Medical —
medical_record_number,health_plan_beneficiary_number,blood_type - Workplace —
company_name,occupation,employee_id,customer_id,employment_status,education_level - Online —
url,ipv4,ipv6,mac_address,http_cookie,api_key,password,device_identifier - Demographic —
race_ethnicity,religious_belief,political_view,sexuality,language - Vehicles —
license_plate,vehicle_identifier - Time —
date,date_time,time - Misc —
biometric_identifier,unique_id
For per-label accuracy, training recipe, and dataset details, see the base PyTorch checkpoint.
Architecture
| Field | Value |
|---|---|
| Source model type | openai_privacy_filter |
| Source architecture | OpenAIPrivacyFilterForTokenClassification |
| Hidden size | 640 |
| Transformer layers | 8 |
| Attention | Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks |
| FFN | Sparse Mixture-of-Experts — 128 experts, top-4 routing, SwiGLU |
| Position encoding | YARN-scaled RoPE (rope_theta=150_000, factor=32) |
| Context length | 131,072 tokens (initial 4,096) |
| Tokenizer | o200k_base (tiktoken) — vocab 200,064 |
| Output head | Linear(640 → 221) with bias |
File set
| File | Size | Purpose |
|---|---|---|
weights.safetensors |
2.6 GB | BF16 model weights in OpenMed-MLX layout |
config.json |
19 KB | Model + MLX runtime config |
id2label.json |
5.4 KB | Numeric ID → BIOES label string |
openmed-mlx.json |
0.7 KB | OpenMed MLX manifest (task, family, runtime hints) |
tokenizer.json, tokenizer_config.json |
27 MB | Source tokenizer files (kept for reference) |
The MLX runtime uses tiktoken o200k_base directly for tokenization;
the tokenizer.json is kept so consumers can inspect or re-tokenize via
transformers if desired.
Quick start
With OpenMed — recommended
OpenMed gives you a single extract_pii() / deidentify() API that
auto-selects MLX on Apple Silicon and PyTorch elsewhere — same code on
every host.
pip install -U "openmed[mlx]"
from openmed import extract_pii, deidentify
text = (
"Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
"phone 415-555-0123, email sarah.johnson@example.com."
)
# Extract grouped entity spans (runs on MLX here, PyTorch fallback elsewhere)
result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron-mlx")
for ent in result.entities:
print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")
# De-identify
masked = deidentify(text, method="mask",
model_name="OpenMed/privacy-filter-nemotron-mlx")
fake = deidentify(
text,
method="replace",
model_name="OpenMed/privacy-filter-nemotron-mlx",
consistent=True,
seed=42, # deterministic locale-aware Faker surrogates
)
When MLX isn't available (Linux, Windows, Intel Mac, missing mlx package),
this exact same call automatically falls back to the PyTorch checkpoint
OpenMed/privacy-filter-nemotron
with a one-time warning. Family-aware fallback: a Nemotron MLX request never
substitutes the unrelated openai/privacy-filter baseline.
Direct MLX usage (lower-level)
from huggingface_hub import snapshot_download
from openmed.mlx.inference import PrivacyFilterMLXPipeline
model_path = snapshot_download("OpenMed/privacy-filter-nemotron-mlx")
pipe = PrivacyFilterMLXPipeline(model_path)
print(pipe("Email me at alice.smith@example.com after 5pm."))
# [{'entity_group': 'email',
# 'score': 0.92,
# 'word': 'alice.smith@example.com',
# 'start': 12,
# 'end': 35}]
The pipeline returns a list of dicts with entity_group, score, word,
start, and end (character offsets into the input string).
Loading from a local snapshot
from openmed.mlx.models import load_model
import mlx.core as mx
model = load_model("/path/to/privacy-filter-nemotron-mlx")
ids = mx.array([[1, 100, 200, 300]], dtype=mx.int32)
mask = mx.ones((1, 4), dtype=mx.bool_)
logits = model(ids, attention_mask=mask) # shape (1, 4, 221)
Hardware notes
- Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower.
- Tested on macOS with
mlx>=0.18. The MLX runtime in this repo is independent ofmlx_lm(token classification, not causal LM). - Forward pass on a typical PII sentence (~10 tokens) takes ~14 ms on
M-series GPU after warmup. For lower latency or smaller memory footprint,
use the
-mlx-8bitsibling instead.
Credits & Acknowledgements
This model wouldn't exist without two open-source releases — sincere thanks to both teams:
- OpenAI for open-sourcing the Privacy Filter
(architecture, modeling code, and
opftraining/eval CLI). The MLX port in this repo runs that same architecture under Apple's MLX framework. - NVIDIA for releasing the Nemotron-PII dataset used to fine-tune the source PyTorch checkpoint.
Additional thanks to Apple for MLX and the HuggingFace team for the model-distribution ecosystem.
License
Apache 2.0 (matches the source checkpoint).
- Downloads last month
- 2,679
Quantized
