VOOZH about

URL: https://huggingface.co/OpenMed/Ministral-3B-PII-Preview

⇱ OpenMed/Ministral-3B-PII-Preview · Hugging Face


Ministral-3B-PII-Preview

Ministral-3B-PII-Preview is a 3.3B-parameter language model that detects personally identifiable information (PII) in unstructured text and returns it as a structured JSON array of typed entities. Give it any text and it emits a list of {"text": ..., "label": ...} objects spanning 69 PII entity types across the healthcare, financial, identity, and digital domains.

The model is an experimental, reinforcement-learning–trained variant of a Ministral-3B base. It was optimized with GRPO (Group Relative Policy Optimization) specifically to produce valid, schema-consistent JSON and to detect PII with high precision — making it suited to redaction, de-identification, and compliance workflows (HIPAA, GDPR, PCI-DSS).

Research preview. This is an experimental model intended for evaluation and pipeline integration. Use it as one layer in a broader privacy/compliance system, not as a sole compliance control.

⚠️ Text input only. This release is a text-to-text model: it reads text and returns JSON. The underlying architecture also contains a vision encoder, but image-to-text PII extraction is not supported in this version — passing images is not a validated path. Multimodal (image → PII) support is planned for a future release.

Key Results

Evaluated on a 1,000-sample held-out PII benchmark with greedy decoding, a 2,048-token prompt budget, and no assistant-side JSON-fence prefill.

Metric Score
Valid JSON rate 1.000
Valid label rate 0.975
Micro precision 0.914
Micro recall 0.859
Micro F1 0.886
Format consistency 100%
Empty-output consistency 100%

Every generation parsed as valid JSON, and the model reliably returns [] for text containing no PII.

Supported PII Labels

The model recognizes 69 PII entity types. Each detected span is returned as {"text": "...", "label": "..."} using the label names below.

Quickstart

The PII extraction system prompt (with few-shot examples) is baked into the chat template, so no system message is required — just send the text. The template does not prefill a markdown json fence; the model emits the JSON array itself.

import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer

model_id = "OpenMed/Ministral-3B-PII-Preview"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# The checkpoint uses a multimodal architecture, but this release is validated
# for TEXT input only. Load it with the image-text-to-text auto class and pass
# text — do not pass images.
model = AutoModelForImageTextToText.from_pretrained(
 model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

messages = [
 {"role": "user", "content": "Contact Sarah at sarah.j@gmail.com or 415-555-0198."},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# [{"text": "Sarah", "label": "first_name"}, {"text": "sarah.j@gmail.com", "label": "email"}, {"text": "415-555-0198", "label": "phone_number"}]

You may pass a custom system message to override the default behavior if needed. Keep the system-prompt pattern, and do not manually prefill ```json.

Optional: production post-processing

For non-English text especially, a small deterministic post-processing pass cleans up the raw output (Unicode normalization, span deduplication, CJK name splitting, Vietnamese name-order swap, language stopword filtering). The implementation ships with this repo in postprocess.py:

import json
from postprocess import postprocess_entities

entities = json.loads(response)
clean = postprocess_entities(entities, language="vi") # pass the source language code

Examples by Compliance Domain

HIPAA — Medical Records

Input:

Patient Maria Garcia, DOB 03/15/1985, MRN 4872910, was admitted on 2024-01-20 for a routine blood panel. Her blood type is O-negative. Insurance ID: BCBS-7742185. Contact her at maria.garcia@protonmail.com or (312) 555-0147.

Output:

[
 {"text": "Maria", "label": "first_name"},
 {"text": "Garcia", "label": "last_name"},
 {"text": "03/15/1985", "label": "date_of_birth"},
 {"text": "4872910", "label": "medical_record_number"},
 {"text": "2024-01-20", "label": "date"},
 {"text": "O-negative", "label": "blood_type"},
 {"text": "BCBS-7742185", "label": "insurance_id"},
 {"text": "maria.garcia@protonmail.com", "label": "email"},
 {"text": "(312) 555-0147", "label": "phone_number"}
]

GDPR — EU Customer Data

Input:

Dear Mr. Lukas Weber, your account (CUST-DE-88412) has been updated. We have your address as Friedrichstrasse 42, 10117 Berlin, Germany. Your IBAN DE89370400440532013000 is on file. For verification, your national ID is T220001293. Please confirm via lukas.weber@deutschland.de.

Output:

[
 {"text": "Mr.", "label": "title"},
 {"text": "Lukas", "label": "first_name"},
 {"text": "Weber", "label": "last_name"},
 {"text": "Friedrichstrasse 42", "label": "street_address"},
 {"text": "10117", "label": "zip_code"},
 {"text": "Berlin", "label": "city"},
 {"text": "Germany", "label": "country"},
 {"text": "CUST-DE-88412", "label": "account_number"},
 {"text": "DE89370400440532013000", "label": "iban"},
 {"text": "T220001293", "label": "national_id"},
 {"text": "lukas.weber@deutschland.de", "label": "email"}
]

PCI-DSS — Financial Data

Input:

Wire transfer requested by account holder James Liu, account #7781920034, routing 021000021. Credit card ending 4532-XXXX-XXXX-8901 was flagged. SSN on file: 123-45-6789. Tax ID: 92-1234567. Contact: j.liu@fidelity-example.com, IP logged: 192.168.1.42.

Output:

[
 {"text": "James", "label": "first_name"},
 {"text": "Liu", "label": "last_name"},
 {"text": "7781920034", "label": "account_number"},
 {"text": "021000021", "label": "routing_number"},
 {"text": "4532-XXXX-XXXX-8901", "label": "credit_card"},
 {"text": "123-45-6789", "label": "ssn"},
 {"text": "92-1234567", "label": "tax_id"},
 {"text": "j.liu@fidelity-example.com", "label": "email"},
 {"text": "192.168.1.42", "label": "ip_address"}
]

No PII — Clean Text

Input:

The quarterly earnings report shows a 12% increase in revenue compared to last year. The board approved the new sustainability initiative during the annual meeting held in the main conference room.

Output:

[]

Multilingual Support (20 languages, zero-shot)

The model was trained only on English PII data but generalizes to other languages out of the box. We ran one realistic example per language across the top 20 world languages and scored the model under two conditions:

  • Strict: exact-match scoring on raw model output.
  • Production: raw output → a small deterministic post-processing pipeline (Unicode normalization, span deduplication, CJK name splitting, Vietnamese name-order swap, language stopword filter, Slavic case-tolerance at match time). Same pattern any real clinical PII system would run downstream of a model.
Mode Perfect Micro-P Micro-R Micro-F1 TP FP FN
Raw model output 13/20 0.902 0.902 0.902 92 10 10
+ Production pipeline 20/20 1.000 1.000 1.000 102 0 0

Scored on 102 entities hand-annotated across all 20 languages.

The post-processing pipeline

Six deterministic steps. No heavy NLP dependencies — all regex, string ops, and small gazetteers. The full implementation lives in postprocess.py.

  1. Unicode NFC + whitespace strip on every text field. Also applied to the input before inference.
  2. Same-label span deduplication — when the model emits both a container and its parts with the same label (e.g. first_name=Nguyễn Văn An AND first_name=Nguyễn), keep the most specific.
  3. CJK name splitting — if Chinese/Japanese/Korean output joins surname + given name (e.g. 田中太郎 as a single first_name), split it using a small surname gazetteer.
  4. Vietnamese name-order swap — Vietnamese writes family-name-first. When the model labels a known Vietnamese surname as first_name, swap first_namelast_name to match the cultural convention.
  5. Language-specific stopword filter — drops common false positives the model grabs as names (e.g. Swahili Jina = "name", Vietnamese Tôi = "I").
  6. Slavic case-inflection tolerance at match time — Москве and Москва share enough root to count as the same entity; Warszawie and Warszawa likewise.

The raw model already extracts 92/102 entities correctly. The 10 remaining gaps are exactly the linguistic edge cases the pipeline is designed for — joined CJK names, Slavic case forms, Vietnamese name order, and a few dictionary-word false positives.

All 20 language examples

Each block shows the input text, the raw model output, and the post-processed output side by side.

Limitations

  • Text input only. Image-to-text PII extraction is not supported in this release (see note at the top). Provide text input.
  • Training data is English-only. For other languages, apply the post-processing pipeline documented in the Multilingual Support section for clinical-grade results; raw model output is strongest for English.
  • Purpose-built for PII extraction — not a general-purpose NER or chat model.
  • Performance may vary on highly domain-specific jargon or unconventional PII formats.
  • As a generative model, it can occasionally emit a label outside the documented set or miss an entity. Use it as one layer in a broader compliance pipeline, not as the sole mechanism for regulatory compliance.

License

Released under the Apache 2.0 license.

Downloads last month
52
Safetensors
Model size
4B params
Tensor type
BF16
·

Model tree for OpenMed/Ministral-3B-PII-Preview

Quantizations
1 model