FineMed Subdomain Classifier (FR)
π€ Blog | π Paper | π» Code | π FineMed | π©Ί DoctoBERT
π Introduction
This is the medical-subdomain classifier used to annotate FineMed-fr. Given a French medical document, it predicts one of 15 medical subdomains (e.g. Clinical guidelines & pathways, Patient education & lifestyle, Biomedical & mechanistic science).
It is a ModernCamemBERT-base classifier distilled from LLM teachers, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density).
π How to Use
The classifier reads the document text with its URL prepended (url + "\n\n" + text), up to 8192 tokens.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
repo = "doctolib-lab/finemed-subdomain-classifier-fr"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()
url = "https://www.example.fr/article"
text = "Le diabète de type 2 est une maladie chronique ..."
inputs = tok(url + "\n\n" + text, return_tensors="pt", truncation=True, max_length=8192)
with torch.inference_mode():
probs = model(**inputs).logits.softmax(-1)[0]
idx = probs.argmax().item()
print(model.config.id2label[idx], round(probs[idx].item(), 3))
π·οΈ Subdomain Taxonomy
best_class is one of these 15 values:
| subdomain | description |
|---|---|
| Clinical cases & vignettes | Single-patient narratives: presentation, evaluation, management, outcomes; case-based teaching. |
| Clinical guidelines & pathways | Non-patient-specific recommendations, algorithms, and standards; named guidelines or consensus statements. |
| Patient education & lifestyle | Consumer-facing explanations and how-to advice on prevention, self-care, symptoms, diet, fitness, mental well-being. |
| Wellness, supplements & CAM | Botanicals, vitamins, supplements, complementary or alternative therapies outside mainstream clinical guidance. |
| Public health, policy & programs | Population surveillance, epidemiology, screening, laws and regulation, financing and insurance, community guidance. |
| Commercial & promotional | Marketing or sales content: pricing, booking, calls-to-action, affiliate/SEO, comparative ads, testimonials. |
| Drugs, trials & regulation | Drug development and evaluation: clinical trials, approvals and labels, PK/PD, safety monitoring, pharmacovigilance. |
| Biomedical & mechanistic science | Experimental or preclinical research: labs, omics, pathways, cell/animal models, assays, mechanisms. |
| Medical devices, diagnostics & imaging | Device or modality descriptions and clinical use; diagnostics, wearables, sensors, imaging. |
| Health IT, telemedicine & operations | EHR/EMR, data standards, interoperability, analytics, telemedicine, workflow, staffing, procurement, logistics. |
| Occupational health & safety | Workplace hazards, exposures, PPE, training, and compliance with occupational regulations. |
| Health workforce education & training | Professional curricula, CME, certification, simulation, residency/fellowship information. |
| Health services & facilities | Neutral descriptions of care-delivery models, service lines, facility capabilities, long-term/residential care. |
| Other health | Health-related content that is unclear or insufficient to classify under the other subdomains. |
| Others | Not clearly health-related, too brief, or lacking detail (e.g. navigation/boilerplate). |
π§ Training
The classifier is distilled from LLM teachers under a two-stage schedule, fine-tuning ModernCamemBERT-base at 8192-token input (document content + URL):
- Stage 1: Qwen3-30B-A3B-Instruct labels 1M documents (high-volume supervision).
- Stage 2: Qwen3-235B-A22B-Instruct labels 490k documents (high-quality supervision).
The 15-class taxonomy was built through three rounds of LLM-driven iteration; class order is shuffled during annotation to mitigate position bias. The full annotation prompt is in subdomain_annotation_prompt.txt.
β οΈ Intended Use & Limitations
Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. Predictions are noisier on short or boilerplate documents, which the Others / Other health classes are meant to absorb.
βοΈ License
MIT, inherited from the ModernCamemBERT base model.
ποΈ Acknowledgments
This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.
- Downloads last month
- 13
Model tree for doctolib-lab/finemed-subdomain-classifier-fr
Base model
almanach/moderncamembert-base