VOOZH about

URL: https://huggingface.co/doctolib-lab/finemed-subdomain-classifier-fr

⇱ doctolib-lab/finemed-subdomain-classifier-fr Β· Hugging Face


FineMed Subdomain Classifier (FR)

πŸ€— Blog | πŸ“„ Paper | πŸ’» Code | 🌐 FineMed | 🩺 DoctoBERT

πŸ“š Introduction

This is the medical-subdomain classifier used to annotate FineMed-fr. Given a French medical document, it predicts one of 15 medical subdomains (e.g. Clinical guidelines & pathways, Patient education & lifestyle, Biomedical & mechanistic science).

It is a ModernCamemBERT-base classifier distilled from LLM teachers, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density).

πŸš€ How to Use

The classifier reads the document text with its URL prepended (url + "\n\n" + text), up to 8192 tokens.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

repo = "doctolib-lab/finemed-subdomain-classifier-fr"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

url = "https://www.example.fr/article"
text = "Le diabète de type 2 est une maladie chronique ..."
inputs = tok(url + "\n\n" + text, return_tensors="pt", truncation=True, max_length=8192)

with torch.inference_mode():
 probs = model(**inputs).logits.softmax(-1)[0]
idx = probs.argmax().item()
print(model.config.id2label[idx], round(probs[idx].item(), 3))

🏷️ Subdomain Taxonomy

best_class is one of these 15 values:

subdomain description
Clinical cases & vignettes Single-patient narratives: presentation, evaluation, management, outcomes; case-based teaching.
Clinical guidelines & pathways Non-patient-specific recommendations, algorithms, and standards; named guidelines or consensus statements.
Patient education & lifestyle Consumer-facing explanations and how-to advice on prevention, self-care, symptoms, diet, fitness, mental well-being.
Wellness, supplements & CAM Botanicals, vitamins, supplements, complementary or alternative therapies outside mainstream clinical guidance.
Public health, policy & programs Population surveillance, epidemiology, screening, laws and regulation, financing and insurance, community guidance.
Commercial & promotional Marketing or sales content: pricing, booking, calls-to-action, affiliate/SEO, comparative ads, testimonials.
Drugs, trials & regulation Drug development and evaluation: clinical trials, approvals and labels, PK/PD, safety monitoring, pharmacovigilance.
Biomedical & mechanistic science Experimental or preclinical research: labs, omics, pathways, cell/animal models, assays, mechanisms.
Medical devices, diagnostics & imaging Device or modality descriptions and clinical use; diagnostics, wearables, sensors, imaging.
Health IT, telemedicine & operations EHR/EMR, data standards, interoperability, analytics, telemedicine, workflow, staffing, procurement, logistics.
Occupational health & safety Workplace hazards, exposures, PPE, training, and compliance with occupational regulations.
Health workforce education & training Professional curricula, CME, certification, simulation, residency/fellowship information.
Health services & facilities Neutral descriptions of care-delivery models, service lines, facility capabilities, long-term/residential care.
Other health Health-related content that is unclear or insufficient to classify under the other subdomains.
Others Not clearly health-related, too brief, or lacking detail (e.g. navigation/boilerplate).

πŸ”§ Training

The classifier is distilled from LLM teachers under a two-stage schedule, fine-tuning ModernCamemBERT-base at 8192-token input (document content + URL):

The 15-class taxonomy was built through three rounds of LLM-driven iteration; class order is shuffled during annotation to mitigate position bias. The full annotation prompt is in subdomain_annotation_prompt.txt.

⚠️ Intended Use & Limitations

Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. Predictions are noisier on short or boilerplate documents, which the Others / Other health classes are meant to absorb.

βš–οΈ License

MIT, inherited from the ModernCamemBERT base model.

πŸ›οΈ Acknowledgments

This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.

Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
F32
Β·

Model tree for doctolib-lab/finemed-subdomain-classifier-fr

Finetuned
(4)
this model

Collection including doctolib-lab/finemed-subdomain-classifier-fr

Paper for doctolib-lab/finemed-subdomain-classifier-fr