FineMed Subdomain Classifier (FR)

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

📚 Introduction

This is the medical-subdomain classifier used to annotate FineMed-fr. Given a French medical document, it predicts one of 15 medical subdomains (e.g. Clinical guidelines & pathways, Patient education & lifestyle, Biomedical & mechanistic science).

It is a ModernCamemBERT-base classifier distilled from LLM teachers, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density).

🚀 How to Use

The classifier reads the document text with its URL prepended (url + "\n\n" + text), up to 8192 tokens.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

repo = "doctolib-lab/finemed-subdomain-classifier-fr"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

url = "https://www.example.fr/article"
text = "Le diabète de type 2 est une maladie chronique ..."
inputs = tok(url + "\n\n" + text, return_tensors="pt", truncation=True, max_length=8192)

with torch.inference_mode():
 probs = model(**inputs).logits.softmax(-1)[0]
idx = probs.argmax().item()
print(model.config.id2label[idx], round(probs[idx].item(), 3))

🏷️ Subdomain Taxonomy

best_class is one of these 15 values:

subdomain	description
Clinical cases & vignettes	Single-patient narratives: presentation, evaluation, management, outcomes; case-based teaching.
Clinical guidelines & pathways	Non-patient-specific recommendations, algorithms, and standards; named guidelines or consensus statements.
Patient education & lifestyle	Consumer-facing explanations and how-to advice on prevention, self-care, symptoms, diet, fitness, mental well-being.
Wellness, supplements & CAM	Botanicals, vitamins, supplements, complementary or alternative therapies outside mainstream clinical guidance.
Public health, policy & programs	Population surveillance, epidemiology, screening, laws and regulation, financing and insurance, community guidance.
Commercial & promotional	Marketing or sales content: pricing, booking, calls-to-action, affiliate/SEO, comparative ads, testimonials.
Drugs, trials & regulation	Drug development and evaluation: clinical trials, approvals and labels, PK/PD, safety monitoring, pharmacovigilance.
Biomedical & mechanistic science	Experimental or preclinical research: labs, omics, pathways, cell/animal models, assays, mechanisms.
Medical devices, diagnostics & imaging	Device or modality descriptions and clinical use; diagnostics, wearables, sensors, imaging.
Health IT, telemedicine & operations	EHR/EMR, data standards, interoperability, analytics, telemedicine, workflow, staffing, procurement, logistics.
Occupational health & safety	Workplace hazards, exposures, PPE, training, and compliance with occupational regulations.
Health workforce education & training	Professional curricula, CME, certification, simulation, residency/fellowship information.
Health services & facilities	Neutral descriptions of care-delivery models, service lines, facility capabilities, long-term/residential care.
Other health	Health-related content that is unclear or insufficient to classify under the other subdomains.
Others	Not clearly health-related, too brief, or lacking detail (e.g. navigation/boilerplate).

🔧 Training

The classifier is distilled from LLM teachers under a two-stage schedule, fine-tuning ModernCamemBERT-base at 8192-token input (document content + URL):

Stage 1: Qwen3-30B-A3B-Instruct labels 1M documents (high-volume supervision).
Stage 2: Qwen3-235B-A22B-Instruct labels 490k documents (high-quality supervision).

The 15-class taxonomy was built through three rounds of LLM-driven iteration; class order is shuffled during annotation to mitigate position bias. The full annotation prompt is in subdomain_annotation_prompt.txt.

⚠️ Intended Use & Limitations

Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. Predictions are noisier on short or boilerplate documents, which the Others / Other health classes are meant to absorb.

⚖️ License

MIT, inherited from the ModernCamemBERT base model.

🏛️ Acknowledgments

This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.

Downloads last month: 13

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for doctolib-lab/finemed-subdomain-classifier-fr

Base model

almanach/moderncamembert-base

Finetuned

(4)

this model

Collection including doctolib-lab/finemed-subdomain-classifier-fr

A French medical pretraining corpus, its LLM-rephrased variant, and the annotators that built them. • 6 items • Updated 8 days ago • 2

Paper for doctolib-lab/finemed-subdomain-classifier-fr

Paper • 2606.22079 • Published 11 days ago • 2

URL: https://huggingface.co/doctolib-lab/finemed-subdomain-classifier-fr

⇱ doctolib-lab/finemed-subdomain-classifier-fr · Hugging Face