FineMed Educational-Quality Scorer (FR)
π€ Blog | π Paper | π» Code | π FineMed | π©Ί DoctoBERT
π Introduction
This is the educational-quality scorer used to annotate FineMed-fr. Given a French medical document, it outputs a 0β5 score for how instructive the document is for medical education (medical students, residents, practicing clinicians), on a rubric adapted from FineWeb-Edu.
It is a ModernCamemBERT-base regression scorer distilled from LLM teachers, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density).
π How to Use
The model has a regression head: take the raw score and round/clip it to the 0β5 integer scale. It reads the document text, up to 8192 tokens.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
repo = "doctolib-lab/finemed-edu-scorer-fr"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()
text = "Le diabète de type 2 est une maladie chronique ..."
inputs = tok(text, return_tensors="pt", truncation=True, max_length=8192)
with torch.inference_mode():
score = model(**inputs).logits.squeeze(-1).item()
normalized = round(max(0, min(score, 5))) # 0β5
print(round(score, 2), normalized)
π·οΈ Scoring Rubric
An additive 0β5 score adapted from FineWeb-Edu's general-education rubric to a medical-education target, awarding one point per successive criterion. The full scoring prompt is in edu_quality_annotation_prompt.txt.
π§ Training
The scorer is distilled from LLM teachers under a two-stage schedule, fine-tuning ModernCamemBERT-base (regression head, round-up rounding) at 8192-token input (document content):
- Stage 1: Qwen3-30B-A3B-Instruct labels 1M documents (high-volume supervision).
- Stage 2: Qwen3-235B-A22B-Instruct labels 90k documents (high-quality supervision).
β οΈ Intended Use & Limitations
Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. The score reflects educational value for medical training, not factual correctness or clinical safety.
βοΈ License
MIT, inherited from the ModernCamemBERT base model.
ποΈ Acknowledgments
This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.
- Downloads last month
- 9
Model tree for doctolib-lab/finemed-edu-scorer-fr
Base model
almanach/moderncamembert-base