FineMed Educational-Quality Scorer (FR)

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

📚 Introduction

This is the educational-quality scorer used to annotate FineMed-fr. Given a French medical document, it outputs a 0–5 score for how instructive the document is for medical education (medical students, residents, practicing clinicians), on a rubric adapted from FineWeb-Edu.

It is a ModernCamemBERT-base regression scorer distilled from LLM teachers, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density).

🚀 How to Use

The model has a regression head: take the raw score and round/clip it to the 0–5 integer scale. It reads the document text, up to 8192 tokens.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

repo = "doctolib-lab/finemed-edu-scorer-fr"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

text = "Le diabète de type 2 est une maladie chronique ..."
inputs = tok(text, return_tensors="pt", truncation=True, max_length=8192)

with torch.inference_mode():
 score = model(**inputs).logits.squeeze(-1).item()
normalized = round(max(0, min(score, 5))) # 0–5
print(round(score, 2), normalized)

🏷️ Scoring Rubric

An additive 0–5 score adapted from FineWeb-Edu's general-education rubric to a medical-education target, awarding one point per successive criterion. The full scoring prompt is in edu_quality_annotation_prompt.txt.

🔧 Training

The scorer is distilled from LLM teachers under a two-stage schedule, fine-tuning ModernCamemBERT-base (regression head, round-up rounding) at 8192-token input (document content):

Stage 1: Qwen3-30B-A3B-Instruct labels 1M documents (high-volume supervision).
Stage 2: Qwen3-235B-A22B-Instruct labels 90k documents (high-quality supervision).

⚠️ Intended Use & Limitations

Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. The score reflects educational value for medical training, not factual correctness or clinical safety.

⚖️ License

MIT, inherited from the ModernCamemBERT base model.

🏛️ Acknowledgments

This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.

Downloads last month: 9

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for doctolib-lab/finemed-edu-scorer-fr

Base model

almanach/moderncamembert-base

Finetuned

(4)

this model

Collection including doctolib-lab/finemed-edu-scorer-fr

A French medical pretraining corpus, its LLM-rephrased variant, and the annotators that built them. • 6 items • Updated 8 days ago • 2

Paper for doctolib-lab/finemed-edu-scorer-fr

Paper • 2606.22079 • Published 11 days ago • 2

URL: https://huggingface.co/doctolib-lab/finemed-edu-scorer-fr

⇱ doctolib-lab/finemed-edu-scorer-fr · Hugging Face