VOOZH about

URL: https://huggingface.co/doctolib-lab/finemed-edu-scorer-fr

⇱ doctolib-lab/finemed-edu-scorer-fr Β· Hugging Face


FineMed Educational-Quality Scorer (FR)

πŸ€— Blog | πŸ“„ Paper | πŸ’» Code | 🌐 FineMed | 🩺 DoctoBERT

πŸ“š Introduction

This is the educational-quality scorer used to annotate FineMed-fr. Given a French medical document, it outputs a 0–5 score for how instructive the document is for medical education (medical students, residents, practicing clinicians), on a rubric adapted from FineWeb-Edu.

It is a ModernCamemBERT-base regression scorer distilled from LLM teachers, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density).

πŸš€ How to Use

The model has a regression head: take the raw score and round/clip it to the 0–5 integer scale. It reads the document text, up to 8192 tokens.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

repo = "doctolib-lab/finemed-edu-scorer-fr"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

text = "Le diabète de type 2 est une maladie chronique ..."
inputs = tok(text, return_tensors="pt", truncation=True, max_length=8192)

with torch.inference_mode():
 score = model(**inputs).logits.squeeze(-1).item()
normalized = round(max(0, min(score, 5))) # 0–5
print(round(score, 2), normalized)

🏷️ Scoring Rubric

An additive 0–5 score adapted from FineWeb-Edu's general-education rubric to a medical-education target, awarding one point per successive criterion. The full scoring prompt is in edu_quality_annotation_prompt.txt.

πŸ”§ Training

The scorer is distilled from LLM teachers under a two-stage schedule, fine-tuning ModernCamemBERT-base (regression head, round-up rounding) at 8192-token input (document content):

⚠️ Intended Use & Limitations

Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. The score reflects educational value for medical training, not factual correctness or clinical safety.

βš–οΈ License

MIT, inherited from the ModernCamemBERT base model.

πŸ›οΈ Acknowledgments

This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.

Downloads last month
9
Safetensors
Model size
0.1B params
Tensor type
F32
Β·

Model tree for doctolib-lab/finemed-edu-scorer-fr

Finetuned
(4)
this model

Collection including doctolib-lab/finemed-edu-scorer-fr

Paper for doctolib-lab/finemed-edu-scorer-fr