VOOZH about

URL: https://huggingface.co/vishal98m/qwen2.5-vl-3b-finetuned-medical-vqa-rad

⇱ vishal98m/qwen2.5-vl-3b-finetuned-medical-vqa-rad · Hugging Face


Qwen2.5-VL-3B — Medical VQA (VQA-RAD)

This is a fine-tuned version of Qwen/Qwen2.5-VL-3B-Instruct trained on the VQA-RAD radiology visual question answering dataset using QLoRA. The LoRA adapter weights have been merged into the base model, so no PEFT dependency is required at inference time.

Model Details

Property Value
Base Model Qwen/Qwen2.5-VL-3B-Instruct
Fine-tuning Method QLoRA (4-bit quantized base, LoRA adapters)
LoRA Rank 16
LoRA Alpha 32
LoRA Dropout 0.05
Vision Encoder Frozen (language layers only fine-tuned)
Merged Yes — merge_and_unload() applied post-training
Language English
License Apache 2.0

Training Details

Hyperparameter Value
Epochs 3
Learning Rate 2e-4
LR Scheduler Cosine
Warmup Ratio 0.03
Batch Size (per device) 1
Gradient Accumulation Steps 4
Effective Batch Size 4
Max Sequence Length 512
Optimizer Paged AdamW 8-bit
Precision bfloat16
Gradient Checkpointing Yes (Unsloth patched)
Framework Unsloth + TRL SFTTrainer

LoRA Target Modules

Language attention and MLP layers. Vision encoder weights were frozen during training — only the language decoder was adapted.

Dataset

VQA-RAD (flaviagiammarino/vqa-rad) is a clinician-generated radiology VQA dataset with 2,248 question-answer pairs across 315 radiology images covering chest X-rays, head CTs, and abdominal scans. Questions are split into two types: - Closed-ended: Yes/No and constrained categorical answers - Open-ended: Free-form short phrase answers Evaluation is on the official test split (451 samples).

Evaluation Results

Overall

Metric Base Model Fine-tuned Improvement
Exact Match (EM) 48.12% 53.88% +11.97% (relative)
Token F1 58.43%

By Question Type (Fine-tuned Model)

Question Type Exact Match Token F1
Closed-ended 77.29% 77.29%
Open-ended 24.50% 34.75%

Note on metrics: Closed-ended EM and F1 are identical because yes/no answers are single tokens. The open-ended EM→F1 gap (24.50% → 34.75%) indicates that many EM failures are phrasing mismatches rather than factual errors — the model produces semantically correct answers that don't exactly match the ground truth string.


Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch
 
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 "vishal98m/qwen2.5-vl-3b-finetuned-medical-vqa-rad",
 torch_dtype=torch.bfloat16,
 device_map="auto"
)
processor = AutoProcessor.from_pretrained(
 "Qwen/Qwen2.5-VL-3B-Instruct",
 use_fast=False
)
 
def ask(image: Image.Image, question: str) -> str:
 prompt = (
 f"{question}\n"
 "Provide only the short direct answer in 1-5 words. "
 "Do not explain or add any extra text."
 )
 messages = [
 {
 "role": "user",
 "content": [
 {"type": "image", "image": image},
 {"type": "text", "text": prompt},
 ],
 }
 ]
 text = processor.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True
 )
 image_inputs, video_inputs = process_vision_info(messages)
 inputs = processor(
 text=[text],
 images=image_inputs,
 videos=video_inputs,
 padding=True,
 return_tensors="pt",
 ).to("cuda")
 
 with torch.no_grad():
 generated_ids = model.generate(**inputs, max_new_tokens=32)
 
 trimmed = [
 out[len(inp):]
 for inp, out in zip(inputs.input_ids, generated_ids)
 ]
 return processor.batch_decode(
 trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )[0].strip()
 
# Example
image = Image.open("chest_xray.jpg")
print(ask(image, "Is there any abnormality in this chest X-ray?"))

Inference Tips

  • Prompt format matters: The model was trained to produce short answers. Prompting with "Provide only the short direct answer in 1-5 words" significantly improves exact match performance by suppressing verbose outputs.
  • Normalization: For evaluation, lowercase and strip punctuation before comparing answers.
  • Hardware: Runs on a single 8GB+ VRAM GPU in bfloat16. For CPU inference, switch to torch_dtype=torch.float32.

Intended Use

✅ Suitable For

  • Medical radiology question answering (academic and educational contexts)
  • Study aid for medical and life science students
  • Generating explanations of radiological findings and imaging concepts
  • Prototype development for medical education tools

❌ Not Suitable For

  • Clinical decision-making or patient diagnosis
  • Replacing licensed medical professionals
  • Providing personalised medical advice
  • Any safety-critical medical application

Limitations

  • Domain: Fine-tuned exclusively on VQA-RAD. Generalization to other medical VQA benchmarks (PathVQA, SLAKE) has not been evaluated.
  • Open-ended answers: Token F1 (34.75%) is a more faithful measure of open-ended performance than EM (24.50%) — strict string matching penalizes valid paraphrases.
  • Not for clinical use: This model is a research artifact. It should not be used for clinical decision-making or patient care.
  • Model size: At 3B parameters, absolute performance is below larger medical VLMs. Larger variants would likely yield higher scores.
  • Vision encoder frozen: Only the language decoder was fine-tuned. Fine-tuning the vision encoder may yield additional gains.

Citation

If you use this model, please cite the original VQA-RAD dataset:

@article{lau2018dataset,
 title={A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images},
 author={Lau, Jason J and Gayen, Soumya and Ben Abacha, Asma and Demner-Fushman, Dina},
 journal={Scientific data},
 volume={5},
 number={1},
 pages={1--10},
 year={2018},
 publisher={Nature Publishing Group}
}

And the relevant methods:

@article{hu2022lora,
 title={LoRA: Low-Rank Adaptation of Large Language Models},
 author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
 journal={ICLR},
 year={2022}
}
 
@article{dettmers2023qlora,
 title={QLoRA: Efficient Finetuning of Quantized LLMs},
 author={Dettmers, Tim and Pagnoni, Artidoro and Fansi, Ari and Zettlemoyer, Luke},
 journal={NeurIPS},
 year={2023}
}

Uploaded model

  • Developed by: vishal98m
  • License: apache-2.0
  • Finetuned from model : unsloth/Qwen2.5-VL-3B-Instruct

This qwen2_5_vl model was trained 2x faster with Unsloth

👁 Image

Downloads last month
3
Safetensors
Model size
4B params
Tensor type
F16
·

Model tree for vishal98m/qwen2.5-vl-3b-finetuned-medical-vqa-rad

Adapter
(216)
this model

Dataset used to train vishal98m/qwen2.5-vl-3b-finetuned-medical-vqa-rad