qwen2-vl-2b-screenshots-distill

A compact screenshot-understanding model: a Qwen2-VL-2B-Instruct student fine-tuned with LoRA to reproduce the screenshot descriptions of a larger Qwen2-VL-7B-Instruct teacher (sequence-level knowledge distillation).

Given a UI screenshot, it produces a one-sentence summary followed by a list of the key interface elements.

Teacher: Qwen/Qwen2-VL-7B-Instruct (4-bit)
Student / base: Qwen/Qwen2-VL-2B-Instruct (this repo ships a LoRA adapter)
Method: response-based / sequence-level KD — the student is trained on the teacher's generated targets
Data: Screen2Words (RICO Android UI screenshots, CC-BY-4.0)
Code: https://github.com/P0rt/vlm-distill-screenshots

Status: proof of concept. Validated end-to-end on an Apple M4 Pro (MPS): train loss 0.80 → 0.39, and the reloaded adapter generates in the trained format. The numbers below are from a small PoC run; a full-scale run is tracked in the repo.

Results (proof-of-concept)

Quality — Screen2Words test split (16 screens), ROUGE-L / BLEU vs human refs:

model	ROUGE-L	BLEU
distilled student	0.178	0.019
untrained baseline	0.153	0.018

The distilled student beats the untrained 2B baseline on ROUGE-L (+16% rel.) after only a short PoC training run.

Efficiency — teacher (7B) vs student (2B); Apple M4 Pro, MLX, 4-bit, 128 tokens:

model	params (B)	latency p50 (ms)	throughput (img/s)	peak mem (GB)
teacher (Qwen2-VL-7B)	8.29	1538	0.63	5.8
student (Qwen2-VL-2B)	2.21	651	1.52	2.4

→ ~2.4× faster, ~2.4× less memory, 3.75× fewer parameters.

👁 Quality vs speed

👁 Quality vs memory

Note: against the short human references (median 7 words), ROUGE-L/BLEU undersell the verbose teacher — the 7B teacher actually scores lower on ROUGE-L (0.164) than the distilled student (0.178). LLM-as-judge / CIDEr would reward content over brevity-matching; the unambiguous win here is efficiency.

Usage

import torch
from peft import PeftModel
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

BASE = "Qwen/Qwen2-VL-2B-Instruct"
ADAPTER = "p00rt/qwen2-vl-2b-screenshots-distill"

processor = AutoProcessor.from_pretrained(BASE, min_pixels=200704, max_pixels=401408)
model = Qwen2VLForConditionalGeneration.from_pretrained(BASE, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(model, ADAPTER).eval()

image = Image.open("screenshot.png")
prompt = ("Describe this UI screenshot in one sentence, then list the key "
 "interface elements as a comma-separated list.")
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Training

LoRA (r=16, α=32, dropout=0.05) on the language-model attention projections.
Visual tokens capped (max_pixels = 512·28·28) so large screenshots fit the context window.
Backend: transformers + peft (hf), on CUDA or Apple MPS.

Limitations

Narrow domain: Android UI screenshots (RICO). Not a general VLM.
Perception only — no grounding (bounding boxes) and no action/agent layer.
Inherits teacher biases and any teacher hallucinations in the distilled targets.

License & attribution

Code & adapter: Apache-2.0.
Base model Qwen/Qwen2-VL-2B-Instruct: see the base model's license.
Screen2Words / RICO data: CC-BY-4.0.

Downloads last month: 15

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for p00rt/qwen2-vl-2b-screenshots-distill

Base model

Qwen/Qwen2-VL-2B

Finetuned

Qwen/Qwen2-VL-2B-Instruct

Adapter

(163)

this model

URL: https://huggingface.co/p00rt/qwen2-vl-2b-screenshots-distill

⇱ p00rt/qwen2-vl-2b-screenshots-distill · Hugging Face