You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Kodeseer-9B

A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding — predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.

Results

Benchmark	Score	Rank
ScreenSpot-V2	94.7%	#7 overall
ScreenSpot-Pro	65.0%	#9 overall
ScreenSpot Original	92.1%	#1 overall

ScreenSpot-V2 Breakdown

Split	Accuracy
Mobile	95.2%
Desktop	94.6%
Web	92.9%
Overall	94.7%

ScreenSpot-Pro Full Breakdown (1581 samples)

Category	Accuracy		Category
eviews	90.0%	word	88.1%
powerpoint	82.9%	unreal_engine	80.0%
vmware	78.0%	matlab	77.4%
davinci	75.0%	solidworks	72.7%
linux_common	70.0%	photoshop	68.6%
android_studio	66.2%	pycharm	66.7%
quartus	64.4%	inventor	64.3%
vivado	63.7%	vscode	61.8%
blender	60.6%	windows_common	59.3%
illustrator	58.1%	macos_common	53.8%
excel	51.6%	premiere	48.1%
stata	46.9%	autocad	41.2%
fruitloops	40.4%	origin	38.7%
Overall	65.0%

Comparison with State-of-the-Art

ScreenSpot-V2

Rank	Model	Size	Score
1	MAI-UI	32B	96.5%
2	OmegaUse	30B-A3B MoE	96.3%
3	UI-Venus-1.5	30B-A3B MoE	96.2%
4	UI-Venus-1.5	8B	95.9%
5	UI-Venus-1.0	72B	95.3%
6	MAI-UI / GTA1	8B / 32B	95.2%
7	Kodeseer-9B	9B	94.7%
8	UI-TARS 1.5	7B	94.2%
9	UI-Venus-1.0	7B	94.1%
10	Step-GUI	4B	93.6%

ScreenSpot-Pro

Rank	Model	Size	Score
1	Holo2 (3-step)	235B-A22B MoE	78.5%
2	MAI-UI + zoom-in	32B	73.5%
3	Holo2 (1-step)	235B-A22B MoE	70.6%
4	UI-Venus-1.5	30B-A3B MoE	69.6%
5	UI-Venus-1.5	8B	68.4%
6	MAI-UI	32B	67.9%
7	Holo2	30B-A3B MoE	66.1%
8	MAI-UI	8B	65.8%
9	Kodeseer-9B	9B	65.0%
10	Qwen3-VL + MVP	8B	65.3%*
11	GTA1	32B	63.6%
12	UI-TARS 1.5	7B	61.6%

*MVP is a training-free inference trick

ScreenSpot Original

Rank	Model	Size	Score
1	Kodeseer-9B	9B	92.1%
2	GUI-G2	7B	92.0%
3	GUI-Actor-7B + Verifier	7B	89.7%
4	UI-TARS-7B	7B	89.5%
5	UGround-V1	72B	89.4%

Usage

import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image

base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"

model = Qwen3_5ForConditionalGeneration.from_pretrained(
 base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()

image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"

messages = [
 {"role": "system", "content": (
 "You are a GUI grounding assistant. Given a screenshot and a user instruction, "
 "return the exact coordinates of the target UI element using the format: "
 "<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
 )},
 {"role": "user", "content": [
 {"type": "image", "image": image},
 {"type": "text", "text": instruction},
 ]},
]

text = processor.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
 output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)

generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
# Example output: <|box_start|>(512,340)<|box_end|>

Coordinate Format

The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * image_width)
pixel_y = int(y / 1000 * image_height)

Limitations

Trained on English instructions only
Weakest on niche professional software (AutoCAD 41.2%, FruitLoops 40.4%, Origin 38.7%)
SFT-only — no RL/GRPO applied yet (further gains expected)
No training data from professional software domains (all training data is general desktop/mobile/web)

License

Apache 2.0 (same as base model)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mdabis/Kodeseer-9B

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(383)

this model

URL: https://huggingface.co/mdabis/Kodeseer-9B

⇱ mdabis/Kodeseer-9B · Hugging Face