VOOZH about

URL: https://huggingface.co/mdabis/Kodeseer-9B

โ‡ฑ mdabis/Kodeseer-9B ยท Hugging Face


You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kodeseer-9B

A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding โ€” predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.

Results

Benchmark Score Rank
ScreenSpot-V2 94.7% #7 overall
ScreenSpot-Pro 65.0% #9 overall
ScreenSpot Original 92.1% #1 overall

ScreenSpot-V2 Breakdown

Split Accuracy
Mobile 95.2%
Desktop 94.6%
Web 92.9%
Overall 94.7%

ScreenSpot-Pro Full Breakdown (1581 samples)

Category Accuracy Category Accuracy
eviews 90.0% word 88.1%
powerpoint 82.9% unreal_engine 80.0%
vmware 78.0% matlab 77.4%
davinci 75.0% solidworks 72.7%
linux_common 70.0% photoshop 68.6%
android_studio 66.2% pycharm 66.7%
quartus 64.4% inventor 64.3%
vivado 63.7% vscode 61.8%
blender 60.6% windows_common 59.3%
illustrator 58.1% macos_common 53.8%
excel 51.6% premiere 48.1%
stata 46.9% autocad 41.2%
fruitloops 40.4% origin 38.7%
Overall 65.0%

Comparison with State-of-the-Art

ScreenSpot-V2

Rank Model Size Score
1 MAI-UI 32B 96.5%
2 OmegaUse 30B-A3B MoE 96.3%
3 UI-Venus-1.5 30B-A3B MoE 96.2%
4 UI-Venus-1.5 8B 95.9%
5 UI-Venus-1.0 72B 95.3%
6 MAI-UI / GTA1 8B / 32B 95.2%
7 Kodeseer-9B 9B 94.7%
8 UI-TARS 1.5 7B 94.2%
9 UI-Venus-1.0 7B 94.1%
10 Step-GUI 4B 93.6%

ScreenSpot-Pro

Rank Model Size Score
1 Holo2 (3-step) 235B-A22B MoE 78.5%
2 MAI-UI + zoom-in 32B 73.5%
3 Holo2 (1-step) 235B-A22B MoE 70.6%
4 UI-Venus-1.5 30B-A3B MoE 69.6%
5 UI-Venus-1.5 8B 68.4%
6 MAI-UI 32B 67.9%
7 Holo2 30B-A3B MoE 66.1%
8 MAI-UI 8B 65.8%
9 Kodeseer-9B 9B 65.0%
10 Qwen3-VL + MVP 8B 65.3%*
11 GTA1 32B 63.6%
12 UI-TARS 1.5 7B 61.6%

*MVP is a training-free inference trick

ScreenSpot Original

Rank Model Size Score
1 Kodeseer-9B 9B 92.1%
2 GUI-G2 7B 92.0%
3 GUI-Actor-7B + Verifier 7B 89.7%
4 UI-TARS-7B 7B 89.5%
5 UGround-V1 72B 89.4%

Usage

import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image

base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"

model = Qwen3_5ForConditionalGeneration.from_pretrained(
 base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()

image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"

messages = [
 {"role": "system", "content": (
 "You are a GUI grounding assistant. Given a screenshot and a user instruction, "
 "return the exact coordinates of the target UI element using the format: "
 "<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
 )},
 {"role": "user", "content": [
 {"type": "image", "image": image},
 {"type": "text", "text": instruction},
 ]},
]

text = processor.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
 output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)

generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
# Example output: <|box_start|>(512,340)<|box_end|>

Coordinate Format

The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * image_width)
pixel_y = int(y / 1000 * image_height)

Limitations

  • Trained on English instructions only
  • Weakest on niche professional software (AutoCAD 41.2%, FruitLoops 40.4%, Origin 38.7%)
  • SFT-only โ€” no RL/GRPO applied yet (further gains expected)
  • No training data from professional software domains (all training data is general desktop/mobile/web)

License

Apache 2.0 (same as base model)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mdabis/Kodeseer-9B

Finetuned
Qwen/Qwen3.5-9B
Adapter
(383)
this model

Datasets used to train mdabis/Kodeseer-9B