Kodeseer-9B
A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding โ predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.
Results
| Benchmark | Score | Rank |
|---|---|---|
| ScreenSpot-V2 | 94.7% | #7 overall |
| ScreenSpot-Pro | 65.0% | #9 overall |
| ScreenSpot Original | 92.1% | #1 overall |
ScreenSpot-V2 Breakdown
| Split | Accuracy |
|---|---|
| Mobile | 95.2% |
| Desktop | 94.6% |
| Web | 92.9% |
| Overall | 94.7% |
ScreenSpot-Pro Full Breakdown (1581 samples)
| Category | Accuracy | Category | Accuracy | |
|---|---|---|---|---|
| eviews | 90.0% | word | 88.1% | |
| powerpoint | 82.9% | unreal_engine | 80.0% | |
| vmware | 78.0% | matlab | 77.4% | |
| davinci | 75.0% | solidworks | 72.7% | |
| linux_common | 70.0% | photoshop | 68.6% | |
| android_studio | 66.2% | pycharm | 66.7% | |
| quartus | 64.4% | inventor | 64.3% | |
| vivado | 63.7% | vscode | 61.8% | |
| blender | 60.6% | windows_common | 59.3% | |
| illustrator | 58.1% | macos_common | 53.8% | |
| excel | 51.6% | premiere | 48.1% | |
| stata | 46.9% | autocad | 41.2% | |
| fruitloops | 40.4% | origin | 38.7% | |
| Overall | 65.0% |
Comparison with State-of-the-Art
ScreenSpot-V2
| Rank | Model | Size | Score |
|---|---|---|---|
| 1 | MAI-UI | 32B | 96.5% |
| 2 | OmegaUse | 30B-A3B MoE | 96.3% |
| 3 | UI-Venus-1.5 | 30B-A3B MoE | 96.2% |
| 4 | UI-Venus-1.5 | 8B | 95.9% |
| 5 | UI-Venus-1.0 | 72B | 95.3% |
| 6 | MAI-UI / GTA1 | 8B / 32B | 95.2% |
| 7 | Kodeseer-9B | 9B | 94.7% |
| 8 | UI-TARS 1.5 | 7B | 94.2% |
| 9 | UI-Venus-1.0 | 7B | 94.1% |
| 10 | Step-GUI | 4B | 93.6% |
ScreenSpot-Pro
| Rank | Model | Size | Score |
|---|---|---|---|
| 1 | Holo2 (3-step) | 235B-A22B MoE | 78.5% |
| 2 | MAI-UI + zoom-in | 32B | 73.5% |
| 3 | Holo2 (1-step) | 235B-A22B MoE | 70.6% |
| 4 | UI-Venus-1.5 | 30B-A3B MoE | 69.6% |
| 5 | UI-Venus-1.5 | 8B | 68.4% |
| 6 | MAI-UI | 32B | 67.9% |
| 7 | Holo2 | 30B-A3B MoE | 66.1% |
| 8 | MAI-UI | 8B | 65.8% |
| 9 | Kodeseer-9B | 9B | 65.0% |
| 10 | Qwen3-VL + MVP | 8B | 65.3%* |
| 11 | GTA1 | 32B | 63.6% |
| 12 | UI-TARS 1.5 | 7B | 61.6% |
*MVP is a training-free inference trick
ScreenSpot Original
| Rank | Model | Size | Score |
|---|---|---|---|
| 1 | Kodeseer-9B | 9B | 92.1% |
| 2 | GUI-G2 | 7B | 92.0% |
| 3 | GUI-Actor-7B + Verifier | 7B | 89.7% |
| 4 | UI-TARS-7B | 7B | 89.5% |
| 5 | UGround-V1 | 72B | 89.4% |
Usage
import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image
base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"
model = Qwen3_5ForConditionalGeneration.from_pretrained(
base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()
image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"
messages = [
{"role": "system", "content": (
"You are a GUI grounding assistant. Given a screenshot and a user instruction, "
"return the exact coordinates of the target UI element using the format: "
"<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
)},
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": instruction},
]},
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)
generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
# Example output: <|box_start|>(512,340)<|box_end|>
Coordinate Format
The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates:
pixel_x = int(x / 1000 * image_width)
pixel_y = int(y / 1000 * image_height)
Limitations
- Trained on English instructions only
- Weakest on niche professional software (AutoCAD 41.2%, FruitLoops 40.4%, Origin 38.7%)
- SFT-only โ no RL/GRPO applied yet (further gains expected)
- No training data from professional software domains (all training data is general desktop/mobile/web)
License
Apache 2.0 (same as base model)
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
