VOOZH about

URL: https://huggingface.co/microsoft/GUI-Actor-Verifier-2B

⇱ microsoft/GUI-Actor-Verifier-2B Β· Hugging Face


GUI-Actor-Verifier-2B

This model was introduced in the paper GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents. It is developed based on UI-TARS-2B-SFT and is designed to predict the correctness of an action position given a language instruction. This model is well-suited for GUI-Actor, as its attention map effectively provides diverse candidates for verification with only a single inference.

For more details on model design and evaluation, please check: 🏠 Project Page | πŸ’» Github Repo | πŸ“‘ Paper.

Model List Hugging Face Link
GUI-Actor-7B-Qwen2-VL πŸ€— Hugging Face
GUI-Actor-2B-Qwen2-VL πŸ€— Hugging Face
GUI-Actor-7B-Qwen2.5-VL πŸ€— Hugging Face
GUI-Actor-3B-Qwen2.5-VL πŸ€— Hugging Face
GUI-Actor-Verifier-2B πŸ€— Hugging Face

πŸ“Š Performance Comparison on GUI Grounding Benchmarks

Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with Qwen2-VL as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.

Method Backbone VLM ScreenSpot-Pro ScreenSpot ScreenSpot-v2
72B models:
AGUVIS-72B Qwen2-VL - 89.2 -
UGround-V1-72B Qwen2-VL 34.5 89.4 -
UI-TARS-72B Qwen2-VL 38.1 88.4 90.3
7B models:
OS-Atlas-7B Qwen2-VL 18.9 82.5 84.1
AGUVIS-7B Qwen2-VL 22.9 84.4 86.0†
UGround-V1-7B Qwen2-VL 31.1 86.3 87.6†
UI-TARS-7B Qwen2-VL 35.7 89.5 91.6
GUI-Actor-7B Qwen2-VL 40.7 88.3 89.5
GUI-Actor-7B + Verifier Qwen2-VL 44.2 89.7 90.9
2B models:
UGround-V1-2B Qwen2-VL 26.6 77.1 -
UI-TARS-2B Qwen2-VL 27.7 82.3 84.7
GUI-Actor-2B Qwen2-VL 36.7 86.5 88.6
GUI-Actor-2B + Verifier Qwen2-VL 41.8 86.9 89.3

Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with Qwen2.5-VL as the backbone.

Method Backbone VLM ScreenSpot-Pro ScreenSpot-v2
7B models:
Qwen2.5-VL-7B Qwen2.5-VL 27.6 88.8
Jedi-7B Qwen2.5-VL 39.5 91.7
GUI-Actor-7B Qwen2.5-VL 44.6 92.1
GUI-Actor-7B + Verifier Qwen2.5-VL 47.7 92.5
3B models:
Qwen2.5-VL-3B Qwen2.5-VL 25.9 80.9
Jedi-3B Qwen2.5-VL 36.1 88.6
GUI-Actor-3B Qwen2.5-VL 42.2 91.0
GUI-Actor-3B + Verifier Qwen2.5-VL 45.9 92.4

πŸš€ Usage

The verifier takes a language instruction and an image with a red circle marking the target position as input. One example is shown below. It outputs either β€˜True’ or β€˜False’, and you can also use the probability of each label to score the sample.

For more detailed usage, please refer to our github repo.

πŸ‘ image
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from transformers.generation import GenerationConfig
import json
import re
import os
import numpy as np
from PIL import Image, ImageDraw
from qwen_vl_utils import process_vision_info



# load model
model_name_or_path = "microsoft/GUI-Actor-Verifier-2B"
model = Qwen2VLForConditionalGeneration.from_pretrained(
 model_name_or_path, 
 device_map="cuda:0", 
 trust_remote_code=True, 
 torch_dtype=torch.bfloat16,
 attn_implementation="flash_attention_2"
 ).eval()
output_len = 1

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_name_or_path)

def draw_annotations(img, point_in_pixel, bbox, output_path='test.png', color='red', size=1):
 draw = ImageDraw.Draw(img)
 
 # Draw the ground truth bounding box in green
 if bbox:
 # Assuming bbox format is [x1, y1, x2, y2]
 draw.rectangle(bbox, outline="yellow", width=4)
 
 # Draw a small circle around the predicted point in red
 if point_in_pixel:
 # Create a small rectangle around the point (5 pixels in each direction)
 radius = np.ceil(8 * size).astype(int)
 circle_bbox = [
 point_in_pixel[0] - radius, # x1
 point_in_pixel[1] - radius, # y1
 point_in_pixel[0] + radius, # x2
 point_in_pixel[1] + radius # y2
 ]
 draw.ellipse(circle_bbox, outline=color, width=np.ceil(4 * size).astype(int))
 
 return img

def ground_only_positive(model, tokenizer, processor, instruction, image, point):
 if isinstance(image, str):
 image_path = image
 image = Image.open(image_path)
 else:
 image_path = image_to_temp_filename(image)
 assert os.path.exists(image_path) and os.path.isfile(image_path), "Invalid input image path."

 width, height = image.size
 image = draw_annotations(image, point, None, output_path=None, size=height/1000 * 1.2)

 prompt_origin = "Please observe the screenshot and exame whether the hollow red circle accurately placed on the intended position in the image: '{}'. Answer True or False."
 full_prompt = prompt_origin.format(instruction)

 messages = [
 {
 "role": "user",
 "content": [
 {
 "type": "image",
 "image": image,
 },
 {"type": "text", "text": full_prompt},
 ],
 }
 ]
 # Preparation for inference
 text_input = processor.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True
 )
 image_inputs, video_inputs = process_vision_info(messages)
 inputs = processor(
 text=[text_input],
 images=image_inputs,
 videos=video_inputs,
 padding=True,
 return_tensors="pt",
 )
 inputs = inputs.to("cuda:0")

 generated_ids = model.generate(
 **inputs, 
 max_new_tokens=output_len,
 do_sample=False,
 temperature=0.0
 )

 generated_ids_trimmed = [
 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 ]
 response = processor.batch_decode(
 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )[0]

 print(response)
 matches = re.findall(r'\b(?:True|False)\b', response)
 if not len(matches):
 answer = 'Error Format'
 else:
 answer = matches[-1]
 return answer

# given the image path and instruction and coorindate
instruction = 'close this window'
image = Image.open('test.png')
width, height = image.size
point = [int(0.9709 * width), int(0.1548, * height)] # The point should be in pixels
answer = ground_only_positive(model, tokenizer, processor, instruction, image, point) # output True or False

πŸ“ Citation

@article{wu2025gui,
 title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
 author={Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and others},
 journal={arXiv preprint arXiv:2506.03143},
 year={2025}
}
Downloads last month
118
Safetensors
Model size
2B params
Tensor type
BF16
Β·

Model tree for microsoft/GUI-Actor-Verifier-2B

Finetuned
(1)
this model
Quantizations
2 models

Dataset used to train microsoft/GUI-Actor-Verifier-2B

Space using microsoft/GUI-Actor-Verifier-2B 1

Paper for microsoft/GUI-Actor-Verifier-2B