WebWorld ๐
๐ License
๐ GitHub
๐ Dataset
๐ MS Dataset
๐ 8B
๐ MS 8B
๐ 14B
๐ MS 14B
๐ 32B
๐ MS 32B
๐ Introduction
WebWorld is a large-scale open-web world model series for training and evaluating web agents. It is trained on 1M+ real-world web interaction trajectories via a scalable hierarchical data pipeline, supporting:
- Long-horizon simulation (30+ steps)
- Multi-format state representations: A11y Tree, HTML, XML, Markdown, and natural language
- CoT-activated reasoning for transition prediction
- Cross-domain generalization to code, GUI, and game environments
Agents trained on WebWorld-synthesized trajectories achieve +9.9% on MiniWob++ and +10.9% on WebArena. When used for inference-time lookahead search, WebWorld outperforms GPT-5 as a world model.
๐ฏ Model Series
| Model | Base Model | HuggingFace Link | ModelScope Link |
|---|---|---|---|
| WebWorld-8B | Qwen3-8B | ๐ค HuggingFace | ๐ค ModelScope |
| WebWorld-14B | Qwen3-14B | ๐ค HuggingFace | ๐ค ModelScope |
| WebWorld-32B | Qwen3-32B | ๐ค HuggingFace | ๐ค ModelScope |
WebWorldData: Huggingface: Qwen/WebWorldData, ModelScope: Qwen/WebWorldData
๐ก Recommendation: Use 8B for fast simulation and data synthesis; use 14B/32B for higher-fidelity simulation and better long-horizon robustness. For best results in a specific environment, we recommend task-specific fine-tuning on in-domain trajectories.
๐ ๏ธ Requirements
transformers(recommended: latest version)torch- Optional:
accelerate,vllmfor efficient serving
๐ Quick Start
Key Notes:
- WebWorld predicts the next page state given the current state and an action.
- It strictly preserves the input/output format (A11y / HTML / XML / Markdown / NL).
- Supports multi-turn trajectory simulation up to 30+ steps.
Single-Step Prediction
Multi-Turn Simulation
The first turn provides the initial state and first action. Each subsequent turn uses a fixed continuation prompt:
๐ฎ Action Space
WebWorld supports a unified action space as Python-style function calls:
| Category | Action | Description |
|---|---|---|
| Element | click(bid, button, modifiers) |
Click a DOM element by its ID |
fill(bid, text, press_enter) |
Type text into an input field | |
select_option(bid, options) |
Select from a dropdown / combobox | |
hover(bid) |
Hover over an element | |
| Mouse | mouse_move(x, y) |
Move cursor to coordinates |
mouse_click(x, y, button) |
Click at coordinates | |
mouse_down(x, y) / mouse_up(x, y) |
Press / release (drag-and-drop) | |
| Keyboard | keyboard_press(key) |
Press a key (e.g., Enter, Tab) |
keyboard_type(text) |
Type a string sequentially | |
| Browser | scroll(dx, dy) |
Scroll the viewport |
goto(url) |
Navigate to a URL | |
go_back() / go_forward() |
Browser history navigation | |
tab_new() / tab_close() / tab_focus(index) |
Manage browser tabs | |
| Meta | send_msg_to_user(text) |
Send a message to the user |
noop(wait_ms) |
Wait for a duration | |
infeasible(reason) |
Declare the task impossible |
๐ Performance
Intrinsic Evaluation (WebWorld-Bench)
WebWorld-Bench evaluates models using Factuality Score (functional correctness) and Web Turing Score (perceptual realism) across nine dimensions:
| Model | Avg Factuality | Avg Turing |
|---|---|---|
| GPT-4o | 59.5 | 35.4 |
| Claude-Opus-4.1 | 71.3 | 47.4 |
| Gemini-3-Pro | 70.3 | 43.2 |
| Qwen3-8B (base) | 26.9 | 17.4 |
| WebWorld-8B | 70.1 | 42.2 |
| WebWorld-14B | 70.7 | 44.7 |
| WebWorld-32B | 71.0 | 45.6 |
Extrinsic Evaluation (Agent Training)
| Model | MiniWob++ SR | WebArena SR |
|---|---|---|
| GPT-4o | 64.3% | 26.6% |
| Qwen3-8B (base) | 49.4% | 9.8% |
| Qwen3-8B + WebWorld | 59.3% (+9.9%) | 20.7% (+10.9%) |
| Qwen3-14B (base) | 54.9% | 15.1% |
| Qwen3-14B + WebWorld | 63.2% (+8.3%) | 24.3% (+9.2%) |
Cross-Domain Generalization
| Environment | Qwen3-8B | WebWorld-8B | Gain |
|---|---|---|---|
| API Services | 0.088 | 0.299 | +0.211 |
| Code | 0.147 | 0.396 | +0.249 |
| Game | 0.253 | 0.473 | +0.220 |
| GUI Desktop | 0.322 | 0.705 | +0.383 |
โ ๏ธ Limitations
- Sycophancy / optimism bias: the model may generate outcomes that are overly favorable to the agent's intended action.
- Content generation fidelity: long-form, high-precision content (e.g., scientific articles) is not the primary target.
- Text-only: WebWorld does not simulate visual / pixel-level rendering.
๐ Citation
@misc{xiao2026webworldlargescaleworldmodel,
title={WebWorld: A Large-Scale World Model for Web Agent Training},
author={Zikai Xiao and Jianhong Tu and Chuhang Zou and Yuxin Zuo and Zhi Li and Peng Wang and Bowen Yu and Fei Huang and Junyang Lin and Zuozhu Liu},
year={2026},
eprint={2602.14721},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.14721},
}
- Downloads last month
- 1,742
Model tree for Qwen/WebWorld-8B
Base model
Qwen/Qwen3-8B-Base