Paper โข 2604.06132 โข Published โข 121
Dataset Viewer
Claw-Eval
๐ Claw-Eval Logo๐ Tasks
๐ Leaderboard
๐ License
End-to-end transparent benchmark for AI agents acting in the real world.
Paper | Leaderboard | Code
Dataset Structure
Splits
| Split | Examples | Description |
|---|---|---|
general |
161 | Core agent tasks across 24 categories (communication, finance, ops, productivity, etc.) |
multimodal |
101 | Multimodal agentic tasks requiring perception and creation (webpage generation, video QA, document extraction, etc.) |
multi_turn |
38 | Multi-turn conversational tasks where the agent interacts with a simulated user persona to clarify needs and provide advice |
Fields
| Field | Type | Description |
|---|---|---|
task_id |
string | Unique task identifier |
query |
string | Task instruction / description |
fixture |
list[string] | Fixture files required for the task (available in data/fixtures.tar.gz) |
language |
string | Task language (en or zh) |
category |
string | Task domain |
Usage
from datasets import load_dataset
# Load all splits
dataset = load_dataset("claw-eval/Claw-Eval")
# Load a specific split
general = load_dataset("claw-eval/Claw-Eval", split="general")
multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal")
multi_turn = load_dataset("claw-eval/Claw-Eval", split="multi_turn")
# Inspect a sample
print(general[0])
Acknowledgements
Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinchBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0.
Citation
If you use Claw-Eval in your research, please cite:
@article{ye2026claw,
title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents},
author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others},
journal={arXiv preprint arXiv:2604.06132},
year={2026}
}
Core Contributors
Bowen Ye(PKU), Rang Li (PKU), Qibin Yang (PKU), Zhihui Xie(HKU), Yuanxin Liu(PKU), Linli Yao(PKU), Hanglong Lyu(PKU), Lei Li(HKU, project lead)
Advisors:
Tong Yang (PKU), Zhifang Sui (PKU), Lingpeng Kong (HKU), Qi Liu (HKU)
Contribution
We welcome any kind of contribution. Let us know if you have any suggestions!
License
This dataset is released under the MIT License.
- Downloads last month
- 3,949
