VOOZH about

URL: https://huggingface.co/datasets/claw-eval/Claw-Eval

โ‡ฑ claw-eval/Claw-Eval ยท Datasets at Hugging Face


Leaderboard Official Benchmark
Learn more Experimental

Parameters Size

#
MODEL
SCORE
64 *
63.82 *
4
62.1 *
5
61.5 *
6
deepseek-ai/DeepSeek-V4-Pro View evaluation results source
58.4 *
7
57.8 *
8
deepseek-ai/DeepSeek-V4-Flash View evaluation results source
57.8 *
9
52.8 *
10
49.7 *
Dataset Viewer

Claw-Eval

๐Ÿ‘ Claw-Eval Logo

๐Ÿ‘ Tasks
๐Ÿ‘ Leaderboard
๐Ÿ‘ License

End-to-end transparent benchmark for AI agents acting in the real world.

Paper | Leaderboard | Code


Dataset Structure

Splits

Split Examples Description
general 161 Core agent tasks across 24 categories (communication, finance, ops, productivity, etc.)
multimodal 101 Multimodal agentic tasks requiring perception and creation (webpage generation, video QA, document extraction, etc.)
multi_turn 38 Multi-turn conversational tasks where the agent interacts with a simulated user persona to clarify needs and provide advice

Fields

Field Type Description
task_id string Unique task identifier
query string Task instruction / description
fixture list[string] Fixture files required for the task (available in data/fixtures.tar.gz)
language string Task language (en or zh)
category string Task domain

Usage

from datasets import load_dataset

# Load all splits
dataset = load_dataset("claw-eval/Claw-Eval")

# Load a specific split
general = load_dataset("claw-eval/Claw-Eval", split="general")
multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal")
multi_turn = load_dataset("claw-eval/Claw-Eval", split="multi_turn")

# Inspect a sample
print(general[0])

Acknowledgements

Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinchBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0.

Citation

If you use Claw-Eval in your research, please cite:

@article{ye2026claw,
 title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents},
 author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others},
 journal={arXiv preprint arXiv:2604.06132},
 year={2026}
}

Core Contributors

Bowen Ye(PKU), Rang Li (PKU), Qibin Yang (PKU), Zhihui Xie(HKU), Yuanxin Liu(PKU), Linli Yao(PKU), Hanglong Lyu(PKU), Lei Li(HKU, project lead)

Advisors:

Tong Yang (PKU), Zhifang Sui (PKU), Lingpeng Kong (HKU), Qi Liu (HKU)

Contribution

We welcome any kind of contribution. Let us know if you have any suggestions!

License

This dataset is released under the MIT License.

Downloads last month
3,949

Space using claw-eval/Claw-Eval 1

Paper for claw-eval/Claw-Eval