Split	Examples	Description
`general`	161	Core agent tasks across 24 categories (communication, finance, ops, productivity, etc.)
`multimodal`	101	Multimodal agentic tasks requiring perception and creation (webpage generation, video QA, document extraction, etc.)
`multi_turn`	38	Multi-turn conversational tasks where the agent interacts with a simulated user persona to clarify needs and provide advice

Fields

Field	Type	Description
`task_id`	string	Unique task identifier
`query`	string	Task instruction / description
`fixture`	list[string]	Fixture files required for the task (available in `data/fixtures.tar.gz`)
`language`	string	Task language (`en` or `zh`)
`category`	string	Task domain

Usage

from datasets import load_dataset

# Load all splits
dataset = load_dataset("claw-eval/Claw-Eval")

# Load a specific split
general = load_dataset("claw-eval/Claw-Eval", split="general")
multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal")
multi_turn = load_dataset("claw-eval/Claw-Eval", split="multi_turn")

# Inspect a sample
print(general[0])

Acknowledgements

Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinchBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0.

Citation

If you use Claw-Eval in your research, please cite:

@article{ye2026claw,
 title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents},
 author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others},
 journal={arXiv preprint arXiv:2604.06132},
 year={2026}
}

Core Contributors

Bowen Ye(PKU), Rang Li (PKU), Qibin Yang (PKU), Zhihui Xie(HKU), Yuanxin Liu(PKU), Linli Yao(PKU), Hanglong Lyu(PKU), Lei Li(HKU, project lead)

Advisors:

Tong Yang (PKU), Zhifang Sui (PKU), Lingpeng Kong (HKU), Qi Liu (HKU)

Contribution

We welcome any kind of contribution. Let us know if you have any suggestions!

License

This dataset is released under the MIT License.

Downloads last month: 3,949

Space using claw-eval/Claw-Eval 1

Paper for claw-eval/Claw-Eval

Paper • 2604.06132 • Published Apr 7 • 121

URL: https://huggingface.co/datasets/claw-eval/Claw-Eval

⇱ claw-eval/Claw-Eval · Datasets at Hugging Face

Claw-Eval

Dataset Structure

Splits