VOOZH about

URL: https://huggingface.co/datasets/benchflow/skillsbench

⇱ benchflow/skillsbench · Datasets at Hugging Face


Dataset Viewer

Warning: The leaderboard above is generated by Hugging Face eval-results and may be incomplete until evaluation_framework: benchflow is accepted and deployed. The audited SkillsBench v1.1 result archive is https://huggingface.co/datasets/benchflow/skillsbench-leaderboard, with all retained submissions normalized under submissions/skillsbench/v1.1/ and compact official exports under leaderboard/skillsbench/v1.1/.

Warning: The dataset is a read-only mirror. The primary source for this benchmark is on GitHub: https://github.com/benchflow-ai/skillsbench. Open issues and pull requests there.

SkillsBench v1.1

👁 Discord
👁 GitHub
👁 WeChat
👁 arXiv

The first benchmark for evaluating how well AI agents use skills.

Website · GitHub · Contributing · BenchFlow SDK · Discord

What is SkillsBench?

SkillsBench measures how effectively agents leverage Skills — modular folders of instructions, scripts, and resources — to perform specialized workflows. We evaluate both skill effectiveness and agent behavior through gym-style benchmarking.

The headline metric is the paired skill lift: the same agent/model run with task Skills minus the same run without them.

Goals

  • Build the broadest, highest-quality benchmark for agent skills.
  • Design tasks requiring skill composition (2+ skills) with SOTA performance below 50%.
  • Track frontier models: GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro, GLM 5.1, Kimi K2.6, MiniMax M3.

This v1.1 release contains 87 default runnable task packages. A further 14 credential-dependent or integration-incompatible packages live under tasks-extra/ and are excluded from the default benchmark set.

Quick Start

git clone https://github.com/benchflow-ai/skillsbench.git
cd skillsbench

# Install the BenchFlow CLI line supported by SkillsBench v1.1.
uv tool install "benchflow>=0.6.2,<0.7"

# Install repository tooling from the committed lockfile.
uv sync --locked

# Create and validate a native task.md task. The oracle must pass before agent runs.
bench tasks init <task-id>
bench tasks check tasks/<task-id>
bench eval create --tasks-dir tasks/<task-id> --agent oracle --sandbox docker

Run an agent with and without Skills to produce the paired comparison:

bench eval create --tasks-dir tasks/<task-id> --agent <agent> --model <model> \
 --skill-mode with-skill --skills-dir tasks/<task-id>/environment/skills

bench eval create --tasks-dir tasks/<task-id> --agent <agent> --model <model> \
 --skill-mode no-skill

Running agents requires API keys (export ANTHROPIC_API_KEY=..., export OPENAI_API_KEY=..., etc.). For a step-by-step workflow, open experiments/run_experiment.ipynb.

Task Structure

SkillsBench tasks are native BenchFlow task.md packages:

tasks/<task-id>/
 task.md # YAML frontmatter + human-written prompt body
 environment/
 Dockerfile # container setup
 skills/ # generalizable, reusable Skills (not task-specific)
 oracle/
 solve.sh # human-written; derives answers through computation
 verifier/
 test.sh # pytest runner, writes reward.txt
 test_outputs.py # outcome-based assertions

Each task's metadata validates against the SkillsBench taxonomy — category, subcategory, task_type, modality, interface, and skill_type. See CONTRIBUTING.md and taxonomy.md for the full structure, controlled vocabulary, and review checklist.

On this Hugging Face mirror, each active task package is flattened from tasks/<task-id>/ to <task-id>/ at the repository root.

Leaderboard and Results

Audited trajectories and selected result artifacts live in benchflow/skillsbench-leaderboard. The current v1.1 board spans 21 models across 8 families (24 model×harness configurations), recomputed 2026-06-16 on the 87-task v1.1 roster; the full interactive board is at skillsbench.ai.

Current public v1.1 exports:

Official scores are computed from reviewed non-timeout trials:

  • with_skills_score = 100 * mean(reward | skill_mode = with-skill)
  • without_skills_score = 100 * mean(reward | skill_mode = no-skill)
  • skill_lift = with_skills_score - without_skills_score

Benchmark Registration

This dataset ships a root eval.yaml for Hugging Face benchmark registration. The associated framework-key PR is huggingface/huggingface.js#2139; the leaderboard widget activates once evaluation_framework: benchflow is accepted and deployed.

Get Involved

Citing SkillsBench

@misc{skillsbench_2026,
 title={SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks},
 author={Xiangyi Li and Wenbo Chen and Yimin Liu and Shenghan Zheng and Xiaokun Chen and Yifeng He and Yubo Li and Bingran You and Haotian Shen and Han-chung Lee},
 year={2026},
 eprint={2602.12670},
 archivePrefix={arXiv},
 primaryClass={cs.AI},
 url={https://arxiv.org/abs/2602.12670}
}
Downloads last month
4,153

Paper for benchflow/skillsbench