VOOZH about

URL: https://www.skillsbench.ai/

⇱ SkillsBench — Benchmarking How Well Agent Skills Work | SkillsBench


New release · SkillsBench v1.1

Bench: Benchmarking How Well Skills Work Across Diverse Tasks

The first evaluation framework that measures how skills work, and the first dataset that measures how powerful models are at using skills on expert-curated tasks across high-GDP-value, diverse domains.

Read the PaperSee LeaderboardDocumentation

How SkillsBench Works

SkillsBench evaluates AI agents across three abstraction layers, mirroring how traditional computing systems are structured.

SkillsApplications
Agent HarnessOperating Systems
ModelsCPUs

Skills Layer

Domain-specific capabilities and workflows that extend agent functionality. Like applications on an OS, skills provide specialized knowledge and tools for particular tasks.

Agent Harness Layer

The execution environment that orchestrates agents, manages tool access, and handles I/O. Analogous to an operating system that mediates between applications and hardware.

Models Layer

The foundational AI models that power reasoning and generation. Like CPUs, they provide the raw computational capability that upper layers build upon.

Agent Performance

Resolution rate vs. mean agent wall-clock per task (log scale, faster to the right). Hover a point for exact values; the no-Skills counterparts are ghosted for context. OpenHands is the default baseline harness; configs run under another harness are labelled with it.

Hover a point to pin its exact resolution rate and wall-clock to the axes. Dashed lines mark fleet means.

Capability Over Time

SkillsBench resolution rate vs. model release date — one dot per model–harness config. Newer models trend up and to the right. Release months are approximate editorial estimates (paper-reported where available). OpenHands is the default baseline harness; configs run under another harness are labelled with it.

Hover a point to pin its release month and resolution rate. Dashed line marks the fleet mean; the emerald line is a least-squares fit.

Agent Leaderboard

Resolution rates across 24 agent–model configurations on SkillsBench (87 tasks, up to 3 trials per task).

dataset: skillsbench@1.1 (v1.1, 87 tasks, registry.json) · recomputed 2026-06-16

Sort by
#AgentWithoutWith SkillsΔGain (g)Visualization
1
GPT-5.5OpenHands
51.5%67.3%+15.832.6%
51.5 → 67.3%
2
GPT-5.5Codex
46.8%66.5%+19.737.0%
46.8 → 66.5%
3
Opus 4.7Claude Code
43.0%61.2%+18.231.9%
43.0 → 61.2%
4
Gemini 3.1 ProGemini CLI
36.0%60.8%+24.838.7%
36.0 → 60.8%
5
GLM 5.1OpenHands
32.7%58.4%+25.738.1%
32.7 → 58.4%
6
Gemini 3 FlashGemini CLI
34.2%54.6%+20.431.0%
34.2 → 54.6%
7
Opus 4.8OpenHands
45.7%54.1%+8.415.5%
45.7 → 54.1%
8
Kimi K2.6OpenHands
33.4%54.0%+20.631.0%
33.4 → 54.0%
9
Opus 4.7OpenHands
42.1%53.1%+11.119.1%
42.1 → 53.1%
10
MiniMax M3OpenHands
29.7%53.0%+23.333.2%
29.7 → 53.0%
11
Gemini 3.1 ProOpenHands
33.8%52.8%+19.028.7%
33.8 → 52.8%
12
GPT-5.2Codex
29.7%51.7%+22.031.3%
29.7 → 51.7%
13
Opus 4.6Claude Code
33.7%50.2%+16.525.0%
33.7 → 50.2%
14
DeepSeek V4 ProOpenHands
26.9%50.1%+23.231.8%
26.9 → 50.1%
15
Opus 4.5Claude Code
23.8%49.0%+25.233.1%
23.8 → 49.0%
16
Gemini 3.5 FlashOpenHands
41.1%48.2%+7.112.1%
41.1 → 48.2%
17
Sonnet 4.6OpenHands
33.5%47.2%+13.620.5%
33.5 → 47.2%
18
DeepSeek V4 FlashOpenHands
27.5%44.7%+17.223.7%
27.5 → 44.7%
19
Grok 4.3OpenHands
22.8%41.7%+18.824.4%
22.8 → 41.7%
20
GPT-5.4 MiniOpenHands
29.9%41.4%+11.516.4%
29.9 → 41.4%
21
Sonnet 4.5Claude Code
16.7%36.2%+19.523.4%
16.7 → 36.2%
22
MiniMax M2.7OpenHands
18.1%34.9%+16.820.5%
18.1 → 34.9%
23
Haiku 4.5Claude Code
8.8%30.1%+21.323.4%
8.8 → 30.1%
24
Gemini 3.1 Flash LiteOpenHands
16.0%20.1%+4.14.9%
16.0 → 20.1%
Hover over a row to see confidence intervals and normalized gain.View full leaderboard →
skillsbench@1.1 · 87 tasks · up to 3 trials per task · 95% CIsWithoutWith Skills
OpenAI
Anthropic
Google
Z.ai
Moonshot
MiniMax
DeepSeek
xAI

Professional-Domain Profile

Resolution rate across the eight professional domains of the 87-task taxonomy. Hover a radar axis to inspect that domain; compare up to 4 agents.

Software Engineering16 tasks
GPT-5.5OpenHands52.6 → 63.4%
Opus 4.8OpenHands55.1 → 65.9%
Gemini 3.5 FlashOpenHands46.4 → 50.9%

solid = without Skills · pale = Skill lift · hover another radar axis to switch domain

GPT-5.5· OpenHandsOpus 4.8· OpenHandsGemini 3.5 Flash· OpenHandsrings at 20–100% · with Skills

Task Registry

industrial-physical-systems
hard

You need to calculate the mass of a 3D printed part. The input (`/root/scan_data.stl`) is a binary STL, but the 2-byte "Attribute Byte Count" at the end of each triangle record is being used to store the **Material ID** of the object.

industrial-physical-systems
hard

# Task

industrial-physical-systems
medium

You need to implement an Adaptive Cruise Control (ACC) simulation that maintains the set speed (30m/s) when no vehicles are detected ahead, and automatically adjusts speed to maintain a safe following distance when a vehicle is detected ahead. The targets are: speed rise time <10s, speed overshoot <...

software-engineering
medium

Detect BGP route oscillation and leaks in an Azure Virtual WAN network topology and evaluate which solutions can fix them.

mathematics-or-formal-reasoning
medium

As an optimization and mathematical modeling expert in a shared bike service company, every night, you need to rebalance the bikes among the stations in your responsible region. So the basic information is that you can control several rebalancing vehicles (`vehicle_count`). The vehicles can start dr...

office-white-collar
medium

You are helping a research team verify the integrity of their bibliography before submitting a paper. The team suspects that some citations in their BibTeX file may be fake or hallucinated.

mathematics-or-formal-reasoning
hard

# Sid Meier's Civilization 6 District Adjacency Bonus Optimizer

office-white-collar
easy

Fill the California Small Claims Court form at `/root/sc100-blank.pdf` based on the case description below, and save the filled one to `/root/sc100-filled.pdf`. Only fill in the necessary fields and leave the court-filled, optional fields or fields not mentioned in the case description below empty. ...

You are a materials science researcher. You want to analyze the crystal structure of a few samples. You are given a few CIF files generated by SHELX through X-ray Diffraction.

cybersecurity
hard

You’re given `packets.pcap` (subset of DAPT2020 traffic). Compute the stats and fill in only the `value` column in `/root/network_stats.csv`. Lines starting with `#` are comments—leave them.

software-engineering
medium

Please use D3.js (v6) to visualize input data stored at `/root/data/stock-descriptions.csv` and `/root/data/indiv-stock/` for company stock details and individual stock price histories. Please return the output as a single-page web app at `/root/output/index.html`. I should be able to open it in a w...

software-engineering
hard

# debug-trl-grpo

software-engineering
easy

You will implement a dialogue parser that converts a given text file into a structured JSON graph. You will be given a text file `/app/script.txt`, and output a validated JSON graph `/app/dialogue.json` and visualization `/app/dialogue.dot`. You should implement a function `def parse_script(text: st...

industrial-physical-systems
medium

Given the system parameters of a drone, generate time-parameterized piecewise continuous trajectories and feedback control design to enable the drone to fly along a pre-defined path in simulation according to a natural language command.

industrial-physical-systems
medium

Given a video file /root/input.mp4, help me to analyze the camera motion (egomotion) and detect dynamic objects in this video. You need to give me 2 files: 1. A json file named /root/pred_instructions.json, it should record the mapping frame intervals to motion labels (valid labels: Stay, Dolly In,...

natural-science
hard

You have some earthquake traces stored at `/root/data/wave.mseed`, and station information stored at `/root/data/stations.csv`.

natural-science
medium

You are a geospatial analyst with expertise in plate tectonics and earthquake analysis. I want to find the earthquake that occurred furthest from the Pacific plate boundary within the Pacific plate itself. Use GeoPandas projections. Output the result to `/root/answer.json` as a JSON file with the fo...

finance-economics
medium

In macroeconomics, understanding business cycle dynamics requires separating the trend component (long-term growth) from the cyclical component (short-term fluctuations) of economic time series. This task involves applying standard detrending techniques to analyze how consumption and investment move...

office-white-collar
medium

You are given a `PDF` file at `/root/input/input.pdf` and a text file at `/root/input/input.txt` containing natural language instructions and updated information. Your task is to update this PDF using the latest information provided in the text file and follow all the instructions in it.

industrial-physical-systems
medium

As an operator-planner in an Independent System Operator, you create a base case for tomorrow's peak hour each morning in order to establish the voltage profile and verify that the day-ahead market schedule is AC-feasible. In order to do this, you should use the model described in `math-model.md` to...