Voozh

I just finished reading SkillBench paper: https://arxiv.org/pdf/2602.12670

And the results are definitely not what most people expect.

What researchers did

👁 SkillBench research setup screenshot

They did 86 real-work tasks across 11 domains and executed 7,308 runs.

Each task was tested in three modes:

Baseline (no skills)
Curated skills (human-written)
Self-generated skills by the model

👁 haiku with good skills is better than vanilla opus

Without further ado, below are some conclusions that I found interesting in the paper.

Self-generated skills don't help

One of the most hyped ideas in agent research is:

"Let the model write its own tools / skills."

But it is mostly a wasted effort. In this research, self-generated skills produced no meaningful improvement over baseline.

In some cases, they made performance worse.

Today's models simply cannot reliably create useful reusable procedural abstractions.

This matters because a huge part of current agent research assumes models can recursively improve by generating better skills/tools. This benchmark suggests that assumption is premature.

👁 SkillBench chart showing self-generated skills did not meaningfully improve performance

Human-made skills work A LOT better

When Skills were carefully written by humans, performance jumped +16.2 percentage points on average.

But here's what's even more surprising:

Domain variance was extreme

Some domains saw small gains (~4-5 pp)
Others saw enormous gains (~50+ pp)

👁 SkillBench chart showing high domain variance for human-made skills

Skills don't help the same in different fields.. They disproportionately help in structured, procedural domains.

Smaller models + skills ≈ bigger models without skills

A smaller model with curated Skills matched or exceeded a larger model without Skills.

This is huge for cost optimization:

Local agents
Edge deployment
Open-source models

Too many skills can hurt

Overly broad or verbose skill libraries degraded performance. Focused, minimal skill modules performed better.

👁 SkillBench result showing too many skills can degrade performance

Pick your skills carefully. 2-3 skills work better than 4+ skills.

Here is my takeaway

If this paper is right (and i think it is, mostly because of my personal experiences with skill files):

Scaling alone isn't enough
Autonomy narratives are premature
Skill architecture design is now a first-class research problem

Read the full paper: https://arxiv.org/pdf/2602.12670

URL: https://dev.to/port/skills-dont-work-the-way-we-think-they-do-494j

⇱ Skills don't work the way we think they do - DEV Community