URL: https://claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills

Explore here

Improving skill-creator: Test, measure, and refine Agent Skills

Skill authors can now verify that their skills work, catch regressions, and improve descriptions.

Category
Claude Code
Product announcements
Product
Claude Code
Date
March 3, 2026
Reading time
5
min
Share
Copy link
https://claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills

Skill-creator now helps you write evals, run benchmarks, and keep your skills working as models evolve. These updates are available now in Claude.ai and Cowork, as a plugin for Claude Code, and within our repo.

Since launching Agent Skills last October, we've noticed that most authors are subject matter experts, not engineers. They know their workflows but don't have the tools to tell whether a skill still works with a new model, triggers when it should, or if it actually improved after an edit.

Today we're announcing skill-creator enhancements that help authors build with more confidence. We are bringing some of the rigor of software development (testing, benchmarking, iterative improvement) to skill authoring without requiring anyone to write code.

Two kinds of skills

Skills generally fall into two categories:

Capability uplift skills help Claude do something the base model either can't do or can't do consistently. Our document creation skills are good examples. They encode techniques and patterns that produce better output than prompting alone.

Encoded preference skills document workflows where Claude can already do each piece, but the skill sequences them according to your team's process. Examples: a skill that walks through NDA review against set criteria, or one that drafts weekly updates with data from various MCPs.

This distinction matters because these two types of skills may need testing for different reasons:

Capability uplift skills may become less necessary as models improve. Evals tell you when that's happened.
Encoded preference skills are more durable, but only as valuable as their fidelity to your actual workflow. Evals verify that fidelity.

Either way, testing turns a skill that seems to work into one you know works.

Using evals to test and improve skills

Skill-creator now helps you write evals, which are tests that check Claude does what you expect for a given prompt. If you've written software tests, this will feel familiar: define some test prompts (plus files if needed), describe what good looks like, and skill-creator tells you whether the skill holds up.

Our PDF skill, for instance, previously struggled with non-fillable forms. Claude had to place text at exact coordinates with no defined fields to guide it. Evals isolated the failure, and we shipped a fix that anchors positioning to extracted text coordinates.

👁 Image

Evals help in many ways, but two important uses are to catch quality regressions and understand model progress.

First, catching regressions in quality. As models and the infrastructure around them evolve, a skill that worked well last month might behave differently today. Running evals against a new model gives you an early signal when something shifts before it impacts your team’s work.

Second, knowing when general model capabilities have outgrown your skill. This applies mainly to capability uplift skills. If the base model starts passing your evals without the skill loaded, that's a signal the skill's techniques may have been incorporated into the model's default behavior. The skill isn't broken; it's just no longer necessary.

We've also added a benchmark mode that runs a standardized assessment using your evals. This is something you can run after model updates or as you iterate on the skill itself. It tracks eval pass rate, elapsed time, and token usage.

👁 Image

Your evals and results stay with you. Store them locally, integrate them with a dashboard, or plug them into a CI system.

Faster, more consistent evaluation with multi-agent support

Running evals sequentially can be slow, and accumulating context can bleed between test runs. Skill-creator now spins up independent agents to run evals in parallel with multi-agent support — each in a clean context with its own token and timing metrics. Faster results, no cross-contamination.

We've also added comparator agents for A/B comparisons: two skill versions, or skill vs. no skill. They judge outputs without knowing which is which, so you can tell whether a change actually helped.

👁 Image

Getting skills to trigger at the right time

Evals measure output quality, but that only matters if your skill triggers when it should. As your skill count grows, description precision becomes critical: too broad and you get false triggers, too narrow and it never fires. Skill-creator now helps you tune descriptions for more reliable triggering — it analyzes your current description against sample prompts and suggests edits that cut both false positives and false negatives.

We ran it across our document-creation skills and saw improved triggering on 5 out of 6 public skills.

👁 Image

Looking ahead

As models improve, the line between "skill" and "specification" may blur. Today, a SKILL.md file is essentially an implementation plan, providing detailed instructions telling Claude how to do something. Over time, a natural-language description of what the skill should do may be enough, with the model figuring out the rest.

The eval framework we're releasing today is a step in that direction. Evals already describe the "what." Eventually, that description may be the skill itself.

Getting Started

All skill-creator updates are available now on Claude.ai and Cowork. Ask Claude to use the skill-creator to get started.

Claude Code users can install the plugin or download from our repo.

No items found.

0/5

eBook

👁 Image

👁 Image
👁 Image

FAQ

No items found.

Explore more product news and best practices for teams building with Claude.

👁 Image

Jun 17, 2026

Meet the winners of our Claude Opus 4.8 Build Day hackathon

Claude Code

Meet the winners of our Claude Opus 4.8 Build Day hackathon

👁 Image

Jun 17, 2026

Secure access to the Claude Platform with Workload Identity Federation

Product announcements

Secure access to the Claude Platform with Workload Identity Federation

👁 Image

Jun 17, 2026

Claude Design now stays on brand for daily work

Product announcements

Claude Design now stays on brand for daily work

👁 Image

Jun 15, 2026

Meet the winners of the Built with Opus 4.7 Claude Code hackathon

Claude Code

Meet the winners of the Built with Opus 4.7 Claude Code hackathon

Transform how your organization operates with Claude

See pricing

Contact sales

Get the developer newsletter

Product updates, how-tos, community spotlights, and more. Delivered monthly to your inbox.

Thank you! You’re subscribed.

Sorry, there was a problem with your submission, please try again later.

Homepage

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Write

Button Text

Learn

Button Text

Code

Button Text

Write

Help me develop a unique voice for an audience
Hi Claude! Could you help me develop a unique voice for an audience? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to— like Google Drive, web search, etc.—if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can—an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!
Improve my writing style
Hi Claude! Could you improve my writing style? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to— like Google Drive, web search, etc.—if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can—an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!
Brainstorm creative ideas
Hi Claude! Could you brainstorm creative ideas? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to— like Google Drive, web search, etc.—if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can—an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!

Learn

Explain a complex topic simply
Hi Claude! Could you explain a complex topic simply? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to— like Google Drive, web search, etc.—if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can—an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!
Help me make sense of these ideas
Hi Claude! Could you help me make sense of these ideas? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to— like Google Drive, web search, etc.—if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can—an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!
Prepare for an exam or interview
Hi Claude! Could you prepare for an exam or interview? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to— like Google Drive, web search, etc.—if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can—an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!

Code

Explain a programming concept
Hi Claude! Could you explain a programming concept? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to— like Google Drive, web search, etc.—if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can—an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!
Look over my code and give me tips
Hi Claude! Could you look over my code and give me tips? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to— like Google Drive, web search, etc.—if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can—an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!
Vibe code with me
Hi Claude! Could you vibe code with me? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to— like Google Drive, web search, etc.—if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can—an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!

Write case studies
This is another test
Write grant proposals
Hi Claude! Could you write grant proposals? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to — like Google Drive, web search, etc. — if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can - an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!
Write video scripts
this is a test

Anthropic

Products

Claude
Claude
Claude Code
Claude Code
Claude Code for Enterprise
Claude Code for Enterprise
Claude Cowork
Claude Cowork
Claude Security
Claude Security
Download app
Download app
Pricing
Pricing
Log in
Log in

Features

Claude for Chrome
Claude for Chrome
Claude for Slack
Claude for Slack
Claude for Microsoft 365
Claude for Microsoft 365
Skills
Skills

Models

Mythos
Mythos
Fable
Fable
Opus
Opus
Sonnet
Sonnet
Haiku
Haiku

Solutions

AI agents
AI agents
Code modernization
Code modernization
Coding
Coding
Customer support
Customer support
Education
Education
Enterprise
Enterprise
Financial services
Financial services
Government
Government
Healthcare
Healthcare
Legal
Legal
Life sciences
Life sciences
Nonprofits
Nonprofits
Security
Security
Small business
Small business
Startups
Startups

Claude Platform

Overview
Overview
Developer docs
Developer docs
Pricing
Pricing
Marketplace
Marketplace
Claude on AWS
Claude on AWS
Google Cloud
Google Cloud
Microsoft Foundry
Microsoft Foundry
Regional compliance
Regional compliance
Console login
Console login

Resources

Blog
Blog
Claude partner network
Claude partner network
Community
Community
Connectors
Connectors
Courses
Courses
Customer stories
Customer stories
Engineering at Anthropic
Engineering at Anthropic
Events
Events
Plugins
Plugins
Powered by Claude
Powered by Claude
Service partners
Service partners
Tutorials
Tutorials
Use cases
Use cases

Company

Anthropic
Anthropic
Careers
Careers
Policy
Policy
Economic Futures
Economic Futures
Research
Research
News
News
Policy on the AI Exponential
Policy on the AI Exponential
Responsible Scaling Policy
Responsible Scaling Policy
Security and compliance
Security and compliance
Transparency
Transparency

Help and security

Availability
Availability
Status
Status
Support center
Support center

Terms and policies

Privacy policy
Privacy policy
Responsible disclosure policy
Responsible disclosure policy
Terms of service: Commercial
Terms of service: Commercial
Terms of service: Consumer
Terms of service: Consumer
Usage policy
Usage policy

x.com

YouTube

Instagram

English (US)

Claude Code

URL: https://claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills