VOOZH about

URL: https://thenewstack.io/agents-last-exam-benchmark/

⇱ We’ve been measuring AI wrong; why economically valuable work is the new benchmark - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2026-06-15 09:42:33
We’ve been measuring AI wrong; why economically valuable work is the new benchmark
AI Agents / AI Strategy / Large Language Models

We’ve been measuring AI wrong; why economically valuable work is the new benchmark

Agent's Last Exam benchmarks AI agents across 1,500+ real-world tasks in 55 occupations to measure economically valuable work — not abstract skills.
Jun 15th, 2026 9:42am by Adrian Bridgwater
👁 Featued image for: We’ve been measuring AI wrong; why economically valuable work is the new benchmark
Maventra Design

As the AI industry gradually builds standardization guidelines and systems, such as those overseen by the Tokenonmics Foundation, the need for a wider set of validated yardsticks by which we can measure the worth of any given model continues. 

Nvidia recently pointed to AgentPerf from Artificial Analysis as a hardware benchmark for developers to compare systems for agentic AI. Models typically also list an MMLU benchmark score, also from Artificial Analysis.

But while pure performance is nice (if not essential) to have, software engineers and their business counterparts will ultimately want benchmarking tools that are calibrated to real-world business use case effectiveness.

Can agents perform economically valuable work?

A new benchmark surfaced on Thursday last week to introduce Agent’s Last Exam (ALE), an agentic AI scoring measure based on an evaluation of Fable 5, GPT-5.5, Composer 2.5, and a selection of other frontier agent systems. The analysis measures whether AI agents can actually perform useful and effective work in real terms across 55 real world occupations and 1,500+ real world tasks

Leading the research group behind the project is Dawn Song, professor and doctor of philosophy in computer science at the University of California, Berkeley.

Song tells The New Stack that her group grounded Agent’s Last Exam on “economically valuable work” in the real labor market, rather than on some abstract benchmark design. 

We’ve been measuring AI wrong

“Everyone wants to know when AI agents will become job-ready,” Song says. “The problem is that we have not been measuring what’s needed to answer this question. Every task in ALE originates from work that a domain expert actually performed in a business, production, or research setting.”

In many cases, ALE asked how long the task took and what level of expertize it required, allowing the tool to estimate the labor value associated with completing it. In that sense, she says, this is not inventing some arbitrary notion of value – it is evaluating work that organizations already pay people to do.

“Businesses do not hire people to solve [math-based] benchmark questions. They hire people to perform real-world work. As agents become increasingly capable, evaluating real-world work is no longer optional, it’s tablestakes.” – Professor Dawn Song.

Most benchmarks evaluate isolated skills: answering questions, solving math problems, writing code snippets, or navigating toy environments,” explains Song. “Businesses, however, do not hire people to solve such benchmark questions. They hire people to perform real-world work. As agents become increasingly capable, evaluating real-world work is no longer optional – it’s tablestakes.”

Song’s group found the results of initial ALE analyses both “impressive and sobering”, largely because today’s agents can solve a “meaningful fraction” of professional tasks, but clearly have limitations.

When the team looked at the hardest work tasks that require sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance. On ALE’s hardest tier, every frontier agent tested, including Fable 5, achieved a 0% success rate.

“Evaluating agents based on economically valuable work provides a common language for comparing progress across systems and understanding where AI can genuinely augment or automate human labor,” says Song. 

Economics is only one dimension of value

That said, she underlines that economic value is “only one dimension of value” i.e. for many operational and professional tasks, labor time and expertize provide a “reasonable proxy” because compensation is closely tied to the work being performed. But, she explains, there are important domains where this breaks down. 

Research is a good example: some projects may consume years of effort and produce little impact, while a single breakthrough can create enormous value. In those cases, hours worked and wages paid are poor measures of ultimate contribution.

The ALE team has surmised that “there is no universally best agent” and every frontier model, including Fable 5, has domains where it shines and domains where it struggles. Song has said that the real signal lies in where agents succeed, where they fail, and how those patterns differ across domains.

Mix of models is a mindful maxim

On identical tasks, different models often fail for very different reasons, so does she advocate a mix of models as the most prudent approach to adopt?

“In the near term, yes. If a software engineering team is deploying agents in production today, using a mix of models is often the most practical approach. Different frontier models have different strengths and cost-performance characteristics, and routing tasks to the model that performs best at a certain cost for a given domain is simply good engineering,” Song clarifies.

She further notes that one lesson from ALE is that performance varies significantly not just across models, but across occupations and task types. That makes model diversity particularly valuable today. The question is not which model is best overall, but which model is best for a given class of economically valuable work.

“The age of useful agents is here. The age of truly job-ready agents is not.”

For scenarios where an agent only operates in the terminal, the group has also released ALE-CLI, a CLI-only subset of the benchmark. The research group behind ALE is a mix of PhD students and postdoctorates. Song is also director of the campus-wide center Berkeley Center for Responsible Decentralized Intelligence (RDI), which has led the behind Agent’s Last Exam. 

“The age of useful agents is here. The age of truly job-ready agents is not,” stated Song.

The hope is that ALE will serve as a “new guidepost and north star” for developing agents capable of reliably performing economically valuable work across a broad range of domains

TRENDING STORIES
Adrian Bridgwater is a technology journalist with three decades of press experience. He has an extensive background in communications, starting in print media, newspapers and also television. Primarily working as an analysis writer dedicated to a software application development ‘beat’,...
Read more from Adrian Bridgwater
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.