VOOZH about

URL: https://artificialanalysis.ai/evaluations/gdpval-aa

โ‡ฑ GDPval-AA v2 Leaderboard | Artificial Analysis


Artificial Analysis
All evaluations

GDPval-AA v2 Leaderboard

GDPval-AA v2 is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.
See example tasks
GDPval-AA v2 uses 220 tasks developed by OpenAI in collaboration with industry professionals to reflect real-world complexity.
The benchmark requires models to produce diverse outputs including documents, slides, diagrams, and spreadsheets, mirroring actual work products across finance, healthcare, legal, and other professional domains.
All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simรณn Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek.
We introduce GDPval, a benchmark designed to evaluate AI models on real-world, economically valuable tasks across 44 occupations. The dataset encompasses 1,320 tasks derived from nine major industries contributing significantly to the U.S. GDP. These tasks were developed in collaboration with industry professionals averaging 14 years of experience, ensuring they accurately represent real-world complexities. The evaluation requires models to produce diverse outputs, including documents, slides, diagrams, and spreadsheets, mirroring actual work products. Initial results indicate that frontier AI models are approaching the quality of work produced by human experts, with models able to perform certain professional tasks approximately 100 times faster and at a fraction of the cost compared to human experts.

GDPval-AA v2

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) scores the highest on GDPval-AA v2 with a score of 1783, followed by Claude Opus 4.8 (Adaptive Reasoning, Max Effort) with a score of 1615, and GLM-5.2 (max) with a score of 1524

GDPval-AA v2 Elo

GDPval-AA v2 Leaderboard

Elo rating for performance on real-world work tasks ยท Anchored to a human baseline of 1,000 ยท Higher is better
Human Baseline (1,000)
Reasoning models are indicated by a lightbulb icon

Cost

GDPval-AA v2 Leaderboard: Cost per Task

Average cost per task (USD), broken down by input, cache hit, cache write, reasoning, and answer tokens
Reasoning models are indicated by a lightbulb icon

Average cost per task in the evaluation. Costs are split by input, cache hit, cache write, reasoning, and answer token pricing where canonical token counts are available.

Elo ComparisonsNew

GDPval-AA v2: Elo vs. Artificial Analysis Intelligence Index

GDPval-AA v2 Elo ยท Artificial Analysis Intelligence Index
Most attractive quadrant

Artificial Analysis Intelligence Index v4.1 includes: GDPval-AA v2, ๐œยณ-Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, AA-LCR. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Token Usage

GDPval-AA v2 Leaderboard: Output Tokens per Task

Output tokens used to run one task, broken down by reasoning and answer tokens
Reasoning models are indicated by a lightbulb icon

The average number of answer and reasoning tokens produced per benchmark task in this evaluation.

Average Turns

GDPval-AA v2: Average Turns per Task

Average number of turns per task
Reasoning models are indicated by a lightbulb icon

Elo vs. Release Date

GDPval-AA v2: Elo vs. Release Date

Most attractive region

GDPval-AA v2 Leaderboard

Creator
Name
Elo
CI
Release Date
1Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)1783-24 / +24Jun 2026
2Claude Opus 4.8 (Adaptive Reasoning, Max Effort)1615-23 / +23May 2026
3GLM-5.2 (max)1524-26 / +26Jun 2026
4Claude Opus 4.7 (Adaptive Reasoning, Max Effort)1519-23 / +23Apr 2026
5GPT-5.5 (xhigh)1509-22 / +22Apr 2026
6GPT-5.5 (high)1486-22 / +22Apr 2026
7MiniMax-M31414-22 / +22Jun 2026
8GPT-5.4 (xhigh)1409-22 / +22Mar 2026
9Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)1400-22 / +22Feb 2026
10Gemini 3.5 Flash (high)1362-22 / +23May 2026
11DeepSeek V4 Pro (Reasoning, Max Effort)1326-22 / +22Apr 2026
12Qwen3.7 Max1299-22 / +22May 2026
13MiMo-V2.5-Pro1282-22 / +22Apr 2026
14GLM-5.1 (Reasoning)1278-22 / +22Apr 2026
15Kimi K2.61204-21 / +21Apr 2026
16Kimi K2.7 Code1203-22 / +22Jun 2026
17DeepSeek V4 Flash (Reasoning, Max Effort)1196-23 / +23Apr 2026
18GPT-5.4 mini (xhigh)1187-21 / +21Mar 2026
19Nemotron 3 Ultra 550B A55B (Reasoning)1180-21 / +21Jun 2026
20MiniMax-M2.71177-21 / +21Mar 2026
21Muse Spark1164-21 / +21Apr 2026
22Qwen3.6 27B (Reasoning)1162-21 / +21Apr 2026
23Qwen3.6 Plus1161-21 / +21Apr 2026
24GPT-5.5 (Non-reasoning)1134-21 / +21Apr 2026
25GPT-5.4 nano (xhigh)1114-20 / +20Mar 2026
26Grok 4.3 (Non-reasoning)1107-21 / +21Apr 2026
27Grok 4.3 (high)1100-21 / +21Apr 2026
28Qwen3.6 35B A3B (Reasoning)1055-21 / +21Apr 2026
29Step 3.7 Flash1031-20 / +20May 2026
30Qwen3.5 122B A10B (Reasoning)982-21 / +21Feb 2026
31Gemini 3.1 Pro Preview974-21 / +21Feb 2026
32Qwen3.5 397B A17B (Reasoning)961-21 / +21Feb 2026
33Qwen3.7 Plus946-21 / +21Jun 2026
34Mistral Medium 3.5927-21 / +21Apr 2026
35Ring-2.6-1T920-21 / +21May 2026
36Claude 4.5 Haiku (Reasoning)901-22 / +22Oct 2025
37Gemma 4 31B (Reasoning)786-23 / +23Apr 2026
38gpt-oss-120b (high)779-23 / +23Aug 2025
39GPT-5.4 mini (Non-Reasoning)757-24 / +24Mar 2026
40Gemma 4 26B A4B (Reasoning)718-24 / +24Apr 2026
41GPT-5.4 nano (Non-Reasoning)716-25 / +25Mar 2026
42NVIDIA Nemotron 3 Super 120B A12B (Reasoning)666-25 / +25Mar 2026
43Nova 2.0 Pro Preview (medium)638-25 / +25Nov 2025
44Gemini 3.1 Flash-Lite605-26 / +26Mar 2026
45gpt-oss-20B (high)528-27 / +27Aug 2025
46Solar Pro 3468-28 / +28Apr 2026
47Granite 4.1 30B418-31 / +31Apr 2026
48Llama 4 Scout86-37 / +37Apr 2025
49Llama 4 Maverickโˆ’11-38 / +38Apr 2025

Example Tasks

Frequently Asked Questions

GDPval-AA v2 is Artificial Analysis' evaluation based on OpenAI's GDPval dataset, which tests AI models on real-world economically valuable tasks across 44 occupations and 9 major industries.

GDPval-AA v2 compares model submissions head-to-head on the same task. For each matchup, the two outputs are anonymized and an LLM judge picks a winner. These blind pairwise results are aggregated into an Elo rating per model.

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) has the highest GDPval-AA v2 score, with a GDPval-AA v2 Elo rating of 1,783 among models with published GDPval-AA v2 results. View model

GDPval-AA v2 covers real-world professional tasks across a range of occupations and industries, producing outputs such as documents, spreadsheets, slides, and diagrams. Generating these deliverables generally requires interacting with a sandbox filesystem through shell access and using web search, capabilities the model is given through the Stirrup agentic harness.

Most benchmarks test short-answer or multiple-choice responses. GDPval-AA v2 instead evaluates complete deliverables: models operate in an agentic environment with tools, produce file outputs, and have their submissions scored through pairwise grading on relative quality.

Explore Evaluations

๐Ÿ‘ Artificial Analysis Intelligence Index
Artificial Analysis Intelligence Index

A composite benchmark aggregating nine challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

๐Ÿ‘ GDPval-AA v2 Leaderboard
GDPval-AA v2 Leaderboard

GDPval-AA v2 is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

๐Ÿ‘ APEX-Agents-AA Benchmark Leaderboard
APEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

๐Ÿ‘ ๐œยฒ-Bench Telecom Benchmark Leaderboard
๐œยฒ-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

๐Ÿ‘ ๐œยณ-Banking Benchmark Leaderboard
๐œยณ-Banking Benchmark Leaderboard

A fintech customer-support benchmark from the ๐œ-Knowledge framework that tests whether agents can navigate a large unstructured knowledge base and execute multi-step tool calls to resolve realistic banking workflows.

๐Ÿ‘ Terminal-Bench Hard Benchmark Leaderboard
Terminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

๐Ÿ‘ Terminal-Bench v2.1 Benchmark Leaderboard
Terminal-Bench v2.1 Benchmark Leaderboard

A verified refresh of Terminal-Bench v2.0 โ€” 89 curated tasks across software engineering, system administration, data processing, model training, and security, with environment and instruction fixes so scores reflect agent capability rather than environment gaps.

๐Ÿ‘ SciCode Benchmark Leaderboard
SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

๐Ÿ‘ Artificial Analysis Long Context Reasoning Benchmark Leaderboard
Artificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

๐Ÿ‘ AA-Omniscience: Knowledge and Hallucination Benchmark
AA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

๐Ÿ‘ IFBench Benchmark Leaderboard
IFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

๐Ÿ‘ Humanity's Last Exam Benchmark Leaderboard
Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

๐Ÿ‘ CritPt Benchmark Leaderboard
CritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

๐Ÿ‘ ITBench-AA Benchmark Leaderboard
ITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

๐Ÿ‘ Artificial Analysis Openness Index
Artificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

๐Ÿ‘ MMLU-Pro Benchmark Leaderboard
MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

๐Ÿ‘ Global-MMLU-Lite Benchmark Leaderboard
Global-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

๐Ÿ‘ LiveCodeBench Benchmark Leaderboard
LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

๐Ÿ‘ MATH-500 Benchmark Leaderboard
MATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

๐Ÿ‘ AIME 2025 Benchmark Leaderboard
AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

๐Ÿ‘ MMMU-Pro Benchmark Leaderboard
MMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.