GDPval-AA v2 Leaderboard

GDPval-AA v2 is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

See example tasks

GDPval-AA v2 uses 220 tasks developed by OpenAI in collaboration with industry professionals to reflect real-world complexity.

The benchmark requires models to produce diverse outputs including documents, slides, diagrams, and spreadsheets, mirroring actual work products across finance, healthcare, legal, and other professional domains.

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek.

We introduce GDPval, a benchmark designed to evaluate AI models on real-world, economically valuable tasks across 44 occupations. The dataset encompasses 1,320 tasks derived from nine major industries contributing significantly to the U.S. GDP. These tasks were developed in collaboration with industry professionals averaging 14 years of experience, ensuring they accurately represent real-world complexities. The evaluation requires models to produce diverse outputs, including documents, slides, diagrams, and spreadsheets, mirroring actual work products. Initial results indicate that frontier AI models are approaching the quality of work produced by human experts, with models able to perform certain professional tasks approximately 100 times faster and at a fraction of the cost compared to human experts.

👁 GDPval
GDPval

👁 2510.04374
2510.04374

👁 openai/gdpval
openai/gdpval

GDPval-AA v2

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) scores the highest on GDPval-AA v2 with a score of 1783, followed by Claude Opus 4.8 (Adaptive Reasoning, Max Effort) with a score of 1615, and GLM-5.2 (max) with a score of 1524

GDPval-AA v2 Elo

GDPval-AA v2 Leaderboard

Elo rating for performance on real-world work tasks · Anchored to a human baseline of 1,000 · Higher is better

Human Baseline (1,000)

Reasoning models are indicated by a lightbulb icon

Cost

GDPval-AA v2 Leaderboard: Cost per Task

Average cost per task (USD), broken down by input, cache hit, cache write, reasoning, and answer tokens

Reasoning models are indicated by a lightbulb icon

Average cost per task in the evaluation. Costs are split by input, cache hit, cache write, reasoning, and answer token pricing where canonical token counts are available.

Elo ComparisonsNew

GDPval-AA v2: Elo vs. Artificial Analysis Intelligence Index

GDPval-AA v2 Elo · Artificial Analysis Intelligence Index

Most attractive quadrant

Artificial Analysis Intelligence Index v4.1 includes: GDPval-AA v2, 𝜏³-Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, AA-LCR. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Token Usage

GDPval-AA v2 Leaderboard: Output Tokens per Task

Output tokens used to run one task, broken down by reasoning and answer tokens

Reasoning models are indicated by a lightbulb icon

The average number of answer and reasoning tokens produced per benchmark task in this evaluation.

Average Turns

GDPval-AA v2: Average Turns per Task

Average number of turns per task

Reasoning models are indicated by a lightbulb icon

Elo vs. Release Date

GDPval-AA v2: Elo vs. Release Date

Most attractive region

GDPval-AA v2 Leaderboard

	Creator	Name	Elo	CI	Release Date
1	👁 Anthropic logo Anthropic	Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	1783	-24 / +24	Jun 2026
2	👁 Anthropic logo Anthropic	Claude Opus 4.8 (Adaptive Reasoning, Max Effort)	1615	-23 / +23	May 2026
3	👁 Z AI logo Z AI	GLM-5.2 (max)	1524	-26 / +26	Jun 2026
4	👁 Anthropic logo Anthropic	Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	1519	-23 / +23	Apr 2026
5	👁 OpenAI logo OpenAI	GPT-5.5 (xhigh)	1509	-22 / +22	Apr 2026
6	👁 OpenAI logo OpenAI	GPT-5.5 (high)	1486	-22 / +22	Apr 2026
7	👁 MiniMax logo MiniMax	MiniMax-M3	1414	-22 / +22	Jun 2026
8	👁 OpenAI logo OpenAI	GPT-5.4 (xhigh)	1409	-22 / +22	Mar 2026
9	👁 Anthropic logo Anthropic	Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)	1400	-22 / +22	Feb 2026
10	👁 Google logo Google	Gemini 3.5 Flash (high)	1362	-22 / +23	May 2026
11	👁 DeepSeek logo DeepSeek	DeepSeek V4 Pro (Reasoning, Max Effort)	1326	-22 / +22	Apr 2026
12	👁 Alibaba logo Alibaba	Qwen3.7 Max	1299	-22 / +22	May 2026
13	👁 Xiaomi logo Xiaomi	MiMo-V2.5-Pro	1282	-22 / +22	Apr 2026
14	👁 Z AI logo Z AI	GLM-5.1 (Reasoning)	1278	-22 / +22	Apr 2026
15	👁 Kimi logo Kimi	Kimi K2.6	1204	-21 / +21	Apr 2026
16	👁 Kimi logo Kimi	Kimi K2.7 Code	1203	-22 / +22	Jun 2026
17	👁 DeepSeek logo DeepSeek	DeepSeek V4 Flash (Reasoning, Max Effort)	1196	-23 / +23	Apr 2026
18	👁 OpenAI logo OpenAI	GPT-5.4 mini (xhigh)	1187	-21 / +21	Mar 2026
19	👁 NVIDIA logo NVIDIA	Nemotron 3 Ultra 550B A55B (Reasoning)	1180	-21 / +21	Jun 2026
20	👁 MiniMax logo MiniMax	MiniMax-M2.7	1177	-21 / +21	Mar 2026
21	👁 Meta logo Meta	Muse Spark	1164	-21 / +21	Apr 2026
22	👁 Alibaba logo Alibaba	Qwen3.6 27B (Reasoning)	1162	-21 / +21	Apr 2026
23	👁 Alibaba logo Alibaba	Qwen3.6 Plus	1161	-21 / +21	Apr 2026
24	👁 OpenAI logo OpenAI	GPT-5.5 (Non-reasoning)	1134	-21 / +21	Apr 2026
25	👁 OpenAI logo OpenAI	GPT-5.4 nano (xhigh)	1114	-20 / +20	Mar 2026
26	👁 xAI logo xAI	Grok 4.3 (Non-reasoning)	1107	-21 / +21	Apr 2026
27	👁 xAI logo xAI	Grok 4.3 (high)	1100	-21 / +21	Apr 2026
28	👁 Alibaba logo Alibaba	Qwen3.6 35B A3B (Reasoning)	1055	-21 / +21	Apr 2026
29	👁 StepFun logo StepFun	Step 3.7 Flash	1031	-20 / +20	May 2026
30	👁 Alibaba logo Alibaba	Qwen3.5 122B A10B (Reasoning)	982	-21 / +21	Feb 2026
31	👁 Google logo Google	Gemini 3.1 Pro Preview	974	-21 / +21	Feb 2026
32	👁 Alibaba logo Alibaba	Qwen3.5 397B A17B (Reasoning)	961	-21 / +21	Feb 2026
33	👁 Alibaba logo Alibaba	Qwen3.7 Plus	946	-21 / +21	Jun 2026
34	👁 Mistral logo Mistral	Mistral Medium 3.5	927	-21 / +21	Apr 2026
35	👁 InclusionAI logo InclusionAI	Ring-2.6-1T	920	-21 / +21	May 2026
36	👁 Anthropic logo Anthropic	Claude 4.5 Haiku (Reasoning)	901	-22 / +22	Oct 2025
37	👁 Google logo Google	Gemma 4 31B (Reasoning)	786	-23 / +23	Apr 2026
38	👁 OpenAI logo OpenAI	gpt-oss-120b (high)	779	-23 / +23	Aug 2025
39	👁 OpenAI logo OpenAI	GPT-5.4 mini (Non-Reasoning)	757	-24 / +24	Mar 2026
40	👁 Google logo Google	Gemma 4 26B A4B (Reasoning)	718	-24 / +24	Apr 2026
41	👁 OpenAI logo OpenAI	GPT-5.4 nano (Non-Reasoning)	716	-25 / +25	Mar 2026
42	👁 NVIDIA logo NVIDIA	NVIDIA Nemotron 3 Super 120B A12B (Reasoning)	666	-25 / +25	Mar 2026
43	👁 Amazon logo Amazon	Nova 2.0 Pro Preview (medium)	638	-25 / +25	Nov 2025
44	👁 Google logo Google	Gemini 3.1 Flash-Lite	605	-26 / +26	Mar 2026
45	👁 OpenAI logo OpenAI	gpt-oss-20B (high)	528	-27 / +27	Aug 2025
46	👁 Upstage logo Upstage	Solar Pro 3	468	-28 / +28	Apr 2026
47	👁 IBM logo IBM	Granite 4.1 30B	418	-31 / +31	Apr 2026
48	👁 Meta logo Meta	Llama 4 Scout	86	-37 / +37	Apr 2025
49	👁 Meta logo Meta	Llama 4 Maverick	−11	-38 / +38	Apr 2025

Example Tasks

Frequently Asked Questions

GDPval-AA v2 is Artificial Analysis' evaluation based on OpenAI's GDPval dataset, which tests AI models on real-world economically valuable tasks across 44 occupations and 9 major industries.

GDPval-AA v2 compares model submissions head-to-head on the same task. For each matchup, the two outputs are anonymized and an LLM judge picks a winner. These blind pairwise results are aggregated into an Elo rating per model.

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) has the highest GDPval-AA v2 score, with a GDPval-AA v2 Elo rating of 1,783 among models with published GDPval-AA v2 results. View model

GDPval-AA v2 covers real-world professional tasks across a range of occupations and industries, producing outputs such as documents, spreadsheets, slides, and diagrams. Generating these deliverables generally requires interacting with a sandbox filesystem through shell access and using web search, capabilities the model is given through the Stirrup agentic harness.

Most benchmarks test short-answer or multiple-choice responses. GDPval-AA v2 instead evaluates complete deliverables: models operate in an agentic environment with tools, produce file outputs, and have their submissions scored through pairwise grading on relative quality.

Explore Evaluations

👁 Artificial Analysis Intelligence Index
Artificial Analysis Intelligence Index

A composite benchmark aggregating nine challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

👁 GDPval-AA v2 Leaderboard
GDPval-AA v2 Leaderboard

👁 APEX-Agents-AA Benchmark Leaderboard
APEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

👁 𝜏²-Bench Telecom Benchmark Leaderboard
𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

👁 𝜏³-Banking Benchmark Leaderboard
𝜏³-Banking Benchmark Leaderboard

A fintech customer-support benchmark from the 𝜏-Knowledge framework that tests whether agents can navigate a large unstructured knowledge base and execute multi-step tool calls to resolve realistic banking workflows.

👁 Terminal-Bench Hard Benchmark Leaderboard
Terminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

👁 Terminal-Bench v2.1 Benchmark Leaderboard
Terminal-Bench v2.1 Benchmark Leaderboard

A verified refresh of Terminal-Bench v2.0 — 89 curated tasks across software engineering, system administration, data processing, model training, and security, with environment and instruction fixes so scores reflect agent capability rather than environment gaps.

👁 SciCode Benchmark Leaderboard
SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

👁 Artificial Analysis Long Context Reasoning Benchmark Leaderboard
Artificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

👁 AA-Omniscience: Knowledge and Hallucination Benchmark
AA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

👁 IFBench Benchmark Leaderboard
IFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

👁 Humanity's Last Exam Benchmark Leaderboard
Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

👁 CritPt Benchmark Leaderboard
CritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

👁 ITBench-AA Benchmark Leaderboard
ITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

👁 Artificial Analysis Openness Index
Artificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

👁 MMLU-Pro Benchmark Leaderboard
MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

👁 Global-MMLU-Lite Benchmark Leaderboard
Global-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

👁 LiveCodeBench Benchmark Leaderboard
LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

👁 MATH-500 Benchmark Leaderboard
MATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

👁 AIME 2025 Benchmark Leaderboard
AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

👁 MMMU-Pro Benchmark Leaderboard
MMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.

URL: https://artificialanalysis.ai/evaluations/gdpval-aa

⇱ GDPval-AA v2 Leaderboard | Artificial Analysis

GDPval-AA v2 Leaderboard

Publication

GDPval-AA v2

GDPval-AA v2 Elo

GDPval-AA v2 Leaderboard

Cost

GDPval-AA v2 Leaderboard: Cost per Task

Elo ComparisonsNew

GDPval-AA v2: Elo vs. Artificial Analysis Intelligence Index

Token Usage

GDPval-AA v2 Leaderboard: Output Tokens per Task

Average Turns

GDPval-AA v2: Average Turns per Task

Elo vs. Release Date

GDPval-AA v2: Elo vs. Release Date

GDPval-AA v2 Leaderboard

Example Tasks

Frequently Asked Questions

Explore Evaluations