VOOZH about

URL: https://www.vellum.ai/llm-leaderboard

⇱ LLM Leaderboard 2026 — Compare Top AI Models


updated 9 Jun 2026

LLM Leaderboard

This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).

Top models per tasks

Best in Reasoning (GPQA Diamond)

100%95%91%86%81%
95.4%
Claude 3 Opus
94.2%
Claude Opus 4.7
94.1%
Claude Fable 5
94.1%
Claude Mythos 5
93.6%
Claude Opus 4.8
Best in Reasoning (GPQA Diamond)
ModelScore
Claude 3 Opus95.4%
Claude Opus 4.794.2%
Claude Fable 594.1%
Claude Mythos 594.1%
Claude Opus 4.893.6%

Best in High School Math (AIME 2025)

100%96%93%89%86%
100%
Gemini 3 Pro
100%
GPT 5.2
99.8%
Claude Opus 4.6
99.1%
Kimi K2 Thinking
98.7%
GPT oss 20b
Best in High School Math (AIME 2025)
ModelScore
Gemini 3 Pro100%
GPT 5.2100%
Claude Opus 4.699.8%
Kimi K2 Thinking99.1%
GPT oss 20b98.7%

Best in Agentic Coding (SWE Bench)

100%93%86%79%72%
95.5%
Claude Mythos 5
95%
Claude Fable 5
88.6%
Claude Opus 4.8
87.6%
Claude Opus 4.7
82%
Claude Sonnet 4.5
Best in Agentic Coding (SWE Bench)
ModelScore
Claude Mythos 595.5%
Claude Fable 595%
Claude Opus 4.888.6%
Claude Opus 4.787.6%
Claude Sonnet 4.582%

Best Overall (Humanity's Last Exam)

70%53%35%18%0%
64.5%
Claude Mythos 5
57.9%
Claude Opus 4.8
45.8%
Gemini 3 Pro
44.9%
Kimi K2 Thinking
43.1%
GPT-5.5 Pro
Best Overall (Humanity's Last Exam)
ModelScore
Claude Mythos 564.5%
Claude Opus 4.857.9%
Gemini 3 Pro45.8%
Kimi K2 Thinking44.9%
GPT-5.5 Pro43.1%

Best in Visual Reasoning (ARC-AGI 2)

90%68%45%23%0%
85%
GPT-5.5
68.8%
Claude Opus 4.6
58.3%
Claude Sonnet 4.6
52.9%
GPT 5.2
37.6%
Claude Opus 4.5
Best in Visual Reasoning (ARC-AGI 2)
ModelScore
GPT-5.585%
Claude Opus 4.668.8%
Claude Sonnet 4.658.3%
GPT 5.252.9%
Claude Opus 4.537.6%

Best in Multilingual Reasoning (MMMLU)

95%90%86%81%77%
91.8%
Gemini 3 Pro
91.1%
Claude Opus 4.6
90.8%
Claude Opus 4.5
89.5%
Claude Opus 4.1
89.3%
Claude Sonnet 4.6
Best in Multilingual Reasoning (MMMLU)
ModelScore
Gemini 3 Pro91.8%
Claude Opus 4.691.1%
Claude Opus 4.590.8%
Claude Opus 4.189.5%
Claude Sonnet 4.689.3%

Fastest and most affordable models

Fastest Models (Tokens/sec)

1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 8b1800 t/s
5Llama 3.1 405b969 t/s

Lowest Latency (TTFT)

1GPT-5.3 Codex0.003s
2Nova Micro0.3s
3Llama 3.1 8b0.32s
4Llama 4 Scout0.33s
5Gemini 2.0 Flash0.34s

Cheapest Models (per 1M tokens)

1Nova Micro$0.04 / $0.14
2Gemma 3 27b$0.07 / $0.07
3Gemini 1.5 Flash$0.075 / $0.3
4GPT oss 20b$0.08 / $0.35
5Gemini 2.0 Flash$0.1 / $0.4

Compare models

vs
MiniMax M3Claude Mythos 5
Context size1,048,5761,000,000
Cutoff dateMar 2026Jan 2026
I/O cost$0.6 / $2.4$10 / $50
Max output512,000128,000
Latency--
Speed--
MiniMax M3Claude Mythos 5
GPQA Diamond
93
94.1
BFCL
-
-
MATH 500
-
-
AIME 2025
-
-
SWE Bench
80.5
95.5
LiveCodeBench
-
-

Compare Personal AI harnesses

Compare with
👁 Claude Cowork
Claude Cowork
Open source
MIT
MIT
Apache 2.0
Proprietary
MIT
Time to set up
Easy
Moderate
Difficult
Easy
Moderate
Native channels
iOS, MacOS, Web, Voice, Email, Telegram, Slack, CLI
CLI / TUI
CLI, MacOS, Web
CLI, MacOS, Windows, Web
CLI / TUI
Memory
Managed memory
SQLite + markdown — you build the memory stack
Basic memory, context loss
Limited
SQLite + markdown — you build the memory stack
Security
Built-in security
DIY
DIY
No sandboxing
DIY
Hosting
Cloud or self-hosted
Self-hosted only
Self-hosted only
Anthropic cloud
Self-hosted only
Native integrations
Managed OAuth connections
No managed connectors
No managed connectors
MCP only
No managed connectors
Schedules
Cron + Heartbeat
Cron + Heartbeat
Cron + Heartbeat
Cron only
Cron + Heartbeat
Pricing
Free + API costs, Paid plans available
Free + DIY Hosting Costs + API costs
Free + DIY Hosting Costs + API costs
Paid plans available + API costs
Free + DIY Hosting Costs + API costs

Model Comparison

ModelContext sizeCutoff dateI/O costMax outputLatencySpeed
👁 Image
Claude Opus 4.6
200,000May 2025$5 / $25128,0001.6s67 t/s
👁 Image
Claude Sonnet 4.6
200,000Aug 2025$3 / $1564,0000.73s55 t/s
👁 Image
OpenAI o3-mini
200,000Dec 2024$1.1 / $4.48,00014s214 t/s
👁 Image
DeepSeek-R1
128,000Dec 2024$0.55 / $2.198,0004s24 t/s
👁 Image
Claude 3.7 Sonnet [R]
200,000Nov 2024$3 / $1564,0000.95s78 t/s
👁 Image
Gemini 2.5 Pro
1,000,000Nov 2024$1.25 / $1065,00030s191 t/s
👁 Image
GPT-5
400,000April 2025$1.25 / $10128,000--
👁 Image
Kimi K2 Thinking
256,000April 2025$0.6 / $2.516,40025.3s79 t/s
👁 Image
Gemini 3 Pro
10000000April 2025$2 / $1265000030.3s128 t/s
👁 Image
Claude 4 Sonnet
200,000Mar 2025$3 / $1564,0001.9s-
👁 Image
Claude 4 Opus
200,000Mar 2025$15 / $7532,0001.95s-
👁 Image
GPT oss 120b
131,072April 2025$0.15 / $0.6131,0728.1s260 t/s
👁 Image
GPT oss 20b
131,072April 2025$0.08 / $0.35131,0724s564 t/s
👁 Image
Claude Opus 4.1
200,000April 2025$15 / $7532,000--
👁 Image
GPT 5.1
200,000April 2025$1.25 / $10128,000--
👁 Image
Claude Sonnet 4.5
200000April 2025$3 / $1516000031s69 t/s
👁 Image
GPT 5.2
400kAug 2025$1.5 / $1416,0000.6s92 t/s
👁 Image
Claude Mythos 5
1,000,000Jan 2026$10 / $50128,000--
👁 Image
Claude Opus 4.8
1,000,000Jan 2026$5 / $25128,000--
👁 Image
GPT-5.5
1,000,000Apr 2026$5 / $30128,000--
👁 Image
Claude Opus 4.7
1,000,000Apr 2026$5 / $25128,000--
👁 Image
DeepSeek V3 0324
128,000Dec 2024$0.27 / $1.18,0004s33 t/s
👁 Image
Qwen2.5-VL-32B
131,000Dec 2024-8,000--
👁 Image
GPT-4.5
128,000Nov 2024$75 / $15016,3841.25s48 t/s
👁 Image
Claude 3.7 Sonnet
200,000Nov 2024$3 / $15128,0000.91s78 t/s
👁 Image
Grok 3 [Beta]
/Nov 2024----
👁 Image
Gemma 3 27b
128,000Nov 2024$0.07 / $0.0781920.72s59 t/s
👁 Image
GPT-4.1
1,000,000December 2024$2 / $816,000--
👁 Image
GPT-4.1 mini
1,000,000December 2024$0.4 / $1.616,000--
👁 Image
Claude Opus 4.5
200,000April 2025$5 / $2564,000--
👁 Image
Claude Fable 5
1,000,000Jan 2026$10 / $50128,000--
MiniMax M31,048,576Mar 2026$0.6 / $2.4512,000--
👁 Image
OpenAI o1-mini
128,000Dec 2024$3 / $128,00011.43s220 t/s
👁 Image
Llama 4 Maverick
10,000,000November 2024$0.2 / $0.68,0000.45s126 t/s
👁 Image
Llama 4 Scout
10,000,000November 2024$0.11 / $0.348,0000.33s2600 t/s
👁 Image
Llama 4 Behemoth
-November 2024----
👁 Image
GPT-4.1 nano
1,000,000December 2024$0.1 / $0.432,000--
👁 Image
GPT-5.5 Pro
1,000,000Apr 2026$30 / $180128,000--
👁 Image
GPT-5.3 Codex
400,000Aug 2025$1.75 / $14128,0000.003s50 t/s

Context window, cost and speed comparison

Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens Speed (tokens/second) Latency
MiniMax M31,048,576$0.6$2.4n/an/a
👁 Image
Claude Mythos 5
1,000,000$10$50n/an/a
👁 Image
Claude Fable 5
1,000,000$10$50n/an/a
👁 Image
Claude Opus 4.8
1,000,000$5$25n/an/a
👁 Image
GPT-5.5
1,000,000$5$30n/an/a
👁 Image
GPT-5.5 Pro
1,000,000$30$180n/an/a
👁 Image
Claude Opus 4.7
1,000,000$5$25n/an/a
👁 Image
Claude Opus 4.6
200,000$5$2567 t/s1.6 seconds
👁 Image
Claude Sonnet 4.6
200,000$3$1555 t/s0.73 seconds
👁 Image
GPT-5.3 Codex
400,000$1.75$1450 t/s0.003 seconds
👁 Image
DeepSeek V3 0324
128,000$0.27$1.133 t/s4 seconds
👁 Image
Qwen2.5-VL-32B
131,000n/an/an/an/a
👁 Image
OpenAI o1-mini
128,000$3$12220 t/s11.43 seconds
👁 Image
OpenAI o3-mini
200,000$1.1$4.4214 t/s14 seconds
👁 Image
DeepSeek-R1
128,000$0.55$2.1924 t/s4 seconds
👁 Image
Claude 3.7 Sonnet [R]
200,000$3$1578 t/s0.95 seconds
👁 Image
GPT-4.5
128,000$75$15048 t/s1.25 seconds
👁 Image
Claude 3.7 Sonnet
200,000$3$1578 t/s0.91 seconds
👁 Image
Gemini 2.5 Pro
1,000,000$1.25$10191 t/s30 seconds
👁 Image
Grok 3 [Beta]
/n/an/an/an/a
👁 Image
Gemma 3 27b
128,000$0.07$0.0759 t/s0.72 seconds
👁 Image
Llama 4 Maverick
10,000,000$0.2$0.6126 t/s0.45 seconds
👁 Image
Llama 4 Scout
10,000,000$0.11$0.342600 t/s0.33 seconds
👁 Image
Llama 4 Behemoth
n/an/an/an/an/a
👁 Image
GPT-4.1
1,000,000$2$8n/an/a
👁 Image
GPT-4.1 mini
1,000,000$0.4$1.6n/an/a
👁 Image
GPT-4.1 nano
1,000,000$0.1$0.4n/an/a
👁 Image
Claude 4 Sonnet
200,000$3$15n/a1.9 seconds
👁 Image
Claude 4 Opus
200,000$15$75n/a1.95 seconds
👁 Image
GPT oss 120b
131,072$0.15$0.6260 t/s8.1 seconds
👁 Image
GPT oss 20b
131,072$0.08$0.35564 t/s4 seconds
👁 Image
Claude Opus 4.1
200,000$15$75n/an/a
👁 Image
GPT-5
400,000$1.25$10n/an/a
👁 Image
GPT 5.1
200,000$1.25$10n/an/a
👁 Image
Kimi K2 Thinking
256,000$0.6$2.579 t/s25.3 seconds
👁 Image
Gemini 3 Pro
10000000$2$12128 t/s30.3 seconds
👁 Image
Claude Sonnet 4.5
200000$3$1569 t/s31 seconds
👁 Image
Claude Opus 4.5
200,000$5$25n/an/a
👁 Image
GPT 5.2
400k$1.5$1492 t/s0.6 seconds

Benchmark glossary

GPQA Diamond
Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
AIME 2025
Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.
SWE-Bench Verified
Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
Humanity's Last Exam
A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
ARC-AGI 2
Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.
MMMLU
Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.
👁 Image
👁 Image

The Personal AI you were promised

👁 Image
👁 Image
GET STARTED