updated 9 Jun 2026
LLM Leaderboard
This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).
Top models per tasks
Best in Reasoning (GPQA Diamond)
100%95%91%86%81%
95.4%
94.2%
94.1%
94.1%
93.6%
| Model | Score |
|---|---|
| Claude 3 Opus | 95.4% |
| Claude Opus 4.7 | 94.2% |
| Claude Fable 5 | 94.1% |
| Claude Mythos 5 | 94.1% |
| Claude Opus 4.8 | 93.6% |
Best in High School Math (AIME 2025)
100%96%93%89%86%
100%
100%
99.8%
99.1%
98.7%
| Model | Score |
|---|---|
| Gemini 3 Pro | 100% |
| GPT 5.2 | 100% |
| Claude Opus 4.6 | 99.8% |
| Kimi K2 Thinking | 99.1% |
| GPT oss 20b | 98.7% |
Best in Agentic Coding (SWE Bench)
100%93%86%79%72%
95.5%
95%
88.6%
87.6%
82%
| Model | Score |
|---|---|
| Claude Mythos 5 | 95.5% |
| Claude Fable 5 | 95% |
| Claude Opus 4.8 | 88.6% |
| Claude Opus 4.7 | 87.6% |
| Claude Sonnet 4.5 | 82% |
Best Overall (Humanity's Last Exam)
70%53%35%18%0%
64.5%
57.9%
45.8%
44.9%
43.1%
| Model | Score |
|---|---|
| Claude Mythos 5 | 64.5% |
| Claude Opus 4.8 | 57.9% |
| Gemini 3 Pro | 45.8% |
| Kimi K2 Thinking | 44.9% |
| GPT-5.5 Pro | 43.1% |
Best in Visual Reasoning (ARC-AGI 2)
90%68%45%23%0%
85%
68.8%
58.3%
52.9%
37.6%
| Model | Score |
|---|---|
| GPT-5.5 | 85% |
| Claude Opus 4.6 | 68.8% |
| Claude Sonnet 4.6 | 58.3% |
| GPT 5.2 | 52.9% |
| Claude Opus 4.5 | 37.6% |
Best in Multilingual Reasoning (MMMLU)
95%90%86%81%77%
91.8%
91.1%
90.8%
89.5%
89.3%
| Model | Score |
|---|---|
| Gemini 3 Pro | 91.8% |
| Claude Opus 4.6 | 91.1% |
| Claude Opus 4.5 | 90.8% |
| Claude Opus 4.1 | 89.5% |
| Claude Sonnet 4.6 | 89.3% |
Fastest and most affordable models
Fastest Models (Tokens/sec)
1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 8b1800 t/s
5Llama 3.1 405b969 t/s
Lowest Latency (TTFT)
1GPT-5.3 Codex0.003s
2Nova Micro0.3s
3Llama 3.1 8b0.32s
4Llama 4 Scout0.33s
5Gemini 2.0 Flash0.34s
Cheapest Models (per 1M tokens)
1Nova Micro$0.04 / $0.14
2Gemma 3 27b$0.07 / $0.07
3Gemini 1.5 Flash$0.075 / $0.3
4GPT oss 20b$0.08 / $0.35
5Gemini 2.0 Flash$0.1 / $0.4
Compare models
vs
| MiniMax M3 | Claude Mythos 5 | |
|---|---|---|
| Context size | 1,048,576 | 1,000,000 |
| Cutoff date | Mar 2026 | Jan 2026 |
| I/O cost | $0.6 / $2.4 | $10 / $50 |
| Max output | 512,000 | 128,000 |
| Latency | - | - |
| Speed | - | - |
MiniMax M3Claude Mythos 5
GPQA Diamond
93
94.1
BFCL
-
-
MATH 500
-
-
AIME 2025
-
-
SWE Bench
80.5
95.5
LiveCodeBench
-
-
Compare Personal AI harnesses
Compare with
👁 Vellum Vellum | 👁 Hermes Hermes | 👁 OpenClaw OpenClaw | 👁 Claude Cowork Claude Cowork | 👁 Hermes Hermes | |
|---|---|---|---|---|---|
| Open source | MIT | MIT | Apache 2.0 | Proprietary | MIT |
| Time to set up | Easy | Moderate | Difficult | Easy | Moderate |
| Native channels | iOS, MacOS, Web, Voice, Email, Telegram, Slack, CLI | CLI / TUI | CLI, MacOS, Web | CLI, MacOS, Windows, Web | CLI / TUI |
| Memory | Managed memory | SQLite + markdown — you build the memory stack | Basic memory, context loss | Limited | SQLite + markdown — you build the memory stack |
| Security | Built-in security | DIY | DIY | No sandboxing | DIY |
| Hosting | Cloud or self-hosted | Self-hosted only | Self-hosted only | Anthropic cloud | Self-hosted only |
| Native integrations | Managed OAuth connections | No managed connectors | No managed connectors | MCP only | No managed connectors |
| Schedules | Cron + Heartbeat | Cron + Heartbeat | Cron + Heartbeat | Cron only | Cron + Heartbeat |
| Pricing | Free + API costs, Paid plans available | Free + DIY Hosting Costs + API costs | Free + DIY Hosting Costs + API costs | Paid plans available + API costs | Free + DIY Hosting Costs + API costs |
Model Comparison
| Model | Context size | Cutoff date | I/O cost | Max output | Latency | Speed |
|---|---|---|---|---|---|---|
| 👁 Image Claude Opus 4.6 | 200,000 | May 2025 | $5 / $25 | 128,000 | 1.6s | 67 t/s |
| 👁 Image Claude Sonnet 4.6 | 200,000 | Aug 2025 | $3 / $15 | 64,000 | 0.73s | 55 t/s |
| 👁 Image OpenAI o3-mini | 200,000 | Dec 2024 | $1.1 / $4.4 | 8,000 | 14s | 214 t/s |
| 👁 Image DeepSeek-R1 | 128,000 | Dec 2024 | $0.55 / $2.19 | 8,000 | 4s | 24 t/s |
| 👁 Image Claude 3.7 Sonnet [R] | 200,000 | Nov 2024 | $3 / $15 | 64,000 | 0.95s | 78 t/s |
| 👁 Image Gemini 2.5 Pro | 1,000,000 | Nov 2024 | $1.25 / $10 | 65,000 | 30s | 191 t/s |
| 👁 Image GPT-5 | 400,000 | April 2025 | $1.25 / $10 | 128,000 | - | - |
| 👁 Image Kimi K2 Thinking | 256,000 | April 2025 | $0.6 / $2.5 | 16,400 | 25.3s | 79 t/s |
| 👁 Image Gemini 3 Pro | 10000000 | April 2025 | $2 / $12 | 650000 | 30.3s | 128 t/s |
| 👁 Image Claude 4 Sonnet | 200,000 | Mar 2025 | $3 / $15 | 64,000 | 1.9s | - |
| 👁 Image Claude 4 Opus | 200,000 | Mar 2025 | $15 / $75 | 32,000 | 1.95s | - |
| 👁 Image GPT oss 120b | 131,072 | April 2025 | $0.15 / $0.6 | 131,072 | 8.1s | 260 t/s |
| 👁 Image GPT oss 20b | 131,072 | April 2025 | $0.08 / $0.35 | 131,072 | 4s | 564 t/s |
| 👁 Image Claude Opus 4.1 | 200,000 | April 2025 | $15 / $75 | 32,000 | - | - |
| 👁 Image GPT 5.1 | 200,000 | April 2025 | $1.25 / $10 | 128,000 | - | - |
| 👁 Image Claude Sonnet 4.5 | 200000 | April 2025 | $3 / $15 | 160000 | 31s | 69 t/s |
| 👁 Image GPT 5.2 | 400k | Aug 2025 | $1.5 / $14 | 16,000 | 0.6s | 92 t/s |
| 👁 Image Claude Mythos 5 | 1,000,000 | Jan 2026 | $10 / $50 | 128,000 | - | - |
| 👁 Image Claude Opus 4.8 | 1,000,000 | Jan 2026 | $5 / $25 | 128,000 | - | - |
| 👁 Image GPT-5.5 | 1,000,000 | Apr 2026 | $5 / $30 | 128,000 | - | - |
| 👁 Image Claude Opus 4.7 | 1,000,000 | Apr 2026 | $5 / $25 | 128,000 | - | - |
| 👁 Image DeepSeek V3 0324 | 128,000 | Dec 2024 | $0.27 / $1.1 | 8,000 | 4s | 33 t/s |
| 👁 Image Qwen2.5-VL-32B | 131,000 | Dec 2024 | - | 8,000 | - | - |
| 👁 Image GPT-4.5 | 128,000 | Nov 2024 | $75 / $150 | 16,384 | 1.25s | 48 t/s |
| 👁 Image Claude 3.7 Sonnet | 200,000 | Nov 2024 | $3 / $15 | 128,000 | 0.91s | 78 t/s |
| 👁 Image Grok 3 [Beta] | / | Nov 2024 | - | - | - | - |
| 👁 Image Gemma 3 27b | 128,000 | Nov 2024 | $0.07 / $0.07 | 8192 | 0.72s | 59 t/s |
| 👁 Image GPT-4.1 | 1,000,000 | December 2024 | $2 / $8 | 16,000 | - | - |
| 👁 Image GPT-4.1 mini | 1,000,000 | December 2024 | $0.4 / $1.6 | 16,000 | - | - |
| 👁 Image Claude Opus 4.5 | 200,000 | April 2025 | $5 / $25 | 64,000 | - | - |
| 👁 Image Claude Fable 5 | 1,000,000 | Jan 2026 | $10 / $50 | 128,000 | - | - |
| MiniMax M3 | 1,048,576 | Mar 2026 | $0.6 / $2.4 | 512,000 | - | - |
| 👁 Image OpenAI o1-mini | 128,000 | Dec 2024 | $3 / $12 | 8,000 | 11.43s | 220 t/s |
| 👁 Image Llama 4 Maverick | 10,000,000 | November 2024 | $0.2 / $0.6 | 8,000 | 0.45s | 126 t/s |
| 👁 Image Llama 4 Scout | 10,000,000 | November 2024 | $0.11 / $0.34 | 8,000 | 0.33s | 2600 t/s |
| 👁 Image Llama 4 Behemoth | - | November 2024 | - | - | - | - |
| 👁 Image GPT-4.1 nano | 1,000,000 | December 2024 | $0.1 / $0.4 | 32,000 | - | - |
| 👁 Image GPT-5.5 Pro | 1,000,000 | Apr 2026 | $30 / $180 | 128,000 | - | - |
| 👁 Image GPT-5.3 Codex | 400,000 | Aug 2025 | $1.75 / $14 | 128,000 | 0.003s | 50 t/s |
Context window, cost and speed comparison
| Models | Context Window | Input Cost / 1M tokens | Output Cost / 1M tokens | Speed (tokens/second) | Latency |
|---|---|---|---|---|---|
| MiniMax M3 | 1,048,576 | $0.6 | $2.4 | n/a | n/a |
| 👁 Image Claude Mythos 5 | 1,000,000 | $10 | $50 | n/a | n/a |
| 👁 Image Claude Fable 5 | 1,000,000 | $10 | $50 | n/a | n/a |
| 👁 Image Claude Opus 4.8 | 1,000,000 | $5 | $25 | n/a | n/a |
| 👁 Image GPT-5.5 | 1,000,000 | $5 | $30 | n/a | n/a |
| 👁 Image GPT-5.5 Pro | 1,000,000 | $30 | $180 | n/a | n/a |
| 👁 Image Claude Opus 4.7 | 1,000,000 | $5 | $25 | n/a | n/a |
| 👁 Image Claude Opus 4.6 | 200,000 | $5 | $25 | 67 t/s | 1.6 seconds |
| 👁 Image Claude Sonnet 4.6 | 200,000 | $3 | $15 | 55 t/s | 0.73 seconds |
| 👁 Image GPT-5.3 Codex | 400,000 | $1.75 | $14 | 50 t/s | 0.003 seconds |
| 👁 Image DeepSeek V3 0324 | 128,000 | $0.27 | $1.1 | 33 t/s | 4 seconds |
| 👁 Image Qwen2.5-VL-32B | 131,000 | n/a | n/a | n/a | n/a |
| 👁 Image OpenAI o1-mini | 128,000 | $3 | $12 | 220 t/s | 11.43 seconds |
| 👁 Image OpenAI o3-mini | 200,000 | $1.1 | $4.4 | 214 t/s | 14 seconds |
| 👁 Image DeepSeek-R1 | 128,000 | $0.55 | $2.19 | 24 t/s | 4 seconds |
| 👁 Image Claude 3.7 Sonnet [R] | 200,000 | $3 | $15 | 78 t/s | 0.95 seconds |
| 👁 Image GPT-4.5 | 128,000 | $75 | $150 | 48 t/s | 1.25 seconds |
| 👁 Image Claude 3.7 Sonnet | 200,000 | $3 | $15 | 78 t/s | 0.91 seconds |
| 👁 Image Gemini 2.5 Pro | 1,000,000 | $1.25 | $10 | 191 t/s | 30 seconds |
| 👁 Image Grok 3 [Beta] | / | n/a | n/a | n/a | n/a |
| 👁 Image Gemma 3 27b | 128,000 | $0.07 | $0.07 | 59 t/s | 0.72 seconds |
| 👁 Image Llama 4 Maverick | 10,000,000 | $0.2 | $0.6 | 126 t/s | 0.45 seconds |
| 👁 Image Llama 4 Scout | 10,000,000 | $0.11 | $0.34 | 2600 t/s | 0.33 seconds |
| 👁 Image Llama 4 Behemoth | n/a | n/a | n/a | n/a | n/a |
| 👁 Image GPT-4.1 | 1,000,000 | $2 | $8 | n/a | n/a |
| 👁 Image GPT-4.1 mini | 1,000,000 | $0.4 | $1.6 | n/a | n/a |
| 👁 Image GPT-4.1 nano | 1,000,000 | $0.1 | $0.4 | n/a | n/a |
| 👁 Image Claude 4 Sonnet | 200,000 | $3 | $15 | n/a | 1.9 seconds |
| 👁 Image Claude 4 Opus | 200,000 | $15 | $75 | n/a | 1.95 seconds |
| 👁 Image GPT oss 120b | 131,072 | $0.15 | $0.6 | 260 t/s | 8.1 seconds |
| 👁 Image GPT oss 20b | 131,072 | $0.08 | $0.35 | 564 t/s | 4 seconds |
| 👁 Image Claude Opus 4.1 | 200,000 | $15 | $75 | n/a | n/a |
| 👁 Image GPT-5 | 400,000 | $1.25 | $10 | n/a | n/a |
| 👁 Image GPT 5.1 | 200,000 | $1.25 | $10 | n/a | n/a |
| 👁 Image Kimi K2 Thinking | 256,000 | $0.6 | $2.5 | 79 t/s | 25.3 seconds |
| 👁 Image Gemini 3 Pro | 10000000 | $2 | $12 | 128 t/s | 30.3 seconds |
| 👁 Image Claude Sonnet 4.5 | 200000 | $3 | $15 | 69 t/s | 31 seconds |
| 👁 Image Claude Opus 4.5 | 200,000 | $5 | $25 | n/a | n/a |
| 👁 Image GPT 5.2 | 400k | $1.5 | $14 | 92 t/s | 0.6 seconds |
Benchmark glossary
- GPQA Diamond
- Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
- AIME 2025
- Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.
- SWE-Bench Verified
- Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
- Humanity's Last Exam
- A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
- ARC-AGI 2
- Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.
- MMMLU
- Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.
