Independent analysis of AI
Understand the AI landscape to choose the best model and provider for your use case
Highlights
Personalized model recommender
Get personalized recommendations based on your priorities for intelligence, speed, and cost
Explore agents for general work, coding, customer support, and more
Compare AI agents across capabilities, pricing, and platform support
Explore premium plans
Access expanded benchmark data, custom visualizations, industry reports, and more
GLM-5.2 (max)
Kimi K2.7 Code
HyperNova 60B 2605
Gemma 4 12B (Non-reasoning)
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
North Mini Code
LFM2.5-8B-A1B
Gemma 4 12B (Reasoning)
MiniCPM5-1B (Reasoning)
Nemotron 3 Ultra 550B A55B (Reasoning)
grok-imagine-video-1.5-preview
MiniMax-M3See more
IntelligenceUpdated
Intelligence of leading AI models based on our independent evaluations
Artificial Analysis Intelligence Index
Artificial Analysis Intelligence Index v4.1 includes: GDPval-AA v2, 𝜏³-Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, AA-LCR. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Artificial Analysis Intelligence Index by Open Weights / Proprietary
Artificial Analysis Intelligence Index v4.1 includes: GDPval-AA v2, 𝜏³-Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, AA-LCR. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Indicates whether the model weights are available. Models are labelled as 'Commercial Use Restricted' if the weights are available but commercial use is limited (typically requires obtaining a paid license).
Cost per Intelligence Index Task
Weighted average cost per Intelligence Index task. Each evaluation’s cost is calculated from input, cache hit, cache write, reasoning, and answer token prices, divided by task count, and weighted by its Intelligence Index weight.
Frontier Language Model Intelligence, Over Time
Artificial Analysis Intelligence Index v4.1 includes: GDPval-AA v2, 𝜏³-Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, AA-LCR. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Performance, cost, and execution time for leading coding agents on end-to-end software engineering tasks
Explore Artificial Analysis Coding Agent IndexArtificial Analysis Coding Agent Index
Image & Video Leaderboards
Top models from our Image Arena and Video Arena leaderboards, with 95% confidence intervals
Text to Image Leaderboard
Intelligence Evaluations
Agentic real-world work tasks, (Elo-500)/2000
Agentic coding & terminal use
Agentic tool use
Long context reasoning
Knowledge
1 - hallucination rate
Reasoning & knowledge
Scientific reasoning
Coding
Instruction following
Physics reasoning
Long-horizon agentic tasks
Kubernetes incident root-cause analysis
Visual reasoning
While model intelligence generally translates across use cases, specific evaluations may be more relevant for certain use cases.
Artificial Analysis Intelligence Index v4.1 includes: GDPval-AA v2, 𝜏³-Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, AA-LCR. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains
AA-Omniscience Index
AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct.
GDPval-AA v2Updated
GDPval-AA v2 evaluates AI models on real-world, economically valuable tasks across a wide range of occupations
GDPval-AA v2 Leaderboard
Artificial Analysis Openness Index assesses how 'open' models are on the basis of their availability and transparency across different components.
Artificial Analysis Openness Index: Components
Artificial Analysis Openness Index vs. Artificial Analysis Intelligence Index
Output TokensUpdated
Output tokens of leading AI models based on our independent evaluations
Output Tokens per Intelligence Index Task
The number of tokens required per Intelligence Index task. This is calculated by multiplying the output tokens per eval by the relative weights of each benchmark in the Intelligence Index, then dividing by task count (excluding repeats).
Price and CostUpdated
Price and real-world costs of leading AI models based on our independent evaluations
Cost per Intelligence Index Task
Weighted average cost per Intelligence Index task. Each evaluation’s cost is calculated from input, cache hit, cache write, reasoning, and answer token prices, divided by task count, and weighted by its Intelligence Index weight.
Cost to Run Artificial Analysis Intelligence Index
The cost to run the evaluations in the Artificial Analysis Intelligence Index, calculated using the model's input, cache hit, cache write, reasoning, and answer token prices and the number of tokens used across evaluations (excluding repeats).
Pricing: Cache Hit, Input, and Output
Price per token for cached prompts (previously processed), typically offering a significant discount compared to regular input price, represented as USD per million tokens. The values shown here are the cache hit price; cache write and cache storage are billed separately and vary by provider — see "Cache pricing by provider" for detail.
Price per token included in the request/message sent to the API, represented as USD per million Tokens.
The blended cache price shown here uses cache hit price only. Other caching costs differ by provider:
- Anthropic: charges a separate cache write fee, with different rates for 5-minute and 1-hour TTLs (1-hour TTL is more expensive).
- Google (Vertex/Gemini): charges a per-hour cache storage fee in addition to cache hit pricing. Some providers also use tiered pricing for prompts above 200K tokens.
- OpenAI, DeepSeek, others: typically charge only cache hit pricing with no write or storage fee.
See Prompt Caching for the full breakdown.
Price per token generated by the model (received from the API), represented as USD per million Tokens.
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Speed & LatencyUpdated
Comparison of first-party API performance
Output Speed
Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Time per Intelligence Index Task
The weighted average time (seconds) per Artificial Analysis Intelligence Index task. This is calculated by dividing output tokens per task by output speed, weighted by the relative weights of each benchmark in the Intelligence Index.
Comprehensive benchmarking of GPUs for language model inference
Compare leading Text to Video and Image to Video models
Compare leading Image Generation and Image Editing models
Compare leading Text to Speech models
API Provider Performance
Output Speed vs. Price: gpt-oss-120b (high)
Smaller, emerging providers are offering high output speed and at competitive prices.
Price per token, shown in USD per million tokens. Price is a blend of cache hit, input, and output token prices using the selected ratio (default 7:2:1 cache-input-output).
Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent median (P50) measurement over the past 72 hours to reflect sustained changes in performance.
Pricing (Cache Hit, Input, and Output): gpt-oss-120b (high)
Price per token for cached prompts (previously processed), typically offering a significant discount compared to regular input price, represented as USD per million tokens. The values shown here are the cache hit price; cache write and cache storage are billed separately and vary by provider — see "Cache pricing by provider" for detail.
Price per token included in the request/message sent to the API, represented as USD per million Tokens.
The blended cache price shown here uses cache hit price only. Other caching costs differ by provider:
- Anthropic: charges a separate cache write fee, with different rates for 5-minute and 1-hour TTLs (1-hour TTL is more expensive).
- Google (Vertex/Gemini): charges a per-hour cache storage fee in addition to cache hit pricing. Some providers also use tiered pricing for prompts above 200K tokens.
- OpenAI, DeepSeek, others: typically charge only cache hit pricing with no write or storage fee.
See Prompt Caching for the full breakdown.
Price per token generated by the model (received from the API), represented as USD per million Tokens.
Output Speed: gpt-oss-120b (high)
Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
