👁 Best AI Models June 2026 — Top 10 by SWE-Bench, Pricing & Context

Apr 22, 2026 (updated Jun 16, 2026 )

model-comparisonapi-guidepricingtutorial

Best AI Models June 2026 — Top 10 by SWE-Bench, Pricing & Context

TL;DR (June 16, 2026): Claude Opus 4.8 is the practical #1 for any team outside the US — 88.6% SWE-Bench Verified, 56 on the Artificial Analysis Intelligence Index (max-mode), 1486 Elo on LM Arena. GPT-5.5 follows at 82.6% SWE-Bench, Opus 4.7 at 82.0%, Gemini 3.5 Flash at 78.8%. The cheapest frontier-tier API is DeepSeek V4-Pro at $0.45/M input. Claude Fable 5 scored higher historically (95.0% SWE-Bench, 60 AA Index) but was withdrawn from all non-US customers on June 13, 2026 after a US Commerce Department export-control directive — it is reference-only in the table below.

Top 10 — June 2026 At-a-Glance

Rank	Model	SWE-Bench Verified	AA Intelligence	Input $/M	Output $/M	Context	ofox Model ID
1	Claude Opus 4.8	88.6%	56	$5.00	$25.00	1M	`anthropic/claude-opus-4.8`
2	GPT-5.5 (xhigh)	82.6%	55	$4.25	$25.50	1M	`openai/gpt-5.5`
3	Claude Opus 4.7	82.0%	54	$5.00	$25.00	1M	`anthropic/claude-opus-4.7`
4	Gemini 3.5 Flash	78.8%	50	$1.50	$9.00	1M	`google/gemini-3.5-flash`
5	Claude Sonnet 4.6 (max)	—	47	$3.00	$15.00	1M	`anthropic/claude-sonnet-4.6`
6	Gemini 3.1 Pro Preview	—	46	$2.00	$12.00	1M	`google/gemini-3.1-pro-preview`
7	Qwen 3.7 Max	—	46	$2.50	$7.50	1M	`bailian/qwen3.7-max`
8	Grok 4.20	—	—	$4.00	$12.00	2M	`x-ai/grok-4.20`
9	Kimi K2.6	vendor-reported only	—	$0.95	$4.00	262K	`moonshotai/kimi-k2.6`
10	DeepSeek V4-Pro	—	—	$0.45	$0.88	1M	`deepseek/deepseek-v4-pro`
—	Claude Fable 5	95.0% (historical)	60 (historical)	—	—	1M	— (export-controlled)

Sources, as of June 13, 2026: vals.ai SWE-Bench Verified, Artificial Analysis Intelligence Index, ofox.ai/en/models pricing. Em-dashes mark scores not currently listed on the independent leaderboard. The bottom row is reference only — see below.

About Claude Fable 5 [Withdrawn — US export controls per US Commerce Department, June 13, 2026]. On June 13, 2026 the US Commerce Department issued an export-control directive restricting distribution of Claude Fable 5 and its underlying Mythos 5 model to foreign nationals — including non-US citizens working inside Anthropic. Because the controls apply by nationality rather than geography, Anthropic disabled both models for all non-US customers worldwide, not just specific regions. This is not an Anthropic product retirement; it is policy-driven. Historical benchmark scores remain on the leaderboards for reference, but neither model is callable through ofox.ai or any other public endpoint for non-US users as of this writing. For most production teams, Claude Opus 4.8 is the practical top model — 88.6% SWE-Bench Verified, no policy uncertainty.

How These Rankings Work

Three independent leaderboards measure different things, and they disagree — which is the point.

LM Arena (formerly LMSYS Chatbot Arena) uses blind human preference votes. Two models answer the same prompt; users pick the better response without knowing which model is which. With 6.8M+ votes across 366 models as of June 10, 2026, it remains the largest human-preference dataset in existence. Scores are Elo ratings — the same system used in chess.

SWE-Bench Verified measures whether a model can resolve real GitHub issues. An agent gets a repo, a bug report, and a test suite. It passes if the tests go green. No partial credit. This is the benchmark that actually predicts whether a model will be useful in a coding agent.

GPQA Diamond tests graduate-level science questions in biology, physics, and chemistry — questions designed to be “Google-proof.” Human PhD experts score around 65-70%. Models above 85% are doing something that most domain experts cannot.

Artificial Analysis Intelligence Index aggregates multiple benchmarks (MMLU-Pro, GPQA, AIME, LiveCodeBench, and more) into a single composite score, useful for a quick overall comparison.

LM Arena — Human Preference Leaderboard

LM Arena top 10 as of June 10, 2026 (source):

Rank	Model	Elo Score	Votes
1	claude-fable-5 (Anthropic) †	1510 ± 11	2,883
2	claude-opus-4-6-thinking (Anthropic)	1504 ± 4	42,797
3	claude-opus-4-7-thinking (Anthropic)	1502 ± 5	28,900
4	claude-opus-4-6 (Anthropic)	1498 ± 4	45,808
5	claude-opus-4-7 (Anthropic)	1492 ± 5	29,924
6	muse-spark (Meta, preliminary)	1487 ± 6	13,511
7	gemini-3.1-pro-preview (Google)	1487 ± 4	55,403
8	gemini-3-pro (Google)	1486 ± 4	41,317
9	claude-opus-4-8-thinking (Anthropic)	1486 ± 7	9,190
10	gpt-5.5-high (OpenAI)	1481 ± 5	24,620

† Claude Fable 5 stays on the leaderboard as a historical entry; the model itself was withdrawn from non-US customers on June 13, 2026 (see “About Claude Fable 5” above). The 1510 Elo is from 2,883 votes captured before the directive, while the full leaderboard dataset spans 6.8M+ votes across all 366 models.

Two things stand out on the usable leaderboard. First, Anthropic still holds five of the top ten spots — four Opus 4.6/4.7 variants plus Opus 4.8-thinking — even after Fable’s withdrawal. Second, Opus 4.8-thinking only sits at #9 despite leading SWE-Bench. Arena rewards conversational polish; SWE-Bench rewards finished code. The Elo gap between #2 and #10 is 23 points, statistically meaningful but not a blowout.

For teams evaluating xAI’s API — Grok 4.20 pricing, model IDs, and a working setup — see the Grok API pricing and setup guide.

Best for Coding — SWE-Bench Verified

Claude Opus 4.8 leads usable SWE-Bench Verified at 88.6% (vals.ai, verified June 13, 2026). Fable 5 scored 95.0% historically but was withdrawn from non-US customers on June 13.

Rank	Model	SWE-Bench Verified	Notes
1	Claude Opus 4.8	88.6%	Released May 2026; practical leader
2	GPT-5.5	82.6%	Released April 23, 2026
3	Claude Opus 4.7	82.0%	Still strong; cheaper migration path than 4.8
4	Gemini 3.5 Flash	78.8%	Lowest-cost top-5 entry
—	Claude Fable 5	95.0% (historical)	Withdrawn — US export controls (Jun 13, 2026)

The spread between Opus 4.8 and Gemini 3.5 Flash is 9.8 percentage points — large enough that model choice matters for production coding agents. The price spread across the same four is wider still: Gemini 3.5 Flash bills $1.50/$9.00 per million tokens vs. Opus 4.8’s $5.00/$25.00, a 3.3× input-cost gap for that 9.8-point benchmark drop. Whether the gap is worth paying for depends on how much your agent fails and retries at the lower tier.

Kimi K2.6 still vendor-reported only. Moonshot AI publishes internal SWE-Bench numbers in the 70s, but as of June 13, 2026, vals.ai does not list an independently-verified score. Treat any K2.6 SWE-Bench figure outside vals.ai as vendor-reported.

For a deeper look at Claude’s coding performance, the Claude Opus 4.7 review and upgrade guide covers the architectural jump from 4.6 to 4.7. The Qwen 3.7 Plus vs. Max benchmark covers the same family’s coding numbers in detail.

Best for Reasoning — Composite Intelligence Index

The Artificial Analysis Intelligence Index aggregates GPQA Diamond, MMLU-Pro, AIME, LiveCodeBench, and several other reasoning benchmarks into one composite score.

Top of the leaderboard as of June 16, 2026 — Fable 5 stays at #1 on the leaderboard view as a historical entry, but for any non-US team the usable order is Opus 4.8 first, then GPT-5.5 xhigh, then Opus 4.7. The Intelligence Index is updated rolling weekly, so single-point absolute scores drift; the ordering below is the more durable signal.

Model	AA Intelligence Index	Notes
Claude Fable 5 (with fallback)	60	Historical — withdrawn Jun 13, 2026
Claude Opus 4.8 (max)	56	Practical #1 outside the US
GPT-5.5 (xhigh)	55	Slower / more expensive than default mode
Claude Opus 4.7 (max)	54	Cheaper migration target than 4.8
GPT-5.5 (high)	53
Gemini 3.5 Flash	50	Cheapest top-tier entry
Claude Sonnet 4.6 (max)	47
GPT-5.5 (medium)	47
Gemini 3.1 Pro Preview	46
Qwen 3.7 Max	46

Source: Artificial Analysis leaderboard, pulled June 16, 2026.

Two observations. First, GPT-5.5’s xhigh reasoning mode (55) ties Opus 4.8 (56) within rounding error — but xhigh is significantly slower and costlier than GPT-5.5’s default mode, so the tie only matters if your workload tolerates the latency. Second, Qwen 3.7 Max (46) matches Gemini 3.1 Pro Preview on the index but at a slightly higher input price ($2.50/M vs. $2.00/M); Qwen’s actual edge is its 1M-token context plus Bailian’s domestic-network routing, not raw per-token cost.

For a hands-on look at Gemini 3.1 Pro’s reasoning and its 1M-token context window, see the Gemini 3.1 Pro API guide.

Best Value — Price-Performance

The frontier tier has roughly an 11× price spread on input tokens between Qwen 3.7 Plus and GPT-5.5 — and the cheapest options still score within striking distance of the most expensive on the Intelligence Index.

Model	Input ($/M)	Output ($/M)	Context	SWE-Bench Verified	AA Intelligence
Qwen 3.7 Plus	$0.40	$1.60	1M	—	—
DeepSeek V4-Pro	$0.45	$0.88	1M	—	—
Kimi K2.6	$0.95	$4.00	262K	vendor-reported	—
GLM 5.1	$1.40	$4.40	200K	—	—
Gemini 3.5 Flash	$1.50	$9.00	1M	78.8%	50
Gemini 3.1 Pro Preview	$2.00	$12.00	1M	—	46
GPT-5.4	$2.13	$12.80	1M	—	—
Qwen 3.7 Max	$2.50	$7.50	1M	—	46
Claude Sonnet 4.6	$3.00	$15.00	1M	—	47
Grok 4.20	$4.00	$12.00	2M	—	—
GPT-5.5	$4.25	$25.50	1M	82.6%	55
Claude Opus 4.7	$5.00	$25.00	1M	82.0%	54
Claude Opus 4.8	$5.00	$25.00	1M	88.6%	56

Prices as of June 16, 2026, pulled from ofox.ai/en/models. Em-dashes indicate the model is not currently scored on the independent leaderboard. Claude Fable 5 omitted from the value table — not callable for non-US customers.

The big value plays. DeepSeek V4-Pro at $0.45/M input is roughly 11× cheaper than Opus 4.8 on input and 28× cheaper on output. Qwen 3.7 Plus is even lower at $0.40/M but with fewer independent benchmark anchors. For high-volume coding agents where you’re not pushing the SWE-Bench ceiling, both make economic sense as defaults with Opus 4.8 as the escalation target. The DeepSeek API pricing guide breaks down the V4-Pro cache discount that lowers input cost further on repeated context.

For mainland China teams comparing Qwen tiers head-to-head, the Qwen 3.7 Plus vs. Qwen 3.7 Max benchmark covers when the 6× price jump from Plus to Max is worth it.

Best Open-Weight Model

Kimi K2.6 from Moonshot AI is the strongest open-weight model that’s actually downloadable. Moonshot describes it as a 1-trillion-parameter Mixture-of-Experts architecture with 32B active parameters per forward pass, a 262K context window, and API pricing of $0.95/M input and $4.00/M output. It appears on LM Arena and on Artificial Analysis as the strongest sub-frontier open model. An independent SWE-Bench Verified listing has not been published yet.

GLM 5.1 (Zhipu) is the second open-weight option in the same band — $1.40/$4.40 with a 200K context. GLM 5.2 weights and the API are rolling out: per Z.AI’s June 2026 announcement, API access opens next week. The GLM 5.2 access guide tracks the rollout.

For teams that need to self-host, audit model weights, or build on top of a modifiable base, K2.6 and GLM 5.1 are the two open models that belong in the same conversation as the closed-source frontier as of June 2026.

Which Model Should You Use? (June 2026)

Pick by task, not by headline ranking.

Coding agents. Claude Opus 4.8 — 88.6% SWE-Bench Verified, $5/$25, 1M context. Fable 5 is technically higher but unavailable to non-US customers since June 13. Opus 4.7 is the safer migration target if you’re already running it — same family, same pricing, 82.0% SWE-Bench.

Long-context analysis and research. Gemini 3.1 Pro Preview. AA Intelligence Index 46 is below the Claude/GPT band, but the 1M-token context handles document corpora that the others have to chunk. Grok 4.20’s 2M context is even larger if you need it.

High-volume production. DeepSeek V4-Pro ($0.45/$0.88) or Qwen 3.7 Plus ($0.40/$1.60). Both are below the frontier on benchmarks but close enough for most tasks at roughly an order of magnitude cheaper. Use Opus 4.8 as the escalation tier for the requests they can’t close.

Best reasoning at frontier. GPT-5.5 (xhigh mode) ties Opus 4.8 at 55-56 AA Intelligence, but verify that xhigh’s latency works for your workload before committing — the default mode is materially faster.

Self-hosted. Kimi K2.6 is the strongest open-weight option. GLM 5.1 if you prefer Zhipu’s stack; GLM 5.2 API arrives next week.

Multimodal and OCR. For document parsing and OCR-heavy workloads, the picks shift — the best AI model for OCR in 2026 covers Gemini 3.1 Pro, GPT-5.5 vision, and the open models specifically on this axis.

The full Claude vs. GPT vs. Gemini comparison guide goes deeper on use-case-specific recommendations across the three frontier families.

Where to Access These Models

Every model in the leaderboard above — Claude Opus 4.8, Opus 4.7, Sonnet 4.6, GPT-5.5, GPT-5.4, Gemini 3.1 Pro Preview, Gemini 3.5 Flash, Qwen 3.7 Max/Plus, Kimi K2.6, Grok 4.20, GLM 5.1, DeepSeek V4-Pro — is available through ofox.ai with a single API key and OpenAI-compatible endpoints. Claude Fable 5 and Mythos 5 are excluded per the June 13, 2026 US Commerce export-control directive; the rest of the catalog is unaffected. For teams operating from mainland China, the same endpoint is reachable without a VPN, which is why this list is the one most ofox users build against.

One key, one billing account, one SDK integration. Switch models by changing one string:

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
 base_url="https://api.ofox.ai/v1",
 api_key="your-ofox-key",
)

# Switch between any model on this leaderboard by changing `model`.
response = client.chat.completions.create(
 model="anthropic/claude-opus-4.8", # or "openai/gpt-5.5", "google/gemini-3.1-pro-preview"
 messages=[{"role": "user", "content": "Refactor this Django view to use async..."}],
)
print(response.choices[0].message.content)

Node.js (OpenAI SDK):

import OpenAI from "openai";

const client = new OpenAI({
 baseURL: "https://api.ofox.ai/v1",
 apiKey: process.env.OFOX_API_KEY,
});

const response = await client.chat.completions.create({
 model: "bailian/qwen3.7-max", // or "moonshotai/kimi-k2.6", "deepseek/deepseek-v4-pro"
 messages: [{ role: "user", content: "Refactor this Django view to use async..." }],
});
console.log(response.choices[0].message.content);

Model IDs in the table above (provider/model.version) are what go in the model parameter — they are not OpenAI’s or Anthropic’s native naming. For a complete setup guide, including failover and per-team budget routing, see AI API aggregation: access every model through one endpoint.

Last updated: June 16, 2026. Benchmark sources verified on publication: vals.ai SWE-Bench Verified (June 13, 2026), LM Arena text leaderboard (June 10, 2026, 6.8M+ votes across 366 models), Artificial Analysis Intelligence Index (June 2026). Pricing from ofox.ai/en/models. Claude Fable 5 / Mythos 5 withdrawal per US Department of Commerce export-control directive, effective June 13, 2026. For the previous April 2026 ranking with Opus 4.7 in the top slot and DeepSeek V3.2 as the value pick, refer to AI Model Rankings May 2026 for the intermediate snapshot.

URL: https://ofox.ai/blog/llm-leaderboard-best-ai-models-ranked-2026/