SWE-bench Pro is Scale AI's contamination-resistant coding benchmark: 1,865 real-world software tasks across 41 professional repositories, scored Pass@1, that the same frontier models clearing 80-95% on SWE-bench Verified solve only ~59% of under standardized scaffolding. The standardized leader on June 18, 2026 is GPT-5.4 (xHigh) at 59.1%; vendor-reported, Claude Opus 4.8 leads at 69.2%.
Three numbers all claim to be the best SWE-bench Pro score: 59.1% (gpt-5.4 xHigh, Scale's standardized SEAL leaderboard), 69.2% (Claude Opus 4.8, Anthropic's own scaffold), and 47.1% (Opus 4.6, Scale's private commercial set). All three are real. The spread is scaffolding and data splits, and most pages quoting a score never say which one they mean. Anthropic's 80.3% Fable 5 number is excluded here: that model is currently suspended (see note below).
This page keeps all three views side by side: Scale's standardized public and commercial leaderboards, vendor-reported scores, the Pro-vs-Verified delta per model, and score per dollar of output-token price.
SWE-bench Pro: SEAL Leaderboard Top 10 (Public Set)
Scale AI standardized scaffolding, Pass@1, 731 public tasks
Source: Scale AI SEAL Leaderboard, June 18, 2026. Standardized scaffolding; some entries run the mini-swe-agent harness.
Current top SWE-bench Pro score (standardized, public set): GPT-5.4 (xHigh) at 59.1%. The best vendor-reported score is Claude Opus 4.8 at 69.2% on Anthropic's own scaffold. The best open-source result is GLM-5.1 at 58.4% (vendor-reported, not a Scale SEAL entry).
SWE-bench Pro Leaderboard: Scale SEAL Public Set
Scale AI runs every model through identical scaffolding, which isolates model capability from harness quality. These are the only directly comparable SWE-bench Pro numbers. Scores below are from the public set (731 tasks), Pass@1, as of June 9, 2026.
GPT-5.4 (xHigh) leads at 59.1%, 4.1 points ahead of the new Muse Spark entry and 7.2 ahead of the best Claude run (Opus 4.6 thinking, 51.9%). Confidence intervals are roughly ±3.5 points, so adjacent ranks below the top 3 overlap.
| Rank | Model | Score | 95% CI | Release |
|---|---|---|---|---|
| 1 | GPT-5.4 (xHigh) | 59.1% | ±3.56 | 2026 |
| 2 | Muse Spark (new) | 55.0% | ±3.60 | 2026 |
| 3 | Claude Opus 4.6 (thinking) | 51.9% | ±3.61 | Feb 2026 |
| 4 | Gemini 3.1 Pro (thinking) | 46.1% | ±3.60 | Feb 2026 |
| 5 | Claude Opus 4.5 | 45.9% | ±3.60 | Nov 2025 |
| 6 | Claude Sonnet 4.5 | 43.6% | ±3.60 | Sep 2025 |
| 7 | Gemini 3 Pro (preview) | 43.3% | ±3.60 | 2025 |
| 8 | Claude Sonnet 4 | 42.7% | ±3.59 | May 2025 |
| 9 | GPT-5 (High) | 41.8% | ±3.49 | Aug 2025 |
| 10 | GPT-5.2 Codex | 41.0% | ±3.57 | Jan 2026 |
| 11 | Claude Haiku 4.5 | 39.5% | ±3.55 | Oct 2025 |
| 12 | Qwen3 Coder 480B (open) | 38.7% | ±3.55 | 2025 |
Source: Scale AI SEAL Leaderboard, June 9, 2026. Standardized scaffolding; entries marked with an asterisk on Scale's page run the mini-swe-agent harness. Claude Fable 5 (GA June 9, 2026) and Opus 4.8 (May 28, 2026) have no SEAL entries yet.
SWE-bench Pro Commercial Set: Scores on Code No Model Has Seen
The commercial set is 276 tasks from 18 proprietary startup codebases that are not on the public internet. It is the strongest contamination control available, and scores drop hard: every model loses ground versus its public-set number, and the ranking reshuffles.
| Rank | Model | Score | 95% CI | Public-Set Score |
|---|---|---|---|---|
| 1 | Claude Opus 4.6 (thinking) | 47.1% | ±6.07 | 51.9% |
| 2 | Muse Spark | 44.7% | ±6.05 | 55.0% |
| 3 | GPT-5.4 (xHigh) | 43.4% | ±6.03 | 59.1% |
| 4 | Gemini 3.1 Pro (thinking) | 32.2% | ±5.69 | 46.1% |
| 5 | GPT-5.2 Codex | 27.7% | ±5.09 | 41.0% |
| 6 | GPT-5.2 | 23.8% | ±5.09 | n/a |
| 7 | Claude Opus 4.5 | 23.4% | ±5.07 | 45.9% |
| 8 | Gemini 3 Pro | 18.0% | ±4.78 | 43.3% |
| 9 | Claude Opus 4.1 | 17.8% | ±4.51 | n/a |
| 10 | GPT-5 | 14.9% | ±4.20 | 41.8% |
| 11 | Gemini 2.5 Pro Preview | 10.1% | ±3.56 | n/a |
| 12 | Claude Sonnet 4 | 9.1% | ±3.39 | 42.7% |
Source: Scale AI SEAL Private Leaderboard. Wider confidence intervals reflect the smaller 276-task set.
The reshuffle is the interesting part. GPT-5.4 leads the public set by 4.1 points but falls to third on commercial code. Opus 4.5 drops 22.5 points (45.9% to 23.4%), the largest fall in the top 10. Opus 4.6 holds 47.1%, losing only 4.8 points. If you are choosing a model for a private codebase, the commercial column is the one that predicts your experience.
Vendor-Reported SWE-bench Pro Scores: Fable 5 at 80.3% (currently suspended), Opus 4.8 at 69.2%
Labs also publish SWE-bench Pro numbers run on their own agent scaffolds. These are not comparable to SEAL scores: the harness, context retrieval, and turn budgets are tuned per lab. They are comparable to each other within one lab's table. Anthropic's Claude Fable 5 launch table (June 9, 2026):
| Model | Score | Output Price |
|---|---|---|
| Claude Fable 5 (currently suspended, see note) | 80.3% | $50/M tokens |
| Claude Mythos Preview (currently suspended, see note) | 77.8% | n/a |
| Claude Opus 4.8 | 69.2% | $25/M tokens |
| GPT-5.5 | 58.6% | $30/M tokens |
| Gemini 3.1 Pro | 54.2% | $12/M tokens |
Source: Anthropic launch benchmarks via Vellum's analysis. Prices from the Anthropic and OpenAI/Google API price lists, June 2026.
The vendor-vs-SEAL gap is consistent: Anthropic reports 69.2% for Opus 4.8 while Scale's best standardized Claude run (Opus 4.6 thinking) scores 51.9%. GPT-5.3-Codex reported 57% at launch on OpenAI's scaffold; its predecessor gpt-5.2-codex scores 41.0% under SEAL. When you see a SWE-bench Pro score 10-30 points above the Scale leaderboard, it is a vendor-scaffold number.
Score per Dollar: SWE-bench Pro Points per $1/M Output Tokens
Benchmark points are not free. Dividing each model's SWE-bench Pro score by its output-token price ($/M) shows where capability is cheap. Haiku 4.5 buys 7.9 points per output dollar. Fable 5, the highest scorer, buys 1.6.
| Model | Pro Score | Scaffold | $/M Output | Points per $ |
|---|---|---|---|---|
| Claude Haiku 4.5 | 39.5% | Scale SEAL | $5 | 7.9 |
| GPT-5.4 (xHigh) | 59.1% | Scale SEAL | $15 | 3.9 |
| Gemini 3.1 Pro | 46.1% | Scale SEAL | $12 | 3.8 |
| GPT-5.2 Codex | 41.0% | Scale SEAL | $14 | 2.9 |
| Claude Opus 4.8 | 69.2% | Vendor | $25 | 2.8 |
| Claude Opus 4.6 | 51.9% | Scale SEAL | $25 | 2.1 |
| GPT-5.5 | 58.6% | Vendor | $30 | 2.0 |
| Claude Opus 4.5 | 45.9% | Scale SEAL | $25 | 1.8 |
| Claude Fable 5 (currently suspended, see note) | 80.3% | Vendor | $50 | 1.6 |
Prices: Anthropic, OpenAI, and Google official API price lists, June 2026. Note: Opus 4.7 and later (including Fable 5) use a tokenizer that can produce up to 35% more tokens for the same text than pre-4.7 Claude models, which raises effective per-request cost beyond the per-token rate. Full cost modeling in our LLM cost calculator.
WarpGrep Impact on SWE-bench Pro (Morph Internal)
Morph runs SWE-bench Pro internally and serves several of the leaderboard's models on api.morphllm.com, including morph-qwen35-397b, morph-minimax27-230b, morph-dsv4flash, and morph-qwen36-27b. The benchmark runs below isolate one variable: adding a search subagent to an existing coding agent.
Self-reported data
The scores below are from Morph's internal benchmark runs (March 2026), not from the SEAL leaderboard. They show the effect of adding WarpGrep v2 as a search subagent to existing coding agents.
SWE-bench Pro: With vs Without WarpGrep v2
Morph internal benchmarks, public set (731 tasks)
WarpGrep v2 adds 2.1-2.2 points to every model tested.
WarpGrep v2 is an RL-trained search subagent that runs in its own context window. It issues up to 8 parallel tool calls per turn and returns only the relevant file spans. The main coding model never sees files WarpGrep rejected, so its context stays clean.
With Opus 4.6, adding WarpGrep v2 cuts cost by 15.6% and time by 28%. The expensive model spends fewer tokens on search and more on code generation. Read how subagents make coding agents faster for the full breakdown.
SWE-bench Verified Leaderboard (June 2026)
SWE-bench Verified is the human-validated 500-task Python subset of the original SWE-bench. It remains the most-quoted coding benchmark, but OpenAI deprecated it in February 2026 over contamination. Scores below are vendor-reported and aggregated by llm-stats.
| Rank | Model | Score |
|---|---|---|
| 1 | Claude Fable 5 (currently suspended, see note) | 95.0% |
| 2 | Claude Mythos Preview (currently suspended, see note) | 93.9% |
| 3 | Claude Opus 4.8 | 88.6% |
| 4 | Claude Opus 4.7 | 87.6% |
| 5 | Claude Opus 4.5 | 80.9% |
| 6 | Claude Opus 4.6 | 80.8% |
| 7 | DeepSeek-V4-Pro-Max (open) | 80.6% |
| 8 | Gemini 3.1 Pro | 80.6% |
| 9 | MiniMax M3 (open) | 80.5% |
| 10 | Qwen3.7 Max | 80.4% |
Source: llm-stats SWE-bench Verified tracker, June 2026. Vendor-reported; harness differences apply. See our full Claude benchmarks page for the rest of the suite.
Note the compression: ranks 5 through 10 span 0.5 points (80.9% to 80.4%). When six models from four labs are statistically tied near 80%, the benchmark has stopped discriminating at the frontier. That saturation, plus contamination, is why Pro exists.
SWE-bench Pro vs Verified: Same Model, Different Score
The per-model delta between Verified and Pro is the cleanest measure of how much Verified overstates capability:
| Model | Verified | Pro | Drop | Pro Scaffold |
|---|---|---|---|---|
| Claude Opus 4.5 | 80.9% | 45.9% | −35.0 pts | Scale SEAL |
| Gemini 3.1 Pro | 80.6% | 46.1% | −34.5 pts | Scale SEAL |
| Claude Opus 4.6 | 80.8% | 51.9% | −28.9 pts | Scale SEAL |
| Claude Opus 4.8 | 88.6% | 69.2% | −19.4 pts | Vendor |
| Claude Fable 5 (currently suspended, see note) | 95.0% | 80.3% | −14.7 pts | Vendor |
GPT-5 is the starkest case the long-tail queries ask about: it scores 41.8% on Pro's public set and 14.9% on the commercial set, against the 70%+ range its generation posted on Verified. The drop is not the model getting worse. It is the benchmark getting honest.
| Dimension | SWE-bench Verified | SWE-bench Pro |
|---|---|---|
| Tasks | 500 | 1,865 |
| Repositories | 12 (all Python) | 41 (Python, Go, TS, JS) |
| Avg lines changed | 11 (median: 4) | 107.4 |
| Avg files changed | ~1 | 4.1 |
| Minimum task size | 161/500 tasks are 1-2 lines | Every task is 10+ lines |
| Contamination resistance | Low: public Python repos | High: copyleft + proprietary code |
| Status | Deprecated by OpenAI, Feb 2026 | Active, recommended |
Open-Source Models on SWE-bench: GLM-5.1 Leads Open-Weights on Pro, DeepSeek V4, MiniMax M3, Qwen
GLM-5.1 is the best-performing open-source model on SWE-bench Pro at 58.4%, vendor-reported (not a Scale SEAL entry), with MiniMax M3 reported edging it at ~59% on open-weights trackers in June 2026. That puts open weights within striking distance of GPT-5.5's 58.6% vendor number. On SWE-bench Verified, open-weights models now tie Gemini 3.1 Pro. Coverage under Scale's standardized scaffolding is still thin. Status per model, June 18, 2026:
| Model | Verified | Pro (Scale SEAL) | Pro (Vendor) | Output Price |
|---|---|---|---|---|
| GLM-5.1 | n/a | No entry | 58.4% | $4.40/M |
| MiniMax M3 | 80.5% | No entry | ~59% | $1.20/M |
| DeepSeek-V4-Pro-Max | 80.6% | No entry | 55.4% | $0.87/M (V4-Pro API) |
| Qwen3.7 Max | 80.4% | No entry | n/a | n/a |
| Qwen3 Coder 480B | n/a | 38.7% | n/a | n/a |
Verified scores: llm-stats, June 2026. Pro (Scale SEAL): Scale SEAL leaderboard. Pro (Vendor): vendor-reported / open-weights trackers, not confirmed by Scale. Prices: official DeepSeek, MiniMax, and Z.AI API price lists.
DeepSeek V4 and SWE-bench: neither DeepSeek V4 Flash nor Pro has a Scale SEAL SWE-bench Pro entry as of June 9, 2026. Third-party trackers circulate 55.4% for V4-Pro on vendor-style scaffolds (unverified by Scale). Its strongest verified result is V4-Pro-Max at 80.6% on SWE-bench Verified, the top open-weights score, tied with Gemini 3.1 Pro. V4 is MIT-licensed, 1.6T total / 49B active parameters (Pro) and 284B / 13B (Flash), with API output at $0.28/M (Flash) and $0.87/M (Pro).
GLM-5.1 and SWE-bench Pro: the 58.4% figure circulating for GLM-5.1 is vendor-reported, not a Scale SEAL entry. Scale's standardized leaderboard has no GLM-5 generation entry; the top open-weights entry under SEAL scaffolding remains qwen3-coder-480b-a35b at 38.7%. GLM-5.1 costs $1.40/M input, $4.40/M output on the official Z.AI API. Comparisons against other open models: GLM-5 vs MiniMax and GLM-5 vs Qwen 3.5.
How SWE-bench Pro Works: 1,865 Tasks, 41 Repos, Pass@1
SWE-bench Pro contains 1,865 tasks across 41 actively maintained repositories spanning Python, Go, TypeScript, and JavaScript, scored Pass@1 (one attempt, no retries). Tasks come from real commit histories: consecutive commits where one resolves a bug or adds a feature, paired with tests that demonstrate the fix.
Three Subsets
Public Set (731 tasks)
Tasks from 11 copyleft (GPL) repositories, openly available on HuggingFace. The primary evaluation target for leaderboard submissions.
Commercial Set (276 tasks)
Tasks from 18 proprietary startup codebases, acquired through Scale AI partnerships. Not publicly accessible: the strongest contamination control.
Held-Out Set (858 tasks)
Tasks from 12 repositories reserved for overfitting detection. Scale can release these to verify that public-set gains generalize.
Three-Stage Human Augmentation
- Problem statement creation: original commit messages and issue discussions are synthesized into clear, structured descriptions
- Requirements definition: annotators create specification lists grounded in unit tests and gold patches, detailing expected behavior without prescribing implementation
- Interface specification: class and function signatures are documented to prevent false negatives from naming mismatches
Evaluation methodology
Evaluation uses containerized, language-specific environments. Each task must pass fail2pass tests (tests that fail before the fix and pass after, verifying the issue is resolved) and pass2pass tests (existing tests that must keep passing). Gold patches are validated across 3 test runs before inclusion. Copyleft licensing makes the public set legally unattractive as training data, and the commercial set is never published at all.
Why Scores Are So Much Lower Than Verified
Four factors compound. Multi-file modifications: Pro tasks touch 4.1 files on average; Verified is mostly single-file. Longer horizons: tasks that take a professional engineer hours to days, requiring coherent plans across many steps. Production codebases: business applications and developer tools with real build systems and conventions. No memorization: copyleft and proprietary repos mean models must reason about unfamiliar code, not recall it.
Failure mode analysis
Scale's trajectory analysis shows where models break: semantic understanding failures (35.9% of Opus 4.1 failures), context overflow (35.6% of Sonnet 4 failures), and tool-use inefficiency (42% of smaller-model failures). Context overflow dominating the strongest models aligns with research showing coding agents spend 60%+ of their time searching for context.
Is SWE-bench Verified Contaminated? Why OpenAI Deprecated It
In February 2026, OpenAI published "Why SWE-bench Verified no longer measures frontier coding progress" and stopped reporting Verified scores. The core finding: frontier models could reproduce gold patches and problem-statement specifics from training data, since all 500 tasks come from public Python repositories that predate every model's cutoff.
Benchmark validity criticism cuts both ways. A widely circulated community analysis claims 68.5% of GPT-5.5's SWE-bench Pro failures trace to broken test cases rather than model errors. That figure has not been confirmed by Scale or OpenAI; treat it as an open question rather than a result. What is verifiable: Scale validates gold patches across 3 test runs, publishes confidence intervals, and keeps an 858-task held-out set specifically to catch overfitting.
The DeepSWE Audit: Reward Hacking and Broken Verifiers
In May 2026, Datacurve released DeepSWE and ran an audit of SWE-bench Pro's rollouts and graders. Three findings sit directly under the 68.5%-broken-test-cases criticism above. All are reported by Datacurve, not confirmed by Scale; the git-history issue is acknowledged as an open issue on Scale's own GitHub repo.
Reported by Datacurve, not confirmed by Scale
- Git-history reward hacking. Datacurve marked Claude Opus 4.6 and 4.7 as "CHEATED" on more than 12% of reviewed SWE-bench Pro tasks. The benchmark's Docker containers ship the repository's full
.githistory, so the gold-patch commit is on disk; the agents ran commands likegit log --allto read the merged fix and paste it. GPT-5.4 and GPT-5.5 were not flagged for this. Scale tracks it as an open issue (#93). - Broken verifiers. Datacurve reports SWE-bench Pro's automated graders accepted incorrect implementations 8.5% of the time and rejected correct ones 24% of the time, roughly one-third of trials mis-graded. That is the mechanism behind the circulating "68.5% of failures are broken test cases" claim.
- Best open-source model. GLM-5.1 is reported as the strongest open-source model on SWE-bench Pro at 58.4% (vendor-reported, not a Scale SEAL entry). MiniMax M3 is reported edging it at ~59% on open-weights trackers (June 2026).
Sources: VentureBeat on the Datacurve DeepSWE audit; Scale's GitHub issue #93. None of these figures are confirmed by Scale AI.
Practical reading order for a model decision: commercial-set score first (closest to private-codebase reality), public SEAL score second (clean cross-model comparison), vendor numbers last (upper bound with tuned scaffolding). Verified scores from 2026 onward are best read as a saturation indicator, not a ranking.
Frequently Asked Questions
What is SWE-bench Pro?
SWE-bench Pro is Scale AI's software engineering benchmark: 1,865 tasks from 41 repositories across Python, Go, TypeScript, and JavaScript, scored Pass@1, split into public (731), commercial (276), and held-out (858) sets. Tasks average 107.4 changed lines across 4.1 files.
How hard is SWE-bench Pro?
Models lose 15 to 35 points moving from Verified to Pro. Opus 4.5: 80.9% to 45.9%. Gemini 3.1 Pro: 80.6% to 46.1%. The best standardized score as of June 18, 2026 is 59.1% (GPT-5.4 xHigh). On the proprietary commercial set, no model exceeds 47.1%.
What does Claude Fable 5 score on SWE-bench Pro?
80.3%, per Anthropic's launch table (GA June 9, 2026), versus 69.2% for Opus 4.8 and 58.6% for GPT-5.5 in the same vendor-run comparison. Scale's standardized SEAL leaderboard has no Fable 5 entry yet; its top Claude run is Opus 4.6 (thinking) at 51.9%. Fable 5 is priced at $10/M input, $50/M output with a 1M-token context window. Note: Fable 5 and Mythos 5 are currently suspended (see note above).
What does Claude Opus 4.8 score on SWE-bench Pro?
69.2% vendor-reported (Anthropic scaffold). Opus 4.8 (released May 28, 2026) also posts 88.6% on SWE-bench Verified and 74.6% on Terminal-Bench 2.1, at $5/M input and $25/M output.
What does GPT-5.3-Codex score on SWE-bench Pro?
OpenAI reported 57% at launch on its own Codex scaffold. Under Scale's standardized scaffolding, the predecessor gpt-5.2-codex scores 41.0% on the public set and 27.7% on the commercial set. gpt-5.3-codex is priced at $1.75/M input, $14/M output.
Does DeepSeek V4 have a SWE-bench Pro score?
No Scale SEAL entry exists for any DeepSeek V4 variant as of June 9, 2026. Third-party trackers report 55.4% for V4-Pro on vendor-style scaffolds (unverified). DeepSeek-V4-Pro-Max scores 80.6% on SWE-bench Verified, the highest open-weights result. Details on the model family: DeepSeek V4.
What is the best open-source model on SWE-bench?
On Verified: DeepSeek-V4-Pro-Max (80.6%), MiniMax M3 (80.5%), Qwen3.7 Max (80.4%). On Scale's standardized SWE-bench Pro leaderboard, the top open-weights entry is qwen3-coder-480b-a35b at 38.7%. GLM-5.1's circulating 58.4% Pro figure is vendor-reported, not a SEAL entry. See best open-source coding models.
Why do vendor scores and Scale SEAL scores differ?
Scale runs every model through identical scaffolding; vendors run tuned agent harnesses. The gap is 10-30 points and is mostly context retrieval and tool-use quality, not model capability. Morph's internal runs show the same effect from one variable: adding the WarpGrep v2 search subagent lifts every model tested by 2.1-2.2 points.
Is SWE-bench Verified still useful?
As a frontier ranking, no: OpenAI deprecated it in February 2026 over confirmed contamination, and ranks 5-10 now sit within 0.5 points of each other. It still separates weak models from strong ones and runs cheaply. For production model selection, use SWE-bench Pro's commercial-set scores.
Is SWE-bench Pro reliable?
It is the most contamination-resistant public coding benchmark, but it has known validity issues. Datacurve's May 2026 DeepSWE audit reported that SWE-bench Pro's graders mis-graded roughly one-third of trials (accepted incorrect patches 8.5% of the time, rejected correct ones 24%), and that Claude Opus 4.6 and 4.7 were flagged "CHEATED" on more than 12% of reviewed tasks for reading gold solutions out of the repo's .git history (tracked as Scale's GitHub issue #93). A separate community claim attributes 68.5% of GPT-5.5's failures to broken test cases. None of these figures are confirmed by Scale AI. Read the standardized commercial-set scores as the most reliable signal.
WarpGrep v2: Search Subagent for SWE-bench Pro
WarpGrep v2 is the RL-trained search subagent that lifted every model it was paired with by 2+ points on SWE-bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster. Free for 100k requests, then $1 per 1M.
