Voozh

OpenAI announced GPT-5.6 Sol on June 26, 2026 with a stack of benchmark numbers that look like a clean record. Terminal-Bench state of the art, the only model past 50% on Agent’s Last Exam in code mode, cyber evals that match a top competitor on a third of the tokens. The catch you need to read first: you cannot run any of it. Sol ships as a government-gated limited preview through the OpenAI API and Codex only, restricted to roughly 20 partners whose names were individually approved by the US government. It is not in ChatGPT, and there is nothing to sign up for today.

So the benchmarks are not buying advice. They answer one question, and only one: is GPT-5.6 Sol worth waiting for, or should you move on with a model you can already use? That is what this piece sorts out. We walk through what each headline benchmark measures, put every number next to the GPT-5.5 and Claude Mythos 5 baseline you already have, and finish with an honest wait-or-move-on verdict. Every figure here comes from OpenAI’s own framing and early secondary coverage, not from a test we ran.

TL;DR

GPT-5.6 Sol is in a limited preview: OpenAI API and Codex only, not in ChatGPT, about 20 government-approved partners. General availability is “coming weeks” per OpenAI.
The reported scores are strong but secondary-sourced. Treat them as OpenAI’s claims, not measured results, until the model opens up.
Headline numbers (per OpenAI / early coverage): Terminal-Bench 2.1 SOTA, Agent’s Last Exam code mode above 50%, ExploitBench parity at roughly a third of the output tokens.
Wait if your work is agentic coding, long terminal tasks, or defensive security and you can stall a few weeks.
Do not bother waiting if you need a model in production now. The alternatives you can test today close most of the gap.

Read this before you read the scores

Benchmarks tell you what a model can do. They do not tell you whether you can use it. For GPT-5.6 Sol those are two different facts, and the second one dominates right now.

The launch is gated by the US administration under a June 2, 2026 executive order that set up benchmarking and assessment for new AI models. OpenAI agreed as a temporary step. In its words, quoted by MacRumors, “We are taking this short-term step because we believe it is the strongest path to broader availability in the coming weeks.” OpenAI says general availability in ChatGPT, Codex, and the API is coming in the coming weeks. Until then, the scores are a preview of something you cannot buy.

That framing matters for how you read the rest of this article. A 4-point Terminal-Bench lead is meaningful if you can deploy it. It is a reason to keep watching, not to halt your roadmap, if you cannot. If you want the full picture of what Sol is and why it is locked, our GPT-5.6 Sol explainer covers the family and the gate. The exact API model identifiers have not been published yet, so there is nothing to wire up even if you wanted to.

Terminal-Bench 2.1: the headline number

Terminal-Bench measures how well a model completes real tasks in a terminal: editing files, running commands, chaining tools, recovering from errors. It is the closest public proxy for “can this thing do agentic coding work end to end” rather than answer a single prompt. That is why OpenAI led with it.

👁 Image

Per OpenAI and early coverage, on Terminal-Bench 2.1 the new “ultra” configuration, Sol Ultra, scores about 91.91%, with standard Sol around 88.8%. The baselines you already have for context: Claude Mythos 5 around 88% and GPT-5.5 around 83.4%. If those hold, Sol’s standard mode roughly ties Mythos 5, and Sol Ultra pulls a few points clear of the field.

The “ultra” part is doing real work in that top score. Per OpenAI’s announcement, ultra mode “goes beyond a single agent by leveraging subagents to accelerate complex work.” So the 91.91% is not one model thinking harder; it is one model spawning helpers. That is a genuine capability shift, and it also means the headline figure does not map cleanly onto a single GPT-5.5 call. For a head-to-head on the models you can run today, our Claude Opus 4.8 vs GPT-5.5 vs Gemini 3.5 comparison is the better reference while Sol stays locked.

Agent’s Last Exam: the “only model past 50%” claim

Agent’s Last Exam is a hard agentic benchmark built to resist saturation: multi-step tasks where the model has to plan, use tools, and follow through without a human nudging it back on track. Code mode is the slice that stresses software work specifically.

Per early coverage, GPT-5.6 Sol scores about 50.9% in code mode and is described as the only model above 50%. That framing is the point. On a benchmark where most frontier models sit in the 40s, clearing half is the kind of jump OpenAI wants to anchor the launch on.

Read it with the same caution as the Terminal-Bench figure. 50.9% is a claim from secondary reporting, not a number we measured, and “the only model above 50%” is a snapshot that other labs will push on within weeks. The honest read: if your work is genuinely agentic, long-horizon coding where a model has to drive a task to completion, this is the benchmark that argues for waiting. If your work is shorter request-and-response coding, the gap over a model you already run is smaller than the headline suggests.

ExploitBench: efficiency over raw score

The third benchmark is the most interesting one for the wait-or-move-on call, because it is not really about a bigger score. ExploitBench (and the related ExploitGym) measure cybersecurity capability. Sol is tuned to find software vulnerabilities and write fixes while resisting efforts to craft full exploit chains. This is a defensive posture, not an offensive hacking model, and OpenAI calls it its “most robust safety stack to date.”

Per early coverage, on ExploitBench Sol is competitive with Anthropic’s Mythos Preview while using roughly one third of the output tokens. The same pattern shows up on the science side: on GeneBench v1, OpenAI reports an improvement over GPT-5.5 using fewer tokens.

👁 Image

The token story is the one with real budget consequences. If Sol hits a similar quality bar at a third of the output tokens, the effective cost per solved task drops well below what the $5 input / $30 output per million tokens rate card suggests on paper. That is the efficiency argument for waiting: not that Sol is smarter on every prompt, but that it may get to the same answer cheaper on the workloads it is tuned for. The OpenAI deployment safety system card is where the safety and cyber framing is documented, and it is worth reading before you treat any cyber number as load-bearing.

How to read these scores against your baseline

Put the three benchmarks together and a shape appears. Sol’s case is strongest on long, agentic, tool-heavy work: terminal tasks, multi-step coding, defensive security sweeps. On those, it claims a few points of headroom over Mythos 5 and a wider gap over GPT-5.5, plus a token-efficiency edge.

What the benchmarks do not show is just as important. There is no published max output token limit, no stated knowledge cutoff, no confirmed modality list. The context window is reported as roughly 1.5M tokens by one outlet and “not specified” by another, so treat it as unconfirmed.

The verdict: wait or move on

Here is the honest cut.

Wait if: your core workload is agentic coding, long terminal sessions, or defensive security, and you can hold for a few weeks. The Terminal-Bench lead, the Agent’s Last Exam result, and the ExploitBench token efficiency all point at that exact profile. If a few percentage points on those tasks change your economics, Sol is worth watching closely. Watch for general availability and, more importantly, for independent benchmarks that confirm or deflate the launch numbers.

Do not bother waiting if: you need a model in production now, or your work is shorter request-and-response coding, chat, summarization, or classification. You cannot get Sol today regardless, the model IDs are not even published, and the alternatives you can run right now close most of the gap on everyday work. Waiting on a locked model to ship before you fix a problem you have today is the wrong trade. The smarter move is to pick a frontier model you can actually use; our roundup of the frontier models you can use today matches each one to the job Sol is being hyped for.

One more honest note: even when GA lands, the first wave will be GPT-5.6 across the tier lineup, Terra and Luna included, not just Sol. Terra is positioned as roughly 2x cheaper than GPT-5.5 with similar performance, which is the tier most teams will end up using. So “waiting for Sol” may really mean waiting to pick the right tier, and that is a calmer decision than the benchmark headlines imply.

Where Apidog fits while you wait

You cannot test Sol yet. You can test everything you would otherwise reach for in the meantime. Mythos 5, GPT-5.5, Gemini, and the rest all expose OpenAI-compatible or standard HTTP APIs, and you can drive them, assert on their responses, and compare behavior in Apidog today. Set up a request, point it at each model’s endpoint, and you have a repeatable harness for the decision this article is about.

👁 Image

That harness is also your day-one readiness for Sol. The day your preview access lands, or GA opens, you swap in the endpoint and model ID and run the same scenarios you already built. No new tooling, no scramble. Download Apidog to build those tests against the models you can use now, so you are ready the moment the gated one opens up.

Conclusion

GPT-5.6 Sol’s benchmarks are strong, narrowly so on the agentic and security work it was tuned for, and they are still just claims under a government gate you cannot pass today. Wait if that frontier profile is your job and you can hold a few weeks. Otherwise, move on with a model you can ship now and revisit when Sol gets independent numbers and a public endpoint.

Build your evaluation harness against the models you can use today in Apidog, so you are ready to test Sol the day your access lands.

URL: https://apidog.com/blog/gpt-5-6-sol-benchmarks/

⇱

TL;DR

Read this before you read the scores

Terminal-Bench 2.1: the headline number

Agent’s Last Exam: the “only model past 50%” claim

ExploitBench: efficiency over raw score

How to read these scores against your baseline

The verdict: wait or move on

Where Apidog fits while you wait

Conclusion