VOOZH about

URL: https://tech-insider.org/gpt-5-5-launch-openai-april-23-terminal-bench-2026/

⇱ GPT-5.5 Launch: 82.7% Terminal-Bench, $5 API [2026]


Skip to content
May 26, 2026
16 min read

OpenAI launched GPT-5.5 on April 23, 2026, ending six months of speculation about whether the company could close the agentic-coding gap with Anthropic. The model rolled out to ChatGPT Plus, Pro, Business, and Enterprise subscribers the same day, with API access following on April 24. It posts a 82.7% score on Terminal-Bench 2.0, an 84.9% rating on GDPval, and a 58.6% on SWE-Bench Pro — three benchmarks that OpenAI has spent the past year arguing matter more than the saturated academic tests dominating the 2024 leaderboards.

The pricing is the surprise. Standard GPT-5.5 lands at $5 per million input tokens and $30 per million output tokens with a 1-million-token context window — flat versus GPT-5.4 on the input side and only modestly higher on output, despite double-digit gains across every headline benchmark. GPT-5.5 Pro, the deliberative variant aimed at long-horizon research and codebase-wide refactors, is priced at $30 input / $180 output per million tokens. Both numbers reframe the cost structure of frontier reasoning models: Anthropic’s Claude Opus 4.7 still commands $15 input / $75 output, meaning OpenAI now undercuts the closest competitor on raw token economics while matching or beating it on most tasks.

This analysis examines what shipped, where GPT-5.5 leads, where it loses, and what the launch tells you about the trajectory of the 2026 model race — including stock-market reaction, competitive response from Anthropic and Google, the rollout schedule across consumer and enterprise tiers, and the five things to watch before OpenAI’s expected GPT-6 announcement later this year.

The April 23 Launch: Three Models, One System Card

OpenAI’s announcement was structured around three product tiers rather than a single flagship. GPT-5.5 Instant is the default model surfaced inside ChatGPT for free, Plus, and Pro users — a routing-layer model that responds quickly with light reasoning. GPT-5.5 Thinking is the deliberative variant exposed in the model picker for problems that benefit from extended chain-of-thought. GPT-5.5 Pro, the most expensive tier, is reserved for ChatGPT Pro ($200/month) and Enterprise tiers, plus an API endpoint targeted at research labs, quant funds, and law firms running long-form analysis pipelines.

The system card, published the same day and updated on April 24 to reflect API availability, characterizes GPT-5.5 as the company’s “strongest agentic coding model to date.” That framing is significant. OpenAI has explicitly moved away from positioning new releases around academic benchmark scores like MMLU and GPQA — both of which have effectively saturated for frontier models — and toward agentic evaluation suites that measure end-to-end task completion. The launch post emphasizes that GPT-5.5 understands tasks earlier, asks for less clarification, uses tools more efficiently, checks its own work, and “keeps going until it is done.”

The model also ships with a more aggressive personalization layer inside ChatGPT, including richer memory integration, persistent access to past chats, and (where users opt in) connected services like Gmail, Google Drive, and Microsoft 365. OpenAI quantified the safety improvements in the system card: GPT-5.5’s individual factual claims are reportedly 23% more likely to be correct than the comparison baseline, and full responses contain a factual error roughly 3% less often. The launch materials stop short of describing a single percentage hallucination rate, but the directional improvement is consistent with what Anthropic claimed for its Opus 4.7 release in March.

Headline Benchmarks: Where GPT-5.5 Leads and Where It Loses

The benchmark story is more nuanced than the press release suggests. GPT-5.5 takes the top spot on three of the seven evaluations OpenAI highlighted, but Claude Opus 4.7 still wins on real-world GitHub issue resolution and tool-coordination workloads, and Gemini 3.1 Pro retains its lead on autonomous web research. The pattern matches what independent reviewers including Vals AI, Artificial Analysis, and the SWE-Bench team have been reporting since the Opus 4.7 launch: the frontier has fragmented into specialty leaders rather than one model dominating every test.

👁 Headline Benchmarks: Where GPT-5.5 Leads and Where It Loses
BenchmarkGPT-5.5GPT-5.4Claude Opus 4.7Gemini 3.1 ProLeader
Terminal-Bench 2.082.7%75.1%69.4%68.5%GPT-5.5
SWE-Bench Pro (public)58.6%57.7%64.3%54.2%Claude Opus 4.7
SWE-Bench Verified82.6%82.0%GPT-5.5
GDPval (wins or ties)84.9%83.0%80.3%67.3%GPT-5.5
MCP Atlas (tool use)75.3%79.1%Claude Opus 4.7
Humanity’s Last Exam (no tools)41.4%46.9%Claude Opus 4.7
BrowseComp84.4%85.9%Gemini 3.1 Pro
OSWorld-Verified (computer use)78.7%GPT-5.5
Tau2-bench Telecom98.0%GPT-5.5
FinanceAgent60.0%n/a
Internal IB modeling88.5%n/a
OfficeQA Pro54.1%n/a
Benchmark scores reported by OpenAI in the GPT-5.5 launch materials and the April 24 system card update. Dashes indicate scores not published in the comparison set.

The 13.3-percentage-point Terminal-Bench gap over Opus 4.7 is the most material number on the table. Terminal-Bench 2.0 evaluates agent performance on long-running shell-based tasks — installing dependencies, debugging environment errors, recovering from failed commands, and producing working artifacts — and is widely considered a leading indicator for autonomous developer workflows. A 13-point lead at this level of saturation is unusual, and several independent reviewers including Jake Handy (HandyAI Substack) and Alex Lavaee verified the result within 48 hours of release. The SWE-Bench Pro loss to Opus 4.7 is the launch’s most acknowledged weakness: Anthropic’s model resolves real GitHub issues end-to-end at 64.3% to OpenAI’s 58.6%, a 5.7-point delta that matters for enterprises standardizing on a single coding agent.

The Pricing Reset: $5/$30 Is the Real Headline

Pricing parity with GPT-5.4 — combined with a 1-million-token context window and a token-efficiency improvement that early users estimate at roughly 40% fewer output tokens on Codex-style tasks — represents a quiet but consequential reset. The effective cost of completing equivalent agentic coding work on GPT-5.5 versus the previous generation is closer to a 20% increase than the doubling the headline price would suggest, according to early calculations published by HandyAI and corroborated by developer-experience reports on the OpenAI Developer Forum.

ModelInput ($/1M)Output ($/1M)ContextProvider
GPT-5.5$5.00$30.001M tokensOpenAI
GPT-5.5 Pro$30.00$180.001M tokensOpenAI
GPT-5.4$5.00$25.001M tokensOpenAI
GPT-4o$2.50$10.00128K tokensOpenAI
Claude Opus 4.7$15.00$75.001M tokens (beta)Anthropic
Claude Sonnet 4.6$3.00$15.00200K tokensAnthropic
Gemini 3.1 Pro$2.50$15.002M tokensGoogle
DeepSeek V4$0.55$1.74128K tokensDeepSeek
Frontier model API pricing as of April 26, 2026. Batch and Flex modes on OpenAI’s API run at half the listed standard rate; Priority processing runs at 2.5x standard.

The Anthropic comparison is the one OpenAI clearly designed for. GPT-5.5 lists at one-third the input cost and 40% of the output cost of Opus 4.7, while leading on Terminal-Bench, GDPval, and OSWorld-Verified. For startups building coding agents, customer-support automation, or document-processing pipelines that consume hundreds of millions of tokens per week, the unit-economics gap is now large enough to drive switching even where Opus 4.7 retains a benchmark edge. Three early-stage AI infrastructure startups — Factory, Cognition, and Magic — confirmed to TechCrunch within hours of launch that they were running GPT-5.5 in production evaluation pipelines by the morning of April 24.

Google’s Gemini 3.1 Pro remains the volume play. At $2.50 input / $15 output and a 2-million-token context window, it is still the cheapest frontier model for high-context retrieval workloads, and its BrowseComp lead means autonomous web research agents continue to default to it. DeepSeek V4 occupies its own pricing tier — roughly 9x cheaper on input than GPT-5.5 — but trails meaningfully on the agentic evaluations OpenAI now emphasizes, and the open-weight distribution model means enterprises pay infrastructure costs rather than per-token rates.

Market Reaction: Microsoft, Nvidia, and the Hyperscaler Trade

The launch landed during a constructive tape for Big Tech. The S&P 500 was within 1.5% of all-time highs heading into the four-stock earnings window of Alphabet, Microsoft, Amazon, and Meta, all reporting between April 28 and May 1. Microsoft, which holds 27% of OpenAI’s commercial economics and remains the largest single contracted compute customer for OpenAI workloads despite the November 2025 AWS Bedrock agreement, traded up 2.1% in the session following the announcement. Nvidia advanced 1.8% on the news that GPT-5.5’s training run consumed an order-of-magnitude more H200 and Blackwell GPU-hours than GPT-5.4, although OpenAI declined to publish exact compute totals.

👁 Market Reaction: Microsoft, Nvidia, and the Hyperscaler Trade

Oracle, which executed a five-year $300 billion compute commitment with OpenAI as part of the Stargate buildout, gained 3.4% — the largest single-day move among hyperscalers tied to OpenAI’s roadmap. The trade reflects a market view that GPT-5.5’s economics, particularly the implied 40% token-efficiency improvement, make the Stargate capex pencil out more cleanly than the consensus view three months ago, when OpenAI’s revenue miss disclosed in the Wall Street Journal raised questions about the timeline. CoreWeave, which carries an estimated $66.8 billion AI-infrastructure backlog including a multibillion-dollar Anthropic contract signed in March, traded flat — reflecting investor uncertainty about whether GPT-5.5’s launch erodes Anthropic’s competitive position enough to slow Claude inference demand.

“This is the cleanest agentic-coding lead OpenAI has held since GPT-4 Turbo’s launch in late 2024,” said Gil Luria, head of technology research at D.A. Davidson, in a note to clients the morning of April 24. “The pricing structure tells you OpenAI is now competing on unit economics, not just capability — that’s a fundamentally different posture than the 2025 model releases and it suggests the marginal cost of training and serving these systems has come down materially.”

Anthropic’s Position: SWE-Bench Pro and Tool Coordination Remain Defended

Anthropic faces the most direct competitive pressure from the launch. Opus 4.7, released March 5, was the model that had set the agentic-coding benchmark bar throughout Q1 2026, with a SWE-Bench Verified score of 82.0% and an industry-leading 64.3% on SWE-Bench Pro. GPT-5.5 now edges Opus on SWE-Bench Verified by 0.6 percentage points while still trailing on SWE-Bench Pro by 5.7 points and on MCP Atlas tool-use by 3.8 points. The split is narrow enough that head-to-head buyer evaluations will likely come down to specific workload mix, latency requirements, and price.

Anthropic chief product officer Mike Krieger has not commented publicly on the GPT-5.5 release as of April 26, but the company’s developer relations team began circulating updated head-to-head evaluation guides to enterprise customers within 12 hours of OpenAI’s launch. Internal pricing pressure is the more immediate concern. Anthropic’s Opus 4.7 input rate is 3x what OpenAI now charges for GPT-5.5; if customer churn accelerates, an Opus 4.7 price cut or a faster-than-planned Opus 4.8 announcement become likely defensive moves before the summer enterprise renewal cycle.

“OpenAI has finally caught up on what we’d call the legibility of agentic work — the ability for the model to explain what it’s doing and recover gracefully from errors,” said Nathan Benaich, founder of Air Street Capital and author of the State of AI Report, in a Substack post on April 24. “Claude has held that lead since Sonnet 3.5. GPT-5.5 closes most of the gap, but Anthropic still wins on the messy, long-horizon refactoring work that drives a lot of revenue inside the big coding-agent startups.”

Google’s Counter: Gemini 3.1 Pro and the BrowseComp Lead

Google’s response is structural rather than tactical. The Gemini 3.1 Pro release in late March, paired with the April 20 announcement of the TPU 8t and 8i inference chips and a $21 billion Broadcom partnership, signaled that Google intends to compete on compute economics rather than chase OpenAI’s coding-agent positioning. Gemini 3.1 Pro retains the lead on BrowseComp (85.9% vs. 84.4%), holds the only 2-million-token production context window, and undercuts even GPT-5.5 on input pricing at $2.50 per million tokens.

👁 Google's Counter: Gemini 3.1 Pro and the BrowseComp Lead

The Vertex AI platform also benefits from Google’s enterprise-distribution advantage in Workspace, education, and the public sector. Demis Hassabis, DeepMind co-founder and CEO, addressed the launch obliquely on April 25 at a UK government AI policy event, telling attendees that “the more interesting frontier is no longer the benchmark race — it’s the cost per useful task at production scale, where TPUs give us a structural advantage we expect to widen through 2027.” The quote was reported by The Financial Times and circulated widely in AI-investor newsletters over the weekend.

The Stanford AI Index Context: A 2.7% Gap That Just Widened

The launch lands ten days after the Stanford HAI 2026 AI Index Report, released April 13, which framed the U.S.-China frontier-model gap at 2.7% as of March 2026 and noted that SWE-Bench Verified performance had climbed from roughly 60% to near 100% across the top models in a single year. GPT-5.5’s 82.6% SWE-Bench Verified score sits below the headline saturation figure the report cited, but the comparison is misleading: Stanford’s “near 100%” is a composite of top human-baseline-equivalent performance across the full benchmark suite, not a single model’s Verified score.

“This year’s AI Index reveals a widening gap between how quickly AI is advancing and our ability to measure and manage it,” Stanford HAI wrote in the report’s framing. GPT-5.5 is a textbook example: the model’s most consequential capabilities — long-horizon agentic coding, computer use, and multi-tool coordination — are exactly the categories where existing benchmarks struggle to keep pace, and OpenAI’s introduction of evaluations like Terminal-Bench 2.0 and OSWorld-Verified is part of a broader industry shift away from saturated academic tests toward harder, more representative workload simulations.

The report’s labor-market data underscores the commercial stakes. AI skills now appear in 2.5% of all U.S. job postings, up 55% year-over-year. Mentions of the “Agentic AI” skill cluster jumped over 280% in a single year — from 0.06% of postings in 2024 to 0.23% in 2025, or roughly 90,000 postings. GPT-5.5’s positioning around agentic capability is the most direct commercial expression of that hiring trend yet.

Computer Use, Tool Coordination, and the Agentic Stack

The OSWorld-Verified score of 78.7% is the launch’s most underappreciated number. OSWorld-Verified measures end-to-end agent performance on real desktop environments — filling out forms, navigating menus, recovering from broken application states — and is the closest available proxy for true computer-use capability. GPT-5.5 outperforms every public competitor on this benchmark by margins of 10-15 percentage points, depending on the reviewer, and the gap explains why early integration partners including Lindy, Reflection, and Adept-derived startups are running production migration pilots.

👁 Computer Use, Tool Coordination, and the Agentic Stack

Tau2-bench Telecom, at 98.0%, indicates near-perfect performance on customer-service agent simulations in a constrained domain. The FinanceAgent score of 60.0% and the internal investment-banking modeling task at 88.5% — a private OpenAI evaluation built with cooperation from several Wall Street firms — point to a deliberate enterprise vertical strategy. OfficeQA Pro at 54.1% indicates document-processing capability that is competitive but not industry-leading, and it is the area where Anthropic’s Sonnet 4.6 still has the strongest specialty case for cost-sensitive deployments.

The 1M Context Window: What Actually Changes

OpenAI matched Anthropic’s Opus 4.7 context expansion by bringing the full 1-million-token window to both GPT-5.5 and GPT-5.5 Pro at launch. The practical effect: developers can now feed entire mid-size codebases, multi-hour transcripts, or 700-page legal documents into a single request without resorting to RAG architectures. Internal OpenAI evaluations report that needle-in-a-haystack retrieval accuracy across the full 1M-token range remains above 95% — meaningfully better than the degradation curve observed on GPT-4 Turbo’s 128K window two years ago.

Google’s Gemini 3.1 Pro retains the 2M-token lead and remains the default choice for ultra-long-context applications like full-codebase comprehension or thorough video transcript analysis. But the cost picture has shifted: at $5/$30 with the efficiency improvements OpenAI ships in GPT-5.5, the effective per-task cost of long-context reasoning is now competitive with Gemini for most enterprise workloads under 800K tokens. The 1M boundary is also where prompt-caching benefits become meaningful — OpenAI’s caching system reduces input costs by 50% on repeated portions of long prompts, which substantially favors agentic loops with stable system prompts and shifting working memory.

Safety Evaluations and the Updated System Card

The April 24 system card update is the most extensive safety disclosure OpenAI has published for a non-frontier-rebadged release. Headline figures: a 23% reduction in incorrect individual claims, a 3% reduction in responses containing any factual error, and what the system card describes as “substantially improved performance on adversarial jailbreaking suites” without quantifying the delta. The Preparedness Framework evaluations rate GPT-5.5 below the “high risk” threshold across cybersecurity, biosecurity, autonomy, and persuasion categories — consistent with OpenAI’s GPT-5.4 rating.

👁 Safety Evaluations and the Updated System Card

External red-teaming partners including Apollo Research, METR, and the U.S. AI Safety Institute were granted pre-launch access under the standard NDA framework. METR’s public summary, posted April 24, notes that GPT-5.5 demonstrates “meaningfully improved long-horizon agentic capability” but that the improvement remains within the company’s internal expectations and does not trigger a higher Preparedness tier. The summary is the most consequential third-party validation of the system card claims and is the document most likely to be cited in upcoming Congressional AI oversight hearings scheduled for May.

Five Predictions for the Next 90 Days

1. Anthropic announces an Opus 4.7 price cut before the end of May. The 3x input-price gap to GPT-5.5 is unsustainable at the current capability delta. Expect Anthropic to cut Opus pricing to roughly $10/$50 per million tokens — a 33% reduction — to preserve enterprise share before the Q2 renewal window closes.

2. Google ships a Gemini 3.2 Pro update within 60 days. The BrowseComp lead is the only category where Gemini retains a clear win against GPT-5.5, and Google will not let that erode through a multi-quarter release cycle. The 3.2 update will emphasize agentic coding and tool coordination — categories where Google has trailed since the Sonnet 3.5 era.

3. DeepSeek announces a V4.1 release optimized for the same agentic workloads. DeepSeek’s V4, released March 12 on Huawei Ascend hardware, was tuned for academic benchmarks rather than agentic evaluation. Expect a V4.1 that explicitly targets Terminal-Bench, SWE-Bench Pro, and OSWorld-Verified — and is priced aggressively to maintain DeepSeek’s order-of-magnitude cost lead.

4. xAI announces Grok 4 with a comparable computer-use benchmark. xAI has been silent on Grok 4 since the Colossus 2 power dispute. The GPT-5.5 launch resets the competitive bar for what counts as a frontier release, and Musk’s company will not be able to ship a Grok 4 announcement without a competitive OSWorld-Verified or Terminal-Bench number.

5. Enterprise AI spending on agentic infrastructure outpaces RAG by H2 2026. The combination of strong agentic benchmarks, falling per-task costs, and Stanford-confirmed labor-market demand for agentic AI skills should drive a meaningful budget shift away from retrieval-augmented generation toward true agent orchestration platforms. Watch for Series B and Series C rounds at LangChain, CrewAI, and the LlamaIndex commercial entity within the next two quarters.

Historical Context: From GPT-4 to GPT-5.5 in 24 Months

The release pace has accelerated dramatically. GPT-4 launched in March 2023; GPT-4 Turbo arrived eight months later; GPT-4o landed in May 2024; GPT-5 was announced in late 2025; GPT-5.4 shipped early in 2026; and GPT-5.5 is the second material release of the year. That cadence — roughly one substantive frontier release per quarter — is well above what the 2024 industry consensus expected, and it tracks the pattern Anthropic established with the Claude 3.5 / 3.7 / 4.0 / 4.5 / 4.6 / 4.7 series.

The reasoning-mode innovation pioneered by OpenAI’s o1 release in September 2024 and Anthropic’s Claude 3.7 Sonnet “extended thinking” mode in early 2025 has now been fully integrated into the base model lineups of both companies. GPT-5.5 Thinking and GPT-5.5 Pro represent OpenAI’s view that deliberation depth should be a product tier rather than a separate model family, which is the same architectural conclusion Anthropic reached with Sonnet 4.6 and Opus 4.7.

What to Watch Next

The four-stock earnings window between April 28 and May 1 is the most immediate signal. Microsoft’s commentary on Azure AI demand, GitHub Copilot revenue (which now runs heavily on GPT-5.5 inference), and capex guidance for FY27 will set the tone for the next quarter of AI-infrastructure investment. Alphabet’s Google Cloud growth rate and any update on the Anthropic compute commitment will signal how rapidly Gemini is gaining enterprise share against the OpenAI/Azure stack. Amazon’s commentary on Bedrock usage will show whether the November 2025 OpenAI deal is materially affecting AWS AI revenue mix.

The May 2026 RSAC and Build conferences will reveal which enterprise tooling stacks have moved fastest to GPT-5.5 integration. The June Apple WWDC event will indicate whether Apple Intelligence’s expanded model integration — rumored to add GPT-5.5 alongside the existing on-device foundation models — proceeds on schedule. And OpenAI’s expected developer day announcement in late summer is the likely venue for the GPT-6 capability preview, which several reporters at The Information have suggested will emphasize multimodal world models and embodied AI evaluation rather than another agentic-coding leap.

Frequently Asked Questions

When did GPT-5.5 launch?

OpenAI launched GPT-5.5 to ChatGPT Plus, Pro, Business, and Enterprise users on April 23, 2026. The API became available the following day, April 24, 2026, with the system card updated to reflect production availability.

How much does GPT-5.5 cost on the API?

Standard GPT-5.5 is priced at $5 per million input tokens and $30 per million output tokens. GPT-5.5 Pro, the deliberative variant, is priced at $30 per million input tokens and $180 per million output tokens. Batch and Flex pricing run at half the standard rate; Priority processing is 2.5x the standard rate. Both models include a 1-million-token context window.

What is the difference between GPT-5.5 Instant, GPT-5.5 Thinking, and GPT-5.5 Pro?

GPT-5.5 Instant is the default fast-response model surfaced for most ChatGPT users. GPT-5.5 Thinking is the deliberative variant that uses extended chain-of-thought for harder problems. GPT-5.5 Pro is the most expensive tier, restricted to ChatGPT Pro and Enterprise subscribers and available via a separate API endpoint at $30/$180 per million tokens.

Does GPT-5.5 beat Claude Opus 4.7 on coding benchmarks?

GPT-5.5 leads on Terminal-Bench 2.0 (82.7% vs. 69.4%), SWE-Bench Verified (82.6% vs. 82.0%), and GDPval (84.9% vs. 80.3%). Claude Opus 4.7 still leads on SWE-Bench Pro (64.3% vs. 58.6%), MCP Atlas tool-use (79.1% vs. 75.3%), and Humanity’s Last Exam without tools (46.9% vs. 41.4%). The competitive position depends on workload mix.

How does GPT-5.5 compare to Gemini 3.1 Pro?

GPT-5.5 leads Gemini 3.1 Pro on every published OpenAI launch benchmark except BrowseComp, where Gemini posts 85.9% to GPT-5.5’s 84.4%. Gemini 3.1 Pro is meaningfully cheaper ($2.50/$15 per million tokens) and offers a 2-million-token context window, making it the volume play for high-context retrieval workloads.

What is the GPT-5.5 context window?

GPT-5.5 and GPT-5.5 Pro both ship with a 1-million-token context window, matching Claude Opus 4.7 and trailing only Gemini 3.1 Pro’s 2-million-token window among frontier production models. OpenAI reports needle-in-a-haystack retrieval accuracy above 95% across the full 1M-token range.

Is GPT-5.5 available in ChatGPT free tier?

Yes. GPT-5.5 Instant rolled out as the default model for ChatGPT free users on April 23, 2026, with rate limits significantly lower than the Plus tier. GPT-5.5 Thinking and GPT-5.5 Pro are restricted to paid tiers (Plus $20/month, Pro $200/month, Business, and Enterprise).

Will GPT-5.5 impact Microsoft, Nvidia, and Oracle stock?

Microsoft traded up 2.1%, Nvidia +1.8%, and Oracle +3.4% in the session following the launch. Oracle was the biggest hyperscaler beneficiary, reflecting investor view that GPT-5.5’s improved economics make the $300 billion Stargate compute commitment pencil out more cleanly. CoreWeave, more exposed to Anthropic inference demand, traded flat.

Is GPT-5.5 better than GPT-5.4?

GPT-5.5 improves on GPT-5.4 across every published launch benchmark: Terminal-Bench 2.0 (82.7% vs. 75.1%), SWE-Bench Pro (58.6% vs. 57.7%), and GDPval (84.9% vs. 83.0%). Standard API pricing is flat on input ($5 per million tokens) and rises 20% on output ($30 vs. $25). The improved token efficiency means net per-task cost increases are typically in the 15-25% range.

Does GPT-5.5 support computer use and tool calls?

Yes. GPT-5.5 scores 78.7% on OSWorld-Verified, the leading public benchmark for autonomous desktop computer use, and 75.3% on MCP Atlas for tool coordination. Both numbers represent meaningful improvements over GPT-5.4 and put OpenAI ahead of every published competitor on computer-use evaluation as of April 26, 2026.

Related Coverage

Sources: GPT-5.5 Wikipedia entry · Stanford HAI 2026 AI Index Report · SWE-Bench official benchmark · Anthropic Claude product page · Google DeepMind Gemini.

👁 Marcus Chen

Marcus Chen

Senior Tech Reporter

Marcus Chen is a Senior Tech Reporter at Tech Insider covering cloud computing, enterprise software, and the business of technology. Before joining TI, he spent five years at ZDNet covering digital transformation across European enterprises and three years at The Register reporting on cloud infrastructure. Marcus is known for his deep dives into cloud cost optimization and multi-cloud strategy. He holds a degree in Computer Science from Imperial College London and speaks regularly at KubeCon and CloudNative events.

View all articles
👁 Tech Insider
Tech
Insider

Tech Insider delivers in-depth coverage of the technologies shaping the future: AI, cybersecurity, cloud computing, hardware, and the trends that matter.

Company

Explore

Categories

© 2026 Tech Insider Media AB. All rights reserved.