![]() |
VOOZH | about |
Google released Gemini 3.1 Pro on February 19, 2026, and its benchmark numbers are hard to ignore. The model scored 77.1% on ARC-AGI-2, a test specifically designed to prevent AI from relying on memorised answers — it forces genuine reasoning on problems the model has never encountered before. That is more than double the 31.1% scored by Gemini 3 Pro when it launched just three months ago. Google has since launched Gemini 3.5, whose Flash model now beats Gemini 3.1 Pro on coding and agentic benchmarks.
This article covers everything you need to know about Gemini 3.1 Pro, including what changed from the previous version, how it performs against Claude Opus 4.6 and GPT-5.2, what it costs, and where you can access it right now. If you’re deciding whether it belongs in your AI workflow, the data below makes the comparison clear.
- Google released Gemini 3.1 Pro on February 19, 2026, currently available as a preview
- ARC-AGI-2 score: 77.1%, a 2.5x improvement over Gemini 3 Pro’s 31.1%
- Pricing starts at $2 per million input tokens, over six times cheaper than Claude Opus 4.6
- Supports a 1 million token context window with up to 75% cost savings via context caching
- Available now in the Gemini app, Google AI Studio, Vertex AI, and GitHub Copilot
Gemini 3.1 Pro is Google’s updated flagship AI model and the most capable model in the Gemini 3 family. The .1 increment is new for Google; previous generations moved straight from the base model to a .5 update, which signals this is a meaningful capability jump rather than a minor patch.
Google targets it at complex tasks, not everyday chat. It handles text, images, audio, video, and entire code repositories in a single session, with a 1 million token input context window and up to 64,000 tokens of output per call. Three configurable thinking levels (Low, Medium, and High) let you control the trade-off between reasoning depth, speed, and cost depending on what the task demands.
Gemini 3.1 Pro is here. Hitting 77.1% on ARC-AGI-2, it’s a step forward in core reasoning (more than 2x 3 Pro).
— Sundar Pichai (@sundarpichai) February 19, 2026
With a more capable baseline, it’s great for super complex tasks like visualizing difficult concepts, synthesizing data into a single view, or bringing creative… pic.twitter.com/aEs0LiylQZ
The reasoning improvement is the headline. Gemini 3.1 Pro scored 77.1% on ARC-AGI-2 versus 31.1% for Gemini 3 Pro. ARC-AGI-2 is designed to test whether a model can solve logic patterns it has genuinely never seen — it deliberately avoids anything a model could memorise from training data. A 2.5x gain on that benchmark in one update is significant.
Coding performance improved too. The model scored 80.6% on SWE-Bench Verified and 68.5% on Terminal-Bench 2.0, both placing first among publicly ranked models at launch. Google describes it as an “agentic coding model” built for edit-then-test workflows that require fewer tool calls per completed task.
Gemini 3.1 Pro introduces selectable thinking levels, a feature designed for cost optimisation at scale. You choose how much reasoning budget the model applies based on what you’re doing. A quick document summary doesn’t need the same processing depth as designing a data pipeline or debugging a complex codebase. Google has not yet published detailed performance breakdowns per level, but the system is built to prevent you from paying for heavy reasoning when you don’t need it.
Gemini 3.1 Pro leads on most major benchmarks, though the picture is nuanced depending on task type. According to Digital Applied’s benchmark tracker, the model currently holds first place on 12 out of 18 tracked metrics.
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | Claude Sonnet 4.6 | GPT-5.2 |
|---|---|---|---|---|
| ARC-AGI-2 | 77.1% | 68.8% | 60.4% | 52.9% |
| GPQA Diamond | 94.3% | 91.3% | 74.1% | 93.2% |
| SWE-Bench Verified | 80.6% | 80.8% | 79.6% | 80.0% |
| Terminal-Bench 2.0 | 68.5% | 65.4% | 59.1% | 60.0% |
| GDPval-AA Elo | 1,317 | 1,606 | 1,633 | 1,462 |
| Context window | 1M tokens | 1M tokens (beta) | 1M tokens (beta) | 400K tokens |
| Input price (per 1M tokens) | $2 | $5 | $3 | $1.75 |
| Output price (per 1M tokens) | $12 | $25 | $15 | $14 |
*ARC-AGI-2 score for Claude Opus 4.6 is estimated; Gemini 3.1 Pro leads by over 8 percentage points per Google. GPT-5.2 benchmark data not publicly confirmed at time of writing.
Gemini 3.1 Pro leads Claude Opus 4.6 on ARC-AGI-2 by more than 8 percentage points and on GPQA Diamond by 3 points, at over six times lower cost on input tokens.
The one area where Claude Opus 4.6 holds a clear lead is GDPval-AA, a benchmark measuring performance on complex expert-level office work. Claude Opus 4.6 scores 1,633 Elo versus Gemini 3.1 Pro’s 1,317. If your work involves high-stakes research synthesis, multi-step legal analysis, or complex long-form writing, that gap is real and worth factoring in.
Gemini 3.1 Pro pricing is the same as Gemini 3 Pro. For prompts under 200,000 tokens, you pay $2 per million input tokens and $12 per million output tokens. For longer sessions in the 200,000 to 1,000,000 token range, that moves to $4 per million input and $18 per million output.
Context caching cuts those costs by up to 75% for repeated content across long sessions, useful for anyone running the model against large documents or codebases regularly.
| Tier | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Under 200K tokens | $2 | $12 |
| 200K to 1M tokens | $4 | $18 |
| With context caching | up to 75% off | up to 75% off |
Claude Opus 4.6 costs $15 per million input tokens and $75 per million output tokens, more than six times the input cost and more than six times the output cost for the same workload. For most use cases, Gemini 3.1 Pro is the significantly more cost-effective choice, unless your tasks fall into the expert-level category where Claude still leads.
Gemini 3.1 Pro launched in preview on February 19, 2026. It is available across all of these platforms today.
gemini-3.1-pro-preview model IDA second variant, gemini-3.1-pro-preview-customtools, is available for agentic workflows where precise function-call performance matters. Simon Willison’s first-day testing also noted significant latency at launch; one query took over 100 seconds, and Google is expected to resolve this as the preview scales.
General availability is expected soon. For a broader look at where Gemini 3.1 Pro sits in the current AI field, see our February 2026 AI model rankings.
Alongside the 3.1 Pro launch, Google added music generation to the Gemini app via Lyria 3, its most capable music model to date. You describe a track in plain text or upload an image, and Gemini generates a 30-second song complete with AI-written lyrics and cover art. The rollout began on February 18, 2026 on desktop, with mobile support following shortly after.
Lyria 3 is a meaningfully better model than its predecessor. It handles more complex track structures, gives you greater control over musical style and vocals, and generates lyrics as part of the output rather than requiring a separate step. Every track is watermarked using Google’s SynthID technology, so AI-generated music stays identifiable even after editing or re-encoding.
The feature supports eight languages at launch: English, German, Spanish, French, Hindi, Japanese, Korean, and Portuguese, with more planned. It is available to all Gemini users aged 18 and over. The current limit is 30-second tracks, suited for social media clips, demos, and short-form content. Google has not confirmed when longer track support will arrive.
This is a Gemini app feature powered by Lyria 3, not a capability of the 3.1 Pro model itself. You do not need a paid Gemini plan to access it, though availability may vary by region during the initial rollout.
For developers and technically focused users, the case is strong. Gemini 3.1 Pro leads most benchmarks at a price that makes Claude Opus 4.6 hard to justify for the same tasks. The GitHub Copilot integration means you can access it inside your existing editor without any workflow changes.
For non-technical users, the answer depends on what you actually do. On GDPval-AA expert office tasks, Claude Opus 4.6 still holds the lead with an Elo of 1,633 versus 1,317. For sophisticated analytical writing, research synthesis, and high-context reasoning tasks, that gap may matter.
The clearest cases for choosing Gemini 3.1 Pro are coding, novel reasoning problems, high-volume API usage where cost matters, and any workflow that benefits from a 1 million token context window. For context on where Claude sits right now, see our Claude Sonnet 4.6 release breakdown.
Gemini 3.1 Pro is a genuine step up from Gemini 3 Pro. Its ARC-AGI-2 score and coding benchmarks place it at the top of the current leaderboard for most tasks, and at $2 per million input tokens, it is the most cost-effective flagship model available right now. The only caveat is expert-level tasks, where Claude Opus 4.6 still leads by a clear margin.
If you haven’t tried it, Google AI Studio gives you free access today via the gemini-3.1-pro-preview model ID. If you’re on a GitHub Copilot plan, it’s already available in your editor.
Stay ahead with expert AI insights trusted by top tech professionals!
Join thousands of AI fans & professionals benefiting from exclusive tips and insights from industry leaders.