VOOZH about

URL: https://thenewstack.io/cursor-composer-benchmarks/

⇱ Cursor bets on cheaper coding with Composer 2.5 and Kimi K2.5 - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2026-05-20 07:00:00
Cursor bets on cheaper coding with Composer 2.5 and Kimi K2.5
AI Agents / AI Models / Developer tools

Cursor bets on cheaper coding with Composer 2.5 and Kimi K2.5

Cursor's Composer 2.5 undercuts Opus 4.7 and GPT-5.5 on price, posts gains on Terminal-Bench and SWE-Bench, but real-world coding tests loom.
May 20th, 2026 7:00am by Meredith Shubel
👁 Featued image for: Cursor bets on cheaper coding with Composer 2.5 and Kimi K2.5
Image source: Adriandra Karuniawan via Unsplash+

Cursor announced this week that Composer 2.5 is available in Cursor, only two months after the release of Composer 2, which beat Opus 4.6 on coding benchmarks at a fraction of the price. It’s another burst in the company’s streak of model releases, marking the fourth Composer in the last seven months. 

Cursor says the latest iteration brings major upgrades to long-running coding tasks, complex instruction-following, and training efficiency, as well as behavioral improvements in “communication style and effort calibration,” but time will tell whether benchmark gains translate into real-world improvements. 

A cheaper contender in the coding model line-up

Like its predecessor, Composer 2.5 is built on Moonshot Kimi K2.5, an open-source native multimodal agentic model, but should now outperform Composer 2 on intelligence and behavior. 

In its announcement, Cursor attributes these improvements to scaled training, more complex Reinforcement Learning (RL), and new learning methods. When you look at the benchmarks, it’s easy to see how Composer 2.5 has leveled up from Composer 2, moving from a 61.7% score on Terminal-Bench 2.0 to 69.3% and from 52.2% to 63.2% on its own CursorBench v3.1. 

And while Composer 2.5 still hasn’t surpassed Opus 4.7’s and GPT-5.5’s scores (save inching past GPT-5.5 by 2% on SWE-Bench Multilingual), it’s definitely giving Anthropic and OpenAI a run for their money. 

But benchmarks are just that — benchmarks.

👁 Image
Image: Cursor

While the benchmarks offer an interesting, high-level comparison of the industry’s main contenders, they don’t provide any real assurance for how these models will perform in the real world.  

As one Redditor commented: “Haven’t tested it yet but the benchmarks are wild. What’s interesting is that raw model performance doesn’t always translate to actual coding productivity. I’ve seen plenty of ‘better’ models still generate code that needs heavy cleanup or doesn’t fit the project context properly.”

“Anyone who’s used Claude or GPT-4 for actual projects knows that intelligence on benchmarks ≠ usefulness in practice.”

Instead, they posit the real test of Composer 2.5 will come once it’s used to handle multi-file changes and if it can then maintain consistency with existing codebases: “Anyone who’s used Claude or GPT-4 for actual projects knows that intelligence on benchmarks ≠ usefulness in practice.”

Cursor aims to improve long-running agent work

Cursor also says Composer 2.5 has leveled up on long-running coding tasks, for which it trained the model with targeted textual feedback to tackle tricky credit assignment during RL: “The idea is to provide feedback directly at the point in the trajectory where the model could have behaved better.” 

By constructing and inserting short hints into the local context, Cursor aims to target specific mistakes while still retaining the bigger-picture RL objective. 

With barely a day since release, it’s still too early to tell if this training will make a real difference, but an early glimpse at user feedback suggests the problem could still give developers trouble.

As one Redditor notes, “Composer 2.5 starts to work in agent mode, then all of a sudden it thinks it’s in ask mode and stops to work. When I prompt it to continue it tries to understand where it was in the task and only finishes what it just was working on, yet forgets about everything else in the pipeline.”

More synthetic data training, more unexpected reward hacking

According to Cursor, Composer 2.5 was trained on 25 times as many synthetic tasks as Composer 2, using a range of approaches to generate them. But such a breadth of synthetic task creation had at least one sour side effect: unexpected reward hacking. 

As Cursor itself admits: “As the model became more adept, Composer 2.5 was able to find increasingly sophisticated workarounds to solve the task at hand,” such as reverse-engineering a Python type-checking cache. 

Are you always getting what you pay for? 

Composer 2.5 costs $0.50 per million input tokens and $2.50 per million output tokens. Upgrading to the “faster” tier will put you back $3.00 per million input tokens and $15.00 per million output tokens — but you’re left with the same intelligence. 

Whether or not the better latency is worth the sixfold price increase, one thing is for certain: Composer 2.5 is considerably cheaper than both Opus 4.7 and GPT-5.5, with Anthropic’s model standing at $25 per million output tokens, OpenAI’s at $30 per million output tokens, and both companies at $5 per million input tokens.

Whether lower prices are enough to push developers to make the switch is the question. “We have to ask ourselves if Opus 4.7 is 10x better,” comments one Redditor, to which another replies: “For some tasks — yes. I’m not a huge fan of Composer for UI. But it’s great for small, targeted tasks. Also, he is excellent at explaining details.”

Either way, Cursor says an improvement is already in the works. Last month, Cursor announced a partnership with SpaceX on model training. The company now teases that it is working with SpaceXAI to train “a significantly larger model from scratch, using 10x more total compute” that it expects “to be a major leap in model capability.” 

Developers will have to wonder, given this week’s news about Composer 2.5’s prices, how much it’ll cost. 

TRENDING STORIES
Meredith Shubel is a technical writer covering cloud infrastructure and enterprise software. She has contributed to The New Stack since 2022, profiling startups and exploring how organizations adopt emerging technologies. Beyond The New Stack, she ghostwrites white papers, executive bylines,...
Read more from Meredith Shubel
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Anthropic, OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.