VOOZH about

URL: https://thenewstack.io/will-it-mythos-benchmark/

⇱ Will it Mythos? One coder's verdict on Anthropic's blend of debugging - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2026-06-24 16:51:49
Will it Mythos? One coder's verdict on Anthropic's blend of debugging
AI Models / AI Operations / Security

Will it Mythos? One coder’s verdict on Anthropic’s blend of debugging

The independent developer community is benchmarking Anthropic's Mythos AI model on security bug detection, testing claims about its vulnerability-hunting capabilities.
Jun 24th, 2026 4:51pm by Adrian Bridgwater
👁 Featued image for: Will it Mythos? One coder’s verdict on Anthropic’s blend of debugging
Macude Mariana Cuesta

Anthropic’s rollercoaster ride with Claude Fable 5 and Mythos 5 has left various deposits of fallout this year, but the mêlée hasn’t stopped independent software application developers from testing the models’ purported powers via their own methods and channels.

Austin, Texas-based software developer Joe Cooper (aka swelljoe) has said he is skeptical about how well Mythos might find really challenging security bugs and cyber vulnerabilities.

A rich enough blend for debugging – will it Mythos? 

“The idea was to gather up bugs that were specifically found by Mythos, as covered by Anthropic’s own documentation [and start to build a benchmarking service],” wrote Cooper, on his blog.

“[I wanted to] find the commit from before the bug was fixed, verify that a top-tier model (Opus, in this case) can identify and understand the bug if pointed right at it, and add that to our corpus for benchmarking whether models going in blind can accurately detect and describe the bug,” explained Cooper. 

In homage to the Will It Blend? YouTube infomercial video series, which served as a marketing vehicle for the Blendtec line of blenders featuring founder Tom Dickson attempting to blend everything from a whole chicken to an iPhone 4, Cooper’s analysis piece this May was amusingly titled Will It Mythos? 

The description of his process noted that he had previously built a tool called Nelson to automate bug hunting in his own projects. He noted that he had “already noticed there are surprising differences” among the various models (Nelson works with a variety of models via Claude Code, Gemini CLI, and OpenAI-compatible APIs) and in how effectively they identify bugs.

But he wanted hard numbers, and so he mostly used Claude to cook up a benchmark suite that borrows some code from Nelson and takes Mythos to task.

Zero-day vulnerabilities in every major operating system

With his ambitions to test bug gathering according to “Anthropic’s own documentation”, let’s remember here that Anthropic famously said that during its testing, it found Mythos Preview was capable of identifying and then exploiting zero-day vulnerabilities in “every major operating system and every major web browser” when directed by a user to do so.

“The toughest bugs are multi-file bugs. The models were free to look at all files, but one often needs to know the context to know that a given usage is a problem. This is a hard problem for any security reviewer, human or AI.”

Ultimately, he wanted to find out whether models going in blind can accurately detect and describe a bug. Cooper’s system was built so that the models used can look at the whole code repository and follow logic across file boundaries, but they’re not told what to look for.

“The toughest bugs are multi-file bugs. The models were free to look at all files, but one often needs to know the context to know that a given usage is a problem. This is a hard problem for any security reviewer, human or AI,” Cooper says.  

He said he assumed Mythos has more advanced tooling and perhaps runs the software in a debugger or performs fuzz testing

Some credence that Mythos is particularly good at this 

“Guessing at everything Mythos might do is beyond the goals of this project for now. But there are bugs in this corpus that are extremely hard to find, giving some credence to the notion that Mythos is particularly good at this problem,” conceded Cooper.

“It almost certainly leads in raw capability, so in that sense, Mythos is engineered to blend an expansive range of bugs. But it’s important to recognize the difference between simply leading a benchmark, versus gating all security progress on one model.” – Conor Sherman, Sysdig.

Raw model capability is not the same as gate-all security

As opinions on Anthropic’s new model abilities start to crystallize, we will gradually gain more real-world insight into their working mechanics.

Conor Sherman, global CISO at runtime security specialist Sysdig is a glass-half-full kind of guy, he tells The New Stack that Mythos stands apart from other AI models. 

“It almost certainly leads in raw capability,” Sherman says. “So in that sense, Mythos is engineered to ‘blend’ an expansive range of bugs (but perhaps not a whole chicken or iPhone), and it has an inherent ability to perform well when hunting out the multi-file bugs that stump everything else seems to be where it’s most notably ahead.”

But there are caveats here: Sherman advises that it’s important to recognize the difference between simply leading a benchmark and gating all security progress on one model, like some kind of cyber wonder-panacea. 

“A less capable but less expensive model wired into the right scaffolding can close most of that gap,” explains Sherman. “The edge that actually matters for defenders isn’t model-versus-model on a corpus of known bugs. Security in an age of AI-powered threats requires runtime context and real-time signals that equip teams to act at the speed at which today’s attackers move.”

No tool should be its own judge

Fabien Renaudineau, co-founder and co-CEO at AI-native synthetic testing company Mozark, tells The New Stack that in terms of pure capability, agentic testing and debugging tools are improving very quickly, and it “would be a mistake” to underestimate them. 

“Systems in this class can already help engineers reason through larger code contexts, identify likely failure paths, and accelerate remediation,” Renaudineau says. “But whatever the benchmark, blend or breed… production reliability is not determined only by what an agent can do.”

Renaudineau draws a line here and says that debugging reliability can only ever be determined by how a tool’s output is verified, and by what we choose to base that verification on – and the first principle here is that an “agent itself cannot be the credible judge” of its own reliability.

“Verification has to be independent of the agent, reproducible and tested against real-world conditions of use – not only against the controlled environment in which the agent was developed or benchmarked,” Renaudineau adds.

What DevSecOps developers should think next

As for the original Will it Mythos man himself, Joe Cooper, noted that on the core question of whether Anthropic’s model can identify really challenging bugs, his benchmark answers “with a resounding, maybe” in this test.

“Mythos, maybe, really is better than the other current models at finding security bugs, as it found four bugs that no model in this experiment found. But I’ll keep testing. It’s possible prompt or tooling or harness changes can enable better results from the current crop of publicly available models,” concluded Cooper. “Over time, I’ll evolve the corpus. It may become a more generic CVE-based benchmark if Anthropic stops bragging about specific bugs.”

With interest and anticipation in Anthropic’s latest batch so fervent and with the work carried out in areas such as Project Glasswing being so cloak-and-dagger, the fact that the software engineering community is working to provide so much (hopefully always) impartial analysis must be a good thing.

TRENDING STORIES
Adrian Bridgwater is a technology journalist with three decades of press experience. He has an extensive background in communications, starting in print media, newspapers and also television. Primarily working as an analysis writer dedicated to a software application development ‘beat’,...
Read more from Adrian Bridgwater
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: OpenAI, Anthropic, Sysdig.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.