VOOZH about

URL: https://thenewstack.io/ai-training-data-quality/

⇱ Cleaner AI training data, fewer bugs: Sonar's SonarSweep explained - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2026-06-11 08:00:00
Cleaner AI training data, fewer bugs: Sonar's SonarSweep explained
sponsor-sonarsource,sponsored-post-contributed,
AI Engineering / Large Language Models / Security

Cleaner AI training data, fewer bugs: Sonar’s SonarSweep explained

SonarSweep filters flawed AI training data, reducing bugs and security vulnerabilities in generated code by 41%. Learn how it works.
Jun 11th, 2026 8:00am by Joe Tyler
👁 Featued image for: Cleaner AI training data, fewer bugs: Sonar’s SonarSweep explained
Ubaid E. Alyafizi for Unsplash+
Sonar sponsored this post. Insight Partners is an investor in Sonar and TNS.

Large language models have moved quickly from novelty to daily infrastructure in software development. We are no longer using AI just to answer isolated questions in a chat window. We are using it to generate services, infrastructure configurations, test cases, and production code, often at a speed that traditional review processes were never designed to handle.

It’s been proven that models can generate code; the question is whether teams can trust what they produce. A big part of that challenge starts far upstream, in the data used to train the models themselves – a clear example of the “Garbage In, Garbage Out” paradigm. This is a phenomenon the AI research team at Sonar has explored extensively and has built technology called SonarSweep to solve. 

Public code repositories give models enormous breadth across languages and frameworks, but they also contain outdated libraries, insecure patterns, brittle implementations, and poor maintenance habits. Models learn from all of it. They do not reliably distinguish between examples that reflect sound engineering and examples that merely compile.

That matters because functional code is not the same as production-ready code. In an enterprise setting, code also needs to be secure, maintainable, and resilient under real-world conditions. If those qualities are inconsistent in the training data, they will also be inconsistent in the output.

The risk of unvetted code

LLMs are essentially sophisticated statistical systems. They do not possess an inherent understanding of “good” versus “bad” engineering. They optimize for the most probable solution based on the provided context and the patterns learned from the training data.

That means weaknesses in the training corpus can persist in model behavior in ways that are easy to underestimate. Research, including findings from Anthropic late last year, suggests that even relatively small samples of “poisoned” or low-quality data can be encoded into models’ inner representations and lead to dangerous behaviors.

In a paper I co-authored last year, Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis,” we observed this exact issue. All models considered in our study generated a mixture of simple and sophisticated bugs, vulnerabilities, and code smells (maintainability issues), and there are some interesting trends between models: with certain labs prioritizing security performance while others succeed in writing maintainable code. 

To highlight the complexity of these issues, consider path traversal vulnerabilities. These typically require taint analysis (following an input source to a sensitive sink) in order to detect them. While an LLM might generate a block of code that performs a specific function correctly, it may fail to account for how unvalidated user input could manipulate file paths or inject malicious commands three functions down the line. 

“A model can produce code that looks correct, passes a narrow test, and still introduces issues that increase review time, technical debt and security exposure.”

For enterprise teams, this is the real risk. A model can produce code that looks correct, passes a narrow test, and still introduces issues that increase review time, technical debt and security exposure. At scale, those flaws do not stay isolated. They spread through pull requests, internal tools and downstream systems.

Why data quality engineering matters

Organizations are becoming increasingly interested in building and owning their AI. If they want their AI-assisted coding to be context-aware, trustworthy, and cost-effective, they need to treat their code and other data as critical assets. That means applying quality controls before the model ever learns from their datasets.

This is where data quality engineering becomes essential. Rather than accepting public code corpora (and even their own data) as-is, teams can inspect, filter, and improve training data so the model learns from stronger examples. 

My team has put this into practice with SonarSweep, an approach that “sweeps” datasets before the model ever sees them. This ensures training data is properly scrutinized, which is crucial for companies seeking to understand and improve their agentic development practices. There are four key phases to sweeping datasets:
 

  1. Analyze the code deeply. This goes beyond keyword filtering. Static analysis can identify bugs, security vulnerabilities, and maintainability issues across large datasets.
  2. Synthesize examples. Create high-quality training examples for underrepresented tasks and domains. Optionally, an organization may use their codebase to embed specific understanding into the model.
  3. Remediate what can be remediated. If insecure or outdated patterns can be automatically corrected, the model can learn from the improved version instead of the flawed one.
  4. Curate aggressively. Not every example deserves equal weight in a training set. A quality gate removes low-signal data and maximizes diversity.

The payoff is measurable

This is not just a theoretical argument. Training on “swept” data in our model release from the end of last year led to a 41% reduction in the density of security vulnerabilities and a 41% reduction in the density of bugs in the model’s generated output.

That kind of improvement matters because the ROI is not only technical. It is operational. Within an agentic coding session, when the model writes code containing fewer bugs and vulnerabilities, the agent will spend less time and fewer tokens in the loops of the Agent Centric Development Lifecycle (AC/DC) — a framework built for how software is developed today. Within the AC/DC method, AI agents generate most of the code, and teams manage the development loops in three core stages: Guide, Verify, and Solve. 

Furthermore, training the model on your codebase will reduce token usage and help it get up to speed with your architecture and best practices at the start of a session. Additionally, cleaner, better-structured code reduces token use, as agents will spend less time re-reading files, rebuilding context, and working around unnecessary complexity. 

Sonar’s research supports that point: Across six matched pairs of codebases and roughly 660 Claude Code task runs, agents working in SonarQube-verified codebases used about 7% fewer input tokens and 8% fewer output tokens, with no meaningful change in task completion.

Generating quality from the ground up

The next frontier in AI coding is not simply bigger models or faster generation. It is better foundations. We are reaching the point where the limiting factor is no longer how much code AI can produce. It is how quickly organizations can verify that code and decide whether it deserves to move forward. 

“The next frontier in AI coding is not simply bigger models or faster generation. It is better foundations.”

To close the trust gap, organizations must build quality into their system from the beginning, including the datasets that shape model behavior before deployment. The teams that do this well will be the ones that get the most value from AI-assisted development, because their agents will be more efficient, spend less time correcting preventable defects, and have more time to put trustworthy software into production.

For engineer leaders and developers, the mandate is clear: development teams need to understand the implications of different agent configurations on their token spend and development output.

Sonar is the industry standard for code verification and automated code review, trusted by 75% of the Fortune 100. Its SonarQube platform analyzes over 750 billion lines of code daily, helping to prevent outages, reduce risk, lower technical debt, and ensure compliance.
Learn More
The latest from Sonar
Hear more from our sponsor
TRENDING STORIES
Joe Tyler is a specialist in Large Language Models with a background in natural language processing and real-world AI applications. He is currently an AI Researcher at Sonar based in London, leveraging generative AI to revolutionize code quality and security....
Read more from Joe Tyler
Sonar sponsored this post. Insight Partners is an investor in Sonar and TNS.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: SonarSource, Anthropic, Sonar.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.