VOOZH about

URL: https://thenewstack.io/beyond-shift-left-improving-ai-training-data/

⇱ Beyond 'Shift Left': Improving AI Training Data - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-10-28 06:00:51
Beyond 'Shift Left': Improving AI Training Data
sponsor-sonarsource,sponsored-post-contributed,
AI Engineering / Large Language Models / Security

Beyond ‘Shift Left’: Improving AI Training Data

Instead of just generating more code faster and creating a downstream review bottleneck, we can train models to generate better code from the start.
Oct 28th, 2025 6:00am by Manish Kapur
👁 Featued image for: Beyond ‘Shift Left’: Improving AI Training Data
Image by Wilfried Pohnke from Pixabay
Sonar sponsored this post. Insight Partners is an investor in Sonar and TNS.

The software development world is grappling with a new “engineering productivity paradox.” On one hand, AI-powered coding assistants are generating a staggering volume of code. For example, Google has said that 30% of its code uses AI-generated suggestions. However, the engineering velocity has not seen a proportional jump, with productivity gains being estimated at 10%.

This discrepancy highlights a critical bottleneck: All that AI-generated code must be reviewed, verified and often fixed by human developers. The core issue isn’t the quantity of AI-generated code; it’s the quality.

“Garbage in, garbage out” has been a maxim in computing for decades. Today, it’s the central challenge for coding large language models (LLMs), which are trained on vast, unfiltered data sets of public code repositories. The inconvenient truth is that these repositories are riddled with bugs, security vulnerabilities and “code smells” that contribute to technical debt. When an LLM learns from this flawed data, it learns to replicate these flaws.

Recent studies confirm this. Analyses of leading LLMs by Sonar show they all share common blind spots, consistently producing code with high-severity vulnerabilities and a deep-seated tendency to write code that is hard to maintain.

This flood of problematic code places an even greater burden on human reviewers, shifting the bottleneck rather than eliminating it and creating the very productivity paradox we’re trying to solve.

Shifting Left of ‘Shift Left’

For years, the industry has championed the “shift left” movement — a practice focused on identifying and fixing quality and security issues as early as possible in the software development life cycle (SDLC). We moved testing from a final pre-production phase to an integrated part of CI/CD pipelines, and static analysis tools were integrated directly into the developer’s IDE. The goal was simple: Find it early, fix it cheaply.

But AI-assisted code generation breaks this model. The “beginning” of the life cycle is no longer when a developer writes the first line of code. The life cycle now begins before that — inside the LLM itself, with the data it was trained on.

If an AI tool generates code that is already insecure or buggy, the “shift left” battle is already half-lost. We are, in effect, playing defense, using our best developers as a final backstop to catch the mistakes of our most “productive” new tools.

The logical, necessary evolution of this concept is to shift even further left. We must move our focus from only reviewing AI-generated code to improving the source. The new frontier for code quality and security is the LLM’s training data.

Curating the AI’s ‘Education’

A new approach is emerging to tackle this problem head-on. The concept involves applying a “sweep” to the massive data sets used to train and fine-tune coding models.

Imagine using a powerful, large-scale static analysis engine — one that understands thousands of bug patterns, security vulnerabilities and maintainability issues — and turning it loose on petabytes of training data. This engine can identify, remediate and filter out problematic code before it ever becomes part of the LLM’s “education.”

The results of this approach are profound. At Sonar, our early findings with our new service, SonarSweep, have shown that models fine-tuned on such remediated data produce code with significantly fewer flaws. In one analysis, this “sweeping” process led to models that generated code with up to 67% fewer security vulnerabilities and 42% fewer bugs, all without degrading the functional correctness of the output.

This represents a fundamental change in our approach to AI-assisted development. Instead of just generating more code faster and creating a downstream review bottleneck, we can train models to generate better code from the start.

True velocity isn’t just about raw output; it’s about the amount of high-quality, secure and maintainable code that makes it to production with minimal human friction. By ensuring our AI models learn from our best examples, not our worst, we reduce the review burden and free human developers to focus on what they do best: solving complex problems and building what’s next.

Sonar is the industry standard for code verification and automated code review, trusted by 75% of the Fortune 100. Its SonarQube platform analyzes over 750 billion lines of code daily, helping to prevent outages, reduce risk, lower technical debt, and ensure compliance.
Learn More
The latest from Sonar
Hear more from our sponsor
TRENDING STORIES
Manish Kapur is VP of Product and Solutions Marketing at Sonar, where he oversees go-to-market strategy and outbound product management for tools used by development teams to analyze, verify, and remediate code at scale. He has spent his career at...
Read more from Manish Kapur
Sonar sponsored this post. Insight Partners is an investor in Sonar and TNS.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Sonar.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.