VOOZH about

URL: https://thenewstack.io/machine-learning-for-automated-root-cause-analysis-promise-and-pain/

⇱ Machine Learning for Automated Root Cause Analysis: Promise and Pain - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-03-11 06:21:14
Machine Learning for Automated Root Cause Analysis: Promise and Pain
sponsor-cncf,sponsored-post-contributed,
AI / Operations

Machine Learning for Automated Root Cause Analysis: Promise and Pain

Machine learning has the potential to revolutionize root cause analysis, but it must overcome data, computational and interpretability challenges.
Mar 11th, 2024 6:21am by Yuval Lev
👁 Featued image for: Machine Learning for Automated Root Cause Analysis: Promise and Pain
Image from Zakharchuk on Shutterstock.
CNCF sponsored this post.

Let’s envision a world where root causes are instantly identified the moment any system degradation occurs:

Maria, an e-commerce site reliability engineer, wakes up to an alert that the site’s checkout success rate has dropped 15% over the last 30 minutes due to higher-than-normal failure rates. With traditional monitoring tools, this would take hours of manual analysis to troubleshoot.

Instead, within seconds, Maria’s AIOps platform sends a notification showing the root cause: A dependency used by the payment microservice has been degraded, slowing transaction-processing times. The latest version of the payment service couldn’t handle the scale placed on the prior version.

The AIOps platform then details all affected components and APIs involved in this event. With this insight, Maria immediately knows both the blast radius and scope of the issue. She quickly resolves the problem by rolling back the last update made to the payment service, and checkout success rates are restored without any further customer impact. Going from alert to resolution took less than 5 minutes.

This level of automated root cause analysis delivers immense benefits:

  • Rapid detection: Analysis of the “blast radius” — connecting alert indicators to potential service degradations and outages — is done in seconds.
  • Alert fatigue reduction: By consolidating alerts and forming a cohesive picture of a production issue, automated root cause analysis focuses on the core issues that need repair.
  • Precise targeting: The exact root cause across all layers is signaled, along with its probabilistic impact on site reliability and revenue.
  • Faster recovery: By understanding root cause and blast radius from the start, teams can precisely mitigate issues rather than reactively firefighting.
  • Proactive prevention: Over time, patterns emerge showing systemic deficiencies, such as a Redis cluster needing failover configuration, and teams can make targeted improvements.
  • Exponential return on investment: Downtime x mean time to resolution (MTTR) reduction + engineer time savings + customer loyalty gains + risk/liability reduction. When cloud-designed ML automation boasts advanced data collection capabilities, the capital expenditures (CapEx) also decrease.

Why ML Troubleshooting Is Hard

This promise seems almost too good to be true. And indeed, multiple barriers obstruct the path to production-grade ML pipelines for root cause analysis.

To understand why, think about your production environment as if it were a car. You’re driving on the freeway when your engine starts rattling, sputtering and eventually stalling. If you were trying to replace your mechanic with an ML algorithm to identify the root cause, what are some of the challenges you might encounter?

  • No wiring diagram: Where is each sensor, actuator, pump located and how do they fit together? Who manufactured every part of the automobile and where can you source a new component if one breaks? Without a multidimensional topology mapping all dependencies, ML models have zero context of how to traverse interrelated failures. Manually creating this wiring chart is enormously complex at scale. And when you need real-time answers in the middle of a production incident, it only gets harder.
  • Not accounting for blind spots: How many times has the car been repaired or refurbished, and what third-party parts does it have? Counting on user-provided telemetry will often leave blind spots in all production layers, creating gaps, making automatic triage almost impossible. There needs to be another way of collecting complete environment data.
  • Can’t reason from past experience: If you’ve had the same car for a long time, you might instinctively know if a failing alternator, loose suspension components or broken flywheel are causing your problem. ML models lack this ability to zoom in on probable culprits tied to a business impact such as a drop-in checkout conversion, thereby reducing noise.
  • Communicating the diagnosis: Once a failing head gasket is identified as the root cause, the mechanic still needs to explain the diagnosis and severity to the car owner. Likewise with engineers and customers — and generic ML model correlations can’t help here.

Let’s explore further these pitfalls inhibiting automated root cause analysis:

1. No machine-readable system topology

ML models can only spot patterns in data they can access. Without an existing topology mapping the thousands of interdependent services, containers, APIs and infrastructure elements, models have no pathway to traverse failures across domains.

Manually creating this topology is remarkably complex and sometimes impossible as production environments dynamically scale across hybrid cloud infrastructure.

2. Root cause inference at scale

Even with a topology, searching during an incident poses scalability issues. Existing ML libraries cannot handle production causality analysis.

To diagnose checkout failure, should we evaluate payment APIs or database clusters? Intuitively, an engineer would prioritize services tied to revenue delivery. But generic ML techniques lack this reasoning, forcing an exponential search across all topology layers — like holding a microphone to every inch of a car engine.

Advanced algorithms are needed to traverse topology graphs during incidents, weighing and filtering options based on business criticality. Both simple and intricate failure chains must be unpackaged — all before revenue and trust disappear.

3. Interpretability for humans 

Finally, ML troubleshooting creates a new challenge: how to make inferences understandable to humans. Identifying patterns in metrics data reveals statistical correlations between events, but not causal priority chains:

  • Event A (high memory usage) frequently corresponds to Event Z (checkout errors).
  • Therefore, there is a high probability that Event A causes Event Z.

But this diagnosis doesn’t answer the questions that provide actionable insights to engineers:

  • What was the blast radius on revenue and reliability?
  • How do we communicate this to decision-makers?
  • How do we prioritize fixing Event A versus B, which also correlates?

Solving this final-mile problem requires models that capture and visualize rootcause probability, business-impact sequencing, risk levels and mitigation recommendations.

The Vision

While core machine learning techniques show promise, purpose-built solutions are necessary to address the complexity of causality analysis at production scale. Combining specialized topology inference, heuristic graph search algorithms and interpretable data science unlocks the power of automated root cause analysis. But it requires advances in data collection, service mapping, ML and the communication of technical insights — all with the goal of remediation.

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon Europe in Paris, from March 19-22.

The Cloud Native Computing Foundation (CNCF) hosts critical components of the global technology infrastructure including Kubernetes, OpenTelemetry, and Argo. CNCF is the neutral home for cloud native collaboration, bringing together the industry’s top developers, end users, and vendors.
Learn More
The latest from CNCF
TRENDING STORIES
Yuval Lev, CTO and co-founder of the zero-instrumentation production intelligence pioneer Senser, honed his skills as a tech leader at DriveNets, a leader in cloud native networking solutions. Senser was recently recognized as an Intellyx 2023 Digital Innovator of the...
Read more from Yuval Lev
CNCF sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Root.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.