![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Let’s envision a world where root causes are instantly identified the moment any system degradation occurs:
Maria, an e-commerce site reliability engineer, wakes up to an alert that the site’s checkout success rate has dropped 15% over the last 30 minutes due to higher-than-normal failure rates. With traditional monitoring tools, this would take hours of manual analysis to troubleshoot.
Instead, within seconds, Maria’s AIOps platform sends a notification showing the root cause: A dependency used by the payment microservice has been degraded, slowing transaction-processing times. The latest version of the payment service couldn’t handle the scale placed on the prior version.
The AIOps platform then details all affected components and APIs involved in this event. With this insight, Maria immediately knows both the blast radius and scope of the issue. She quickly resolves the problem by rolling back the last update made to the payment service, and checkout success rates are restored without any further customer impact. Going from alert to resolution took less than 5 minutes.
This level of automated root cause analysis delivers immense benefits:
This promise seems almost too good to be true. And indeed, multiple barriers obstruct the path to production-grade ML pipelines for root cause analysis.
To understand why, think about your production environment as if it were a car. You’re driving on the freeway when your engine starts rattling, sputtering and eventually stalling. If you were trying to replace your mechanic with an ML algorithm to identify the root cause, what are some of the challenges you might encounter?
Let’s explore further these pitfalls inhibiting automated root cause analysis:
1. No machine-readable system topology
ML models can only spot patterns in data they can access. Without an existing topology mapping the thousands of interdependent services, containers, APIs and infrastructure elements, models have no pathway to traverse failures across domains.
Manually creating this topology is remarkably complex and sometimes impossible as production environments dynamically scale across hybrid cloud infrastructure.
2. Root cause inference at scale
Even with a topology, searching during an incident poses scalability issues. Existing ML libraries cannot handle production causality analysis.
To diagnose checkout failure, should we evaluate payment APIs or database clusters? Intuitively, an engineer would prioritize services tied to revenue delivery. But generic ML techniques lack this reasoning, forcing an exponential search across all topology layers — like holding a microphone to every inch of a car engine.
Advanced algorithms are needed to traverse topology graphs during incidents, weighing and filtering options based on business criticality. Both simple and intricate failure chains must be unpackaged — all before revenue and trust disappear.
3. Interpretability for humans
Finally, ML troubleshooting creates a new challenge: how to make inferences understandable to humans. Identifying patterns in metrics data reveals statistical correlations between events, but not causal priority chains:
But this diagnosis doesn’t answer the questions that provide actionable insights to engineers:
Solving this final-mile problem requires models that capture and visualize rootcause probability, business-impact sequencing, risk levels and mitigation recommendations.
While core machine learning techniques show promise, purpose-built solutions are necessary to address the complexity of causality analysis at production scale. Combining specialized topology inference, heuristic graph search algorithms and interpretable data science unlocks the power of automated root cause analysis. But it requires advances in data collection, service mapping, ML and the communication of technical insights — all with the goal of remediation.
To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon Europe in Paris, from March 19-22.