VOOZH about

URL: https://thenewstack.io/training-a-ml-model-to-forecast-kubernetes-node-anomalies/

⇱ Training a ML Model to Forecast Kubernetes Node Anomalies - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-10-19 07:24:05
Training a ML Model to Forecast Kubernetes Node Anomalies
contributed,sponsor-cncf,sponsored,sponsored-post-contributed,
AI / Data / Kubernetes

Training a ML Model to Forecast Kubernetes Node Anomalies

We chose a Bayesian network approach to reduce the amount of data required to forecast node failure.
Oct 19th, 2022 7:24am by Jerry Lee
👁 Featued image for: Training a ML Model to Forecast Kubernetes Node Anomalies
Image via Pixabay.
CNCF sponsored this post.

This is part of a series of contributed articles leading up to KubeCon + CloudNativeCon on Oct. 24-28. 

It’s no surprise that using artificial intelligence to improve IT system operations is in the spotlight, considering the five benefits experts attributed to it: the ability to proactively manage, remediate faster, improve productivity, collaborate efficiently and improve application performance.

Using machine learning to forecast system anomalies and reduce alert noise are considered key domains to improve the performance of IT operations. The growing use of open source/standard stacks such as Kubernetes and Prometheus that enable collection of high-quality data, such as metrics and logs, and the increasing accuracy of machine learning are driving the push to adopt it.

Challenges

However, to increase the accuracy of machine learning, organizations need to collect proper data sets to train the machine learning model. For this purpose, various types of outages need to occur, and information such as metrics, events and logs from the relevant monitoring targets must be continuously collected and fed to the models to increase their accuracy.

Even if an individual organization continuously collects data sets, it is necessary to collect large-scale data for a fairly long period of time to achieve a certain accuracy level in the machine learning model. This requires effort to tag whether the monitoring targets are in the anomaly status. Additionally, validating the results from the ML model is difficult and labor intensive.

KubeCon + CloudNativeCon conferences gather adopters and technologists to further the education and advancement of cloud native computing. The vendor-neutral events feature domain experts and key maintainers behind popular projects like Kubernetes, Prometheus, Envoy, CoreDNS, containerd and more.
Learn More
The latest from KubeCon + CloudNativeCon

Approach

To overcome these challenges and apply machine learning to forecast anomalies of computing resources, my team devised the idea to use the Bayesian network approach to secure training data at the initial stage. A Bayesian network starts with the experts’ rule set to get a certain level of model performance. This idea will help organizations gather a basic data set to train the model even when they don’t actually have enough data to do so.

Further, our team aimed to monitor Kubernetes nodes since the standard open source software, such as Prometheus, Node Exporter and cAdvisor, can be installed to generate data sets to evaluate Kubernetes resources’ anomalies.

We chose Kafka as well as a Prometheus-Kafka adapter to receive metric feeds from Prometheus as a metric pipeline. To receive data from Kafka’s topics and provide learning data sets, our team decided to develop a metric evaluation engine to consume and pre-evaluate metrics using rule bases from system experts.

The pre-evaluation results from the engine are stored in a data mart for the machine learning pipeline. The machine learning pipeline is configured on Kubeflow, a machine learning pipeline platform that operates on Kubernetes. A TensorFlow machine learning engine was chosen to forecast the anomaly model, and the evaluation results are stored in MariaDB.

Implementation Details

The figure below shows the overall solution architecture to depict the entire process from the metric feed to saving the evaluation result.

👁 Image

Metrics should be processed within 30 seconds from the metric collection process by Prometheus to the evaluation of metrics by the machine learning model. The default Prometheus metric collection interval is 30 seconds. The 30-seconds-to-1-minute interval is widely accepted as a best practice for system monitoring. This means the metric pipeline from Prometheus to the anomaly forecast by the machine learning model should be completed in 30 seconds.

In a Kubernetes cluster, the metrics provided by Node Exporter, cAdvisor, and Kubernetes are 5,000 per minute. However, the number of must-have metrics for Kubernetes node and pod anomaly forecasting vary depending on the number you’re looking at. About 40 metrics to 50 metrics per Kubernetes resources are enough, therefore the process needs capabilities to filter the must-have metrics to minimize the data processing time.

The metric evaluator pre-evaluates the target nodes using metrics related to CPU, memory, file system and networks applying preset rules from the experts’ guidelines, and saves the evaluation results every 30 seconds. The implemented pipeline is able to process the metrics from the cluster, but it sometimes takes more than 30 seconds to complete the pipeline process.

The machine learning pipeline reads the pre-evaluation results every 30 seconds and feeds the results to train the model as well as evaluate system anomalies so the information can be used for IT operations. The saved evaluation results can be used to mute alert noises and manage system outages proactively.

Implications from the Implementation

After the pipeline and system deployment, it was difficult to train the machine learning models because there was no outage or cluster issue for several days. So our team had to induce a situation so the pipeline would produce pre-evaluation results to train the models.

It might take more than 30 seconds for end-to-end processing if you need to monitor more than two Kubernetes clusters, unless the pipeline process is horizontally scaled. We considered adjusting the metric filter logic to reduce the target metrics to shorten processing time, but our team decided to use a 1-minute process time in order to not lose business contexts.

The evaluation results are largely explained by key metrics such as CPU usage, memory usage, storage, high network traffic and dropped packets. The remaining influence of the metrics for the anomaly was quite low. Creating more subtle machine learning models might require a longer training period.

The team is still discussing when we can turn off the rule-base pre-evaluation to train the models, and we haven’t turned it off yet.

Automating and monitoring the pipeline are key success factors since the entire process is quite complex.

Improvement Opportunities

Overall, the feedback process will be crucial. The IT operation team should be able to provide feedback about anomaly detection for successful or failed cases, and the feedback should be fed to the machine learning model.

Correlation between Kubernetes resources also needs to be considered as one of the inputs to the model. Pod and volumes anomalies might be the cause of node failure, and the machine learning model would be improved to accommodate the correlation.

In addition to metrics, various variables such as Kubernetes events and application logs would be helpful to improve the model’s performance.

To hear more about cloud native topics, join the Cloud Native Computing Foundation and the cloud native community at KubeCon + CloudNativeCon North America 2022 in Detroit (and virtual) from Oct. 24-28.

👁 Image

The Cloud Native Computing Foundation (CNCF) hosts critical components of the global technology infrastructure including Kubernetes, OpenTelemetry, and Argo. CNCF is the neutral home for cloud native collaboration, bringing together the industry’s top developers, end users, and vendors.
Learn More
The latest from CNCF
TRENDING STORIES
Jerry Lee is head of product at NexClipper. He has more than 25 years of experience as an executive for startups and Fortune 500 companies, leading digital transformation and global operations in private and public clouds. He has been with...
Read more from Jerry Lee
CNCF sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.