VOOZH about

URL: https://thenewstack.io/creating-a-path-for-prometheus-success/

⇱ Creating a Path for Prometheus Success - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-02-28 06:21:09
Creating a Path for Prometheus Success
sponsor-cncf,sponsored-post-contributed,
Data / Observability / Open Source

Creating a Path for Prometheus Success

A look at the challenges that can easily disrupt smooth operations with Prometheus, and how to overcome them.
Feb 28th, 2024 6:21am by Arthur Sens
👁 Featued image for: Creating a Path for Prometheus Success
Image from djgis on Shutterstock.
CNCF sponsored this post.

Prometheus is an easy-to-use, open source-monitoring and -alerting toolkit. Its popularity is no doubt due to its efficient time-series data collection database, flexible querying language (PromQL) and general scalability. Furthermore, its support for dynamic service discovery, native integration with Kubernetes and alerting capabilities makes it a great choice for monitoring in dynamic, cloud native environments. Prometheus also has an active, open source community that contributes to continuous improvements and growing adoption.

Yet despite all the benefits that Prometheus offers, many challenges can easily disrupt smooth operations. Let’s take a look at some of them.

A Tale of Cardinal Inexperience

It’s quite common for people who are inexperienced with Prometheus to encounter high-cardinality problems. These issues can lead to Prometheus instances growing much faster than expected, thereby creating scalability and performance problems.

In Prometheus, cardinality refers to the number of unique metric series. A high-cardinality situation occurs when there are a large number of distinct metric labels or label values being generated.

This often arises from misuse or misunderstanding of labels. For example, adding highly dynamic labels (like timestamps, unique identifiers or user IDs) to metrics can rapidly increase the number of time series stored.

This can result in a series of unfortunate events:

Increased Storage Requirements

High cardinality leads to a dramatic increase in the number of time series that Prometheus needs to store, which can quickly consume storage resources. Of course, this can get expensive.

Performance Degradation

Query performance can suffer significantly in high-cardinality scenarios. Prometheus has to process a larger number of time series, which can slow down query responses and increase CPU and memory usage.

Management Overhead 

Managing and maintaining a Prometheus instance with high cardinality becomes more challenging. It requires more careful tuning and possibly more sophisticated infrastructure solutions.

Making Sure Your Storage Management Doesn’t Go A-WAL

Write Ahead Log or WAL in Prometheus is a mechanism used to ensure data integrity and prevent data loss in case of a crash or unexpected shutdown. Whenever Prometheus records new data, it first writes that data to the WAL, housed on the filesystem of the server where Prometheus is running, before it is written to the database.

This approach means that if Prometheus restarts for any reason, it can use the WAL to recover any data that was not yet written to the database. The WAL acts as a record of what should be in the database, ensuring that no data is lost if the system crashes.

However, one of the main challenges with the WAL is the time it takes to replay it after a crash or restart. When Prometheus restarts, it needs to process the WAL to reconstruct its in-memory state. This process can be time-consuming, especially if there’s a lot of data in the WAL.

In practical terms, this means that if the WAL replay process takes a long time, Prometheus can experience significant downtime with monitoring and alerting being temporarily unavailable — not exactly ideal for systems that rely on real-time monitoring.

Scaling without Complexity? LOL!

Handling scalability in Prometheus, especially in large-scale and dynamic environments, often requires adopting additional strategies and tools. While Prometheus is a monolithic application, it does have many individual features such as scraping and storing metrics, returning metrics through queries, alerting and recording evaluations and more.

If in a particular setup you are heavily dependent on a single Prometheus feature, you may be forced to scale up the entire Prometheus even though you really only need to scale one part of it. This is where distributed setups and tools like Thanos and Cortex come into play.

Both of them help extend Prometheus by adding a global query view, supporting Prometheus query API natively, providing efficient storage and multicluster support. They also allow for long-term storage of Prometheus metrics in object storage (like AWS S3 or Google Cloud Storage), making it more cost-effective and scalable. However, while Thanos and Cortex components can be scaled separately, thereby solving the monolithic scaling issue of Prometheus, all of their additional components require some level of expertise and effort to maintain them.

In short, while immensely helpful, both Thanos and Cortex introduce additional components into the monitoring architecture, which increases complexity in terms of deployment, management and troubleshooting.

Creating a Framework for Success

If you want to use Prometheus without encountering these storage and scalability woes, join our presentation on using Prometheus-Operator at the CNCF-hosted co-located Events Europe, as well as our hands-on workshop a few days later at KubeCon.

You’ll learn how to reap all the rewards of Prometheus without risking a thing. You’ll also get to meet me and my colleague Nicolas Takashi — we’re platform engineers at Coralogix — along with our esteemed co-presenters, Bartłomiej Płotka and Mahmoud Amin, senior software engineers at Google, and Jesus Vazquez from Grafana.

See you there!

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon Europe in Paris, from March 19–22.

The Cloud Native Computing Foundation (CNCF) hosts critical components of the global technology infrastructure including Kubernetes, OpenTelemetry, and Argo. CNCF is the neutral home for cloud native collaboration, bringing together the industry’s top developers, end users, and vendors.
Learn More
The latest from CNCF
TRENDING STORIES
Arthur Sens is a platform engineer at Coralogix, with a mixture of site reliability engineering and software engineering backgrounds. He actively contributes to the Prometheus ecosystem, maintaining Prometheus-Operator and Prometheus client_golang while mentoring new open source software contributors.
Read more from Arthur Sens
CNCF sponsored this post.
SHARE THIS STORY
TRENDING STORIES
AWS is a sponsor of The New Stack.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.