VOOZH about

URL: https://thenewstack.io/know-the-hidden-costs-of-diy-prometheus/

⇱ Know the Hidden Costs of DIY Prometheus - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-03-14 09:46:51
Know the Hidden Costs of DIY Prometheus
sponsor-chronosphere,sponsored-post-contributed,
Data / DevOps / Observability

Know the Hidden Costs of DIY Prometheus

Building your own Prometheus instance can only take you so far, and businesses are finding that running it in-house often is neither scalable nor reliable enough.
Mar 14th, 2023 9:46am by George Hamilton
👁 Featued image for: Know the Hidden Costs of DIY Prometheus
Chronosphere sponsored this post.

When it arrived on the scene, the Prometheus open source system monitoring toolkit gave overworked observability teams a way to succeed in today’s modern business world. It was a metrics-based observability system that would ensure their environments are working as needed.

Yet building out your own Prometheus instance often can only take you so far, and businesses are finding out that running Prometheus in-house is neither scalable nor reliable enough to handle their rapidly growing cloud native environments.

Why Start with Prometheus for Metrics-Based Observability?

DIY (do it yourself) Prometheus is a natural starting point for many companies as they begin their cloud native journey. It’s free; it’s open source; and there are great community contributions and support.

However, as their cloud native environment grows and engineers demand more data to optimize their apps and infrastructure, Prometheus requires a more complex architecture — and more staff bandwidth — to scale. At some point, nearly every organization gets to a point where managing a complex Prometheus implementation in-house is anything but free. It becomes more costly and consumes more engineering resources than your production environment.

Chronosphere, a Palo Alto Networks company, is the observability platform built for control in the modern, containerized world. Recognized as a leader by major analyst firms, Chronosphere empowers customers to focus on the data and insights that matter to reduce data complexity, optimize costs, and remediate issues faster. Visit chronosphere.io.
Learn More
Hear more from our sponsor

What Are the Four Common Challenges of DIY Prometheus?

1. Data Becomes Hard to Find

You know you’re bumping up against the limits of Prometheus when you’re hearing complaints from engineers that they can’t quickly locate observability data. To scale Prometheus, you need to spin up separate instances and have each instance store and scrape data from specific services. This will manually shard the load across your Prometheus instances, but this can cause problems as you scale.

Dashboards and Alerts for Prometheus Instances

From a dashboarding and alerting perspective, you need to tell each dashboard or alert which node/Prometheus instance to point to in order to get the data. You also may have a single dashboard or alert that needs data from multiple Prometheus instances, so you federate instances and create a subset of data for the original instances.

The bottom line is that scaling Prometheus leads to more federated nodes, which leads you to having a much more complicated Prometheus structure. And as you do this across zones or regions, you need to federate the data in another Prometheus instance and combine that across both zones or regions. Engineers need to remember which Prometheus instance contains the data they are looking for. You’ll likely hear from engineers that it just takes too long to find data, run queries and fix issues.

2. Poor Reliability Results in Data Loss

Out of the box, Prometheus has a significant point of failure, so if it goes down, you lose active data and access to historical data. So, it’s always recommended to run multiple instances that both scrape the same endpoints. This way, if one goes down you still have a copy of your metrics.

Relying on Dashboards

Another best practice is to run load balancers and point your dashboard instance to the load balancer. This generally works for reliability in the sense that you get one copy of the data. The problem is, that if you are doing rolling restarts of your Prometheus instances, then you’ll come across a gap in your data as the Prometheus instance is down and restarting. Again, the bottom line is you may need a longer-term storage solution or a remote storage solution or perhaps you need to distribute across multiple cloud regions and cloud providers for fault tolerance. This again adds complexity and an operational burden on your engineering teams.

3. Longer Data Retention Gets Expensive

Teams will often demand that they need to retain more data longer to be more effective at troubleshooting. However, Prometheus is not really efficient for long-term data. There are no built-in downsampling capabilities.

How Longer Data Retention Gets Expensive

As an example, if storing one instance for six months at a scrape interval of 30 seconds, it ends up being approximately 8,100 Kbs. But if you were able to downsample to a one-hour resolution for six months, it would use approximately 67.5 Kbs. So as you store more and more longer-term data, downsampling becomes very valuable for efficiency. There are some workarounds, but it adds complexity and engineer time to manage.

4. Data Growth Forces Tough Trade-Offs

A clear sign you’re bumping up against the limits of DIY Prometheus is you’re being forced to make difficult data collection vs. cost trade-offs. In a perfect world, we capture everything so we always have the data we need. But in practical terms, the sheer volume of observability data as you transition from cloud to cloud native is increasing at a faster rate than your production environment.

If you were running on a VM and now you’re running on containers, your infrastructure and cloud bills are pretty much the same, with the same cluster size. But instead of tens of VMs, you’re now running hundreds or thousands of containers, each of which is generating the same amount of telemetry data as the VMs. Your observability costs are higher than the infrastructure supporting your apps. If you’re reducing monitoring as you move to containerized applications, it’s likely time for a more scalable solution.

So if Prometheus can’t keep up, what’s to be done? When you’ve gotten as much as you can out of your DIY Prometheus implementation, it’s time to consider a Prometheus alternative. When evaluating solutions, an important consideration is how that solution can leverage the investment you’ve made in your existing Prometheus environment, specifically:

  • Instrumentation
  • Data collection
  • Data presentation (dashboards and alerts)

A managed solution should leverage your instrumentation and data presentation, but alleviate the increasing cost and operational burden of managing an observability platform in-house.

How Chronosphere Can Help

Chronosphere was built from the ground up for cloud native scale, complexity and reliability. Chronosphere helps engineers be more productive by giving them faster and more actionable alerts that they can triage rapidly. Plus, it allows them to spend less time on monitoring instrumentation and more time delivering innovation that grows your business.

The Data

According to Forrester Research, a typical Chronosphere customer sees a 165% return on investment and $7.75 million in benefits over three years. The average customer reduces their observability data volumes by 48% after transformation, while improving their observability metrics.

To learn more, read the Forrester Total Economic Impact study.

Chronosphere, a Palo Alto Networks company, is the observability platform built for control in the modern, containerized world. Recognized as a leader by major analyst firms, Chronosphere empowers customers to focus on the data and insights that matter to reduce data complexity, optimize costs, and remediate issues faster. Visit chronosphere.io.
Learn More
Hear more from our sponsor
TRENDING STORIES
George Hamilton is the director of product marketing at Chronosphere. George has over 25 years of experience in the technology industry and has held many product marketing roles at leading tech companies such as CloudHealth by VMware, XebiaLabs/Digital.ai and Dell...
Read more from George Hamilton
Chronosphere sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.