![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Organizations are turning in droves to Prometheus to monitor their container and microservice estates, but larger companies often run headlong into a wall: They face scaling challenges when they move beyond a handful of apps.
Monitoring monolithic environments used to be relatively straight forward. You had a certain number of static physical servers and virtual machines and a finite number of metrics to watch. Today the number of entities to track is exploding because of containers and the migration to microservice architectures.
If servers sitting in data centers were pets (as they have been described) requiring our constant attention, and cloud instances are more like cattle (you don’t care about a single one because you have a lot), then containers are more akin to locusts. There are a lot of them, sometimes hundreds per machine, new ones appear all the time, and when used in conjunction with an orchestrator like Kubernetes, their life time can be very short. This makes it much harder to keep track of them, and if you’re not careful, they can cause a lot of damage.
As the complexity and distribution of environments increases, so does the number of entities you need to monitor. Additionally, you might want to monitor more attributes to ensure you have an accurate picture of what is going on, or, in the case of troubleshooting or incident response, what was going on. The latter is particularly problematic in these ephemeral environments because by the time you want to understand the root cause of a problem, often the resources in question have already been decommissioned, meaning the monitoring solutions have to provide a way to store enough history for forensics.
Increasingly, when in need of cloud monitoring, teams are turning to Prometheus, an open source, CNCF project. Prometheus has become the go-to monitoring tool developers use to collect and make sense of metrics in cloud-native environments. It is supported by a large community, with 6,300 contributors from more than 700 companies, and 13,500 code commits and 7,200 pull requests.
A typical cloud-native application stack — such as Kubernetes, Ngnix, MongoDB, Kafka, golang — exposes Prometheus metrics by default. Prometheus is designed as a Go program that scales vertically. It is easy to deploy as, say, a single container or a single host. This means it’s very easy to get started with Prometheus to get visibility into your first Kubernetes cluster. But it also means that as your infrastructure grows, you will hit its limit.
As your environment grows, the number of time series data you need to track skyrockets and at a certain point, a single Prometheus instance won’t be able to keep up. The straightforward option would be to run a fleet of Prometheus servers across the enterprise, but this comes with several challenges. For example, managing and federating data across tens or hundreds of Prometheus servers is not easy. Similarly, figuring out enterprise workflows, single-sign-on, role-based access control, and adhering to SLAs or compliance are not easy problems either. As applications grow, it becomes a huge manageability and reliability issue to operate an all-encompassing monitoring solution without disrupting developer work.
To deal with that, companies have adopted a few approaches.
A simple first step is to have a separate Prometheus server for every namespace or for every cluster. This approach is clearly harder to scale beyond a certain point, and in addition to that, it has the disadvantage of creating a big number of disconnected data silos. This makes troubleshooting cumbersome because most issues will span multiple services/teams/clusters. Not only is it hard to find the same metric in each environment, you then have to stitch together the data to try to understand what is happening.
Another common approach is to use open source tools such as Cortex or Thanos to federate multiple Prometheus servers. They are powerful tools that enable you to query servers in a centralized way, collect the data and then share it in a single dashboard. However, as any data-intensive distributed system, they require substantial skills and resources to operate.
For companies that start with Prometheus and then look for a commercial solution to serve up a holistic view, it is important that they do not lose all of the development work done standardizing on Prometheus — dashboards, alerts, exporters and other work. However, that is not the only thing you should consider. If you go this route, insist on the support of these core criteria:
If you find a commercial tool that meets these criteria, you should be able to swap it into existing Prometheus integrations with a minimum of fuss and sidestep the scalability problems companies are running into. Developers adore Prometheus, with good reason, and due diligence now will help you to ensure that they can still use the metrics they love.