VOOZH about

URL: https://thenewstack.io/3-key-configuration-challenges-for-kubernetes-monitoring-with-prometheus/

⇱ 3 Key Configuration Challenges for Kubernetes Monitoring with Prometheus - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2020-09-30 10:00:11
3 Key Configuration Challenges for Kubernetes Monitoring with Prometheus
contributed,sponsor-dynatrace,sponsored,sponsored-post-contributed,
Kubernetes / Microservices / Observability

3 Key Configuration Challenges for Kubernetes Monitoring with Prometheus

The most common challenges facing platform operators and SREs for onboarding new workloads to Prometheus and configuring the tool ecosystem.
Sep 30th, 2020 10:00am by Jürgen Etzlstorfer
👁 Featued image for: 3 Key Configuration Challenges for Kubernetes Monitoring with Prometheus
Feature image via Pixabay.
Dynatrace sponsored this post.
Jürgen Etzlstorfer
Jürgen is Technology Strategist at Dynatrace. He is passionate about cloud technologies, self-healing applications, and automation. In the Dynatrace Innovation Lab, he researches emerging technologies and how to leverage them in his daily work. When he is not working, you can find him outdoors, biking, hiking or running. You can follow him @jetzlstorfer

Observability is essential to running huge workloads in Kubernetes clusters. Prometheus is a monitoring system and time-series database that has proven to be adept at managing large-scale, dynamic Kubernetes environments. In fact, Prometheus is considered a foundational building block for running applications on Kubernetes and has become the de-facto open source standard for visibility and monitoring in Kubernetes environments.

Although open source, Prometheus does not come for free in terms of configuration that is required to properly monitor Kubernetes workloads. In this article, part one of a two-part piece on Prometheus, I highlight the most common challenges facing platform operators and site reliability engineers (SREs) for onboarding new workloads to Prometheus and configuring the tool ecosystem needed to manage Prometheus, along with potential solutions for overcoming each of these challenges.

Disclaimer: In this article, I don’t discuss the challenge of high-availability setups of Prometheus and multicluster setups. Instead, I focus on how to scale Prometheus to onboard more applications and to create dashboards for each application, so that more people can use it. If you are interested in the high-availability setups, you can refer to projects such as Thanos or VictoriaMetrics.

To start getting Prometheus ready in your organization, you can configure scraping to pull metrics from your services, build dashboards on top of your data using Grafana, and define alerts for important metrics breaching thresholds in your production environment (see figure below).

Dynatrace redefines developer experience by unifying logs, metrics, traces, AI model telemetry, infrastructure, and security data into a single, scalable platform that integrates directly into IDEs and CI/CD pipelines.
Learn More
The latest from Dynatrace
Hear more from our sponsor

As soon as you are comfortable with Prometheus as your weapon of choice, your next challenge will be scaling and managing Prometheus for your whole fleet of applications and environments. Naturally, automation is needed so that new applications can be onboarded fast and safely.

👁 Image

Automated monitoring and dashboarding for all applications running in our Kubernetes cluster makes it easier and faster to onboard new applications.

Challenge 1: Onboarding and Configuring Applications 

Modern workloads often consist of hundreds or thousands of microservices, either as multiple instances of the same application or different smaller applications talking to each other, all orchestrated by Kubernetes. These workloads are not running on a single cluster or in a single environment, but are spread over multiple clusters and environments (or “stages” such as development, hardening, and production).

For example, Uber’s workloads have grown to over 4,000 microservices, as of late 2019. To manage and operate complex applications like these, you need advanced observability, which demands dedicated configurations for scraping, dashboarding, and alerting for each application. Not only do you have to create these configurations, but you also have to apply them to each environment — often done manually and in an ad-hoc manner every time something changes.

The problem: This all represents a huge manual effort for managing configurations in your ecosystem for both Prometheus and Grafana.

Solution: Leverage GitOps to Stay in Control

Instead of applying configurations ad-hoc, you can take a “GitOps” approach where a Git repository holds all configurations, as well as documentation and code, and an operator component applies it automatically to the corresponding systems to be managed — e.g. Prometheus, Grafana, or even a Kubernetes cluster. Instead of making direct changes to the Prometheus configuration or Grafana dashboards, all changes must be committed first to the Git repository and are then synchronized to Prometheus, Grafana, or other tools — maintaining a centralized Git repository as a single source of truth.

Among the many benefits of the GitOps approach is the ability to version all configurations plus audit logs, to identify when and why each change has happened. In the case of problematic changes, you have the ability to roll them back easily. By having Git as the central repository, the workflow is aligned with developers who already base their workflows on Git. Using this approach, you can also promote a configuration (i.e., before applying it in the next stage) using the concept of pull requests that have proven successful for development processes already.

The figure below shows a Git repository and an operator added as an intermediate layer to manage all configuration files. The operator must hold the logic and permissions to apply the configuration to the underlying systems.

👁 Image

Manually applied configurations vs the GitOps approach.

Challenge 2: Manual Creation of Configurations and Dashboards

Setting up a GitOps single source of truth that is version controlled and holds all configurations as code is a first step. But there are still a lot of manual configurations to deal with.

Writing and learning Prometheus PromQL queries is not a trivial task, and this is only one piece of the bigger picture. Besides PromQL, you need Grafana dashboard configurations (written in JSON) to have a comprehensive overview of your applications. You also need alerting rules (written in Yaml) in Prometheus to set up alerting for production issues. You may also need an engineer or two for writing PromQL or creating alerting rules, which require different skills than configuring dashboards in Grafana.

The problem: You need a team of engineers knowledgeable in different configuration languages to write and maintain all the manual configurations.

Solution: Code generation empowers scaling

Code generation to the rescue! Instead of manually writing queries and rules for Prometheus and its alert manager, as well as dashboard configurations for Grafana, you can use code generators to mitigate the manual work.

One great example is generating Prometheus alerting and recording rules based on SRE concepts, such as the Golden Signals or the RED method, or even the USE method, that are widely considered as the most useful and critical metrics. Another example would be generating Grafana dashboards (for examples, see uber / grafana-dash-gen, metalmatze / slo-libsonnet, and prometheus-operator / kube-prometheus on the GitHub website, and Scripted Dashboards on the Grafana Labs website).

Bottom line: Using code generators speeds up configuration efforts. The generated files are stored in the Git repository to reap all the benefits I discussed earlier. The image below compares manual configuration with code-generated configuration and shows how the latter approach does the heavy lifting and reduces the chance for user errors.

👁 Image

Manually writing configurations vs using code generators.

Challenge 3: Configurations Drift Out of Sync 

Once you start using code generators, you end up with lots of auto-generated configuration files. Those configurations, stored in the Git repository, are independent of each other. There is no control mechanism to base them on the same input files; in fact that might not even be possible since code generators might rely on different kinds of inputs.

For example: Changing the input for code generator 1 outputs a result that is now out-of-sync with the output of code generator 2 or 3 — there is no synchronization mechanism between the generated files. To mitigate this, a change of one input could trigger the execution of all generators, but the actual problem is that the input for each generated file is in a different format, since the code generators are independent solutions. Only a few solutions tackle this, such as prometheus-operator / kube-prometheus.

The problem: Manual work is required to bring a desired change into each input format and to eventually create a new generation of configuration files.

Solution: Use Abstraction to Foster Reuse and Keep Generated Files in Sync

Abstraction in software engineering fosters reuse, and this same concept can help overcome the challenge of configuration files drifting out of sync. Introducing an intermediate language to cover common SRE concepts can help provide a mutual understanding and technical foundation to build upon.

The image below shows how introducing an intermediate language, such as jsonnet or your own defined language, allows you to define common concepts and generate specific configuration files for different platforms like Prometheus and Grafana. Using this higher-order programming language enables you to abstract implementation details. The language you use must provide all concepts that are prevalent in the Prometheus monitoring domain.

There has been the consensus in recent years to focus on terminology and concepts that stem from the SRE community. A mature concept is to build upon the notion of service-level objectives (SLOs) that allow you to define objectives for each microservice. Putting this into machine- and human-readable code (using Yaml files) allows you to generate the configuration for multiple tools and conform all configurations to the defined service-level objective. This reduces complexity and makes it easier to cope with operating and scaling your Prometheus environments.

👁 Image

Compare and contrast the old approach of no abstraction with the new approach of SRE concept-based abstraction.

But this is all just half of the story! In part two, I will detail how Prometheus, when coupled with another open-source solution called Keptn, can deliver automated, advanced observability for your K8s environment more quickly.

Dynatrace redefines developer experience by unifying logs, metrics, traces, AI model telemetry, infrastructure, and security data into a single, scalable platform that integrates directly into IDEs and CI/CD pipelines.
Learn More
The latest from Dynatrace
Hear more from our sponsor
TRENDING STORIES
Jürgen is Technology Strategist at Dynatrace. He is passionate about cloud technologies, self-healing applications, and automation. In the Dynatrace Innovation Lab, he researches emerging technologies and how to leverage them in his daily work. When he is not working, you...
Read more from Jürgen Etzlstorfer
Dynatrace sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, Golden.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.