VOOZH about

URL: https://thenewstack.io/cncf-projects-integration-production/

⇱ Why Prometheus couldn't see Cilium metrics at 2 a.m. - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2026-05-10 10:00:00
Why Prometheus couldn't see Cilium metrics at 2 a.m.
sponsor-cncf,sponsored-post-contributed,
Cloud Native Ecosystem / Kubernetes / Platform Engineering

Why Prometheus couldn’t see Cilium metrics at 2 a.m.

Tame the Kubernetes "integration tax." Learn how to wire CNCF projects like Prometheus and Cilium for production-grade reliability.
May 10th, 2026 10:00am by Rishi Mondal
👁 Featued image for: Why Prometheus couldn’t see Cilium metrics at 2 a.m.
Andania Humaira for Unsplash+
CNCF sponsored this post.

I still remember the first time we lost sleep over something that wasn’t a bug.

It was a Tuesday. Grafana dashboards showed blank panels for Cilium network metrics. Hubble was working fine — DNS visibility, TCP flows, and HTTP latency were all there in the Hubble UI. But the on-call engineer staring at Grafana at 2 AM couldn’t see any of it. The reason? Prometheus had no ServiceMonitors wired to Cilium’s agent and operator pods. Two Cloud Native Computing Foundation (CNCF) projects, both installed correctly, were completely invisible to each other.

This is what’s known as the integration tax. It’s the hidden cost of running multiple CNCF projects together in production, and it’s where most platform teams spend 80% of their time — not installing projects and not tuning them individually, but wiring them together. Hence, they actually talk to each other.

“This is what’s known as the integration tax. It’s the hidden cost of running multiple CNCF projects together in production.”

Every team builds the same stack. Every team breaks it differently.

The CNCF landscape has about 250 projects. In practice, most production Kubernetes platforms settle on the same core stack of 20–30 cloud native tools. Prometheus for monitoring. ArgoCD for GitOps. Cilium for networking. cert-manager for TLS. Velero for backups. Sealed Secrets for credentials. Kyverno for policy. You install them. You write values files. 

Wiring happens. Then the failures begin, and they’re never in any single project’s issue tracker.

Where CNCF projects collide

cert-manager versus ingress controllers. We ran into this issue across three cloud providers. cert-manager’s HTTP-01 ACME challenge expects to serve a token over plain HTTP. But if your ingress controller enforces a global HTTP-to-HTTPS redirect —which it should for security —  every ACME validation request gets 301’d before reaching cert-manager’s solver pod. Certificate renewals fail silently. You find out when customers see expired TLS warnings in their browsers. The fix? DNS-01 challenges via Route53, Cloud DNS, or Azure DNS.  But that’s cloud-specific IAM scoping that no Helm chart configures by default. You discover these limitations only after the incident.

Prometheus versus kubelet. Here’s one that took us weeks to diagnose. kubelet exposes metrics on four scrape paths. Two of them — /metrics and /metrics/probes  — both emit process_start_time_seconds with identical timestamps because they’re the same process. Prometheus dutifully scrapes both, sees duplicate samples, and fires PrometheusDuplicateTimestamps. The alert is noisy. The root cause is invisible without reading the kubelet source.  But the fix is a Jsonnet relabeling rule that drops an entire scrape endpoint. None of these are bugs. Every project works exactly as documented. The failures live in the gaps.

“None of these are bugs. Every project works exactly as documented. The failures live in the gaps.”

Cluster API gave us one workflow for four clouds

Before Cluster API (CAPI), provisioning clusters meant choosing a cloud vendor’s CLI. eksctl for AWS. gcloud container clusters create for GCP. az aks create for Azure. Each had its own lifecycle model, upgrade path, and disaster recovery story. You weren’t just locked into a cloud; you were locked into its opinions about managing Kubernetes.

CAPI changed the game. Your cluster is now a set of Kubernetes-native resources —  Cluster, MachineDeployment, MachinePool — and a cloud-specific provider translates them into infrastructure. We run CAPA on AWS, CAPG on GCP, CAPZ on Azure, and CAPH on Hetzner bare metal. The bootstrap sequence is identical everywhere: K3D management cluster → deploy provider → create workload cluster → clusterctl move to make it self-managing.

But here’s where the real value emerges: Day-2 operations. A Kubernetes version upgrade becomes a one-line change to a MachineDeployment. CAPI handles cordon, drain, and rolling replacement. A MachineHealthCheck automatically removes unhealthy nodes. Disaster recovery means recreating a management cluster, restoring Velero backups from cloud storage, and letting CAPI resources reconcile. The entire cluster rebuilds itself from the Git state. This is where Cluster API—like the rest of the CNCF stack—reveals whether your integration work actually holds together under pressure.

The architecture that finally stopped the bleeding

After years of firefighting integration failures across clouds, we landed on a pattern that finally made things sustainable: a two-repo GitOps split. This approach applies whether you’re using commercial platforms or assembling your own stack from open source projects.

Platform repo: 100+ Helm charts with production-tested defaults. Cilium NetworkPolicies baked into each chart. Prometheus ServiceMonitors pre-wired. cert-manager annotations are configured for the right challenge type. This configuration is shared across all clusters in all clouds.

Config repo: One per customer or environment. Only values that genuinely vary between clusters: domain names, node counts, GCP project IDs, AWS account roles, and Hetzner server types.

ArgoCD watches both. When we fix the Prometheus duplicate timestamps issue in the platform repo, that fix propagates to every cluster (AWS, GCP, Azure, bare metal) via a version bump. One pull request. No per-cluster tickets. No human memory of, “Oh, we need to update the relabeling rule on three different systems”; the integration logic lives in code.

Hard-won lessons from production

Generate your monitoring, don’t assemble it. We use Jsonnet to produce the entire kube-prometheus stack from a single per-cluster vars file. Custom alerting mixins — Velero backup age, CloudNativePG replication lag, kubelet certificate expiry — live as Jsonnet libraries alongside upstream rules. A single build.sh produces everything. Reproducible. Diffable. Version-controlled. When Prometheus upgrades break your custom rules, the diff is immediate, and the fix is testable before it reaches production.

Embed NetworkPolicies in charts, not in post-deployment runbooks. We ship Cilium NetworkPolicy templates inside 20+ Helm charts. Each chart declares its own egress requirements: what external APIs it calls and what internal services it needs. Reverse-engineering network rules from Hubble flow logs after deployment is like writing tests after shipping. Your policies drift. Security becomes guesswork. Embedding them in charts means the policy lives where it’s maintained.

Automate disaster recovery at bootstrap. Our provisioning creates cloud storage buckets (S3, GCS, Azure Blob) for Velero backups during initial cluster setup — not as a follow-up task that lives in a Jira ticket for six months. If you can run the bootstrap, you can recover from total cluster loss. Disaster recovery stops being a hope and becomes a testable reality.

Encrypt secrets, then commit them. Every credential — deploy keys, cloud IAM, and TLS certs — gets encrypted with Sealed Secrets before it touches Git. The decryption key gets backed up to cloud storage. Your Git repository becomes a complete, auditable record of every cluster’s state, including secrets. Drift detection works. Recovery is one pull request and one clusterctl move away.

Let machines enforce policy. Kyverno blocks deployments that are missing resource limits. Kubescape continuously scans CIS benchmarks and feeds violations into Prometheus alerts. Combined with Cilium network segmentation, your security posture becomes something auditors verify from Git history and live cluster state — not from a spreadsheet last updated two quarters ago.

The compounding cost

The integration tax isn’t a one-time fee. Every Kubernetes version bump, every Helm chart upgrade, and every new CNCF project introduces new integration surfaces. If your monitoring is hand-crafted YAML, upgrading kube-prometheus from v0.13 to v0.17 means manually diffing hundreds of generated files. If it’s Jsonnet, it’s one line — the debt compounds.

The CNCF ecosystem is extraordinarily powerful. But power without integration is just a list of Helm installs. The work that actually matters—drift detection, coordinated updates, and disaster recovery automation—happens in the wiring. That’s where your platform either survives its second year or just becomes a collection of tools you stop trusting.

Readers interested in exploring the framework discussed in this article further can find the project’s source code and detailed documentation on the KubeAid repository and the KubeAid website.

The Cloud Native Computing Foundation (CNCF) hosts critical components of the global technology infrastructure including Kubernetes, OpenTelemetry, and Argo. CNCF is the neutral home for cloud native collaboration, bringing together the industry’s top developers, end users, and vendors.
Learn More
The latest from CNCF
TRENDING STORIES
Rishi Mondal is an SRE at Obmondo, where he builds and maintains KubeAid an open-source Kubernetes management platform integrating 25+ CNCF projects. He's a CNCF KubeStellar Maintainer, Docker Captain, Linux Foundation Mentor, and GSoC Mentor at CNCF. He was featured...
Read more from Rishi Mondal
CNCF sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.