Everyone talks about observability like it is a solved problem. Pick a stack, wire it up, done. In practice, every team I have seen attempt this discovers the same thing: observability becomes its own project. A week becomes a month. The stack itself needs babysitting. And at some point you realise half the job is keeping the monitoring alive rather than actually using it.
This is not a tooling problem. It is a platform problem. And until you treat it that way, you will keep having the same fights.
The Real Reasons It Takes So Long
Cardinality kills you before you notice:
Prometheus is the default metrics choice for good reason. It is powerful, well supported, and the ecosystem around it is mature. It is also extremely sensitive to cardinality, and most teams find this out the hard way.
One bad label choice, one team emitting high cardinality identifiers like request IDs or user IDs as metric labels, and you are tuning memory instead of building features. The problem compounds because cardinality issues are invisible until they are not. Everything looks fine until Prometheus OOMs at 2am.
The fix is not a Prometheus configuration change. It is treating high cardinality labels as a platform policy problem. If teams can emit arbitrary labels without guardrails, you will be fighting this forever. Recording rules and metric aggregation at the collector layer before data hits Prometheus buys significant headroom. But this requires a platform decision, not a per-team one.
Log volume costs spiral without intervention
The instinct is to keep everything. The reality is that keeping everything in hot storage is expensive, and teams only discover how expensive when the bill arrives.
The pattern that works is tiered retention with aggressive filtering at the collector level before anything hits storage. Drop debug logs in production at the OTel collector, not at query time. Most teams do it the wrong way around, paying for storage they never query and then making panicked retention cuts that they regret during the next incident.
Deciding what to keep, at what tier, for how long, is not a decision individual teams should be making independently. It is a platform decision with cost, compliance, and operational implications.
The OTel collector YAML problem
OpenTelemetry is the right long term bet. The collector is genuinely powerful. The configuration is also deeply unpleasant until you have been through it a few times.
Receivers, processors, exporters, pipelines, the YAML surface area is large and the blast radius of a misconfiguration is real. The mistake most platform teams make is letting every team write their own collector configuration. You end up with as many collector configs as you have teams, each slightly different, each with its own quirks, none of them easy to maintain at scale.
Treat the collector config as a platform concern. Own the base configuration. Template it. Let teams extend it within defined guardrails. This is the difference between an observability platform and a collection of observability configurations that happen to exist in the same organisation.
Nobody owns the stack
The deeper problem underneath all of these is ownership. Observability stacks assembled from open source components do not own themselves. Somebody has to be responsible for the Prometheus upgrade, the Loki retention policy, the Grafana dashboard standards, the OTel collector base config.
In most organisations nobody explicitly owns this. It grows organically, maintained by whoever has time, which means it is maintained by nobody consistently. The result is a stack that works until it does not, and when it does not, the blame is diffuse and the fix is slow.
What a Sane Stack Looks Like in 2026
For teams starting fresh or looking to reduce the operational burden of what they already have, here is what I would recommend based on what I have seen work in practice.
Metrics: Consider VictoriaMetrics instead of vanilla Prometheus if cardinality is already causing problems. It is drop-in compatible with the Prometheus ecosystem, significantly more memory efficient at scale, and requires less tuning. For smaller deployments Prometheus is still fine, just go in with cardinality guardrails from day one.
Logs: Loki with aggressive retention policies and collector-level filtering. Accept that you will not keep everything and make that decision deliberately rather than reactively.
Traces: Tempo. It integrates cleanly with the rest of the Grafana ecosystem and the operational overhead is reasonable.
Frontend: Grafana as the unified query and visualisation layer across all three signals. The cross-signal correlation capability is where the real value is, being able to jump from a metric spike to the relevant logs to the relevant traces without switching tools.
Collector: OpenTelemetry Collector as the single ingestion layer. Own the base config at the platform level.
The Shift That Actually Fixes It
The teams I have seen get observability right treat it as a product they build and maintain for their engineering organisation, not a set of tools they install and hope for the best.
That means a dedicated owner or team. A roadmap. An adoption process that brings engineering teams onto the platform rather than letting everyone build their own. Standards for what good looks like. And a feedback loop with the teams consuming the platform so you know what is actually useful versus what is just noise on a dashboard nobody looks at.
Observability taking forever to set up is a symptom. The underlying condition is treating it as an infrastructure task rather than a platform capability. Fix the ownership model and the tooling problems become manageable.
I cover observability as a platform discipline in depth in The Comprehensive Guide to Platform Engineering, including reference architectures, retention strategy, and how to build the business case for treating observability as a first class platform investment.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.
For further actions, you may consider blocking this person and/or reporting abuse
