![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Is the term “observability” still productive? When it first entered common parlance around 2017, it was instructive in differentiating itself from traditional infrastructure and application monitoring. Back when applications were a bit more static, SREs would set up “monitors” on their critical applications, and receive alerts when certain thresholds were crossed e.g. an unwanted CPU spike, a customer-facing latency concern, and perhaps an entire service being knocked offline.
But what modern observability solutions are trying to unlock is the ability to ask real-time questions of our systems, without having to pre-define what may go wrong in advance. Because in a distributed world of microservices, it’s simply not possible to set up monitors on everything that matters; we need the ability to deal with unexpected situations and react accordingly.
Given the success of various observability solutions (Datadog’s IPO in 2019, Lightstep’s acquisition by ServiceNow in 2021), many startups and large enterprises alike co-opted the term observability, becoming a catch-all term for any solution that offers some sort of insight into your digital business.
So I sat down with some of the most respected thought leaders in observability to ask the tough questions.
Just like teams don’t necessarily want more tools, they also don’t necessarily want new terms. Many companies still lack basic monitoring, leaving considerable room for most enterprises to improve their observability practices. So while it’s fun to think about carving out further distinctions, such as understandability and controllability, as practices and platforms continue to evolve, the group here seems to reach consensus that observability still holds as a term and practice.
There’s agreement that the cost of observability is a legitimate concern. Modern applications are becoming more complex at an unprecedented rate, driven by AI, and as a result, operational costs scaling alongside that complexity may become untenable for many businesses. That said, the nuance lies in whether the value scales alongside the cost or not. Too often, costs are increasing, but the user experience is deteriorating. Different solutions to this problem are discussed in the roundtable, ranging from improved sampling to the new practice of Bring Your Own Cloud (BYOC), which enables customers to store their data while maintaining a SaaS experience.
OpenTelemetry is the second most active open source project in the Cloud Native Computing Foundation (CNCF) behind only Kubernetes. Funny enough, many Kubernetes users still prefer Prometheus, at least for metrics. And so while the idea of open standards for observability telemetry has unanimous support within the group, there’s still plenty of room to grow in execution for OTel, particularly in areas like logging and networking.
The group as a whole certainly wants to avoid the overpromises of AIOps back in 2017. That said, AI-driven development is moving faster than ever, and some on the panel go so far as to say that removing humans from the loop when it comes to IT Management is a legitimate possibility. Others on the panel stick more to traditional messaging around the importance of socio-technical systems and keeping humans in the loop, even as machines take on more responsibility. I think it’s fair to say that for the moment, humans aren’t going anywhere anytime soon when it comes to performance and reliability engineering… but leveraging AI solutions will certainly be part of any modern SRE toolkit.