![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
In a meeting last year with a bunch of senior observability leaders from cloud native companies, I asked everyone to tell me their least favorite telemetry type: metrics, events, logs, traces or whatever. I was pretty confident the dominant answer would be logs. Nothing against logs, but I had recently heard this group express the hot take that “during an incident, if you’ve gone to the logs, you’ve already failed.”
I was wrong. To my surprise, they answered almost unanimously: events. Events were the most despised telemetry type. I followed up by asking, why do you dislike events so much? Again the answer was nearly unanimous: Lack of definition about what they are and how you can use them.
I get it. In researching events, I’ve found four or five different definitions, and no one seems to have nailed down the best way to use them in a troubleshooting workflow.
Since that meeting, our team has spent a lot of time thinking about events and how we can make them useful as a first-class telemetry citizen. The team did extensive research and then got to work building a function to track change events. Just recently, we announced the ability to ingest events in our observability platform.
I want to step back and explore why events are so critical and how they can help.
Change is the leading cause of errors. In a steady state, a system should continue to operate consistently for an indefinite period of time. Unfortunately, in a modern DevOps environment, our systems change dozens of times a day. We ship new code, we turn on and off feature flags, we deploy new infrastructure, we scale it up and down and we even change observability solutions. And business doesn’t stand still either; it’s in constant flux based on the time of day, day of the week, season of the year, world events, competition and a million other factors we can’t track.
The only way to stay on top of change is to contextually link your systems so that when you get an alert, you can quickly see what occurred in the same time frame that might have introduced the breaking change. This is what we call an event.
👁 Observability UI showing an event alert
An event is a discrete change to a system, a workload or an observability platform. Here are some examples of events and how they might help you troubleshoot an issue:
👁 System integrations that can create change events
Like an observability signal, events cannot stand alone. They play an important role in the troubleshooting workflow alongside metrics, traces and logs. While metrics can tell you the symptom of a problem and are the primary driver in mean time to detect (MTTD) results, events can quickly tell you what changed. Alongside tracing, which will help you find the location of the problem, events help you remediate and stop the customer pain. From there, you might dig into the logs to start understanding why the problem happened so that you can get to the root cause and fix the underlying issue.
We call this workflow the three phases of observability: Know about an issue, triage it and then understand it, all while working toward remediation as quickly as possible.
👁 Three phases of observability
I originally called this piece “in defense of events,” and hopefully now you understand why and are open to giving them a chance. They complement and enhance your other telemetry types, hopefully making it faster to get critical context into your alerts.
Want to see more? Request a demo to see it in action.