![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Today’s digital-first organizations need to create superb experiences for their customers — or risk irrelevance. Ideally, this requires resolving any operational issues before the end user has realized there’s something wrong. However, for most organizations, it’s not that easy. Digital operations teams are drowning in a tsunami of events. Existing tooling is unable to cope; manual processes and multiple point solutions translate into interruptions and escalations for overburdened responders.
Not only does this slow mean time to resolve (MTTR) and impact customer loyalty, but it can also hit team morale and drive up burnout rates. Organizations need their best and brightest to work on high-value innovation projects. They don’t want engineers to stop their work every five minutes for interruptions that could be automated. This problem is exacerbated by the sheer volume of events from dozens, if not hundreds, of sources. Filtering and contextualizing these alerts to know where to take action is a daunting task.
PagerDuty is seeing a 70% increase in event volumes year-over-year across its customer base. Why? The increase can be partly attributed to the surge in home working and a corresponding proliferation of systems. Teams are also getting more advanced with their modeling. They want to get ahead of incidents, which means taking a look at data more often. Observability systems looking at multiple signals generate a deluge of metrics, which in turn generates more events. However, there are inevitable problems that stem from this surge in data:
Solving the issues above is where event orchestration can help. Event orchestration enables users to route events toward the most appropriate set of actions. PagerDuty’s event orchestration functionality, for example, analyzes, enriches, determines logic for and automatically acts on events as they occur in real time, within microseconds. This enables our customers to take all the events coming in from 650+ integrations and apply logic and automation to figure out what should be done with each one — what the next best action is — at machine speed.
Because we’re able to nest automation together, users can have one automated action, start a diagnostic process, learn more about the event and then use this information to figure out what to do next. It allows organizations to take human processes and automate them. Furthermore, it allows us to enrich events — creating context, removing machine jargon and making it human-actionable, as required.
There are two big wins from this. Either there’s a high-priority incident, for which the responder knows exactly what to do and where to start, or, ideally, there’s no incident at all. The event has been automatically resolved and developers can get on with their job without any interruptions. Perhaps they can even enjoy some well-earned rest.
There are three key innovations behind this engine:
When organizations start nesting rules like this, it results in some interesting mathematical outcomes. On the face of it, PagerDuty has built a more powerful rule engine that gives users the ability to nest rules and leverage advanced conditions. Behind the scenes, we’re allowing users to build “directed acyclic graphs” within their event ingestion pipeline.
What this really means is that customers are now able to build a finite state machine. These machines have some very useful properties. They can detect the statefulness of an incoming event and all the data associated with it. State machines can pick apart all the data associated with an event and turn it into a particular set of actions in a highly deterministic fashion. Thus, users can push an event into this high-tech “vending machine” and it will follow all the rules, processes and logic they input, and, with 100% certainty, users will know what they’ll get on the other side.
This could include outcomes like automated remediation or suppressing/enriching an event. The point is that determinism gives users precise control over what happens to events as they’re being ingested. It allows low risk and precise deployment of automation. And this precision gives users the confidence to try more automation use cases knowing exactly what’s going to happen if they take a particular course of action.
So what’s the bottom line for event orchestration? It provides a set of tools that can be leveraged in a variety of use cases in highly effective ways. Think: automated diagnostics to flag alerts that didn’t auto-resolve, helping to speed up resolution and suppression of non-urgent notifications that arrive outside the responders’ working hours. Or how about automatically identifying noisy parts of the infrastructure? Identifying and automatically informing teammates of known root causes? Event orchestration can do all of the above.
It’s all about helping to get rid of manual work, dealing with known issues and enabling responders to get to work faster. In today’s uncompromising digital ops environments, nothing less will do.