VOOZH about

URL: https://thenewstack.io/fighting-incidents-with-end-to-end-event-driven-automation/

⇱ Fighting Incidents with End-to-End Event-Driven Automation - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-04-13 07:20:29
Fighting Incidents with End-to-End Event-Driven Automation
sponsor-pagerduty,sponsored-post-contributed,
DevOps / Operations / Security

Fighting Incidents with End-to-End Event-Driven Automation

Struggling with high MTTR and too much toil? Employ a crawl, walk, run strategy toward automation for better, faster incident response.
Apr 13th, 2023 7:20am by Frank Emery
👁 Featued image for: Fighting Incidents with End-to-End Event-Driven Automation
PagerDuty sponsored this post.

The volume of incidents today’s technical teams face is unprecedented, as is the pressure to perform. Companies want to protect revenue and customer experience. And customers across industries have high expectations for digital customer experiences. They want it fast, flawless and highly available, and have low tolerance for gaps in service. According to PWC, one in three customers would stop doing business with a brand they loved after one bad experience.

The teams that have the task of keeping these services available are inundated by alert noise. Responders are confused about what information they need to resolve an incident and where that information is located. And finding this information, plus completing the same manual, repetitive tasks for each incident means they’re wasting too much time.

To reduce mean time to resolution (MTTR) and keep customers and response teams happy, organizations need to leverage automation. But this isn’t a one-and-done ordeal, or something that can be accomplished and scaled within a single sprint. It’s a commitment to better incident-response practices, complete with challenges to overcome and stages of the journey.

PagerDuty is the global leader in AI-first operations management serving more than 35,000 organizations worldwide. The PagerDuty Operations Cloud is a comprehensive, multi-product operations cloud platform that sits at the center of the enterprise technology stack.
Learn More
The latest from PagerDuty
Hear more from our sponsor

Challenges We Hear from Our Customers about Automation

From our time working with customers, from small startups to Fortune 100 companies, to help drive better incident-response best practices, we’ve heard the most common challenges of adopting automation.

Here are the top three:

Too busy firefighting: When incidents are coming in fast, all teams can feel like they’re being pulled into crisis mode. They can’t get ahead of the issues fast enough to complete their assigned work, much less tackle initiatives to improve incident response.

No buy-in: Leaders across industries are looking at how to be the most competitive on the market and how to do so with as little cost as possible. Long initiatives like crafting automation can be seen as a distraction if it doesn’t have tangible benefits to an organization’s bottom line.

Can’t scale: Some organizations are working toward deploying automation but are reaching a stumbling block. They can’t scale. Some teams have detailed auto-remediations built for their services. Others are still stuck doing manual work. There’s no standardization.

When these challenges are at play within an organization, it may be time to employ a crawl, walk, run approach to creating and deploying automation.

How to Employ a Crawl, Walk, Run Approach to Automation

The first step is to determine who is part of the team and at what level you plan to execute. One of the best ways to get an organization to buy in to automation is to start with a small pilot team automating some low-hanging fruit that improves the day to day for a specific team, group or service. Share that automation with other teams and see adoption spread. This will drive interest in building more automation, helping a grassroots initiative succeed. And, with better MTTR, you’re more likely to get executive buy-in as well with proven results and less customer impact.

Crawl

If the event stream is too overwhelming for your team, start at the source and stem the flow. Crawling toward better incident response automation starts with two things: suppression and pausing transient alerts. Compared to other forms of automation, these are relatively easy to execute. Plus, they immediately help responders gain back time and reduce alert fatigue.

Suppression is used to stop an incident from sending a notification to a responder for an event that’s known to have little to no value. According to AIOps customer data, 50% of noise compression comes from suppression. Suppression can reduce incident volumes via broad rules targeting those events that the team never needs to know about.

For example, a developer team at PagerDuty suppresses events until a certain number of them have arrived, at which point they turn suppression off and allow Event Orchestration to start creating incidents.

Pausing notifications allows users to suspend the creation of an incident for a predefined period. Once that time period lapses, the incident will be created normally. This automation is best used for flagging incidents with clearly defined conditions. An example of this could be a company that pauses certain high CPU usage incidents for 5 minutes, only creating an incident if high CPU turns out to be long-lasting/durable.

Walk

Once you’ve decreased the noise in your environment and your teams are getting fewer incidents, it’s time to make those incidents easier to resolve with the proper data. You can do this by enriching events, alerts and incidents.

Event enrichment allows you to speed up triage by ensuring responders have incidents populated with relevant contextual information. Teams can normalize event data so incidents look the same across an organization. This is especially helpful for network operation centers (NOCs) or other L1 response teams who want consistency across the events that come in and don’t have the time to learn the nuances of the hundreds of teams that they support.

Alert enrichment goes a layer deeper. Once the event officially becomes an alert, responders can define the severity with which an alert should be created. This ensures that notifications are routed to the correct escalation policy, saving time during response.

For the alerts that are grouped into an incident, incident enrichment allows users to define the priority and notes that an incident has when it is initially created. This means that you’re more certain when an incident is a P1, and all hands need to be on deck, versus a P4, which you don’t need to interrupt your dinner for. It’s a quality-of-life improvement for anyone on call. Notes are also useful for populating knowledge-base articles, internal wikis or providing information on how a responder should proceed.

Run

The last step of this journey is auto-remediation. Incidents resolve themselves with automation as the L0 responder. No humans are required to respond. One way to achieve this is with webhooks that can be triggered on incident creation. Or you can call in other forms of automation, whether that’s through PagerDuty or another vendor. While some organizations can arrive at this level of sophistication on their own, this automation is difficult to build, and scaling it across an organization can pose many challenges. In fact, this is one of the top reasons why people turn to PagerDuty. Partnership during this phase can help take some of the strain off individual teams that  are responsible for developing their own automation or site reliability engineering teams that are responsible for creating it organizationwide.

Looking to Automate on a Global Scale across Your Technical Ecosystem?

Whether you’re just starting the crawl stage of your automation journey or are already running with auto-remediation, PagerDuty AIOps can help you achieve fewer incidents with faster resolution. And our new feature, Global Event Orchestration, can help you create and scale automation across even the most complex technical ecosystems. For more information, you can take our product tour or register for our webinar.

PagerDuty is the global leader in AI-first operations management serving more than 35,000 organizations worldwide. The PagerDuty Operations Cloud is a comprehensive, multi-product operations cloud platform that sits at the center of the enterprise technology stack.
Learn More
The latest from PagerDuty
Hear more from our sponsor
TRENDING STORIES
Frank Emery is a principal product manager on the AIOps team at PagerDuty. He has a background in mathematics, machine learning and big data, and is focused on solving problems in the event automation space
Read more from Frank Emery
PagerDuty sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.