VOOZH about

URL: https://thenewstack.io/fast-focused-incident-response-reduce-system-noise-by-98/

⇱ Fast, Focused Incident Response: Reduce System Noise by 98% - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-12-02 09:05:20
Fast, Focused Incident Response: Reduce System Noise by 98%
contributed,sponsor-pagerduty,sponsored,sponsored-post-contributed,
Operations

Fast, Focused Incident Response: Reduce System Noise by 98%

With less time spent firefighting and more on innovating, AIOps can empower engineers and developers to drive bigger strategic gains for their organizations.
Dec 2nd, 2022 9:05am by Julia Nasser
👁 Featued image for: Fast, Focused Incident Response: Reduce System Noise by 98%
PagerDuty sponsored this post.

Today’s organizations are stuck in a bind. Overwhelmingly, they want to embrace digital transformation to work more efficiently and deliver the experiences customers and employees crave. But the IT complexity this kind of project ushers in can stretch technical teams to the limit, leaving them exhausted and despondent.

This is where AIOps and automation can generate some big wins. However, it’s not always easy to know where and how value can be delivered, or which tools should be deployed.

Noise reduction is one area where digital Ops teams can start gaining some quick wins. Applying machine learning capabilities effectively to correlate alerts can help to suppress noise and dramatically enhance the ability of responders to get the job done quickly and efficiently.

A Noisy World

The basic goal of AIOps is to help developers and engineers easily discover and quickly resolve issues to minimize IT downtime. But they can’t do so when overwhelmed by a flood of alerts. The bottom line is that incident responders are drowning in information. Research shows that 69% of DevOps and ITOps teams are struggling with alert noise on a daily basis.

To help, organizations can turn to several tools and techniques. Fairly well understood today is deduplication (“dedup”), which works on services with API integrations. It allows users to easily group multiple incidents that trigger the same issue, using a dedup key.

Then there’s suppression, which is effectively front-of-pipe rules that can be used to suppress any nonactionable events. Service routing is another useful tool, ensuring events coming in are actionable and mapped to services that each represent a specific area or application.

Perhaps the least understood area of noise suppression is the use of machine learning and heuristics to group multiple alerts into one incident. Taken together, these capabilities could reduce system noise by as much as 98%. Let’s take a closer look at how it works.

PagerDuty is the global leader in AI-first operations management serving more than 35,000 organizations worldwide. The PagerDuty Operations Cloud is a comprehensive, multi-product operations cloud platform that sits at the center of the enterprise technology stack.
Learn More
The latest from PagerDuty
Hear more from our sponsor

Detecting and Pausing Transient Alerts

Transient alerts are frustrating. Responders are often forced to switch what they’re doing to undertake a review, only to find the alert soon auto-resolves via an integration. They may even have woken up in the middle of the night to take a look. Yet historical data can be a good predictor of transient alerts.

In line with this assumption, we designed a prediction model for transient alerts. It began with definitions and discovery — deciding what transient alerts are, and then creating a labeled data set with historical data to train and validate the model. Any alerts resolved via integration were assumed not to have required human action.

Next came phase two: testing the prediction model offline and online. This led to the development of two models — a prediction model and a real-time rolling-count algorithm that were run in A/B tests during the early-access program.

Based on performance and accuracy, we chose a winner: the prediction model. It significantly outperformed the real-time rolling counts, recording a higher accuracy for 66% of services. This solution can help users to automatically eliminate unnecessary noise from flapping alerts. But it’s not the only way machine learning can help under-pressure incident responders.

Intelligently Grouping Alerts

Noise from duplicate or very similar alerts is arguably even more common than the issue of transient alerts. It means responders are pinged over and over for what is essentially the same issue. But it can be mitigated with capabilities that use machine learning to look for text similarities in incoming alert summaries.

It will then cluster these alerts into the same incident. Additionally, user feedback on errors can be ingested and learned from to improve grouping activity in the future.

Duplicate alert noise can also be reduced by analyzing the time that alerts arrive. Machine learning is used to assess the optimal cutoff point after which no more alerts can be added to a particular group. Again, it’s based on historical data crunching to check how far apart chronologically alerts tend to arrive for particular services.

Of course, such settings can also be applied manually, and in some cases, responders will have good insight into what works best. But the power of intelligent algorithms is to spot the data patterns that human eyes usually miss, helping to optimize things like alert compressions.

This is how PagerDuty’s Intelligent Alert Grouping solution works. But organizations can supercharge their use of such tools further, with a few simple steps.

Because they work partly by analyzing text similarity, organizations should try to be as consistent as possible when naming service resources and entities. For example, one resource named “login database” in one alert and “login db” in another may not immediately be recognized and will decrease long-term accuracy. Human-readable names for service resources and entities can also help improve grouping accuracy.

These suggestions are not an exhaustive list of AIOps capabilities in alert noise reduction, but they do hopefully illustrate the kinds of wins incident response teams can generate. It ultimately boils down to more productive, effective responders, fewer distractions and a better customer experience.

With less time spent firefighting and more on innovating, AIOps can empower engineers and developers to drive bigger strategic gains for their organizations.

In a digital era characterized by fierce competition, that is a compelling reason to take a look.

PagerDuty is the global leader in AI-first operations management serving more than 35,000 organizations worldwide. The PagerDuty Operations Cloud is a comprehensive, multi-product operations cloud platform that sits at the center of the enterprise technology stack.
Learn More
The latest from PagerDuty
Hear more from our sponsor
TRENDING STORIES
Julia Nasser is a senior product manager in AIOps at PagerDuty focused on the areas of noise reduction and change events. Her background includes working with enterprise and B2B companies to solve complex problems using analytics and data science. Julia...
Read more from Julia Nasser
PagerDuty sponsored this post.
SHARE THIS STORY
TRENDING STORIES
PagerDuty is a sponsor of The New Stack.
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.