VOOZH about

URL: https://thenewstack.io/a-simple-safe-path-for-automating-remediation/

⇱ A Simple, Safe Path for Automating Remediation - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2020-09-02 10:14:25
A Simple, Safe Path for Automating Remediation
contributed,sponsor-pagerduty,sponsored,sponsored-post-contributed,
DevOps / Observability

A Simple, Safe Path for Automating Remediation

For remediation -- taking action to mitigate or fix an incident -- the idea of automation has traditionally been met with considerable resistance.
Sep 2nd, 2020 10:14am by Rachel Obstler
👁 Featued image for: A Simple, Safe Path for Automating Remediation
PagerDuty sponsored this post.

PagerDuty sponsored this post.

Rachel Obstler
Rachel is vice president of product at PagerDuty, where she is responsible for product direction, customer experience and pricing. She has over 10 years of experience in SaaS and over 15 years in product management. Rachel holds a B.S. from MIT and an MBA from Stanford University Graduate School of Business.

The COVID-19 pandemic has put IT incident response teams under a level of pressure they have never seen before, illustrated most pointedly by the recent surges in incidents. As shelter-in-place orders were given, some industries like retail, online learning and collaboration experienced intense strain on their digital operations, seeing incidents more than double compared to pre-pandemic levels. And while these teams are being asked to handle more time-critical work than ever, IT budgets are under severe scrutiny as some companies attempt to cope with contracting revenue streams.

One obvious response to this situation is to invest in automation. Automating repetitive and manual tasks gets them done faster, reduces labor costs, and in many cases provides a higher level of accuracy. However, when it comes to remediation — taking action to mitigate or fix an incident — the idea of automation has traditionally been met with considerable resistance.

Risk vs. Reward

IT organizations are notoriously risk-averse, and with good reason. One hour of downtime can cost an organization literally hundreds of thousands of dollars, and COVID-19 has raised the stakes even higher. Right now, about a third of Americans are working at home because of the pandemic. According to one study, 54% want to keep working at home even when the health crisis has passed. This means that if a collaboration system is down, employees can’t communicate by simply walking to a cubical on the other side of the office. Their work simply grinds to a halt.

There’s more pressure on customer-facing applications as well. US e-commerce sales jumped 49% in April and many are speculating that the habit of online shopping is here to stay — even after COVID-19. For many businesses, this means that broken shopping carts or pages that fail to load are a disaster. And disaster means revenue lost. Costco’s website was down for a few hours during Thanksgiving and experts estimated the retailer lost $11 million in sales.

The Path to Automation

With so much at stake, is it safe to automate remediation? What if it works perfectly for 95% of the incidents, but makes things catastrophically worse for the other 5%? This is a fair question, but it’s based on a false premise. It assumes that automation is an all-or-nothing proposition — either fully manual or fully automated, where a machine does everything. In fact, automation often can and should be implemented in stages.

The safe path to automation has several steps. Each step in the automation evolution builds on the last, although some steps may be omitted in some circumstances.

  • Phase 1: Identify candidates for automation. The first step is determining the types of incidents that teams encounter on a repetitive basis. These are the incidents where automated remediation makes the most sense and delivers the highest rewards. Not everything will make the cut: stick to manual processes for situations where the investment doesn’t justify the outcome.
  • Phase 2: Human-initiated automation. The next step is automating runbooks and making those automated scripts available to all the individuals who might be notified about an issue. Then they can remediate the issue by simply pressing a button. Scripts can be refined as necessary. The goals here are to resolve repetitive incidents in exactly the same way every time, regardless of who may receive the alert or when it may occur, and verify that the scripts actually work.
  • Phase 3: Human and machine co-existence. In this step, a human is paged but the automation initiates simultaneously. The human function is to make sure that the automated remediation worked. For projects where the remaining human steps are particularly complex, this is often the final phase.
  • Phase 4: Machine-initiated with human fallback. In this step, remediation is automatically initiated upon incident detection, and a human is only paged if it failed to resolve the problem. Keep in mind, even if your automation is succeeding, it’s important to regularly observe its results so that you’re able to course-correct if necessary. You may also want to make an investment in your infrastructure to correct the root cause of the failure if it happens frequently enough; even a short downtime may be unacceptable to your business.

This approach to automation has several very important benefits. First, there are gains at every step in terms of time saved. It is absolutely not necessary to complete all the steps to obtain the benefits of automation or establish cost justification. Second, this is a very low-risk approach. The efficacy of the automated remediation gets tested again and again under production conditions, and human supervision is never entirely eliminated. Finally, this approach to automation can be implemented at a gradual pace, with no disruption to existing processes.

Making Automated Remediation Work Best For You

For all its benefits, some individuals resist the idea of automation, sometimes because it’s simply a new way of doing things and sometimes because they don’t trust it. For these reasons alone, making automated remediation successful always starts with people. Organizations need a plan developed by humans, along with human interactions throughout the journey. Automation can make it easier for remediation teams to cope, and it can be implemented without a lot of risk if companies take a step-by-step approach rather than trying to jump immediately from manual processes to full automation. That said, automation is going to become increasingly more important over time, both for keeping the lights on and avoiding employee burnout.

Feature image via Pixabay.

At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: feedback@thenewstack.io.

PagerDuty is the global leader in AI-first operations management serving more than 35,000 organizations worldwide. The PagerDuty Operations Cloud is a comprehensive, multi-product operations cloud platform that sits at the center of the enterprise technology stack.
Learn More
The latest from PagerDuty
Hear more from our sponsor
TRENDING STORIES
Rachel is vice president of product at PagerDuty, where she is responsible for product direction, customer experience and pricing. She has over 10 years of experience in SaaS and over 15 years in product management. Rachel holds a B.S. from...
Read more from Rachel Obstler
PagerDuty sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.