VOOZH about

URL: https://thenewstack.io/how-to-automate-incident-management-with-code-and-get-better-results/

⇱ How to Automate Incident Management with Code and Get Better Results - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2020-10-30 10:22:25
How to Automate Incident Management with Code and Get Better Results
contributed,sponsor-torq,sponsored,sponsored-post-contributed,
DevOps / Observability

How to Automate Incident Management with Code and Get Better Results

Does it really only take 4 steps to achieve automated incident management with code? We say yes and this is how.
Oct 30th, 2020 10:22am by Mike Mackrory
👁 Featued image for: How to Automate Incident Management with Code and Get Better Results
Feature image via Pixabay.
Torq sponsored this post. Insight Partners is an investor in Torq and TNS.
Mike Mackrory
Mike is a Global citizen who has settled down in the Pacific Northwest — for now. By day he works as a Lead Engineer on a DevOps team, and by night, he writes and tinkers with other technology projects. When he's not tapping on the keys, he can be found hiking, fishing, and exploring both the urban and rural landscape with his kids. Always happy to help out another developer, he has a definite preference for helping those who bring gifts of gourmet donuts, craft beer, and single-malt Scotch.

When something goes wrong in your production environment, you want your best and brightest minds to start working on the problem as soon as possible. The best time for me to work on a production problem is when I’ve got my day planned, I’ve just finished my morning coffee, and my mind is primed and ready for action. Unfortunately, production incidents seldom occur at this time; usually, it’s more like 3:24 a.m. I know from experience that someone who stumbles out of bed and fumbles for their phone and laptop is not going to be in top form for at least ten minutes or so, if you’re lucky!

What if you could leverage the mental faculties of the engineer whose mind is prepared and focused to build solutions, for addressing problems that happen at any time of the day or night? This article will discuss the evolutionary jump from the sleepy human to the support model, wherein the system itself automatically handles triage and the initial response to an incident. You’ll find yourself with a more resilient system, and your engineers will be able to perform at their best and add more value for your customers. It’s a win-win situation.

The Next Evolution: Response-as-Code

As software development has evolved, the process of building and supporting applications has become more straightforward and organized. For example, with Infrastructure-as-Code (IaC), you describe infrastructure in machine-readable definition files and then check these files into the code repository alongside your source code, giving you a single source of truth for your application and the infrastructure you need to provide for its deployment.

A response-as-code plan is similar; you check in solutions and tools alongside your code, which can provide the foundation for automatically identifying and resolving problems without the need to involve an engineer. I will show you how to implement this plan below; but first, let me explain why you should consider doing it.

As someone who has supported production systems for many years, I’ve noticed a couple of things. The DevOps system is excellent for establishing ownership and producing a better product, but when engineers have their hands full developing new features and supporting systems, the risk of burnout and alert fatigue increases. Implementing a new plan will take time, and you will need to convince your team that it’s worth it, but the result will be less time spent troubleshooting common problems — as well as faster mean time to detection (MTTD) and reduced mean time to resolution (MTTR).

Get Started with Your Response-as-Code Plan

Your response-as-code plan will have a few critical components. You will begin by building some generic components that you can use across all of your projects, to identify and resolve common problems. You’ll also need project-specific components to accomplish the same thing for issues that are specific to each project. Then, you will connect all of these components for a comprehensive system that will automatically handle most problems.

Step 1: Begin with Your PlayBook

Most teams that I’ve worked on have put together a compilation of scripts and solutions for specific problems. A playbook can take many forms, from a shared document to a complex knowledge base. If you don’t have a playbook yet, then you should gather knowledge from your team members and compile one.

Whatever form your playbook takes, it will enable you to identify some common production problems that your team faces. You’ll begin by determining whether the problem is unique to the service, or a more generic problem across multiple services. For example, you might occasionally run into disk space issues or sudden spikes in traffic that cause a degradation in performance. Once you can identify the problem and determine how to identify it programmatically, then you can design an automatic response to resolve it. It’s also important to keep in mind that a programmatic reaction might not work in some situations, so you need to ensure that you have an escalation path that involves an actual human in case the problem breaches a certain threshold.

Torq is a no-code automation platform for security and operations teams. Easy workflow building, endless integrations, and out-of-the-box templates deliver value in minutes — not weeks. Torq and TNS are under common control.
Learn More
The latest from Torq

One thing that I’ve found invaluable for implementing this step is to leverage your existing monitoring and Application Performance Monitoring (APM) solutions. Many of these products allow you to set up alerts based on specific criteria. You can use triggers to an API or a webhook to invoke a script to rectify a problem. In the past, I’ve used an invocation of AWS Lambda to resolve infrastructure needs automatically.

Step 2: Identify and Build Patterns to Detect Problems

Once you’ve picked off some of the low-hanging fruit by solving common problems based on your playbook, it’s time to think bigger. Look across your organization and identify the core technology stack, then begin compiling a library of code solutions that can automatically detect common problems. You can also reference previous production problems, which will help you identify and resolve the same problems programmatically in the future.

At this point, it’s worth mentioning the work that StackPulse has been doing in this space. In their quest to make the tech world a more reliable place and provide resources for SREs and developers, they’ve already compiled standard playbooks for Redis, RabbitMQ, and other technologies.

Step 3: Build and Share Solutions

You can also begin compiling a library of potential solutions along with your collection of problem identification and troubleshooting tools. I mentioned an AWS Lambda that I built to resolve infrastructure problems under specific conditions automatically. The pattern that I used in that solution could be applied to remediate many issues within AWS, and the logic could be ported over to other cloud and on-premise solutions as well.

The greater potential of these first three steps will become more apparent when you begin to share what you’ve built with others and encourage them to participate. I’ve yet to meet an engineer that didn’t get excited about automating solutions, and more importantly, reducing the risk of an after-hours phone call to fix a problem.

Step 4: Keep the Ball Rolling and Continue Coding Defensively

Importantly, these steps aren’t a one-and-done solution. Implementing your plan will require constant awareness and maintenance as you add new features and technologies. You should strive to build a team and an organizational culture that invests in a robust response-as-code component for all new work moving forward. As I said above, automating responses to potential problems reduces the time that it takes to resolve production problems and saves wear and tear on your engineers.

Moving Forward and Improving Continuously

As in the wider DevOps movement, your focus will be on building and establishing strong and resilient patterns for your teams to follow. You should be continuously looking for new ways to improve your process of designing, developing, and deploying software. A robust response-as-code plan will help you move your teams to the next level, and when you’ve mastered it, you’ll be ready for the next iteration of improvements and innovation.

And on the topic of improving continuously, it’s key to be aware of the types of modern incident response tooling that are becoming more readily available today. You can read more about this by reading StackPulse’s article on “How the Incident Response Software Stack Has Evolved.”

If this is a topic that interests you, you should sign up for early access to the tools and community that StackPulse is building. You can sign up and learn more about what they have to offer here.

Torq is a no-code automation platform for security and operations teams. Easy workflow building, endless integrations, and out-of-the-box templates deliver value in minutes — not weeks. Torq and TNS are under common control.
Learn More
The latest from Torq
TRENDING STORIES
Mike Mackrory is a global citizen who has settled down in the Pacific Northwest — for now. By day, he works as a lead engineer on a DevOps team, and by night, he writes and tinkers with other technology projects....
Read more from Mike Mackrory
Torq sponsored this post. Insight Partners is an investor in Torq and TNS.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, Torq.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.