![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
For today’s digital-first organizations, software problems often become business problems. As companies’ revenue and customer experience increasingly move online, incidents and disruptions — and their associated downtime — will have a bigger impact on revenue, customer satisfaction and employee productivity.
The fact of the matter is that many IT disruptions are reasonably well understood, both in how to triage and remediate — even when you’re only temporarily fixing the problem. Diagnosing alerts from noisy services usually begins with the same steps. “Fix it for right now” remediation steps are also often the same, involving simple service reboots and failovers.
These repetitive actions are good candidates for applying automation to allow faster response, avoid interrupting subject matter experts (SMEs), decrease errors and increase productivity.
IT operators must resolve severe outages as quickly as possible, which is why they track metrics such as mean time to resolution (MTTR) and error budgets. In these cases, service restoration is the highest priority, regardless of whose work is disrupted.
Once you meet service level objects (SLOs), driving IT support efficiency becomes a concern. All the less-severe incidents, IT events and monitoring alerts can drive up support costs and interrupt senior engineers from their primary work, reducing the velocity of new features. Unfortunately, the situation in many organizations is far from ideal. Research reveals that a fifth of organizations suffer a “high impact” (equaling a 25% or greater loss of productivity) from being interrupted by unplanned work stemming from IT incidents and outages. For 47% of organizations, the impact is “significant,” meaning a 10%–25% productivity loss.
Much of this toil can be traced back to operators without the knowledge or access to fix problems on their own needing to escalate to senior engineers for resolution. The reason is many first responders in operations centers lack knowledge of the many systems an enterprise runs and likely the skills to diagnose and remediate an issue unless clear instructions are available, such as in a runbook. They also may not have the requisite access privileges to run tests or make changes to production, whether because of lower skill levels or companies needing to keep their environments locked down for compliance reasons.
Often, these responders are left drowning in signals and alerts, unable to filter out the noise from huge data volumes and unable to do anything other than escalate for help. As a result, senior engineers are called to help even with basic triage tasks simply because they have access privileges to the impacted systems. These interruptions can consume hours each week, distracting engineers from development projects. Incidents end up involving far too many engineers, doing basic things like running tests to show their code is not causing the problem.
Automating predictable, repeating steps in incidents can reduce needless escalations to experts, empower first responders to take more actions and (ideally) eliminate calling any humans at all. Consider a typical incident response workflow:
👁 Typical incident response workflow
Employing AIOps to detect problems from alerts and label incidents is a major way to increase speed and efficiency. You won’t need responders staring at glass to find problems; AIOps can filter through a lot of repeated noise and false alerts to find real problems that need action. With AIOps in charge of triggering your incident workflows, you can automate tasks through resolution, closure and even the final fix by developers.
The diagram above shows there are many opportunities to improve incident response with automation. But where should you start?
It’s a balance between your confidence in the automation, the value or cost of the incident and the frequency the task occurs. Common incidents with proven automated steps for diagnosis and remediation are good opportunities to trigger with AIOps. From there, follow a similar process to prioritize your incident response.
Automate diagnosis and remediation steps for serious outages to speed resolution. Then focus on increasing efficiency by automating recurring diagnostics and remediation actions that occur across many kinds of incidents. You can safely automate and trigger lower-risk actions such as read-only diagnostic pulls with AIOps, giving downstream personnel the information they need, even when they are paged.
You can automate common remediation actions and make them available to responders to use. This automation can utilize secrets management tools such as Vault to enable privileged actions in production environments without sharing credentials, making it safer to delegate to responders. When the likely cause of an incident is obvious, and the remediation automation is proven, you can have AIOps trigger the remediation to enable self-healing without needing to call any responders.
What you choose to automate first comes with an opportunity cost. So finding the tasks that can generate the biggest financial impact is your path to success.
Here are five key design principles that will help organizations automate incident remediation to dramatically reduce worker toil, free talent for innovation and optimize how they resolve incidents.
Automation isn’t a panacea for incident response. The idea is to let machines take on manual and repetitive tasks where possible. When incidents are complex or novel, humans need to get involved. Even in cases where SMEs are required to step in, automated processes can speed up their work by proactively gathering the detailed diagnostic data they need to determine root causes and the right remediation steps.
In a digital-first world, automation should be on top of the to-do list for every IT function.