VOOZH about

URL: https://thenewstack.io/better-incident-management-requires-more-than-just-data/

⇱ Better Incident Management Requires More than Just Data - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-09-22 10:00:35
Better Incident Management Requires More than Just Data
contributed,
Observability / Security

Better Incident Management Requires More than Just Data

You should expect your company's number of incidents to increase, as you begin to embrace a culture of failure and learning.
Sep 22nd, 2021 10:00am by Cole Potrocky
👁 Featued image for: Better Incident Management Requires More than Just Data
Feature image via Pixabay.
Cole Potrocky
Cole Potrocky is the co-founder and CTO of Kintaba and was a founding engineer on the Facebook Workplace team.

It’s 1900, and the cobra population is out of control in Delhi. The British Government, outsiders by any definition, brainstorm ways to deal with the issue. They stumble upon an obvious, if macabre, answer: pay people for each cobra head they bring to the crown. Everything goes swimmingly for a while: the cobra population is going down and people are getting paid well for facilitating the decline, perhaps too well. People start breeding cobras to collect a higher bounty. The cobra population explodes. The problem is now worse than ever.

The British Government got caught in a perverse incentive: they started rewarding people to make their problem worse. Perverse incentives are everywhere today, and they happen because of a lack of understanding of a problem. One must ask: how do incentives distort the problem I’m trying to solve?

“There is a quality even meaner than outright ugliness or disorder, and this meaner quality is the dishonest mask of pretended order, achieved by ignoring or suppressing the real order that is struggling to exist and to be served.”

— Jane Jacobs, The Death and Life of Great American Cities

To the uninitiated, all complexity looks like chaos. Real order requires understanding.  Real understanding requires context. I’ve seen teams all over the tech world abuse data and metrics because they don’t relate it to its larger context: what are we trying to solve and how might we be fooling ourselves to reinforce our own biases?

In no place is this more true in the world of incident management. Things go wrong in businesses, large and small, every single day. Those failures often go unreported, as most people see failure through the lens of blame, and no one wants to admit they made a mistake.

Because of that fact, site reliability engineering (SRE) teams establishing their own incident management process often invest in the wrong initial metrics. Many teams are overly concerned with reducing MTTR: mean time to resolution. Like the British government, those teams are overly relying on their metrics and not considering the larger context. Incidents are almost always going to be underreported initially: people don’t want to admit things are going wrong. If people are judged on their ability to close incidents quickly, they’ll close incidents too early, or declare them too late.

Companies just adopting an incident response strategy should focus on metrics to help normalize failure as a regular component of doing business. Incident count is one of those metrics: paradoxically you should expect your company’s number of incidents to increase, as you begin to embrace a culture of failure and learning.

Three Ways to Actually Improve Incident Response

  • Embrace Failure. Early on you need to normalize failure, so looking at increasing incidents is important. Once you have a track record of actively recording incidents, you can consider measuring MTTR because you’ll have a proper baseline. And you’ll have created a culture where, if a major incident requires a lot of time to reach a resolution, you won’t be so influenced by measuring MTTR that you take a counter-productive action like closing out the incident early.
  • Work to understand the context. No metric works without understanding of the larger context. C-suite should practice scuttlebutt and watch incidents take place (or even participate) to understand the pain points of day-to-day incidents, and to understand how metrics rarely show the full picture. As an executive, if your level of involvement is just wanting to see a report each month with metrics like MTTR decreasing — you’ll never actually create a resilient culture.
  • Don’t suffocate the process. When your product isn’t working, it can be tempting to push your responders to fix things quicker. Build trust with your teams, and acknowledge that they’re composed of human beings who need time and space to solve difficult issues. Micromanaging or over-optimizing during active incidents just increases stress and paradoxically will reduce your teams’ ability to respond effectively.

No combination of metrics can help you determine your company’s effectiveness at incident response. Data is just the starting point: it informs a hypothesis that company leaders need to confirm by using their eyes and ears. All of this together will help you build a realistic view of your company’s incident response. Remember: your goal as a company isn’t to reduce mean-time-to-recovery, it’s to learn from failure and build a more resilient organization.

TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Real.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.