![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
It’s 1900, and the cobra population is out of control in Delhi. The British Government, outsiders by any definition, brainstorm ways to deal with the issue. They stumble upon an obvious, if macabre, answer: pay people for each cobra head they bring to the crown. Everything goes swimmingly for a while: the cobra population is going down and people are getting paid well for facilitating the decline, perhaps too well. People start breeding cobras to collect a higher bounty. The cobra population explodes. The problem is now worse than ever.
The British Government got caught in a perverse incentive: they started rewarding people to make their problem worse. Perverse incentives are everywhere today, and they happen because of a lack of understanding of a problem. One must ask: how do incentives distort the problem I’m trying to solve?
“There is a quality even meaner than outright ugliness or disorder, and this meaner quality is the dishonest mask of pretended order, achieved by ignoring or suppressing the real order that is struggling to exist and to be served.”
— Jane Jacobs, The Death and Life of Great American Cities
To the uninitiated, all complexity looks like chaos. Real order requires understanding. Real understanding requires context. I’ve seen teams all over the tech world abuse data and metrics because they don’t relate it to its larger context: what are we trying to solve and how might we be fooling ourselves to reinforce our own biases?
In no place is this more true in the world of incident management. Things go wrong in businesses, large and small, every single day. Those failures often go unreported, as most people see failure through the lens of blame, and no one wants to admit they made a mistake.
Because of that fact, site reliability engineering (SRE) teams establishing their own incident management process often invest in the wrong initial metrics. Many teams are overly concerned with reducing MTTR: mean time to resolution. Like the British government, those teams are overly relying on their metrics and not considering the larger context. Incidents are almost always going to be underreported initially: people don’t want to admit things are going wrong. If people are judged on their ability to close incidents quickly, they’ll close incidents too early, or declare them too late.
Companies just adopting an incident response strategy should focus on metrics to help normalize failure as a regular component of doing business. Incident count is one of those metrics: paradoxically you should expect your company’s number of incidents to increase, as you begin to embrace a culture of failure and learning.
No combination of metrics can help you determine your company’s effectiveness at incident response. Data is just the starting point: it informs a hypothesis that company leaders need to confirm by using their eyes and ears. All of this together will help you build a realistic view of your company’s incident response. Remember: your goal as a company isn’t to reduce mean-time-to-recovery, it’s to learn from failure and build a more resilient organization.