VOOZH about

URL: https://thenewstack.io/the-need-to-decouple-human-error-from-incident-response/

⇱ The Need to Decouple Human Error from Incident Response - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-05-25 03:00:52
The Need to Decouple Human Error from Incident Response
profile,
DevOps / Tech Culture

The Need to Decouple Human Error from Incident Response

A lot of software management is still quick to blame incidents on human error over the complexity of those machines we’re interacting with.
May 25th, 2022 3:00am by Jennifer Riggins
👁 Featued image for: The Need to Decouple Human Error from Incident Response

VALENCIA, Spain — Science fiction may be about humans versus the machines. However, a lot of software management is still quick to blame incidents on human error over the complexity of those machines we’re interacting with.

At a time in which we understand the impact of both burnout and psychological safety on teams, finding “human error” included in a root cause analysis is just bad business. The blame game must end as it’s disruptive to the team and organizational resiliency.

In her lightning talk “Whyhappn instead of Whodunnit,” independent software engineer Silvia Pina begged the KubeCon + CloudNativeCon Europe 2022 audience to remove the term human error from their vocabulary. Because when we are talking about consistently complex systems with unknown unknowns and increasingly sophisticated attack vectors, it can’t come down to just one person.

As Charity Majors contends, the smallest unit of software delivery and ownership is a team. It’s time to shift our focus from blaming the individual to applying Pina’s perspective of systems thinking and organizational psychology to increasing resiliency.

Even Aviation Doesn’t Talk Human Error Anymore

The concept of human error in technology is adapted from the aviation industry. “Because the system or the machine is considered really reliable and all safety issues come from the fact that humans are operating it, so humans are the weak link,” Pina explained. Or at least we were perceived to be.

Over time, human error in aviation changed from the cause of failure to a symptom of failure. Safety is no longer perceived as inherent to the system, so progress has been redefined as a better understanding of the ways in which tools, tasks, and the environment interact.

Alas, human error is still being applied to reasons behind software incidents.

“It’s like an Agatha Christie story trying to figure out who has committed the crime, or, in this case, the incident,” Pina said. “This ties to an old view of human error that comes from aviation where high reliability is a requirement.” Reliability is of course a requirement in software engineering, but not at the 100% uptime an airplane full of people demands.

Like aviation, distributed software systems have high levels of complexity. But these systems also have a huge amount of variability. “This level of variability requires some level of adjustment,” she said. “This is one of the reasons we are successful, but this is also one of the reasons why there are failures.” Teams must accept that failures will occur, no matter what they do to plan against them.

There’s also an embracing of failure — in software engineering, not aviation — as an opportunity to experiment and learn. This is even a critical part of the site reliable engineering practice, to allow for an error budget, applying observability and chaos engineering to better learn through pushing systems to the limits, and sometimes, failure.

The Psychological Safety of High-Performing Organizations

Success and failure are better perceived as two sides of the same coin. Pina calls this new view of human error more like a “no view. We no longer need to have human error as a category in postmortems.

“We should take away the focus from the individual and try to look at what organizations can do,” she said.

At this level, she recommends considering the five characteristics that are common to high-reliability organizations, which are:

  1. Preoccupied with failure — try to identify warning signs for all possible failures at technical, process or human levels
  2. Reluctant to simplify — embrace complexity, don’t look for simple answers, understand need for specialization, upskilling and training, as well as automation
  3. Sensitive to operations — maintain a global view and look to understand work-as-done, embracing candid employee feedback
  4. Committed to resilience — failure becomes a learning opportunity, teams constantly looking for ways to recover more quickly
  5. Defer to expertise — anyone can ask questions or provide answers, expertise is valued more than authority

“Failure has a role in how these [elite] organizations work,” Pina explained. “We build resilience to failure by focusing on helping people to cope with complexity under pressure.”

👁 A picture of a Magritte painting which has the shadow of a man with a hat and the opposite next to the words WhyHappen

This means, she says, keeping awareness at an organizational level, and spreading the lessons throughout. With this in mind, the blameless postmortem is essential to learn the root causes of an incident. A postmortem is an important mechanism for continuous learning and improvement in incident response, but only if the finger-pointing is left out.

“We move from this very human tendency to judge to a point where we can then understand why a failure happens,” Pina said. “And this is why we need to no longer talk about human error.”

This is also why zero trust culture centers on moving away from the assumption that humans are the weakest link in any security chain, and more toward making security everyone’s job. Then from a technical level, enforcing collaborative governance. Yes, human error is a leading cause of Kubernetes security incidents, but that’s because the orchestration system is very weak on security-minded defaults.

Red Hat even found that these Kubernetes incidents were caused by misconfiguration incidents. But if that is the repeated error, that’s a systemic and procedural issue — along with a technical one — not down to the error of one teammate. High-performing organizations understand that they must improve processes and tech in response, not play the blame game.

Psychological safety is essential to building organizational resilience to failure. Pina says, therefore, it’s leadership’s job to help people cope with complexity under pressure.

Decoupling human error from incident response gains perspective, she explained, and you see things anew, just like a René Magritte painting.

TRENDING STORIES
Jennifer Riggins is a tech storyteller and journalist, event and panel host. She bridges the gap between business, culture and technology, with her work grounded in the developer experience. She has been a working writer since 2003, and is based...
Read more from Jennifer Riggins
SHARE THIS STORY
TRENDING STORIES
KubeCon + CloudNativeCon and Red Hat are sponsors of The New Stack.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.