VOOZH about

URL: https://thenewstack.io/the-power-of-the-debriefing-to-get-at-root-causes/

⇱ The Power of the Debriefing to Get at Root Causes - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-03-05 10:22:31
The Power of the Debriefing to Get at Root Causes
profile,
CI/CD / Containers / Tech Culture

The Power of the Debriefing to Get at Root Causes

Speaking to a common theme across Chaos Carnival, where he gave his presentation, Shaked explained how PerimeterX learned to implement a wide communication channel to help prevent repeated incidents, because it helped bridge trust gaps. One of the most effective ways to do this is through debriefings.
Mar 5th, 2021 10:22am by Jennifer Riggins
👁 Featued image for: The Power of the Debriefing to Get at Root Causes

Every production system has issues. Every production system fails. This is why a team, and the organization as a whole, must go “through the process of change and creating a healthy and supportive culture of learning,” said Amir Shaked, senior vice president of research and development at web application security provider PerimeterX, where they have 300 fully-Dockerized microservices.

Speaking to a common theme across Chaos Carnival, where he gave his presentation, Shaked explained how PerimeterX learned to implement a wide communication channel to help prevent repeated incidents, because it helped bridge trust gaps. One of the most effective ways to do this is through debriefings.

👁 Image

With this in mind, Shaked’s team started looking at repeated issues. Those constant, but seemingly minor production fails where “minor risks become catastrophic as you scale,” he said.

As he looked to examine these repeated issues, things that logically a business would want to fix or prevent in the future, Shaked immediately felt pushback. The team had a strong fear of judgment: Why do you ask so many questions? Why don’t you trust us?

“If you have team members afraid or feeling that they are being judged or insecure in their work environments, they are going to underperform and as a team you are not going to be able to learn and adapt as you should,” Shaked said.

So, about three years ago, he set about setting a new process for the team, focusing on revamping how they analyze different kinds of failure.

Because, he said, “Assuming you have the right foundation of engineers, if you fix the process, anything can happen.”

Shaked shared PerimeterX’s debriefing process with the virtual audience of Chaos Carnival, and now The New Stack shares it with you today.

Debriefs Focus on Root Causes

An incident happens — a customer calls and complains. Usually, that’s how you find out about it.

Shaked said, “When they don’t have a resolution, they page the engineering team, usually waking them up. They find the problem and fix it but will resent the fact they had to wake up to fix it.”

But, he added, “If that’s the end, you will have similar issues again because you don’t have the root causes.”

“Humans make mistakes. This is why we need to fix the process and not try to fix the people.” — Amir Shaked, PerimeterX

The PerimeterX team pinpointed that they were missing that crucial last step — analyzing after the fact to learn lessons and stop recent history from repeating itself.

In their first new debrief, they realized that particular incident was caused by code being deployed into production by mistake. An engineer was merging into the main branch. The code failed the test, but it was late, so the engineer decided to pause everything and then look at it tomorrow.

Shaked said, “What he didn’t know was that the microservice that he was working on, a different addition was made by a DevOps engineer that automatically deployed into production — with an autoscale.”

He said they could have a focus on why there was a merging in the first place, why the developer didn’t know about autoscaling or how microservices are deeply complex and don’t autoscale easily.

Instead, their new debriefing zoomed in on why was there a misunderstanding about how to treat the main branch.

The team all determined together that the main branch equals production. That means, no matter what, any change involving the main branch is considered a drastic change.

Shaked’s team had to intentionally remove judgment from the debriefing process. He says that when you just assume that people are doing their jobs, and when you’re focusing on the process, you can take away the blame and get to the root cause.

Then, as a team matures, the team will take smaller incidents to learn from too. Within 24 to 72 hours after the resolution, PerimeterX has a debrief meeting. Then about two to three weeks after the debrief, they do a checkpoint meeting to make sure the immediate tasks were incorporated.

Conduct a Debrief, Not a Retro.

A retrospective is the most sacred of agile rituals. A retro, as it’s usually called, is used by teams to reflect on their way of working, and to continuously become better in what they do. PerimeterX probably did have a retro to examine their processes for debriefs, but not specific incidents.

A debrief, on the other hand, is a formulaic activity to examine any incident that may have a severe impact on your operation.

One thing retros and debriefs have in common is asking a lot of questions. For PerimeterX’s debriefing sessions, they ask the following:

  • What happened? This is a detailed timeline of events. From the moment the issue started rolling into production through to analysis and resolution. As PagerDuty’s Julie Gunderson reminded, a simple chat tool like Slack during the incident helps to timestamp.
  • What’s the impact? Shaked says you have to convey the cost impact, how many and which customers were affected, and complaints received. You need to get a full scope, as it’s vital to get everyone to understand why you are delving into the problem. “Understanding the bigger picture, the more you do it, they will focus on that and focus on the bigger impact. And the learning will propagate to have resolutions sooner,” he said.
  • How is everything related? Follow-up and action items are necessary for a debrief to be full scope. Try to find patterns, as you learn more about your system and how it fails.
  • Did we identify the issue in under a certain amount of time? PerimeterX sets five minutes. You need a timeframe to establish consistency but that timeframe will vary by team.
  • How long until we fixed the problem? Again this varies by team from under an hour to within ten minutes to automatically. The goal of chaos engineering is to study your system to both shore it up and to automate as many fixes as possible.

Next comes the discussion of what needs to be done in order to make sure all the above goals are met, followed by a plan of action to make the system even better.

The ‘Drastic’ Cultural Change Driven by Streamlined Debriefs

Shaked said these changes to debriefs led to a drastic cultural change overtime, but that they had to learn from their mistakes along the way.

First and foremost, they uncovered a lack of trust for the then-newly promoted Shaked, who was coming in to “install” that new process and culture.

Inevitably your team will start playing the blame game, which he says you have to nip in the bud as quickly as possible.

“When the focus is on the process and the system, it’s not about who caused the incident. It’s setting the ground to creating the learning opportunities and improvement.” — Amir Shaked, PerimeterX

“If you see it starting to happen, you need to interfere politely and calmly,” Shaked advised.

Keep your debrief narrowly focused on one incident — not broader themes like retrospectives — and focus on the what, not the who. And remember to go easy on the why questions.

He explained, “You need to ask why someone did something, but you don’t want to create self-doubt — you want to focus on the process not the behavior.”

They also realized a debrief is a moot ritual if you don’t include follow-up action items, which you then check back on.

But sometimes you need to communicate in-the-now. That’s why they implemented a crisis mode process — a proverbial big red button, clarifying what is it and when to press it to make sure it wakes up everyone. Because having everyone around the table in a big issue bridges knowledge gaps and leads to a faster solution.

Shaked said a good debrief all comes down to process consistency, so people know the questions they are going to be asked ahead of time, which helps keep everything more positive.

He said, “Keeping calm and making it clear there is a path forward is really important for a change environment, especially when there’s a very serious incident with a very high impact.”

Over the last three years, through the simple act of honed debriefing, PerimeterX has learned some valuable lessons — about both their teams and their systems. But at the top of that list is to never try to fix the humans because you should trust you have a good team but also understand that humans are going to make mistakes.

Download your own copy of PerimeterX’s free debrief template.Carnival

Chaos Carnival was organized by MayaData, a sponsor of The New Stack.

Feature image by Jens P. Raak de Pixabay.

TRENDING STORIES
Jennifer Riggins is a tech storyteller and journalist, event and panel host. She bridges the gap between business, culture and technology, with her work grounded in the developer experience. She has been a working writer since 2003, and is based...
Read more from Jennifer Riggins
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Root, MayaData.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.