VOOZH about

URL: https://thenewstack.io/top-12-best-practices-for-better-incident-management-postmortems/

⇱ Top 12 Best Practices for Better Incident Management Postmortems - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2020-12-02 04:00:38
Top 12 Best Practices for Better Incident Management Postmortems
contributed,sponsor-torq,sponsored,sponsored-post-contributed,
DevOps / Observability / Security

Top 12 Best Practices for Better Incident Management Postmortems

Learn how to start off on the right foot when it comes to conducting the best postmortem meeting by implementing this postmortem process.
Dec 2nd, 2020 4:00am by Steve Tidwell
👁 Featued image for: Top 12 Best Practices for Better Incident Management Postmortems
Feature image via Pixabay.
Torq sponsored this post. Insight Partners is an investor in Torq and TNS.

Poorly implemented postmortems for IT incidents can be painful for everyone involved; they cost money, and worse yet, they can fail to address the root cause of the problem. In this post, we will discuss some of the pitfalls of postmortems and introduce several best practices that can help smooth the postmortem process — including choosing the right personnel, creating a culture of accountability, and conducting “blameless” postmortems. In short, we will explain what you need to do to improve the postmortem process for everyone involved.

What Is a Postmortem?

Steve Tidwell
Steve has been working in the tech industry for over two decades, and has done everything from end-user support to scaling a global data ingestion and analysis platform to handle data analysis for some of the largest streaming events on the web. He has worked for a number of companies helping to improve their operations and automate their infrastructure.

According to Merriam-Webster, a postmortem is “an analysis or discussion of an event after it is over.” In the tech world, postmortems meetings are a key component to an overall process of incident management and are conducted after an undesirable outcome in order to determine what went wrong, why it went wrong, and how it can be avoided in the future.

Postmortems are not limited to the tech world. Many industries and organizations utilize this process to create a feedback loop that allows for continuous improvement. Regardless of the industry, though, a postmortem will almost always follow the same basic format:

  1. What was the intended outcome?
  2. What actually happened?
  3. Why did it happen?
  4. How can it be avoided in the future?

Retrospectives vs. Postmortems

Postmortems are similar to Agile retrospectives in that they have a similar intent, but there are a few key differences. Postmortems are normally held as soon as possible after an event or incident occurs. Retrospectives are normally held on a regular basis as part of a wider Agile strategy that includes sprint planning, a daily standup, and a retrospective (which is typically held at the end of the sprint).

Although there are different ways to implement a retrospective, they usually look something like this:

  1. What went well during the project, sprint, or prior period?
  2. What didn’t go so well?
  3. What would we like to see in the future?

What to Avoid in a Postmortem Process

So can postmortems go wrong? Very easily, as it turns out. In an organization without proper accountability or a well-planned postmortem process, the most common problem is usually finger-pointing — or what is sometimes called “The Blame Game.”

Many people can probably relate to this scenario. A poorly moderated postmortem discussion would go something like this:

  1. Question: “What was the intended outcome?”
    Answer: “To successfully deploy new code and features to production.”
  2. Question: “What actually happened?”
    Answer: “The website went down during a regularly scheduled deployment.”
  3. Question: “Why did that happen?”
    Developers might answer: “QA signed off. They didn’t have a proper test strategy and let a bug slip into production.”
    QA might answer: “Ops didn’t configure the production environment correctly. If it weren’t for that, we would have caught this before it went out.”
    Ops might answer: “If the code had been written correctly, the application wouldn’t have crashed in the first place.”
  4. Question: “How can it be avoided in the future?”
    Developers might answer: “QA needs to do a better job in the future!”
    QA might answer: “Ops needs to do a better job in the future!”
    Ops might answer: “Developers need to do a better job in the future!”
    Management: “Sigh…”

The Blameless Postmortem

Google’s SRE Book has an excellent postmortem strategy in the chapter entitled, “Postmortem Culture: Learning from Failure.” It discusses why postmortems need to be conducted objectively (hint: people are hard-wired to point fingers) and why collaboration is a better approach (because most people want to learn from their mistakes and make things work better for everyone else too).

Torq is a no-code automation platform for security and operations teams. Easy workflow building, endless integrations, and out-of-the-box templates deliver value in minutes — not weeks. Torq and TNS are under common control.
Learn More
The latest from Torq

A practical implementation of a blameless postmortem would look something like this:

  1. Question: “What was the intended outcome?”
    Answer: “To successfully deploy new code and features to production.”
  2. Question: “What actually happened?”
    Answer: “The website went down during a regularly scheduled deployment.”
  3. Question: “Why did that happen?”
    Answer: “The staging and production environments were different. A bug that didn’t manifest in the staging environment manifested in production. That caused the application to crash.”
  4. Question: “How can it be avoided in the future?”
    Answer: “We should include additional checks in the code to improve our ability to catch error conditions and prevent the application from crashing. We should make sure that the staging and production environments are identical. If that’s not possible, we should implement additional testing using a canary deployment (or other means) to catch bugs before they are fully deployed to production.”

The last step should also include a list of actionable items, with an owner assigned to each one. A routine follow-up should also be conducted to ensure that those action items were actually completed in a timely manner.

Notice that at no point in our blameless postmortem scenario did anyone attempt to blame another group. Instead, they conducted an objective analysis of the incident. This process would also include a proper root cause analysis, along with a list of possible remedial actions. You can also get ahead of the blame game by proactively avoiding some common communication mistakes among teams.

Potential Postmortem Pitfalls

The problem with trying to instill an accountable yet blameless culture in organizations is that, as we mentioned earlier, humans tend to be hard-wired to point the finger — whether it’s at themselves or someone else.

For an example of how you can avoid “the blame game,” check out “Blameless postmortems don’t work. Here’s what does.” In short, you want to make sure that your process is solid, you hold people to the process, you always keep in mind that you are dealing with human beings, you are “blame aware,” and you work with your teams to help them understand healthier ways to interact and improve.

Postmortem Best Practices

The following are a few best practices and tips to help you on your journey to a better postmortem process:

  1. Obtain buy-in from management, from the bottom all the way to the top. Without some kind of authority behind your process, it will most likely go nowhere.
  2. Assign a process owner. This individual will be responsible for all followup, including scheduling meetings.
  3. Keep the overall process simple. Complicated processes make gaining acceptance more difficult. A lack of acceptance begets non-compliance.
  4. Create a project in your ticketing system dedicated solely to tracking incident workflow.
  5. Keep the ticket workflow simple.
    •  For example, a simple workflow might be something like:
      1. Incident in progress
      2. Incident resolved
      3. Root cause analysis
      4. Incident followup
      5. Incident closed
  6. Keep the amount of information required for a ticket to a minimum. If you have less fields in the ticket, it will be easier for people to identify the information that will facilitate the process. It will also increase the likelihood that the ticket will be filled out properly.
    • A minimalist ticket might look like the following:
      1. Title
      2. Executive Summary
      3. List of personnel who participated in resolving the incident
      4. Ticket (incident owner)
      5. Incident date
      6. Start and end time of the incident. (We recommend using UTC if you have an organization that spans more than one timezone. This will also help keep the timeline more accurate when reviewing server or chat logs, since correlation is easier when it doesn’t require conversion.)
      7. Incident timeline
      8. What happened?
      9. Why did it happen (ie: RCA)?
      10. Attachments, links, graphs, logs, or other information
      11. Sub-tickets with suggested followup actions
      12. Due date for followup
  7. Enforce ticket creation whenever a major incident occurs. This can be done by the individual, or team responding to the incident, or by an Incident Coordinator.
  8. Once the incident is over, assign the ticket to an owner. The owner will be responsible for following up on the root cause analysis and ensuring that action items that were created during postmortem discussions are completed.
  9. Appoint a process owner to ensure that tickets in the incident project move through the workflow. In addition, the process owner should be responsible for scheduling meetings as needed.
  10. You should initiate a postmortem when you have:
    1. Major outages that impact end users
    2. Failed deployments
    3. Security breaches
    4. Data loss
    5. Missed deadlines
    6. Repeated or unresolved incidents
  11. You should avoid a postmortem when you have:
    1. Minor problems
    2. Proactive maintenance to prevent larger problems
    3. Scheduled work (unless the work itself causes an incident)
  12. Finally, stamp out finger-pointing wherever possible and try to create a culture of “blame-awareness” and cooperation.

This article will point you in the right direction when it comes to postmortems, but there are many variables that organizations will need to assess in order to determine what will work best for them. Keep in mind that the postmortem process itself should be reassessed over time in order to account for changes in requirements and to make sure that it is still optimal for your organization.

There are many excellent articles that describe how different companies have implemented their version of the “blameless postmortem.” In particular, see Blameless PostMortems and a Just Culture, as well as How to run a blameless postmortem, and Tuning Blameless Postmortems.

Torq is a no-code automation platform for security and operations teams. Easy workflow building, endless integrations, and out-of-the-box templates deliver value in minutes — not weeks. Torq and TNS are under common control.
Learn More
The latest from Torq
TRENDING STORIES
Steve has been working in the tech industry for over two decades, and has done everything from end-user support to scaling a global data ingestion and analysis platform to handle data analysis for some of the largest streaming events on...
Read more from Steve Tidwell
Torq sponsored this post. Insight Partners is an investor in Torq and TNS.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, Torq, Root.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.