🏆 Squadcast ranked among the “Top 10 tools in the Incident Management Category” by G2 🔥

Towards More Effective Incident Postmortems

Apr 27, 2020
Last Updated:
Apr 27, 2020
Share this post:
Towards More Effective Incident Postmortems

An incident postmortem is not only an essential document for reference, but also necessary as a process by which teams can collaboratively learn from failure, and communicate independent learnings across the organization.

Table of Contents:

    As our systems grow in scale and complexity, outages are inevitable, no matter how hard we try to provide uninterrupted services. When an outage occurs, the most important and immediate step is, of course, fixing the underlying issue and keeping the relevant stakeholders and customers informed. A lot of the incidents can be quickly rectified with tools like infrastructure automation, runbooks, feature flags, version control, continuous delivery and people can be kept in the loop with chatops and status pages. These actions, though beneficial to fix the situation at hand, do not really help understand what failed and why. And understanding what failed and why is a crucial step towards preventing similar occurrences going forward.

    This is where incident postmortems come in - the next logical step after any incident is to dissect and analyze the why, how and the what of the incident. And ideally, this should really be done for every single incident, and not just the high severity or high impact ones.

    An incident postmortem is a report that records the details of an incident, the impact it has on the service, the team that was assembled to address the event, the immediate steps taken to mitigate the damage,the actions taken to resolve the incident and the lessons learnt that can help the team minimize the impact of future incidents. These lessons can in turn affect how you think about a particular component of your system, or sometimes just how mitigation steps could be done faster in specific cases. Which is a big deal, to say the least.

    What is an incident postmortem?

    An incident postmortem is a process that takes place after an incident occurs. It is the process of analyzing the incident and identifying its root causes. These root causes can then be addressed in future incidents. It is important to understand what an incident postmortem is, why it is important, and how it should be conducted.

    Importance of Incident Postmortems

    An incident postmortem is not only an essential document for reference, but also necessary as a process by which teams can collaboratively learn from failure, and communicate independent learnings across the organization.

    There are several reasons why doing an incident postmortem are incredibly important :

    • Serves as a documentation tool: It provides team members with the ability to record the nitty-gritties of an incident ensuring that it won’t be forgotten. A well-documented incident becomes invaluable to a team since it not only includes a description of what happened but also details on the actions that were taken that serves as a reference point for remediation and mitigation of future incidents
    • Helps build trust and transparency with customers and relevant stakeholders of a particular service or application when posted publicly. This also helps build confidence amongst users that necessary steps are taken to prevent any future disruptions to the services provided
    • Instills a culture of learning. As rightly said, “The cost of failure is education”. It also helps shift the focus from the immediate now to the future. This is why conducting blameless postmortems becomes crucial. More on being blameless is covered later in this blog.
    • Serves as an opportunity to get more insights to drive improvement in infrastructure when services and applications fail in new and interesting ways for us to realise what areas need improvement

    Incident Postmortem - What does it consist of?

    Incident Postmortems are also called RCAs (Root Cause Analysis) or incident reviews. At Squadcast, we prefer the term Incident Reviews but to keep this easier to digest, we are going to refer to them by the more popular “Incident Postmortem”, for the rest of this article. When it comes to an incident postmortem, there is no one-size-fits-all approach or even a universally accepted standard for doing different kinds of post-mortems. The Postmortem process varies across organizations and sometimes even within companies depending on the size and culture of the teams, from casual to highly formal, depending on the nature of the product or the severity of the incident.

    Regardless of the names and the approach, the end goal remains the same - to keep relevant stakeholders informed and as a learning opportunity not only to fix a weakness but to make systems more resilient as a whole. The whole incident postmortem process can take considerable time and effort to gather information and the postmortem meeting (if needed) might occur days, or even weeks, after the actual incident depending on the severity of the same.

    A typical postmortem process covers the below-outlined aspects, in no particular order:

    • A high-level outline or summary: This covers the ‘what’ and ‘why’ of the incident, the severity and business impact on customers or users, people involved in the response process and the resolution of the incident. This is particularly beneficial to managers and application owners who need to communicate details of the incident to the top management and relevant outside stakeholders.
    • Causes: This part of the postmortem addresses more technical and operational aspects of the incident starting with the causes and triggers, explaining the origins of the failure and highlights the underlying cause - what made the system to break. A popular method to get to the root cause is called the 5 Whys Process - which was first made popular by Toyota.
    • Effects: Post analyzing the deeper granularity on the causes of the incident, the team is now tasked with measuring and analyzing the effects on business, services, and users. This step of the postmortem process also analyses the extent and severity of the incident. For instance, the impact on the business when a payment service was down on an e-commerce website affecting its customer’s experience in purchase.
    • Resolution: This step starts with a diagnostic dissection into details of the Incident Timeline covering the time of failure, the time when the incident was recognized and handled, the team involved in the process, procedures taken to remediate the problem. This part can also include a review of failed attempts which can serve as a reference to the team when a similar incident occurs, saving valuable time.
    • Conclusion. : Outlines the key takeaways, recommendations and next steps to ensure prevention of the same or similar incident in the future.
    Unified Incident Response Platform
    Try for free
    Seamlessly integrate On-Call Management, Incident Response and SRE Workflows for efficient operations.
    Automate Incident Response, minimize downtime and enhance your tech teams' productivity with our Unified Platform.
    Manage incidents anytime, anywhere with our native iOS and Android mobile apps.
    Try for free

    Successful postmortems are blameless

    “Blameless postmortems are a tenet of SRE culture. You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems”

    A critical factor in incident postmortem to be successful is that they are blameless. A culture that seeks to point fingers at the person who may have caused an outage through error or omission is unlikely to get truthful answers during a review, thus negating the intent behind the whole exercise of having an incident postmortem in the first place.

    Through blameless postmortems, the aim is to have a nurturing environment where every “mistake” is seen as an opportunity to strengthen the system. Blameless postmortems shift from allocating blame to investigating the underlying cause and reasons, why an individual or team faced an outage, and also emphasizing the effective prevention plans that can be put in place.

    Many teams, including us here at Squadcast similar to Google, have adopted the culture of the blameless postmortem which paves way to build resilience in its teams and systems.

    Blameless postmortems can tend to be challenging to write since the postmortem format clearly identifies the actions that led to the incident. However, removing blame from a postmortem provides the team the confidence to escalate issues without fear. The next section outlines the steps that can be taken to conduct effective blameless postmortems.

    In order to ensure that teams develop a culture around blameless incident postmortem reviews, it should also be noted that empowering teams with an easy and automated way to capture incident information and publish the final report with reusable checklists and templates, could potentially make incident postmortem meetings less dreadful. In fact, having an automated timeline and templates that are auto-populated with incident metrics and other details as part of your incident management tool can help the process be more consistent and productive for every incident that occurs.

    Conducting effective Incident Postmortems - The process

    In order for postmortems to be blameless and effective at reducing recurring incidents, the review process can incentivize teams to identify root causes to fix them. A well-conducted postmortem allows teams to come together to achieve better goals in a less stressful environment. The exact method can depend on team culture.

    Here are a few best practices that can ensure the effectiveness of postmortems:

    1. Start with an incident timeline

    Prior to conducting an effective postmortem meeting , the premise of the meeting should be around the timeline of significant activity - from chat conversations, incident details and more. You can streamline the entire postmortem process with automated incident timeline building, collaborative editing, actionable insights, and formalize your own postmortem process to make it as easy as possible for your team to respond to issues.

    The goal is to understand all contributing root causes, document the incident for pattern discovery, which allows you to set a better context during the post mortem meeting. This step also plays a key role in enacting effective preventative actions to reduce the likelihood or impact of recurrence.

    2. Conduct a postmortem meeting with anyone internal to the team who was affected by the incident

    A structured and collaborative approach by bringing people together affected by an incident allows for a better cohesive contribution to the postmortem meeting in terms of what they learnt from the incident. This also helps in building trust and resiliency within teams. The formal incident postmortem document that records the details of the incident along with how the team remedied it can help teams in handling future incidents.

    At this step, a formal template can help you record all key details and helps build consistency across all your incident postmortems.

    At Squadcast we use our own incident postmortem feature that helps build an insightful timeline in a matter of minutes. This is especially useful as automation ensures that you can quickly have a system-generated post mortem for pretty much any incident, big or small. There are also a few predefined postmortem templates available from the likes of Google, Azure, and others. You can also choose to create new templates/modify existing ones. What’s more, these are available to download in MD and PDF formats!

    postmortem template
    postmortem template
    postmortem template

    3. Define roles and owners along with having a moderator

    Another key aspect to keep in mind during a postmortem meeting is to have well defined roles and owners along with having a moderator who can ensure the meeting stays on track and avoid any hint of a “blamestorming” session. It will be helpful to have guidelines for the owners of the postmortem process in how the meetings should be run.

    The owner of the review is tasked with managing the meeting and chronicling the subsequent report. It is advisable that the owner should be someone who has sufficient understanding of the technical details, familiarity with the incident, and an understanding of the business impact. Mostly, the moderator is the owner of the incident review and is responsible for maintaining order and giving every participant the chance to speak.

    4. Determine the urgency of an incident by setting the right thresholds

    Not all incidents are equal. Each incident in an organisation should be associated with a measurable severity level based on the impact it has on its business and customers. Associating incidents with the right severity level can help you prioritize your postmortem process. For instance, Sev 1 or higher incidents definitely require a postmortem, while for less severe incidents, postmortems can be automated with a tool like Squadcast.

    That said, if need be, teams should also be provided with an option to request a postmortem for any incident that doesn't meet the threshold.

    Integrated Reliability Automation Platform
    Platform
    PagerDuty
    FireHydrant
    Squadcast
    Incident Retrospectives
    APM, Monitoring, ITSM,Ticketing Integrations
    Incident
    Notes
    On Call Rotations
    Built-In Public and Private Status Page
    Advanced Error Budget Tracking
    Try For free
    Platform
    Incident Retrospectives
    APM, Monitoring, ITSM,Ticketing Integrations
    Incident
    Notes
    On Call Rotations
    Built-In Public and Private Status Page
    Advanced Error Budget Tracking
    PagerDuty
    FireHydrant
    Squadcast
    Try For free

    5. Devil’s in the Details - incident metrics and other key information captured

    Capturing as many details as possible about what happened and what was done during the incident can help teams be more unambiguous. Details such as links to tickets, status updates, incident state documents like monitoring charts along with screenshots and relevant graphics or dashboards becomes a powerful data set that captures the fine details of an incident.

    It is also crucial that along with summarizing key details, important incident related metrics are also captured that help you associate numeric and hard data to the incidents and their impact. Metrics such as Mean Time to Resolution (MTTR), SLO, Extent of SLO breach, Error Budget consumed, severity of incident, number of minutes of downtime can be considered for postmortem tracking. With consistent measurement of these metrics, you can analyze the incident trends over time.

    The key to conducting effective incident postmortems that can help you improve your team and systems is to have a process and stick to. And, making sure it is effective requires commitment at all levels in the organization.

    6. Publish and track postmortems promptly

    Once the postmortem review meeting is completed, the final but important step is to publish the postmortem promptly and distribute the same as an internal communication, typically via email, to all relevant stakeholders, describing the results and key learnings along with a link to the full report.

    Google states that “A prompt postmortem tends to be more accurate because information is fresh in the contributors’ minds. The people who were affected by the outage are waiting for an explanation and some demonstration that you have things under control. The longer you wait, the more they will fill the gap with the products of their imagination. That seldom works in your favor!”

    Regular application of these practices results in better system design, less downtime, and more effective and happier engineers.

    Related Reading

    There are many resources out there that you may consider to check out, if you are interested to know more on how to conduct effective postmortems, here are few of our suggestions

    Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution and create a knowledge base to effectively handle incidents.

    squadcast
    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    April 27, 2020
    April 27, 2020
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Anusuya Kannabiran
    A New Era for Squadcast
    A New Era for Squadcast
    December 12, 2022
    Transparency in Incident Response
    Transparency in Incident Response
    December 16, 2019
    Mean Time to Resolve (MTTR) –What It Is? and how to reduce it using Squadcast.
    Mean Time to Resolve (MTTR) –What It Is? and how to reduce it using Squadcast.
    September 3, 2019

    Towards More Effective Incident Postmortems

    Towards More Effective Incident Postmortems
    Apr 27, 2020
    Last Updated:
    Apr 27, 2020

    An incident postmortem is not only an essential document for reference, but also necessary as a process by which teams can collaboratively learn from failure, and communicate independent learnings across the organization.

    As our systems grow in scale and complexity, outages are inevitable, no matter how hard we try to provide uninterrupted services. When an outage occurs, the most important and immediate step is, of course, fixing the underlying issue and keeping the relevant stakeholders and customers informed. A lot of the incidents can be quickly rectified with tools like infrastructure automation, runbooks, feature flags, version control, continuous delivery and people can be kept in the loop with chatops and status pages. These actions, though beneficial to fix the situation at hand, do not really help understand what failed and why. And understanding what failed and why is a crucial step towards preventing similar occurrences going forward.

    This is where incident postmortems come in - the next logical step after any incident is to dissect and analyze the why, how and the what of the incident. And ideally, this should really be done for every single incident, and not just the high severity or high impact ones.

    An incident postmortem is a report that records the details of an incident, the impact it has on the service, the team that was assembled to address the event, the immediate steps taken to mitigate the damage,the actions taken to resolve the incident and the lessons learnt that can help the team minimize the impact of future incidents. These lessons can in turn affect how you think about a particular component of your system, or sometimes just how mitigation steps could be done faster in specific cases. Which is a big deal, to say the least.

    What is an incident postmortem?

    An incident postmortem is a process that takes place after an incident occurs. It is the process of analyzing the incident and identifying its root causes. These root causes can then be addressed in future incidents. It is important to understand what an incident postmortem is, why it is important, and how it should be conducted.

    Importance of Incident Postmortems

    An incident postmortem is not only an essential document for reference, but also necessary as a process by which teams can collaboratively learn from failure, and communicate independent learnings across the organization.

    There are several reasons why doing an incident postmortem are incredibly important :

    • Serves as a documentation tool: It provides team members with the ability to record the nitty-gritties of an incident ensuring that it won’t be forgotten. A well-documented incident becomes invaluable to a team since it not only includes a description of what happened but also details on the actions that were taken that serves as a reference point for remediation and mitigation of future incidents
    • Helps build trust and transparency with customers and relevant stakeholders of a particular service or application when posted publicly. This also helps build confidence amongst users that necessary steps are taken to prevent any future disruptions to the services provided
    • Instills a culture of learning. As rightly said, “The cost of failure is education”. It also helps shift the focus from the immediate now to the future. This is why conducting blameless postmortems becomes crucial. More on being blameless is covered later in this blog.
    • Serves as an opportunity to get more insights to drive improvement in infrastructure when services and applications fail in new and interesting ways for us to realise what areas need improvement

    Incident Postmortem - What does it consist of?

    Incident Postmortems are also called RCAs (Root Cause Analysis) or incident reviews. At Squadcast, we prefer the term Incident Reviews but to keep this easier to digest, we are going to refer to them by the more popular “Incident Postmortem”, for the rest of this article. When it comes to an incident postmortem, there is no one-size-fits-all approach or even a universally accepted standard for doing different kinds of post-mortems. The Postmortem process varies across organizations and sometimes even within companies depending on the size and culture of the teams, from casual to highly formal, depending on the nature of the product or the severity of the incident.

    Regardless of the names and the approach, the end goal remains the same - to keep relevant stakeholders informed and as a learning opportunity not only to fix a weakness but to make systems more resilient as a whole. The whole incident postmortem process can take considerable time and effort to gather information and the postmortem meeting (if needed) might occur days, or even weeks, after the actual incident depending on the severity of the same.

    A typical postmortem process covers the below-outlined aspects, in no particular order:

    • A high-level outline or summary: This covers the ‘what’ and ‘why’ of the incident, the severity and business impact on customers or users, people involved in the response process and the resolution of the incident. This is particularly beneficial to managers and application owners who need to communicate details of the incident to the top management and relevant outside stakeholders.
    • Causes: This part of the postmortem addresses more technical and operational aspects of the incident starting with the causes and triggers, explaining the origins of the failure and highlights the underlying cause - what made the system to break. A popular method to get to the root cause is called the 5 Whys Process - which was first made popular by Toyota.
    • Effects: Post analyzing the deeper granularity on the causes of the incident, the team is now tasked with measuring and analyzing the effects on business, services, and users. This step of the postmortem process also analyses the extent and severity of the incident. For instance, the impact on the business when a payment service was down on an e-commerce website affecting its customer’s experience in purchase.
    • Resolution: This step starts with a diagnostic dissection into details of the Incident Timeline covering the time of failure, the time when the incident was recognized and handled, the team involved in the process, procedures taken to remediate the problem. This part can also include a review of failed attempts which can serve as a reference to the team when a similar incident occurs, saving valuable time.
    • Conclusion. : Outlines the key takeaways, recommendations and next steps to ensure prevention of the same or similar incident in the future.
    Unified Incident Response Platform
    Try for free
    Seamlessly integrate On-Call Management, Incident Response and SRE Workflows for efficient operations.
    Automate Incident Response, minimize downtime and enhance your tech teams' productivity with our Unified Platform.
    Manage incidents anytime, anywhere with our native iOS and Android mobile apps.
    Try for free

    Successful postmortems are blameless

    “Blameless postmortems are a tenet of SRE culture. You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems”

    A critical factor in incident postmortem to be successful is that they are blameless. A culture that seeks to point fingers at the person who may have caused an outage through error or omission is unlikely to get truthful answers during a review, thus negating the intent behind the whole exercise of having an incident postmortem in the first place.

    Through blameless postmortems, the aim is to have a nurturing environment where every “mistake” is seen as an opportunity to strengthen the system. Blameless postmortems shift from allocating blame to investigating the underlying cause and reasons, why an individual or team faced an outage, and also emphasizing the effective prevention plans that can be put in place.

    Many teams, including us here at Squadcast similar to Google, have adopted the culture of the blameless postmortem which paves way to build resilience in its teams and systems.

    Blameless postmortems can tend to be challenging to write since the postmortem format clearly identifies the actions that led to the incident. However, removing blame from a postmortem provides the team the confidence to escalate issues without fear. The next section outlines the steps that can be taken to conduct effective blameless postmortems.

    In order to ensure that teams develop a culture around blameless incident postmortem reviews, it should also be noted that empowering teams with an easy and automated way to capture incident information and publish the final report with reusable checklists and templates, could potentially make incident postmortem meetings less dreadful. In fact, having an automated timeline and templates that are auto-populated with incident metrics and other details as part of your incident management tool can help the process be more consistent and productive for every incident that occurs.

    Conducting effective Incident Postmortems - The process

    In order for postmortems to be blameless and effective at reducing recurring incidents, the review process can incentivize teams to identify root causes to fix them. A well-conducted postmortem allows teams to come together to achieve better goals in a less stressful environment. The exact method can depend on team culture.

    Here are a few best practices that can ensure the effectiveness of postmortems:

    1. Start with an incident timeline

    Prior to conducting an effective postmortem meeting , the premise of the meeting should be around the timeline of significant activity - from chat conversations, incident details and more. You can streamline the entire postmortem process with automated incident timeline building, collaborative editing, actionable insights, and formalize your own postmortem process to make it as easy as possible for your team to respond to issues.

    The goal is to understand all contributing root causes, document the incident for pattern discovery, which allows you to set a better context during the post mortem meeting. This step also plays a key role in enacting effective preventative actions to reduce the likelihood or impact of recurrence.

    2. Conduct a postmortem meeting with anyone internal to the team who was affected by the incident

    A structured and collaborative approach by bringing people together affected by an incident allows for a better cohesive contribution to the postmortem meeting in terms of what they learnt from the incident. This also helps in building trust and resiliency within teams. The formal incident postmortem document that records the details of the incident along with how the team remedied it can help teams in handling future incidents.

    At this step, a formal template can help you record all key details and helps build consistency across all your incident postmortems.

    At Squadcast we use our own incident postmortem feature that helps build an insightful timeline in a matter of minutes. This is especially useful as automation ensures that you can quickly have a system-generated post mortem for pretty much any incident, big or small. There are also a few predefined postmortem templates available from the likes of Google, Azure, and others. You can also choose to create new templates/modify existing ones. What’s more, these are available to download in MD and PDF formats!

    postmortem template
    postmortem template
    postmortem template

    3. Define roles and owners along with having a moderator

    Another key aspect to keep in mind during a postmortem meeting is to have well defined roles and owners along with having a moderator who can ensure the meeting stays on track and avoid any hint of a “blamestorming” session. It will be helpful to have guidelines for the owners of the postmortem process in how the meetings should be run.

    The owner of the review is tasked with managing the meeting and chronicling the subsequent report. It is advisable that the owner should be someone who has sufficient understanding of the technical details, familiarity with the incident, and an understanding of the business impact. Mostly, the moderator is the owner of the incident review and is responsible for maintaining order and giving every participant the chance to speak.

    4. Determine the urgency of an incident by setting the right thresholds

    Not all incidents are equal. Each incident in an organisation should be associated with a measurable severity level based on the impact it has on its business and customers. Associating incidents with the right severity level can help you prioritize your postmortem process. For instance, Sev 1 or higher incidents definitely require a postmortem, while for less severe incidents, postmortems can be automated with a tool like Squadcast.

    That said, if need be, teams should also be provided with an option to request a postmortem for any incident that doesn't meet the threshold.

    Integrated Reliability Automation Platform
    Platform
    PagerDuty
    FireHydrant
    Squadcast
    Incident Retrospectives
    APM, Monitoring, ITSM,Ticketing Integrations
    Incident
    Notes
    On Call Rotations
    Built-In Public and Private Status Page
    Advanced Error Budget Tracking
    Try For free
    Platform
    Incident Retrospectives
    APM, Monitoring, ITSM,Ticketing Integrations
    Incident
    Notes
    On Call Rotations
    Built-In Public and Private Status Page
    Advanced Error Budget Tracking
    PagerDuty
    FireHydrant
    Squadcast
    Try For free

    5. Devil’s in the Details - incident metrics and other key information captured

    Capturing as many details as possible about what happened and what was done during the incident can help teams be more unambiguous. Details such as links to tickets, status updates, incident state documents like monitoring charts along with screenshots and relevant graphics or dashboards becomes a powerful data set that captures the fine details of an incident.

    It is also crucial that along with summarizing key details, important incident related metrics are also captured that help you associate numeric and hard data to the incidents and their impact. Metrics such as Mean Time to Resolution (MTTR), SLO, Extent of SLO breach, Error Budget consumed, severity of incident, number of minutes of downtime can be considered for postmortem tracking. With consistent measurement of these metrics, you can analyze the incident trends over time.

    The key to conducting effective incident postmortems that can help you improve your team and systems is to have a process and stick to. And, making sure it is effective requires commitment at all levels in the organization.

    6. Publish and track postmortems promptly

    Once the postmortem review meeting is completed, the final but important step is to publish the postmortem promptly and distribute the same as an internal communication, typically via email, to all relevant stakeholders, describing the results and key learnings along with a link to the full report.

    Google states that “A prompt postmortem tends to be more accurate because information is fresh in the contributors’ minds. The people who were affected by the outage are waiting for an explanation and some demonstration that you have things under control. The longer you wait, the more they will fill the gap with the products of their imagination. That seldom works in your favor!”

    Regular application of these practices results in better system design, less downtime, and more effective and happier engineers.

    Related Reading

    There are many resources out there that you may consider to check out, if you are interested to know more on how to conduct effective postmortems, here are few of our suggestions

    Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution and create a knowledge base to effectively handle incidents.

    squadcast
    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    April 27, 2020
    April 27, 2020
    Share this post:

    Subscribe to our latest updates

    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    In this blog:
      Subscribe to our LinkedIn Newsletter to receive more educational content
      Subscribe now
      FAQ
      Learn how organizations are using Squadcast
      to maintain and improve upon their Reliability metrics
      Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
      mapgears
      "Mapgears simplified their complex On-call Alerting process with Squadcast.
      Squadcast has helped us aggregate alerts coming in from hundreds...
      bibam
      "Bibam found their best PagerDuty alternative in Squadcast.
      By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
      tanner
      "Squadcast helped Tanner gain system insights and boost team productivity.
      Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
      Alexandre Lessard
      System Analyst
      Martin do Santos
      Platform and Architecture Tech Lead
      Sandro Franchi
      CTO
      Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
      Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
      What our
      customers
      have to say
      mapgears
      "Mapgears simplified their complex On-call Alerting process with Squadcast.
      Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
      Alexandre Lessard
      System Analyst
      bibam
      "Bibam found their best PagerDuty alternative in Squadcast.
      By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
      Martin do Santos
      Platform and Architecture Tech Lead
      tanner
      "Squadcast helped Tanner gain system insights and boost team productivity.
      Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
      Sandro Franchi
      CTO
      Revamp your Incident Response.
      Peak Reliability
      Easier, Faster, More Automated with SRE.
      Incident Response Mobility
      Manage incidents on the go with Squadcast mobile app for Android and iOS devices
      google playapple store
      Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
      Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
      Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
      Users love Squadcast on G2
      Copyright © Squadcast Inc. 2017-2024