An incident postmortem is not only an essential document for reference, but also necessary as a process by which teams can collaboratively learn from failure, and communicate independent learnings across the organization.
As our systems grow in scale and complexity, outages are inevitable, no matter how hard we try to provide uninterrupted services. When an outage occurs, the most important and immediate step is, of course, fixing the underlying issue and keeping the relevant stakeholders and customers informed. A lot of the incidents can be quickly rectified with tools like infrastructure automation, runbooks, feature flags, version control, continuous delivery and people can be kept in the loop with chatops and status pages. These actions, though beneficial to fix the situation at hand, do not really help understand what failed and why. And understanding what failed and why is a crucial step towards preventing similar occurrences going forward.
This is where incident postmortems come in - the next logical step after any incident is to dissect and analyze the why, how and the what of the incident. And ideally, this should really be done for every single incident, and not just the high severity or high impact ones.
An incident postmortem is a report that records the details of an incident, the impact it has on the service, the team that was assembled to address the event, the immediate steps taken to mitigate the damage,the actions taken to resolve the incident and the lessons learnt that can help the team minimize the impact of future incidents. These lessons can in turn affect how you think about a particular component of your system, or sometimes just how mitigation steps could be done faster in specific cases. Which is a big deal, to say the least.
â
An incident postmortem is a process that takes place after an incident occurs. It is the process of analyzing the incident and identifying its root causes. These root causes can then be addressed in future incidents. It is important to understand what an incident postmortem is, why it is important, and how it should be conducted.
An incident postmortem is not only an essential document for reference, but also necessary as a process by which teams can collaboratively learn from failure, and communicate independent learnings across the organization.
There are several reasons why doing an incident postmortem are incredibly important :
Incident Postmortems are also called RCAs (Root Cause Analysis) or incident reviews. At Squadcast, we prefer the term Incident Reviews but to keep this easier to digest, we are going to refer to them by the more popular âIncident Postmortemâ, for the rest of this article. When it comes to an incident postmortem, there is no one-size-fits-all approach or even a universally accepted standard for doing different kinds of post-mortems. The Postmortem process varies across organizations and sometimes even within companies depending on the size and culture of the teams, from casual to highly formal, depending on the nature of the product or the severity of the incident.
Regardless of the names and the approach, the end goal remains the same - to keep relevant stakeholders informed and as a learning opportunity not only to fix a weakness but to make systems more resilient as a whole. The whole incident postmortem process can take considerable time and effort to gather information and the postmortem meeting (if needed) might occur days, or even weeks, after the actual incident depending on the severity of the same.
A typical postmortem process covers the below-outlined aspects, in no particular order:
âBlameless postmortems are a tenet of SRE culture. You canât "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systemsâ
A critical factor in incident postmortem to be successful is that they are blameless. A culture that seeks to point fingers at the person who may have caused an outage through error or omission is unlikely to get truthful answers during a review, thus negating the intent behind the whole exercise of having an incident postmortem in the first place.
Through blameless postmortems, the aim is to have a nurturing environment where every âmistakeâ is seen as an opportunity to strengthen the system. Blameless postmortems shift from allocating blame to investigating the underlying cause and reasons, why an individual or team faced an outage, and also emphasizing the effective prevention plans that can be put in place.
Many teams, including us here at Squadcast similar to Google, have adopted the culture of the blameless postmortem which paves way to build resilience in its teams and systems.
Blameless postmortems can tend to be challenging to write since the postmortem format clearly identifies the actions that led to the incident. However, removing blame from a postmortem provides the team the confidence to escalate issues without fear. The next section outlines the steps that can be taken to conduct effective blameless postmortems.
In order to ensure that teams develop a culture around blameless incident postmortem reviews, it should also be noted that empowering teams with an easy and automated way to capture incident information and publish the final report with reusable checklists and templates, could potentially make incident postmortem meetings less dreadful. In fact, having an automated timeline and templates that are auto-populated with incident metrics and other details as part of your incident management tool can help the process be more consistent and productive for every incident that occurs.
In order for postmortems to be blameless and effective at reducing recurring incidents, the review process can incentivize teams to identify root causes to fix them. A well-conducted postmortem allows teams to come together to achieve better goals in a less stressful environment. The exact method can depend on team culture.
Here are a few best practices that can ensure the effectiveness of postmortemsâ:
1. Start with an incident timeline
Prior to conducting an effective postmortem meeting , the premise of the meeting should be around the timeline of significant activity - from chat conversations, incident details and more. You can streamline the entire postmortem process with automated incident timeline building, collaborative editing, actionable insights, and formalize your own postmortem process to make it as easy as possible for your team to respond to issues.
The goal is to understand all contributing root causes, document the incident for pattern discovery, which allows you to set a better context during the post mortem meeting. This step also plays a key role in enacting effective preventative actions to reduce the likelihood or impact of recurrence.
2. Conduct a postmortem meeting with anyone internal to the team who was affected by the incident
A structured and collaborative approach by bringing people together affected by an incident allows for a better cohesive contribution to the postmortem meeting in terms of what they learnt from the incident. This also helps in building trust and resiliency within teams. The formal incident postmortem document that records the details of the incident along with how the team remedied it can help teams in handling future incidents.
At this step, a formal template can help you record all key details and helps build consistency across all your incident postmortems.
At Squadcast we use our own incident postmortem feature that helps build an insightful timeline in a matter of minutes. This is especially useful as automation ensures that you can quickly have a system-generated post mortem for pretty much any incident, big or small. There are also a few predefined postmortem templates available from the likes of Google, Azure, and others. You can also choose to create new templates/modify existing ones. Whatâs more, these are available to download in MD and PDF formats!
3. Define roles and owners along with having a moderator
Another key aspect to keep in mind during a postmortem meeting is to have well defined roles and owners along with having a moderator who can ensure the meeting stays on track and avoid any hint of a âblamestormingâ session. It will be helpful to have guidelines for the owners of the postmortem process in how the meetings should be run.
The owner of the review is tasked with managing the meeting and chronicling the subsequent report. It is advisable that the owner should be someone who has sufficient understanding of the technical details, familiarity with the incident, and an understanding of the business impact. Mostly, the moderator is the owner of the incident review and is responsible for maintaining order and giving every participant the chance to speak.
4. Determine the urgency of an incident by setting the right thresholds
Not all incidents are equal. Each incident in an organisation should be associated with a measurable severity level based on the impact it has on its business and customers. Associating incidents with the right severity level can help you prioritize your postmortem process. For instance, Sev 1 or higher incidents definitely require a postmortem, while for less severe incidents, postmortems can be automated with a tool like Squadcast.
That said, if need be, teams should also be provided with an option to request a postmortem for any incident that doesn't meet the threshold.
5. Devilâs in the Details - incident metrics and other key information captured
Capturing as many details as possible about what happened and what was done during the incident can help teams be more unambiguous. Details such as links to tickets, status updates, incident state documents like monitoring charts along with screenshots and relevant graphics or dashboards becomes a powerful data set that captures the fine details of an incident.
It is also crucial that along with summarizing key details, important incident related metrics are also captured that help you associate numeric and hard data to the incidents and their impact. Metrics such as Mean Time to Resolution (MTTR), SLO, Extent of SLO breach, Error Budget consumed, severity of incident, number of minutes of downtime can be considered for postmortem tracking. With consistent measurement of these metrics, you can analyze the incident trends over time.
The key to conducting effective incident postmortems that can help you improve your team and systems is to have a process and stick to. And, making sure it is effective requires commitment at all levels in the organization.
6. Publish and track postmortems promptly
Once the postmortem review meeting is completed, the final but important step is to publish the postmortem promptly and distribute the same as an internal communication, typically via email, to all relevant stakeholders, describing the results and key learnings along with a link to the full report.
Google states that âA prompt postmortem tends to be more accurate because information is fresh in the contributorsâ minds. The people who were affected by the outage are waiting for an explanation and some demonstration that you have things under control. The longer you wait, the more they will fill the gap with the products of their imagination. That seldom works in your favor!â
Regular application of these practices results in better system design, less downtime, and more effective and happier engineers.
There are many resources out there that you may consider to check out, if you are interested to know more on how to conduct effective postmortems, here are few of our suggestions
Squadcast is an incident management tool thatâs purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution and create a knowledge base to effectively handle incidents.