Incidents that disrupt services are unavoidable. But every breakdown is an opportunity to learn & improve. This blog is a deep dive into best practices to follow across the lifecycle of an incident, helping teams build a sustainable and reliable product - the SRE way
As the saying goes, “Every problem we face is a blessing in disguise”. On similar lines, every incident in system infrastructure, helps product development & engineering teams understand better about the capabilities of system architecture. This can further help organizations in building a sustainable and reliable product.
In this blog, we are quantifying all complexities of handling an incident in a well-structured format with an intent to help you handle every incident effectively.
ITIL 2011 defines Incident as,
“an unplanned interruption to an IT service or reduction in the quality of an IT service or a failure of a Configuration Item that has not yet impacted an IT service [but has potential to do so]”
Clearly, in order to maintain acceptable service levels, it is important to resolve incidents and restore normal services as quickly as possible.
ITIL defines a standard lifecycle of an incident. While the actual activities that occur during each phase have changed over time, it is still a good starting point for a detailed description of incidents.
Incidents are identified through reports from monitoring systems, or by manual identification. Once an incident is identified it is logged. An incident log can be used to validate that all incidents have been addressed and to identify trends. At this point, the incident is categorized by adding additional information like severity, functional area, and ownership. These three activities were once the responsibility of a first-level monitoring technician, nowadays they are normally automated.
This stage is about notifying the right people to address an incident. In many modern environments identifying the correct responders can be a complex process. Similarly, many organizations have elaborate escalation processes to get specialists or SMEs when the initial responders need help. Modern incident management systems can reduce turnaround times by using rules to automate this.
Once notified, incident responders, gather information about the incident using observability tools. In addition to the current state of the system, RCAs of similar incidents in the past can be valuable sources of data. This information is used to build a hypothesis about the probable cause of the incident and to decide on a fix.
The responder team applies the fix proposed in the previous step and, typically, observes the system for a little while to confirm that the incident has been resolved. Normally, it can take several iterations of trial and error before an incident is resolved. Each trial provides more information to evolve the hypothesis and formulate better fixes.
The incident is marked closed when confirmation is received that normal services have resumed. The definition of confirmation varies but it is often wise to use multiple independent confirmations, for example:
An important part of incident closure is deciding and logging follow-up actions. This usually requires a postmortem that does an RCA and a process review of the incident. The process review will generate follow-up steps to improve the incident management process. The RCA will determine if
Incident lifecycle now gives a clear picture of various activities an incident management team is practically following while encountering an incident. Now let’s look into the best practices a team should have in order to make incident management a less stressful activity.
ITIL incident lifecycle provides a way to handle an incident, but the best practice comes only with extensive practical experience towards managing an incident. This section is about keeping an incident management team productive with a structured format. These are some of the practices that would greatly encourage a team towards efficiency and avoid burnouts.
The first step is to delegate the work involved among all team members. Handling incidents needs a lot of awareness about who has to do which work. Adequate information about each individual's roles and responsibilities would help them in taking key decisions independently. Now the basic roles in handling an incident are,
The incident command system was originally formulated in the year 1968 by a fire disaster response team to delegate roles and responsibilities for every team member across a team. Later it was incorporated into managing incidents across software and cloud infrastructure systems.
The framework of incident response revolves around 3'C's or the goals of effective incident management. They are,
This is about delegation of roles among an incident management team.
This stage is about setting up a designated war room, a centralized space where team members can coordinate with each other in resolving an incident at a faster pace. Here, the team can use Slack/Telephone/Video conferencing for maintaining and recording a communication log between team members about incident related traffic and alerts.
This stage is about the role of an incident commander to maintain a concurrent live incident document where all details of an incident are recorded diligently. This live document can be hosted on wiki and must be accessible to every other team member, enabling them to contribute data about an incident. This practice ensures transparency among team members and stakeholders.
This happens when the incident responders need to change in an ongoing incident. This could be because their shift has ended or even because they are exhausted. When the team changes whatever work they were each doing must be seamlessly handed over to the new team. This includes the overall status, the progress of investigation or corrective action, and more. A real-time incident state document is invaluable for this.
Finally, this stage is about putting into practice all of the best incident management strategies that helped in resolving an incident. This greatly ensures in reducing meantime to recovery and avoid any stressful situations to an incident management team. Some of them are,
After every non-trivial incident, it is important to run a postmortem. There are some important outcomes of a good postmortem:
The outcomes are achieved by reviewing the incident and identifying its root cause.
When postmortems are focussed on assigning responsibility (i.e, blame) then most participants will be primarily concerned with not being blamed. Conversely, a focus on what went wrong will allow the participants to be more objective and less worried about protecting themselves. It also recognizes that humans make mistakes and that it is more effective to address circumstances that contribute to errors than to seek humans who don’t make mistakes.
There is no value in postmortems if they do not generate results. Track and reward postmortem outcomes:
Postmortems without outcomes or action items are usually a sign that they’re ineffective.
The lessons from a postmortem are wasted if they are not applied to all systems and teams organization-wide. Sharing and transparency help ensure that lessons learned to percolate throughout the organization. Some steps to encourage transparency:
Signs of a failing postmortem culture must be immediately addressed. Culture is not a set of principles in a document but behavior that is rewarded or penalized. Some failings are:
Additional reading: Towards More Effective Incident Postmortems
Incidents are common events that should be handled in a standard pattern. ITIL defines a good template to follow. A few good practices can really help improve the effectiveness of an incident management process:
We hope this blog gives you a better and deeper understanding of the best practices to follow during the lifecycle of an incident, enabling you handle critical incidents in your organization without much hassle and burnouts.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.