In an always-on world, companies look to systems & processes to keep their services up & running at all times. Squadcast latest post outlines a few best practices in incident management to restore services during unplanned downtime.
Table of Contents:
In an always-on world, companies look to systems and processes to keep their services up and running at all times. The most important part of maintaining this uptime is having an Incident Management process in place to restore your services in the event of an interruption or unplanned downtime.
Incident Management processes are typically used by SRE, DevOps, NOC and other IT teams to respond to incidents that affect services and work on restoring their uptime. Any team that also follows ITIL and ITSM practices have similar processes in place with slightly different terminologies.
For the purpose of categorizing the different aspects of Incident Management, we can go over the different stages of an Incident Lifecycle.
What is IT incident management?
Incident management is the process of managing an event that disrupts the normal function of a system, network, or process. They can be caused by hardware or software problems and can be a result of a single event or a series of events. While the process can vary depending on the size of the organization, most organizations handle incidents by creating a series of processes that share one goal: to identify the root cause of the incident, and take corrective action.
An organization’s Incident Management process is meant to tie these stages in together seamlessly and cover the entire lifecycle of the incident - from incident trigger to post-incident reviews and postmortems. It is also important to note that these practices are meant to be dynamic and constantly evolving with the people, systems, and architectures.
This post outlines some best practices to keep in mind while implementing or improving your processes.
Incident Detection & Classification:
The initial details you receive about an incident while on-call saves a lot of time in the triage and mitigation process. Configuring the right data fields and Event Tags in order to automate this level of classification is a must.
Set up Deduplication rules to group all similar alerts together. This also ensures that your on-call team is not notified for the same incident repeatedly.
Send only vital information that can help assist in the remediation, in the details field.
Make sure you add any other important data manually as a part of the classification process after the team has been alerted.
Make sure to send alerts only for relevant and actionable events even if other events are also being sent into your incident management tool.
Make sure you configure Deduplication and Suppression Rules to ensure you do not get notified for un-important alerts. This could otherwise cause severe alert fatigue and also affects your team’s response times and productivity.
A crucial form of incident classification is prioritization. This helps the on-call team understand the severity of the issue at first glance. Configure automation to assign priority to every incident routed to your alerting tool.
The prioritization matrix followed by an organization should always be linked to service and customer impact. This gives the on-call team the clarity needed to understand the situation.
Triage and Collaboration:
Configure your incident routing and escalation policies to always reach the right responder. Assign tags to indicate severity or priority and configure routing rules to ensure that the first responder is always the right responder.
In a high-fire situation, the ease with which you can communicate can make or break the customer perception and ultimately the impact on your bottom line. Having a platform-specific collaboration space can reduce the time taken to assemble elsewhere to discuss the incident.
If you use Slack for this, make sure that there is an assigned channel to have any kind of incident related discussion in order to reduce MTTR.
It’s important to keep both customers and customer-facing internal teams in the know of all mitigation activities. This is easier when you automate all communication updates and manage it from one place.
Add in the relevant teams as stakeholders so they can see what’s done to mitigate an incident. Also you can provide additional details on a private status page for internal folks.
Maintain a Public Status Page and constantly update it. The first thing a user would do when facing service issues, is to look at your status page. So, always ensure your status page has all the essential information a user would need to understand the impact of the issue.
Automate as much as possible. Connect your tools to take action directly from within the incident management platform itself. Little steps go a long way.
Document any attempts at resolution or mitigation, as soon as you have taken the steps. What you perceive to be a small problem might not be the case for someone else on your team.
Maintain a repository of Runbooks and RCAs / Incident Reviews for you and your team to go back and review resolution steps for similar incidents in the future.
Drive a collaborative Incident Review process complete with Root-Cause-Analysis (RCA) to get to a fine-grained understanding of any incident as quickly as possible.
Always run an Incident Review process for Medium and High severity incidents. Remember to be Blameless. At the time of a crisis, it’s important to focus on the `What`, `Why`, `How` and `What Next` rather than the `Who`.
Maintain a checklist of tasks that have to be completed for longer-term remediation.
Ensuring that you learn from every incident should be the biggest takeaway from your Incident Response process.
Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution, and create a knowledge base to effectively handle incidents.