A few minutes of unexpected downtime can have catastrophic effects! Having a great incident response plan is more than a luxury - it is a necessity for organisations of all sizes today. This blog outlines key activities that can help you in formulating a better incidence plan.
Table of Contents:
Picture this scenario - your organisation has suffered a catastrophic outage, phones are ringing off the hook and customers are ranting online. Unfortunately, you do not have a reliable plan to deal with this unexpected happening. Already under significant pressure, you start throwing resources at the problem. However, without a proper incident response plan in place the remediation process is haphazard and inefficient, thereby, further increasing the time it takes to respond, leaving your already unhappy customers in the dark.
Having a great incident response plan is more than a luxury - it is a necessity for organisations of all sizes today. In this blog we answer some of the major questions that come to mind while designing and implementing your response plan. Building an intelligent incident response plan is not something that can be achieved overnight, it requires commitment, planning and quite a few tweaks before it works well for your organisation. Like interlocking parts that connect to a whole - a great incident response plan is only as good as the parts it's composed of. But everyone needs a place to start. This blog looks at the benefits, things to consider while creating the plan, and the tools you need. While writing this blog we have also strived to be industry, domain and tool agnostic. Our recommendations will also work with your existing on-call setup.
Why do you need an incident response plan?
A major incident that will affect a lion’s share of your customers is inevitable. Even the most carefully architectured system is likely to break down under unknown variable circumstances. These outages can have a crippling effect on your reputation from which it may be difficult to recover. An incident response plan is the foundation from which you can start the response and recovery process from major outages. A great incident response plan needs to take into account the post-outage communication strategies as well. A notable example of effective communication post a major outage can be found in Slack’s blog post addressed to their users.
Building a great incident response plan
Analysis: The first step is to take a long hard look at your IT environment and take stock of dependencies, points of failures and other bottlenecks that will impede recovery. This may include taking stock of human factors as well.
How well do your Ops and Dev teams know the environment?
Who are the most important users of your product and what does their typical usage workflow look like?
Preparation: Calculate the disruption that various possible IT failures will have on your business.
What are the implications of a downtime of critical infrastructure on the business?
What happens if the DNS service gets knocked out? Are there sufficient redundancies in place to handle an outage if internet vendor goes offline? In the absence of key Ops personnel can we train existing engineers to perform emergency triage? In this phase it is also worthwhile to keep track of vital KPIs like SLAs, SLOs that your organisation will be expected to uphold.
Simulating Scenarios: This part involves making plans for the most likely disaster scenarios. Some of the things to consider may include:
What steps are to be taken while informing customers about the outage?
What are the legal and compliance issues that will need to be addressed for each possible outage?
Dry Runs of Catastrophic Outages: Now that you have the skeleton of an incident response plan in place, it's time to have dry runs and see how well your team performs. Some of the things to keep in mind:
How well does the on-call team perform under pressure?
Are the tools being used for collaboration effective?
Are there things that can be automated?
Learning / Retrospectives: After simulating dry runs this is the part where you compile the learnings from your dry runs.
Were the non-technical stakeholders kept in the loop during the incident response process?
What are the improvements you can make to your existing plan?
Do you still need an incident response plan if you are a startup / smaller organisation?
As a smaller organisation an effective incident response plan is even more essential since any outage can potentially lead to a loss of trust. Startups may see a sudden increase in their number of users and as you scale up your infrastructure, your incident response plan needs to adjust accordingly.
Even the smallest of startups (2-4 person team) can start thinking about improving their incident response process. While the plan may be less formal than one for larger organisations processes, like documentation, automation for runbooks will still have a large impact. For example, if you are a small startup in charge of looking after the technical infrastructure of a large financial services organisation documenting your incident response process with the help of automated incident timelines and providing observability to the larger organisation with role based access can help you avoid potential liabilities in the future. In many sectors especially finance and cybersecurity, for compliance reasons there is a requirement to have proper documentation of every major incident that occurs.
What are the immediate advantages of having a great incident response plan?
Quicker resolution of major incidents This is the most obvious benefit, but one that needs to be mentioned nonetheless. A clear incident response plan has been shown to reduce MTTR (Mean time to Resolve) outages. In case of multiple outages from a specific technical area it becomes easier to pinpoint whether it is a technical issue or a problem with the on-call team.
More organised on-call teams Your on-call team knows what is expected of them during a major outage. They are aware of the documentation that needs to be kept at each stage of the incident resolution process, when to escalate incidents and the follow-ups required. Less time is wasted on deciding responsibilities and remediation measures.
Standardised processes and documentation Having a standardised process in place helps you categorise and evaluate your response to major incidents. The improvements in the performance of your on-call team can be better understood and weaknesses can be identified and resolved. The more detailed information you have regarding past incidents and the steps taken to resolve them, the easier it will be for your new Ops team to fix things if an outage occurs again.
Assigning roles for incidents Having clear cut roles saves precious time and creates a level of specialisation so that Individual members of the on-call team can focus on their areas of responsibility. Defining roles for the incident response team (incident commander, technical lead etc.) may be useful if you have a larger technology stack and on-call team. As your technology stack grows you may need roles for more specialised technical experts.
Understanding the tools that will help you build a better response plan
Runbooks Having specific runbooks for incidents can drastically cut-down the time required to respond to incidents. Newer employees who may not be as familiar with your organisation’s production environment can rely on them to fix issues. A shortcoming of runbooks is, if your production environment changes very often in which case the associated runbooks will need to be updated much more often.
Postmortems / Retrospectives A blameless postmortem after every major incident helps build resilience and a culture of learning in your organisation. There are several great templates out there that will walk you through the process of creating retrospectives.
Automation and Self-healing tools If you have self-healing systems in place you may want to figure out a system of suppressing those alerts that can be autonomously fixed without human supervision. However, creating a system that can detect minor outages and preemptively fix them without human intervention will require more advanced technical skill.
Proactively tracking the production environment Over a period of time your production environment will also change as new dependencies are introduced. Your on-call team and development team need to be in-sync regarding the changes. Major deployments where problems often occur can be coordinated with the Ops team. There are tools that automatically track when new services or dependencies are created for your microservice environment.
Create a War Room for major incidents Creating a war room in case of major outages provides a highly focused environment for tackling the outage. Nothing beats a war room in creating a sense of immediacy and cooperation that is needed to fix major outages. As an organisation you also need to determine the severity of an incident that will necessitate a warroom and the protocols to be followed.
Chat and collaboration tools Slack, MS Teams and Email remain some of the most common tools of communication during an outage. Many incident management tools can automatically create rooms/channels in Slack for a particular incident. This is especially helpful for alerting the non-technical stakeholders in your team of major outages.
Automated incident timeline creation tools These tools can keep track of the earliest measures taken to handle an incident. They also serve as helpful aids during retrospectives. For certain domains (financial or security) having detailed contextual information for an incident may be a regulatory requirement.
Social media tools to communicate with customers across different channels You can use social media tools that post to multiple channels for communicating with your customers after a major outage. This can include tools that automatically post the latest updates from your Status Page. While crafting your incident response plan it is also viable to decide on which parts of your infrastructure to fix first. It is always advisable to get the most basic functionality working and communicating it to your users. If your product is a platform that is used by thousands of users worldwide it is not uncommon to face harsh criticism online in such a case you may want to weigh the benefits, of an automated incident response.
While this blog covers the most common ways you can start building an incident response plan, it is by no means an exhaustive document. These recommendations scratch the surface for building a comprehensive plan. No plan survives contact with a major catastrophe but that doesn't mean that you don't start planning. Like all other planning exercises its effectiveness, will be put to test when you face your first major incident. In a modern distributed architecture stack not only the application but the environment/hardware it is deployed on may be changing constantly as well. Having a culture of collaboration is an essential part of better incident response. Depending upon your organisation it can take anything from a couple of weeks to a couple of months to come up with a better incident response plan that works well for you.
What do you struggle with as a DevOps/SRE? Do you have ideas on how incident response could be done better in your organization? We would be thrilled to hear from you! Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.