As the pandemic wears on, remote incident management has become the norm worldwide for businesses. Here we share some best practices that helped us to address remote incidents and make on-call less stressful.
With the onset of remote work due to Covid-19, remote incident management has become the norm for businesses worldwide. Organisations that were earlier used to having war rooms now find themselves having to coordinate teams through Slack, MS Teams or other collaboration tools. This unexpected and unplanned transition has created a unique set of problems.
Now that we have had a few months of experience in dealing with incident management remotely, here are some best practices we found to be effective. While these best practices are already recommended for effective incident management; in times of remote working, we believe this list is a great starting point to stay on top and prevent major outages.
In this blog, we list some of the ideas that you can implement immediately including better communication among stakeholders, having detailed plans to deal with outages, documenting and learning from past failures. Here are the ways you can make this transition work in your favor and ensure that on-call remains as stress-free as possible.
1. Have a strong communication plan:
This includes using Slack, MS Teams or any other collaboration tool to communicate the incidents. Having a contingency plan in place if your usual communication software goes down is essential. No one wants to spend hours making calls on phones to fix issues. A remote incident management team is like a pit-stop crew but situated miles apart and sometimes in different timezones.
The recent outage of Slack in the first week of 2021, underlines how important it is to keep communication channels open. Private status pages are invaluable to the engineers already working on fixing the issue (especially in larger teams). It also helps your PR and communications team by providing an accurate picture of the size of the outage and the progress being done. The public status page lets your customers know if parts of your product are still operational and indicate the progress being made on returning to full-functionality.
2. Have an information repository of your system in hand:
Earlier if you needed any piece of information about your system it was as simple as moving a few desks over and asking the concerned person. Now, if that person is unavailable on Slack, the information you need to quickly fix the outage is hard to get. Having a centralized information system with all the essential information is invaluable. Too many organisations before the pandemic hit, had their important information down in post-it notes stuck all over the place. Needless to say, this won't work when your team is working remotely. You need to have a searchable repository of vital information to save precious time and effort.
3. Have dry-runs/simulations of catastrophic failures:
Having a dry-run or simulation to see how effectively your team can handle a severe failure while remote is a good idea. It can potentially provide effective insights into areas of improvement in your incident response strategy.
4. Automate more:
There are things that are quick fixes or easy to tackle when you are physically present in the office. These may be scripts that are run manually or meetings that can be avoided. Reducing toilsome activities is a long term goal that assumes greater importance when working remotely. Burnout from working remotely is a serious issue and tackling toil with automation should be high priority. Automation should ideally include running scripts, monitoring clusters, scheduling maintenance, and the auto-configuration of cloud-based virtual machines when the need arises.
Having detailed runbooks will be of great help when a major incident occurs. Automated runbooks can be a game changer when it comes to diagnosing and fixing systems that have gone offline. Whether you are using Ansible, Rundeck or any other tool, even the simplest runbook is better than fixing things manually and starting from scratch every time. You can read more about runbooks in this blog.
5. Fight Alert Fatigue (even more proactively):
Remote alert fatigue is perhaps significantly more damaging than normal alert fatigue. Configuring monitoring tools and tweaking alerting thresholds plays a very important role in reducing alert noise. Additionally, our team tackles alert fatigue by taking proactive steps to reduce alert noise by creating deduplication rules, event routing and tagging rules. Having mandatory off days for on-call engineers to avoid burnout also helps considerably.
6. Coordinate with dev teams before deployment:
Monitor your infrastructure during major deployments. Have rollbacks in place if things go wrong. As the most catastrophic failures can happen during deployments, you need a way to monitor system health during that time and initiate rollbacks if required.
7. Have a clear incident chain of command and roles:
Have you planned for contingencies when your usual leadership is on leave or unreachable? An incident chain of command mitigates any last moment confusion in a time sensitive and stressful situation.
8. Invest in an incident management platform:
If you haven't done it already, a dedicated incident management platform will go a long way in making on-call less stressful with the help of features like escalation policies and alert deduplication rules. Furthermore, many such platforms have dashboards that let you track the performance of your on-call team as well as the quality of service. There are still on-call teams that use spreadsheets to track schedules. While this was manageable (though not recommended) in pre-covid times, the situation now requires more clarity and efficiency. Easy to use On-call schedules in incident management platforms can be a great help for your team in planning their workload. Since engineers know beforehand whether they will be on-call they can plan their other activities accordingly. A healthy rotation in on-call schedules also helps prevent burnout.
After a major outage occurs, automated incident timelines are invaluable for remote teams to figure out measures that were taken to fix things. At Squadcast, we rely on the automated incident timeline to have a real-time view of the progress towards incident resolution. Automated timelines are also of great help when creating incident postmortems subsequently. It becomes much easier to figure out the strengths and weaknesses of your on-call response if you are armed with a detailed timeline of events.
As stated earlier, an incident response team during a major outage is like the pit-crew of a Formula1 team - trying to get as much done in the shortest amount of time possible. Like a pit crew, incident management teams will do their best work when each member knows the things he/she needs to be looking after.
We hope this list is as useful to you as it has been to us. Though this is not an exhaustive list of best practices for managing incidents while working remotely, we would love to hear from you. What other practices or ways of working helped you tackle incidents remotely? Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.