Incident management is stressful. Even more so, during the holidays. This is a checklist of things to watch out for to make sure your on-call team remains calm if an incident were to occur.
It sucks to be on-call when processes are not well defined and streamlined. Especially around the holidays.
You really don't want to hear your phone repeatedly going off right when you're sitting for Christmas dinner with your loved ones or getting to unwrapping the good presents (the ones with the sparkly wrapping paper :P).
Your on-call team’s stress levels reflects the health of your system, the cleanliness of your code and the culture of your organization. So, it's incredibly important to do everything in your power to make it easier for your on-call team. Because that necessarily means a host of goodness in your overall engineering team.
Don't leave your on-call team feeling like this sorry little Charmander.
The first course of action is to define a framework with a good set of rules to be followed. Especially around the holiday season, you can make a pre-holiday checklist.
Create sensible Schedules and Rotations with more people to share the load:
In most cases, the stress of on-call falls on just a few engineers. On-call burnout is a serious issue in the SRE and DevOps world and more so around the holidays given the small list of people willing to be on-call at this time (or in the case of startups, just one or two).
To start with, expand your on-call team so that the stress doesn’t fall on just a few. It’s important for everyone to have their vacation time off and distributing the load to a larger team will go a long way.
Have a foolproof system in place to override those Schedules / Rotations when needed:
You can automatically override schedules in case the alert/incident is clearly meant for a specific person or team. Or if it is obvious what action can be taken on the alerts in case of immediate resolution. This can be done using custom automated Incident Tags to help route notifications directly to the relevant folks or to help trigger pre-defined actions or scripts.
Use “Vacation Mode” to hand-off on-call shifts for both planned & unplanned time off:
Schedules and rotations bring in some order to on-call but it still does not take care of people taking time off. Having the ability to let someone take over your shift in case of emergencies or planned vacation is a boon. It’s important that the on-call schedules accurately reflect this.
Some best practices here would be -
Following a “No Deploys” practice for your Engineering teams during the weekends and holidays:
This is forked from the essential No Deploy Fridays practice that is common knowledge in the on-call community. In today's world, it should be possible for your infrastructure to recognise a failed deploy and roll back automatically. While this may not be the case for all systems and teams, the least that can be done is to ensure that you have these practices in place that help teams quickly recognize whenever an error occurs.
It is general practice to be available for at least a full working day post new deploys. Simply to be aware of how the push is functioning and to be able to quickly respond if it isn’t.
“Always code as if the guy who ends up maintaining, or testing your code will be a violent psychopath who knows where you live.” ~ Dave Carhart
Making Incidents Context Rich:
Half the stress of on-call stems from having little to no information for why something went down. Plenty of hours are spent on looking for more context for an incident than actually resolving it, leading to higher Mean-Time-To-Resolve (MTTR).
How can add context to incidents?
Proactive Incident Management using SLOs and Errors budgets :
A proactive incident management approach entails understanding incidents that are likely to occur and having a plan in place. On the other hand, a reactive incident management approach means scrambling to find the right things to do when an incident occurs, because you are taken by surprise. One useful method of having a proactive incident management approach as opposed to being reactive is by understanding trends from your Service Level Objectives (SLOs) and error budget graphs. By correlating the consumption of your error budget with the incidents that have occurred, you should be able to predict potential customer impacting downtimes.
Based on the types of incidents that have occurred, you can then formulate automatable scripts to resolve and mitigate.
Having a Resolution & Remediation Plan in place:
There are many reasons why services fail. Some are known, some are unknown. It is easier to fight fires knowing there’s always a solution.
The first step of incident resolution is to ensure that you minimize customer impact as soon as possible. The next step is to figure out a longer term remediation for the incident and this comes from a practice of maintaining playbooks or creating a knowledge base for different types of incidents that can guide on-call folks.
Squadcast Actions: It’s always good to have a predefined remediation plan in place. Make sure you integrate all the tools that you would use to take action like your CI/CD platform or infrastructure automation tools so that you can take said actions immediately and directly from your incident management platform when an incident occurs. For example, you can rollback a feature to its previous version or rebuild a project in response to an alert which is firing. If you have these things established then in most cases, you should be able to ensure that an incident is taken care of before your customer is impacted by it.
Runbooks: In cases where you already know the resolution steps for an incident, having an executable script can save you a lot of time. With runbooks, resolution is just a click away compared to otherwise doing it in a manual and repetitive fashion.
There are plenty of ways to make your on-call experience better but understanding why these things are important and communicating this to the broader engineering team is crucial. It's important to know that the sanity of your on-call team reflects the health of your systems and the culture of your organization as a whole.
So it becomes a prime responsibility of the entire team to make on-call folks have a good experience. Let’s take this to heart today and improve the way we do incident management!