🎉 We are live on Product Hunt right now  🎉

How to avoid on-call burnout

Incident management is stressful. Even more so, during the holidays. This is a checklist of things to watch out for to make sure your on-call team remains calm if an incident were to occur.

Why is on-call so stressful?

It sucks to be on-call when processes are not well defined and streamlined. Especially around the holidays.

You really don't want to hear your phone repeatedly going off right when you're sitting for Christmas dinner with your loved ones or getting to unwrapping the good presents (the ones with the sparkly wrapping paper :P).

Your on-call team’s stress levels reflects the health of your system, the cleanliness of your code and the culture of your organization. So, it's incredibly important to do everything in your power to make it easier for your on-call team. Because that necessarily means a host of goodness in your overall engineering team.

Don't leave your on-call team feeling like this sorry little Charmander.

What can you do to make on-call easier?

The first course of action is to define a framework with a good set of rules to be followed. Especially around the holiday season, you can make a pre-holiday checklist.

Create sensible Schedules and Rotations with more people to share the load:

In most cases, the stress of on-call falls on just a few engineers. On-call burnout is a serious issue in the SRE and DevOps world and more so around the holidays given the small list of people willing to be on-call at this time (or in the case of startups, just one or two).

To start with, expand your on-call team so that the stress doesn’t fall on just a few. It’s important for everyone to have their vacation time off and distributing the load to a larger team will go a long way.

Have a foolproof system in place to override those Schedules / Rotations when needed:

You can automatically override schedules in case the alert/incident is clearly meant for a specific person or team. Or if it is obvious what action can be taken on the alerts in case of immediate resolution. This can be done using custom automated Incident Tags to help route notifications directly to the relevant folks or to help trigger pre-defined actions or scripts.

Overriding Schedules with Automated Incident Tags to Route to the right responders in Squadcast.

Use “Vacation Mode” to hand-off on-call shifts for both planned & unplanned time off:

Schedules and rotations bring in some order to on-call but it still does not take care of people taking time off. Having the ability to let someone take over your shift in case of emergencies or planned vacation is a boon. It’s important that the on-call schedules accurately reflect this.

Some best practices here would be -

  1. To let your team know well in advance before a planned vacation so that the necessary changes can be made to the on-call schedules and rotations.
  2. If you are the primary on-call for any services or systems, ensure that you set yourself on Vacation Mode and find / request that someone else take your on-call shift before your vacation begins.
  3. Make sure you would do the same favour for someone else when they need it, if you are available. Track your on-call hours as well as those of others on the team so that you are not overburdened.
  4. If you have an emergency and need someone else to take your on-call shift on short-notice, ensure that you ask them if they have the bandwidth to do that for you. Ideally, pick someone who hasn’t been on-call for a while.

Using Vacation Mode for On-call Schedules in Squadcast.

Following a “No Deploys” practice for your Engineering teams during the weekends and holidays:

This is forked from the essential No Deploy Fridays practice that is common knowledge in the on-call community. In today's world, it should be possible for your infrastructure to recognise a failed deploy and roll back automatically. While this may not be the case for all systems and teams, the least that can be done is to ensure that you have these practices in place that help teams quickly recognize whenever an error occurs.

It is general practice to be available for at least a full working day post new deploys. Simply to be aware of how the push is functioning and to be able to quickly respond if it isn’t.

“Always code as if the guy who ends up maintaining, or testing your code will be a violent psychopath who knows where you live.” ~ Dave Carhart

Making Incidents Context Rich:

Half the stress of on-call stems from having little to no information for why something went down. Plenty of hours are spent on looking for more context for an incident than actually resolving it, leading to higher Mean-Time-To-Resolve (MTTR).

How can add context to incidents?

  • Make sure your incidents are attached with all relevant Tags either automatically or manually - Example: Backend issue/ Frontend issue; Severity: High / Low.
  • Make sure severities are clearly defined and updated for every incident. With this level of clarity, your on-call team will be able to understand if something needs to be done immediately or if they have some time to find a fix post their holidays.
  • On-call teams struggle with switching between the various tools to find the information they need. One way to fix this would be to configure your alert source integrations within your incident management tool carefully, so that useful contextual info is automatically added to every incident. For example, your knowledge base or runbooks or any useful information from your monitoring, logging, tracing or visualization tools can add significant context to an incident to make faster decisions on how to react. This could be time series data, or graphs, or post-mortems of similar incidents addressed in the past.

Tagging Incidents to make them more context-rich in Squadcast.

Proactive Incident Management using SLOs and Errors budgets :

A proactive incident management approach entails understanding incidents that are likely to occur and having a plan in place. On the other hand, a reactive incident management approach means scrambling to find the right things to do when an incident occurs, because you are taken by surprise. One useful method of having a proactive incident management approach as opposed to being reactive is by understanding trends from your Service Level Objectives (SLOs) and error budget graphs. By correlating the consumption of your error budget with the incidents that have occurred, you should be able to predict potential customer impacting  downtimes.

Based on the types of incidents that have occurred, you can then formulate automatable scripts to resolve and mitigate.

Having a Resolution & Remediation Plan in place:

There are many reasons why services fail. Some are known, some are unknown. It is easier to fight fires knowing there’s always a solution.

The first step of incident resolution is to ensure that you minimize customer impact as soon as possible. The next step is to figure out a longer term remediation for the incident and this comes from a practice of maintaining playbooks or creating a knowledge base for different types of incidents that can guide on-call folks.

Squadcast Actions: It’s always good to have a predefined remediation plan in place. Make sure you integrate all the tools that you would use to take action like your CI/CD platform or infrastructure automation tools so that you can take said actions immediately and directly from your incident management platform when an incident occurs. For example, you can rollback a feature to its previous version or rebuild a project in response to an alert which is firing. If you have these things established then in most cases, you should be able to ensure that an incident is taken care of before your customer is impacted by it.

Runbooks: In cases where you already know the resolution steps for an incident, having an executable script can save you a lot of time. With runbooks, resolution is just a click away compared to otherwise doing it in a manual and repetitive fashion.

Using Squadcast Actions to reduce your MTTR.

Using Squadcast Runbooks for faster recovery.

There are plenty of ways to make your on-call experience better but understanding why these things are important and communicating this to the broader engineering team is crucial. It's important to know that the sanity of your on-call team reflects the health of your systems and the culture of your organization as a whole.

So it becomes a prime responsibility of the entire team to make on-call folks have a good experience. Let’s take this to heart today and improve the way we do incident management!

Learn more about Squadcast:
December 20, 2019
Prakya Vasudevan
About the Author:

How to avoid on-call burnout

December 20, 2019
Incident management is stressful. Even more so, during the holidays. This is a checklist of things to watch out for to make sure your on-call team remains calm if an incident were to occur.

Why is on-call so stressful?

It sucks to be on-call when processes are not well defined and streamlined. Especially around the holidays.

You really don't want to hear your phone repeatedly going off right when you're sitting for Christmas dinner with your loved ones or getting to unwrapping the good presents (the ones with the sparkly wrapping paper :P).

Your on-call team’s stress levels reflects the health of your system, the cleanliness of your code and the culture of your organization. So, it's incredibly important to do everything in your power to make it easier for your on-call team. Because that necessarily means a host of goodness in your overall engineering team.

Don't leave your on-call team feeling like this sorry little Charmander.

What can you do to make on-call easier?

The first course of action is to define a framework with a good set of rules to be followed. Especially around the holiday season, you can make a pre-holiday checklist.

Create sensible Schedules and Rotations with more people to share the load:

In most cases, the stress of on-call falls on just a few engineers. On-call burnout is a serious issue in the SRE and DevOps world and more so around the holidays given the small list of people willing to be on-call at this time (or in the case of startups, just one or two).

To start with, expand your on-call team so that the stress doesn’t fall on just a few. It’s important for everyone to have their vacation time off and distributing the load to a larger team will go a long way.

Have a foolproof system in place to override those Schedules / Rotations when needed:

You can automatically override schedules in case the alert/incident is clearly meant for a specific person or team. Or if it is obvious what action can be taken on the alerts in case of immediate resolution. This can be done using custom automated Incident Tags to help route notifications directly to the relevant folks or to help trigger pre-defined actions or scripts.

Overriding Schedules with Automated Incident Tags to Route to the right responders in Squadcast.

Use “Vacation Mode” to hand-off on-call shifts for both planned & unplanned time off:

Schedules and rotations bring in some order to on-call but it still does not take care of people taking time off. Having the ability to let someone take over your shift in case of emergencies or planned vacation is a boon. It’s important that the on-call schedules accurately reflect this.

Some best practices here would be -

  1. To let your team know well in advance before a planned vacation so that the necessary changes can be made to the on-call schedules and rotations.
  2. If you are the primary on-call for any services or systems, ensure that you set yourself on Vacation Mode and find / request that someone else take your on-call shift before your vacation begins.
  3. Make sure you would do the same favour for someone else when they need it, if you are available. Track your on-call hours as well as those of others on the team so that you are not overburdened.
  4. If you have an emergency and need someone else to take your on-call shift on short-notice, ensure that you ask them if they have the bandwidth to do that for you. Ideally, pick someone who hasn’t been on-call for a while.

Using Vacation Mode for On-call Schedules in Squadcast.

Following a “No Deploys” practice for your Engineering teams during the weekends and holidays:

This is forked from the essential No Deploy Fridays practice that is common knowledge in the on-call community. In today's world, it should be possible for your infrastructure to recognise a failed deploy and roll back automatically. While this may not be the case for all systems and teams, the least that can be done is to ensure that you have these practices in place that help teams quickly recognize whenever an error occurs.

It is general practice to be available for at least a full working day post new deploys. Simply to be aware of how the push is functioning and to be able to quickly respond if it isn’t.

“Always code as if the guy who ends up maintaining, or testing your code will be a violent psychopath who knows where you live.” ~ Dave Carhart

Making Incidents Context Rich:

Half the stress of on-call stems from having little to no information for why something went down. Plenty of hours are spent on looking for more context for an incident than actually resolving it, leading to higher Mean-Time-To-Resolve (MTTR).

How can add context to incidents?

  • Make sure your incidents are attached with all relevant Tags either automatically or manually - Example: Backend issue/ Frontend issue; Severity: High / Low.
  • Make sure severities are clearly defined and updated for every incident. With this level of clarity, your on-call team will be able to understand if something needs to be done immediately or if they have some time to find a fix post their holidays.
  • On-call teams struggle with switching between the various tools to find the information they need. One way to fix this would be to configure your alert source integrations within your incident management tool carefully, so that useful contextual info is automatically added to every incident. For example, your knowledge base or runbooks or any useful information from your monitoring, logging, tracing or visualization tools can add significant context to an incident to make faster decisions on how to react. This could be time series data, or graphs, or post-mortems of similar incidents addressed in the past.

Tagging Incidents to make them more context-rich in Squadcast.

Proactive Incident Management using SLOs and Errors budgets :

A proactive incident management approach entails understanding incidents that are likely to occur and having a plan in place. On the other hand, a reactive incident management approach means scrambling to find the right things to do when an incident occurs, because you are taken by surprise. One useful method of having a proactive incident management approach as opposed to being reactive is by understanding trends from your Service Level Objectives (SLOs) and error budget graphs. By correlating the consumption of your error budget with the incidents that have occurred, you should be able to predict potential customer impacting  downtimes.

Based on the types of incidents that have occurred, you can then formulate automatable scripts to resolve and mitigate.

Having a Resolution & Remediation Plan in place:

There are many reasons why services fail. Some are known, some are unknown. It is easier to fight fires knowing there’s always a solution.

The first step of incident resolution is to ensure that you minimize customer impact as soon as possible. The next step is to figure out a longer term remediation for the incident and this comes from a practice of maintaining playbooks or creating a knowledge base for different types of incidents that can guide on-call folks.

Squadcast Actions: It’s always good to have a predefined remediation plan in place. Make sure you integrate all the tools that you would use to take action like your CI/CD platform or infrastructure automation tools so that you can take said actions immediately and directly from your incident management platform when an incident occurs. For example, you can rollback a feature to its previous version or rebuild a project in response to an alert which is firing. If you have these things established then in most cases, you should be able to ensure that an incident is taken care of before your customer is impacted by it.

Runbooks: In cases where you already know the resolution steps for an incident, having an executable script can save you a lot of time. With runbooks, resolution is just a click away compared to otherwise doing it in a manual and repetitive fashion.

Using Squadcast Actions to reduce your MTTR.

Using Squadcast Runbooks for faster recovery.

There are plenty of ways to make your on-call experience better but understanding why these things are important and communicating this to the broader engineering team is crucial. It's important to know that the sanity of your on-call team reflects the health of your systems and the culture of your organization as a whole.

So it becomes a prime responsibility of the entire team to make on-call folks have a good experience. Let’s take this to heart today and improve the way we do incident management!

Prakya Vasudevan
Want to share the awesomeness?
🎉 We are live on Product Hunt right now  🎉