Things to do to make on-call less stressful

In This Article:

Our Products

If you belong to a team that’s doing any DevOps, SRE, Operations or Development, you are invariably required to be on at least one on-call rotation. This can mean hours, days or weeks of just being there to attend to the dreaded pager when it starts ringing. And sometimes, even weekends and holidays. A lucky few get little to no pages while on-call but let’s face it, that’s a rarity.

Even when you don’t get paged, just the anticipation of the pager going off at random hours of the night can be draining.

It’s unfortunate that an alarming number of people consider being on-call the single most dreadful part of their job.

On-call shouldn’t suck and there is no getting rid of it. There is no such thing as a perfect system. Systems will fail and humans going on-call is always going to be a part of keeping systems reliable. What can help is if we give on-call the kind of attention it deserves. Your on-call function is a direct reflection of your engineering practices and company culture. So, if your team dreads it, then you’ve got some serious issues to fix.

On-call shouldn't suck. And it’s an organizational goal, not just the engineering team’s. A lot can be done to make your on-call function better and this post outlines some of the steps you can take in brief.

Setting the right expectations for on-call from the start

There are a lot of misconceptions around being on-call and it is important to establish the real objective for on-call. A few common ones are:

“I need to know everything before I go on-call” : This is not true. You don’t need to know everything. Like everything else, it is a learning process.
“I need to find a long term fix” - On-call is focused on quick fixes and immediate mitigation. Long term resolutions can and should be arrived at collaboratively.
“I need to do this alone” - It is daunting even to think that you may need to do all of this alone and you’re most certainly not expected to. You can ask for help and it is important for this to be an integral part of the process. Collaboration is everything.

In some cases, people assume that they’d be a “hero” if they found the long-term fix for an issue and tend to make this an implicit goal, however long it takes. This makes it hard for the rest of the team for whom on-call may already be very stressful. It is super important to recognize employee effort and outcomes but this shouldn’t come at the cost of making the process harder for the rest to live up to.

It is important to set expectations straight from your engineers who are going on-call and ensure that you have effective on-call schedules and rotations across your team to avoid pitfalls like the above “hero” example.

Create an on-boarding plan

It really helps if you feel like you know enough to get started with on-call as opposed to being blindly thrown into it. There’s the famous 40/70 rule for informative decision making that forms the basis of this.

The idea is that you need to have at least 40 - 70% of the information you need to take the right decision. Anything below that and you’re playing the blind man’s buff and anything beyond 70%, you’ve probably lost a lot of time in the resolution process.

An on-boarding program to help you figure out how to get that 40 - 70% of information you need, can go a long way. Every good on-boarding plan must ultimately help you to:

i) Understand the system and its components:

Provide an overview of all of the systems, its owners, architecture and components. When an incident hits, this information will help understand the impact it may have and prepare for the same, as well as know who to call on for more help.
Understand the dependencies that each component has on others. This will help in the investigation/triage phase and get to the RCA faster.
The fastest way to understand your system behaviour from an on-call perspective is to go through past incidents and see how they were resolved. You can also access your knowledge base and runbooks to understand the typical resolution processes in place.

ii) Know what tools you may need:

Understanding the tools of the trade is half the trade itself. This also opens up opportunities to improve your on-call process. More often than not, tooling is a crucial area where broader inefficiencies can be addressed.

It becomes important to build and maintain open documentation that outlines:

The observability tools that are currently used in the organization as well as the metrics and events being tracked by them
Any logging sources and visualizations to understand log data
Tools typically used for tracing and storing traces
Incident management tool
Other related resources and solutions for on-call management, status pages, CI/CD, ChatOps, ticket management, customer communication, etc.

Here are some really awesome resources that can help you take the right call in terms of tooling:

iii) Train for on-call:

Ensure that you know what’s coming when going on-call and feel fully prepared for it. Getting your teams ready through training can be not only informational but also help them understand cultural nuances of how the on-call process works.

A few ways to do this would be:

Make shadowing a part of the process. This way, they get to understand what goes into resolving an incident and
the kind of resources needed. That pager can ring anytime and they should be able to handle it with absolutely no stress.
Assign someone the role of a scribe, who is responsible for documenting the incident to be prepared for RCAs / Reviews / Status updates.
Taking on low severity incidents is a good way to get prepared. This will also provide a hands-on experience with all the tools available so you wouldn't have to panic and feel lost when you’re actually on-call.
You can create high sev incidents in a simulated environment and get the team to resolve it together to prepare them for the worst.
You can also use some scenario based games like Wheel-of-Misfortune and customize it for your organization. This will make the whole learning process more fun.

iv) Use the knowledge base and contribute effectively to growing it:

Having a knowledge base is probably the most important part of the whole on-call journey. This cannot be stressed enough. A lot of stress involved with being on-call can be reduced if they know they can resolve anything that comes their way. This kind of confidence can be instilled by simply pointing at your past incidents and showing how they were resolved.

To ensure that your team understands this, you can start with:

Going through existing runbooks to understand the resolution process
Going over how to look at historical incidents, and what was done to resolve them
Explaining the importance of post-incident-reviews or post-mortems and the most common ways incident reviews are done in the organization

How are we making on-call not suck at Squadcast

‍At Squadcast, we are always brainstorming ways to improve on-call practices. We also try to incorporate it in our product to make it accessible and simple for anyone using it. Here are a few practices we follow to ensure on-call can be a very smooth experience.

We ensure that incidents with high severity, or, moderate to high impact will have a postmortem report outlining the incident and resolution process.
We use our knowledge base of post-mortems and historical incidents to get our new recruits started with on-call.
We use incident deduplication to ensure that we do not get inundated with alerts.
We use auto-generated incident tags to identify, classify and enrich incident context.
We use our virtual war rooms to call in SMEs and other members on the on-call team for help when needed. We understand that on-call is daunting and your tool should have the ability to collaborate instantly when needed.
We use our automated incident timelines to refer while writing our postmortem reports and to understand similar incidents if any.
We use our private status page where all the components (internal and external facing) and their dependencies are clearly mapped out. This also shows our incident history and implemented resolutions and subsequently published RCAs.

Access our Free On-call Onboarding Checklist here. Please note that you can choose to use this directly or tweak it to fit your current processes and needs.

Enjoyed this? We would love to hear from you! What do you struggle with as a DevOps/SRE? Do you have ideas on how on-call could be done better in your organization?
Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

Written By:

Amiya Adwitiya

Prakya Vasudevan

January 30, 2020

Amiya Adwitiya

Prakya Vasudevan

January 30, 2020

On-Call

Incident Response

Share this blog: