Doing on-call management in a way that’s better, less stressful and actually works to improve your incident response processes, uptime & reliability
If you belong to a team that’s doing any DevOps, SRE, Operations or Development, you are invariably required to be on at least one on-call rotation. This can mean hours, days or weeks of just being there to attend to the dreaded pager when it starts ringing. And sometimes, even weekends and holidays. A lucky few get little to no pages while on-call but let’s face it, that’s a rarity.
Even when you don’t get paged, just the anticipation of the pager going off at random hours of the night can be draining.
It’s unfortunate that an alarming number of people consider being on-call the single most dreadful part of their job.
On-call shouldn’t suck and there is no getting rid of it. There is no such thing as a perfect system. Systems will fail and humans going on-call is always going to be a part of keeping systems reliable. What can help is if we give on-call the kind of attention it deserves. Your on-call function is a direct reflection of your engineering practices and company culture. So, if your team dreads it, then you’ve got some serious issues to fix.
On-call shouldn't suck. And it’s an organizational goal, not just the engineering team’s. A lot can be done to make your on-call function better and this post outlines some of the steps you can take in brief.
There are a lot of misconceptions around being on-call and it is important to establish the real objective for on-call. A few common ones are:
In some cases, people assume that they’d be a “hero” if they found the long-term fix for an issue and tend to make this an implicit goal, however long it takes. This makes it hard for the rest of the team for whom on-call may already be very stressful. It is super important to recognize employee effort and outcomes but this shouldn’t come at the cost of making the process harder for the rest to live up to.
It is important to set expectations straight from your engineers who are going on-call and ensure that you have effective on-call schedules and rotations across your team to avoid pitfalls like the above “hero” example.
It really helps if you feel like you know enough to get started with on-call as opposed to being blindly thrown into it. There’s the famous 40/70 rule for informative decision making that forms the basis of this.
The idea is that you need to have at least 40 - 70% of the information you need to take the right decision. Anything below that and you’re playing the blind man’s buff and anything beyond 70%, you’ve probably lost a lot of time in the resolution process.
An on-boarding program to help you figure out how to get that 40 - 70% of information you need, can go a long way. Every good on-boarding plan must ultimately help you to:
i) Understand the system and its components:
ii) Know what tools you may need:
Understanding the tools of the trade is half the trade itself. This also opens up opportunities to improve your on-call process. More often than not, tooling is a crucial area where broader inefficiencies can be addressed.
It becomes important to build and maintain open documentation that outlines:
Here are some really awesome resources that can help you take the right call in terms of tooling:
iii) Train for on-call:
Ensure that you know what’s coming when going on-call and feel fully prepared for it. Getting your teams ready through training can be not only informational but also help them understand cultural nuances of how the on-call process works.
A few ways to do this would be:
iv) Use the knowledge base and contribute effectively to growing it:
Having a knowledge base is probably the most important part of the whole on-call journey. This cannot be stressed enough. A lot of stress involved with being on-call can be reduced if they know they can resolve anything that comes their way. This kind of confidence can be instilled by simply pointing at your past incidents and showing how they were resolved.
To ensure that your team understands this, you can start with:
At Squadcast, we are always brainstorming ways to improve on-call practices. We also try to incorporate it in our product to make it accessible and simple for anyone using it. Here are a few practices we follow to ensure on-call can be a very smooth experience.
Access our Free On-call Onboarding Checklist here. Please note that you can choose to use this directly or tweak it to fit your current processes and needs.
Enjoyed this? We would love to hear from you! What do you struggle with as a DevOps/SRE? Do you have ideas on how on-call could be done better in your organization?
Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.