🚀 Take control of your Incident Management process with Squadcast's new Audit Logs feature.

Things to do to make on-call less stressful

Jan 30, 2020
Last Updated:
May 2, 2024
Share this post:
Things to do to make on-call less stressful

Doing on-call management in a way that’s better, less stressful and actually works to improve your incident response processes, uptime & reliability

Table of Contents:

    If you belong to a team that’s doing any DevOps, SRE, Operations or Development, you are invariably required to be on at least one on-call rotation. This can mean hours, days or weeks of just being there to attend to the dreaded pager when it starts ringing. And sometimes, even weekends and holidays. A lucky few get little to no pages while on-call but let’s face it, that’s a rarity.

    Even when you don’t get paged, just the anticipation of the pager going off at random hours of the night can be draining.

    It’s unfortunate that an alarming number of people consider being on-call the single most dreadful part of their job. 

    On-call shouldn’t suck and there is no getting rid of it. There is no such thing as a perfect system. Systems will fail and humans going on-call is always going to be a part of keeping systems reliable. What can help is if we give on-call the kind of attention it deserves. Your on-call function is a direct reflection of your engineering practices and company culture. So, if your team dreads it, then you’ve got some serious issues to fix.

    On-call shouldn't suck. And it’s an organizational goal, not just the engineering team’s. A lot can be done to make your on-call function better and this post outlines some of the steps you can take in brief.

    Setting the right expectations for on-call from the start

    There are a lot of misconceptions around being on-call and it is important to establish the real objective for on-call. A few common ones are:

    • “I need to know everything before I go on-call” : This is not true. You don’t need to know everything. Like everything else, it is a learning process.
    • “I need to find a long term fix” - On-call is focused on quick fixes and immediate mitigation. Long term resolutions can and should be arrived at collaboratively.
    • “I need to do this alone” - It is daunting even to think that you may need to do all of this alone and you’re most certainly not expected to. You can ask for help and it is important for this to be an integral part of the process. Collaboration is everything.

    In some cases, people assume that they’d be a “hero” if they found the long-term fix for an issue and tend to make this an implicit goal, however long it takes. This makes it hard for the rest of the team for whom on-call may already be very stressful. It is super important to recognize employee effort and outcomes but this shouldn’t come at the cost of making the process harder for the rest to live up to.

    It is important to set expectations straight from your engineers who are going on-call and ensure that you have effective on-call schedules and rotations across your team to avoid pitfalls like the above “hero” example.

    Create an on-boarding plan

    It really helps if you feel like you know enough to get started with on-call as opposed to being blindly thrown into it. There’s the famous 40/70 rule for informative decision making that forms the basis of this. 

    The idea is that you need to have at least 40 - 70% of the information you need to take the right decision. Anything below that and you’re playing the blind man’s buff and anything beyond 70%, you’ve probably lost a lot of time in the resolution process. 

    An on-boarding program to help you figure out how to get that 40 - 70% of information you need, can go a long way. Every good on-boarding plan must ultimately help you to:

    i) Understand the system and its components:

    • Provide an overview of all of the systems, its owners, architecture and components. When an incident hits, this information will  help understand the impact it may have and prepare for the same, as well as know who to call on for more help.
    • Understand the dependencies that each component has on others. This will help in the investigation/triage phase and get to the RCA faster.
    • The fastest way to understand your system behaviour from an on-call perspective is to go through past incidents and see how they were resolved. You can also access your knowledge base and runbooks to understand the typical resolution processes in place.

    ii) Know what tools you may need:

    Understanding the tools of the trade is half the trade itself. This also opens up opportunities to improve your on-call process. More often than not, tooling is a crucial area where broader inefficiencies can be addressed.

    It becomes important to build and maintain open documentation that outlines: 

    • The observability tools that are currently used in the organization as well as the metrics and events being tracked by them
    • Any logging sources and visualizations to understand log data 
    • Tools typically used for tracing and storing traces
    • Incident management tool
    • Other related resources and solutions for on-call management, status pages, CI/CD, ChatOps, ticket management, customer communication, etc.

    Here are some really awesome resources that can help you take the right call in terms of tooling:

    iii) Train for on-call:

    Ensure that you know what’s coming when going on-call and feel fully prepared for it. Getting your teams ready through training can be not only informational but also help them understand cultural nuances of how the on-call process works.

    A few ways to do this would be: 

    • Make shadowing a part of the process. This way, they get to understand what goes into resolving an incident and
    • the kind of resources needed. That pager can ring anytime and they should be able to handle it with absolutely no stress. 
    • Assign someone the role of a scribe, who is responsible for documenting the incident to be prepared for RCAs / Reviews / Status updates.
    • Taking on low severity incidents is a good way to get prepared. This will also provide a hands-on experience with all the tools available so you wouldn't have to panic and feel lost when you’re actually on-call. 
    • You can create high sev incidents in a simulated environment and get the team to resolve it together to prepare them for the worst.
    • You can also use some scenario based games like Wheel-of-Misfortune and customize it for your organization. This will make the whole learning process more fun.

    iv) Use the knowledge base and contribute effectively to growing it:

    Having a knowledge base is probably the most important part of the whole on-call journey. This cannot be stressed enough. A lot of stress involved with being on-call can be reduced if they know they can resolve anything that comes their way. This kind of confidence can be instilled  by simply pointing at your past incidents and showing how they were resolved. 

    To ensure that your team understands this, you can start with:

    • Going through existing runbooks to understand the resolution process 
    • Going over how to look at historical incidents, and what was done to resolve them
    • Explaining the importance of post-incident-reviews or post-mortems and the most common ways incident reviews are done in the organization

    How are we making on-call not suck at Squadcast

    At Squadcast, we are always brainstorming ways to improve on-call practices. We also try to incorporate it in our product to make it accessible and simple for anyone using it. Here are a few practices we follow to ensure on-call can be a very smooth experience.

    • We ensure that incidents with high severity, or, moderate to high impact will have a postmortem report outlining the incident and resolution process. 
    • We use our knowledge base of post-mortems and historical incidents to get our new recruits started with on-call. 
    • We use incident deduplication to ensure that we do not get inundated with alerts.
    • We use auto-generated incident tags to identify, classify and enrich incident context.
    • We use our virtual war rooms to call in SMEs and other members on the on-call team for help when needed. We understand that on-call is daunting and your tool should have the ability to collaborate instantly when needed.  
    • We use our automated incident timelines to refer while writing our postmortem reports and to understand similar incidents if any.
    • We use our private status page where all the components (internal and external facing) and their dependencies are clearly mapped out. This also shows our incident history and implemented resolutions and subsequently published RCAs.

    Access our Free On-call Onboarding Checklist here. Please note that you can choose to use this directly or tweak it to fit your current processes and needs.

    Enjoyed this? We would love to hear from you! What do you struggle with as a DevOps/SRE? Do you have ideas on how on-call could be done better in your organization?
    Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Amiya Adwitiya
    A New Era for Squadcast
    A New Era for Squadcast
    December 12, 2022
    Announcing our $6M investment to double down on IT incident and Reliability needs
    Announcing our $6M investment to double down on IT incident and Reliability needs
    August 6, 2021
    On-call On-boarding Checklist
    On-call On-boarding Checklist
    May 20, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
    Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
    Users love Squadcast on G2
    Copyright © Squadcast Inc. 2017-2024
    Blog
    On-Call
    Things to do to make on-call less stressful

    Things to do to make on-call less stressful

    Amiya Adwitiya
    Prakya Vasudevan
    Amiya Adwitiya
    Prakya Vasudevan
    January 30, 2020
    Things to do to make on-call less stressful

    If you belong to a team that’s doing any DevOps, SRE, Operations or Development, you are invariably required to be on at least one on-call rotation. This can mean hours, days or weeks of just being there to attend to the dreaded pager when it starts ringing. And sometimes, even weekends and holidays. A lucky few get little to no pages while on-call but let’s face it, that’s a rarity.

    Even when you don’t get paged, just the anticipation of the pager going off at random hours of the night can be draining.

    It’s unfortunate that an alarming number of people consider being on-call the single most dreadful part of their job. 

    On-call shouldn’t suck and there is no getting rid of it. There is no such thing as a perfect system. Systems will fail and humans going on-call is always going to be a part of keeping systems reliable. What can help is if we give on-call the kind of attention it deserves. Your on-call function is a direct reflection of your engineering practices and company culture. So, if your team dreads it, then you’ve got some serious issues to fix.

    On-call shouldn't suck. And it’s an organizational goal, not just the engineering team’s. A lot can be done to make your on-call function better and this post outlines some of the steps you can take in brief.

    Setting the right expectations for on-call from the start

    There are a lot of misconceptions around being on-call and it is important to establish the real objective for on-call. A few common ones are:

    • “I need to know everything before I go on-call” : This is not true. You don’t need to know everything. Like everything else, it is a learning process.
    • “I need to find a long term fix” - On-call is focused on quick fixes and immediate mitigation. Long term resolutions can and should be arrived at collaboratively.
    • “I need to do this alone” - It is daunting even to think that you may need to do all of this alone and you’re most certainly not expected to. You can ask for help and it is important for this to be an integral part of the process. Collaboration is everything.

    In some cases, people assume that they’d be a “hero” if they found the long-term fix for an issue and tend to make this an implicit goal, however long it takes. This makes it hard for the rest of the team for whom on-call may already be very stressful. It is super important to recognize employee effort and outcomes but this shouldn’t come at the cost of making the process harder for the rest to live up to.

    It is important to set expectations straight from your engineers who are going on-call and ensure that you have effective on-call schedules and rotations across your team to avoid pitfalls like the above “hero” example.

    Create an on-boarding plan

    It really helps if you feel like you know enough to get started with on-call as opposed to being blindly thrown into it. There’s the famous 40/70 rule for informative decision making that forms the basis of this. 

    The idea is that you need to have at least 40 - 70% of the information you need to take the right decision. Anything below that and you’re playing the blind man’s buff and anything beyond 70%, you’ve probably lost a lot of time in the resolution process. 

    An on-boarding program to help you figure out how to get that 40 - 70% of information you need, can go a long way. Every good on-boarding plan must ultimately help you to:

    i) Understand the system and its components:

    • Provide an overview of all of the systems, its owners, architecture and components. When an incident hits, this information will  help understand the impact it may have and prepare for the same, as well as know who to call on for more help.
    • Understand the dependencies that each component has on others. This will help in the investigation/triage phase and get to the RCA faster.
    • The fastest way to understand your system behaviour from an on-call perspective is to go through past incidents and see how they were resolved. You can also access your knowledge base and runbooks to understand the typical resolution processes in place.

    ii) Know what tools you may need:

    Understanding the tools of the trade is half the trade itself. This also opens up opportunities to improve your on-call process. More often than not, tooling is a crucial area where broader inefficiencies can be addressed.

    It becomes important to build and maintain open documentation that outlines: 

    • The observability tools that are currently used in the organization as well as the metrics and events being tracked by them
    • Any logging sources and visualizations to understand log data 
    • Tools typically used for tracing and storing traces
    • Incident management tool
    • Other related resources and solutions for on-call management, status pages, CI/CD, ChatOps, ticket management, customer communication, etc.

    Here are some really awesome resources that can help you take the right call in terms of tooling:

    iii) Train for on-call:

    Ensure that you know what’s coming when going on-call and feel fully prepared for it. Getting your teams ready through training can be not only informational but also help them understand cultural nuances of how the on-call process works.

    A few ways to do this would be: 

    • Make shadowing a part of the process. This way, they get to understand what goes into resolving an incident and
    • the kind of resources needed. That pager can ring anytime and they should be able to handle it with absolutely no stress. 
    • Assign someone the role of a scribe, who is responsible for documenting the incident to be prepared for RCAs / Reviews / Status updates.
    • Taking on low severity incidents is a good way to get prepared. This will also provide a hands-on experience with all the tools available so you wouldn't have to panic and feel lost when you’re actually on-call. 
    • You can create high sev incidents in a simulated environment and get the team to resolve it together to prepare them for the worst.
    • You can also use some scenario based games like Wheel-of-Misfortune and customize it for your organization. This will make the whole learning process more fun.

    iv) Use the knowledge base and contribute effectively to growing it:

    Having a knowledge base is probably the most important part of the whole on-call journey. This cannot be stressed enough. A lot of stress involved with being on-call can be reduced if they know they can resolve anything that comes their way. This kind of confidence can be instilled  by simply pointing at your past incidents and showing how they were resolved. 

    To ensure that your team understands this, you can start with:

    • Going through existing runbooks to understand the resolution process 
    • Going over how to look at historical incidents, and what was done to resolve them
    • Explaining the importance of post-incident-reviews or post-mortems and the most common ways incident reviews are done in the organization

    How are we making on-call not suck at Squadcast

    At Squadcast, we are always brainstorming ways to improve on-call practices. We also try to incorporate it in our product to make it accessible and simple for anyone using it. Here are a few practices we follow to ensure on-call can be a very smooth experience.

    • We ensure that incidents with high severity, or, moderate to high impact will have a postmortem report outlining the incident and resolution process. 
    • We use our knowledge base of post-mortems and historical incidents to get our new recruits started with on-call. 
    • We use incident deduplication to ensure that we do not get inundated with alerts.
    • We use auto-generated incident tags to identify, classify and enrich incident context.
    • We use our virtual war rooms to call in SMEs and other members on the on-call team for help when needed. We understand that on-call is daunting and your tool should have the ability to collaborate instantly when needed.  
    • We use our automated incident timelines to refer while writing our postmortem reports and to understand similar incidents if any.
    • We use our private status page where all the components (internal and external facing) and their dependencies are clearly mapped out. This also shows our incident history and implemented resolutions and subsequently published RCAs.

    Access our Free On-call Onboarding Checklist here. Please note that you can choose to use this directly or tweak it to fit your current processes and needs.

    Enjoyed this? We would love to hear from you! What do you struggle with as a DevOps/SRE? Do you have ideas on how on-call could be done better in your organization?
    Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

    Written By:
    Amiya Adwitiya
    Prakya Vasudevan
    Amiya Adwitiya
    Prakya Vasudevan
    January 30, 2020
    On-Call
    Incident Response
    Share this blog:
    Get reliability insights delivered straight to your inbox.
    Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    If you wish to unsubscribe, we won't hold it against you. Privacy policy.
    Get reliability insights delivered straight to your inbox.
    Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    If you wish to unsubscribe, we won't hold it against you. Privacy policy.