🔥 Now Live: Our Latest Enterprise-Grade Feature - Live Call Routing!

Error Budgets and their Dependencies

Feb 3, 2021
Last Updated:
Feb 3, 2021
Share this post:
Error Budgets and their Dependencies

Does your team struggle with not having balanced error budget, that impacts your reliabilty & pace of innovation? Adam Hammond in this blog talks about error budget - accountable for planned & unplanned outages that your systems may encounter & how teams can calculate error budget efficiently.

Table of Contents:

    In our last few articles, we’ve discussed SLOs and how important picking them correctly can make or break for your application’s performance. Today we’re going to cover error budgets, which are used to account for planned and unplanned outages that your systems may encounter. In essence, error budgets exist to cover you when your systems fail and to allow time for upgrades and feature improvement. No system can be expected to be 100% performant, and even if it were, you need to have time available for maintenance. Activity like database major version upgrades can cause significant downtime when they occur. Error budgets allow you to plan ahead and put aside time for your team to manage their services while providing customers with lead time so that they can plan for the downstream impacts of your service going offline.

    Error Budget Calculator

    An introduction to service calculations

    There is an easy trap to fall into when it comes to determining your error budgets. Calculating your error budgets - as with everything in regards to process improvement - is a journey. Most people would usually say “well, my error budget is simply the left-over time once my SLO is taken away” and that formula for them might look like this:

    Error Budget = 100% - Service SLO

    However, this is incorrect and is “starting at the end”. This is definitely your aspirational error budget, but it doesn’t take into account your service’s current performance and what the current state of your service’s error budget is. The initial equation for your error budget is as follows:

    Error Budget = Projected Downtime + Projected Maintenance

    If you remember from our previous article on SLOs, we need to do a lot of research into understanding factors like how performant our customers expect our system to be, but another part of that is, understanding maintenance and existing application error rates. The projections will most likely track very closely to your past performance, unless your service’s performance has been widely variable in the past. When you first define your error budget, it is acceptable to baseline it against what your service can currently provide. If you can only deliver an SLO of 85%, there is no point promising 90%. However, once you have established your baseline error budget, you must never allow it to move below your starting point. Error budgets decrease, they do not increase. The first port of call for most organisations when implementing their error budget is to focus on maintenance as you usually get the best “bang for buck”; there are usually processes that can be improved or better software versions to be installed. This is where your SRE teams come in to help deliver streamlined, automated, and focused software pipelines that minimise application downtime. Move away from manual, labour intensive processes and single-click developer experiences to minimise intentional error budget usage.

    The point of error budgets it to allow you to focus on where your product improvement hours are spent. New features can be implemented if you have not utilised your error budget, consider service improvement if it is nearly consumed, and you absolutely must focus all resources on stabilizing your service if your error budget is in deficit. Ultimately, an error budget is designed to help you understand where you should focus your engineering resources to ensure your SLOs are met. The final stage of our error budget baseline is to compare it against the SLO that we intend to maintain for our service. We can do this by simply reverting to our calculation from the beginning:

    Expected Service SLO = 100% - Error Budget

    It is at this point, that you can determine the immediate direction you need to take in regards to service improvement. If your error budget is running higher than expected, you should focus on reducing it. Once you’ve completed your initial service improvements to bring your error budget into line (if any was required), you can then finally use the “simple” calculation to determine your error budgets:

    Error Budget = 100% - Service SLO

    The important thing to note is that things like customer expectations serve as minimums in terms of SLOs, so we don’t include them in our initial calculations. At the beginning of our error budget journey, we are understanding our current state and in a lot of cases, it is probably less than the desired target. Another key aspect to keep in mind is that if our SLO performance is ever less than the minimum, then we need to reduce our actual error budget via service improvement as soon as possible.

    Unified Incident Response Platform
    Try for free
    Seamlessly integrate On-Call Management, Incident Response and SRE Workflows for efficient operations.
    Automate Incident Response, minimize downtime and enhance your tech teams' productivity with our Unified Platform.
    Manage incidents anytime, anywhere with our native iOS and Android mobile apps.
    Try for free

    What is downtime, really?

    In our calculations, we separated our downtimes into two categories: unexpected and maintenance. To properly calculate our error budgets, we need definitions for what “downtime” is, in general, and then we also need to differentiate between the two categories. For our purposes, a suitable definition for downtime is “systems are not in a state to meet the required metric”. This specifically targets the SLO and it’s associated metric.

    We then further define our two categories, with “maintenance downtime” being “downtime caused by an intentional disruption due to system maintenance” and “unexpected downtime” simply being “all other downtime”. We differentiate between these two types of downtime not specifically to build the error budget, but to provide us with guidance on how we can improve them. For example, if we want to reduce maintenance we need process improvement, but if we want to reduce unexpected downtime we probably need to fix bugs or errors within our services. These categories provide strategic guidance on where we need to look for potential error budget savings when we need to deliver better service to our customers.

    Calculating our error budgets

    Now that we have all of our required definitions and formulas, now it’s a simple process to actually calculate our error budgets. In fact, a quick visit to our maintenance procedures and our metrics dashboard should suffice:

    • Determine our total downtime by retrieving our current monthly error rates from our metric dashboards.
    • Find out how much downtime is scheduled for our maintenance each month.
    • Calculate our unexpected downtime amount by subtracting scheduled downtime from actual error rates.

    Now we have our three metrics: total downtime, maintenance downtime, and unexpected downtime. Now, let’s return to Bill Palmer at Acme Interfaces, Inc for a practical look at how effective error budgets can be, and how we can use all of this information to calculate them appropriately.

    Integrated Reliability Automation Platform
    Platform
    PagerDuty
    FireHydrant
    Squadcast
    Incident Retrospectives
    APM, Monitoring, ITSM,Ticketing Integrations
    Incident
    Notes
    On Call Rotations
    Built-In Public and Private Status Page
    Advanced Error Budget Tracking
    Try For free
    Platform
    Incident Retrospectives
    APM, Monitoring, ITSM,Ticketing Integrations
    Incident
    Notes
    On Call Rotations
    Built-In Public and Private Status Page
    Advanced Error Budget Tracking
    PagerDuty
    FireHydrant
    Squadcast
    Try For free

    “Help, Bill! The system is too slow!”

    Bill Palmer sat at his desk, exasperated. Acme Interfaces had been putting off their database upgrade for years. He received an email from their cloud provider today, advising that the database would be upgraded forcibly if no action was taken in the next four months. Coming in at 15TBs and feeding into over 500 interfaces, their database was at the heart of the business. As part of the upgrade, everything would need to be tested along with the actual upgrade itself. Bill required hours for the upgrade, but it actually looked like Acme Interfaces was going over their error budget by a few minutes every month. Now that their cloud provider had forced their hand, something needed to be done.

    He pulled up an excel spreadsheet with service metrics and began looking for places for a quick win.

    Within a few minutes, he’d found what looked like the root of their error budget deficit. Looking at the error reporting for HTTP requests over the last year at Acme Interfaces, relatively simple requests were returning HTTP 50X errors at quite large volumes and for unknown reasons. He’d made a promise to Dan that he’d get the error rate lower than 10% to get the error budget back in surplus for the upgrade; it was time to get to work. He looked at the detailed statistics and noticed that about half of the errors were 503s and 504s, and the other half were 500s. He just didn’t understand how there could be so many transport errors.

    He picked up his phone and dialed the NOC.

    “*Ring, Ring…. Ring, Ring….* Hello, Acme NOC, this is Charlie.”

    “G’day Charlie, this is Bill, the CTO, do you have a few minutes to discuss some statistics I’m reviewing.”

    “Sure, Bill.”

    “Excellent. I’m just taking a look at our HTTP error codes  for the past year and for some reason we return a lot of bad gateway and service unavailable errors, do you know why that would be the case?”

    “Sure do, Bill. Our load balancer software is on a really old version. It’s got a bug, where it hits a memory leak and won’t be able to parse requests back from the backend servers. That’s what throws the 502s. After a few minutes, the server will restart but because it is our load balancer we can’t easily take it out of service so we return 503s. We used to have to manually restart the servers, but we implemented a script that checks for health and can reboot within a few minutes.”

    Bill paused for a few moments. “...Is there a reason why the infrastructure team hasn’t upgraded to a new version of the load balancer?”

    “Well, that’s the problem, we don’t really have anyone dedicated to the load balancers. They were setup up a few years ago as part of a project, and now the NOC just fixes them up when they go a bit crazy. The vendor has confirmed that the newer version of the software doesn’t have the bug but we just don’t have the expertise to manage that at the moment. We also restart them all at night which takes about an hour which would cause 503s.”

    “Okay, well thanks for the information, Charlie. I’ll see what we can do. *click*”

    Bill started to write up all the information he had gained from the phone call with Charlie.

    After he was done, he called Jenny.

    “Jenny, can you please do me a favour and find out how much a System Administration course for our Load Balancing software would be, please?”

    “Sure, is this about those HTTP errors?”

    “You know it!”

    Bill continued to look at the whiteboard, and just knew the fastest way to improve performance would be to bring the load balancer up to scratch, and get the NOC team up-skilled to handle these systems. They’ve been improving these systems in spite of not having any official training, so they definitely are great operators.

    Bill’s phone rang, “Bill, it’s Jenny. I just got off the phone with them and they said they could do a 20% discount on the training with a group larger than 10 people and that it would be $10,000 a head.”

    “Okay, get back to them and book in two sessions of 15 people each. I want the whole NOC to be up-skilled on the Load Balancer immediately. Draw up a project proposal for shift-left knowledge transfer from some of the application teams as well as SRE development for the NOC team. Their skills are wasted waiting for fires to break out, I know they can get this environment up to where it needs to be.”

    “Sounds good, I’ll get onto it now!”

    *****

    Bill surveyed the room, taking in the hundreds of leaders from across Acme Interfaces, as he prepared to talk about his team’s development over the last six months.

    “Hi everyone, I’m sure most of you know me by now, but I’m Bill, the new-ish CTO. Today I’m going to be talking about how we were able to eliminate a major barrier to our database upgrade by analysing and refining the error budgets for our HTTP requests.”

    “Six months ago, we were seeing an error rate on HTTP requests of up to 15% per month which was well above our expected error budget of 10%. About 5% of these were caused by application errors, but 8.5% of these errors were being seen at the load balancer and were due to availability issues. We wanted our error budget to be 10% or less request errors, but we were tracking 5% above that. We had to improve something if we wanted to meet that target.”

    “I got onto the NOC and spoke with Charlie who enlightened me to some issues we were having with our load balancer: it hadn’t been updated for a few years and a bug was causing all these errors. Further exacerbating the issue, no one with the skills to actually upgrade the load balancer worked at the company so that wasn’t an immediate option.”

    “Jenny got onto the vendor and arranged training for the entire NOC. Within three weeks they were all skilled up, then we began our project to upgrade the load balancers. With everyone skilled up, it only took us two weeks to upgrade all of the servers and we were able to do this during downtime that was previously reserved for maintenance (otherwise known as restarting servers due to the bug). We’ve also begun transitioning all of the existing NOC operators to new SRE-based roles that will allow them to assume greater responsibility for the improvement of our core infrastructure.”

    “Within two months of defining our current state error budget, we had used them to identify where our issues were coming from, resolved those issues, and, now we’ve been able to meet (and exceed) our target of less than 10% HTTP request errors. We’ve also used the experience to refine our NOC and give our staff greater responsibility.”

    “I’d heartily recommend everyone has a look at the internal error budgets that you are responsible for, as I am very sure that it can only have positive outcomes for the business. Thanks for attending my session, and I hope the rest of the retreat goes well.”

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    squadcast
    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    February 3, 2021
    February 3, 2021
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Adam Hammond
    How small changes to your SLOs can be SMART for your business - A narrative case study
    How small changes to your SLOs can be SMART for your business - A narrative case study
    November 17, 2020
    Choosing SLOs that users need, not the ones you want to provide
    Choosing SLOs that users need, not the ones you want to provide
    October 1, 2020
    Keeping your teams and customers in the loop during downtime
    Keeping your teams and customers in the loop during downtime
    August 12, 2020

    Error Budgets and their Dependencies

    Error Budgets and their Dependencies
    Feb 3, 2021
    Last Updated:
    Feb 3, 2021

    Does your team struggle with not having balanced error budget, that impacts your reliabilty & pace of innovation? Adam Hammond in this blog talks about error budget - accountable for planned & unplanned outages that your systems may encounter & how teams can calculate error budget efficiently.

    In our last few articles, we’ve discussed SLOs and how important picking them correctly can make or break for your application’s performance. Today we’re going to cover error budgets, which are used to account for planned and unplanned outages that your systems may encounter. In essence, error budgets exist to cover you when your systems fail and to allow time for upgrades and feature improvement. No system can be expected to be 100% performant, and even if it were, you need to have time available for maintenance. Activity like database major version upgrades can cause significant downtime when they occur. Error budgets allow you to plan ahead and put aside time for your team to manage their services while providing customers with lead time so that they can plan for the downstream impacts of your service going offline.

    Error Budget Calculator

    An introduction to service calculations

    There is an easy trap to fall into when it comes to determining your error budgets. Calculating your error budgets - as with everything in regards to process improvement - is a journey. Most people would usually say “well, my error budget is simply the left-over time once my SLO is taken away” and that formula for them might look like this:

    Error Budget = 100% - Service SLO

    However, this is incorrect and is “starting at the end”. This is definitely your aspirational error budget, but it doesn’t take into account your service’s current performance and what the current state of your service’s error budget is. The initial equation for your error budget is as follows:

    Error Budget = Projected Downtime + Projected Maintenance

    If you remember from our previous article on SLOs, we need to do a lot of research into understanding factors like how performant our customers expect our system to be, but another part of that is, understanding maintenance and existing application error rates. The projections will most likely track very closely to your past performance, unless your service’s performance has been widely variable in the past. When you first define your error budget, it is acceptable to baseline it against what your service can currently provide. If you can only deliver an SLO of 85%, there is no point promising 90%. However, once you have established your baseline error budget, you must never allow it to move below your starting point. Error budgets decrease, they do not increase. The first port of call for most organisations when implementing their error budget is to focus on maintenance as you usually get the best “bang for buck”; there are usually processes that can be improved or better software versions to be installed. This is where your SRE teams come in to help deliver streamlined, automated, and focused software pipelines that minimise application downtime. Move away from manual, labour intensive processes and single-click developer experiences to minimise intentional error budget usage.

    The point of error budgets it to allow you to focus on where your product improvement hours are spent. New features can be implemented if you have not utilised your error budget, consider service improvement if it is nearly consumed, and you absolutely must focus all resources on stabilizing your service if your error budget is in deficit. Ultimately, an error budget is designed to help you understand where you should focus your engineering resources to ensure your SLOs are met. The final stage of our error budget baseline is to compare it against the SLO that we intend to maintain for our service. We can do this by simply reverting to our calculation from the beginning:

    Expected Service SLO = 100% - Error Budget

    It is at this point, that you can determine the immediate direction you need to take in regards to service improvement. If your error budget is running higher than expected, you should focus on reducing it. Once you’ve completed your initial service improvements to bring your error budget into line (if any was required), you can then finally use the “simple” calculation to determine your error budgets:

    Error Budget = 100% - Service SLO

    The important thing to note is that things like customer expectations serve as minimums in terms of SLOs, so we don’t include them in our initial calculations. At the beginning of our error budget journey, we are understanding our current state and in a lot of cases, it is probably less than the desired target. Another key aspect to keep in mind is that if our SLO performance is ever less than the minimum, then we need to reduce our actual error budget via service improvement as soon as possible.

    Unified Incident Response Platform
    Try for free
    Seamlessly integrate On-Call Management, Incident Response and SRE Workflows for efficient operations.
    Automate Incident Response, minimize downtime and enhance your tech teams' productivity with our Unified Platform.
    Manage incidents anytime, anywhere with our native iOS and Android mobile apps.
    Try for free

    What is downtime, really?

    In our calculations, we separated our downtimes into two categories: unexpected and maintenance. To properly calculate our error budgets, we need definitions for what “downtime” is, in general, and then we also need to differentiate between the two categories. For our purposes, a suitable definition for downtime is “systems are not in a state to meet the required metric”. This specifically targets the SLO and it’s associated metric.

    We then further define our two categories, with “maintenance downtime” being “downtime caused by an intentional disruption due to system maintenance” and “unexpected downtime” simply being “all other downtime”. We differentiate between these two types of downtime not specifically to build the error budget, but to provide us with guidance on how we can improve them. For example, if we want to reduce maintenance we need process improvement, but if we want to reduce unexpected downtime we probably need to fix bugs or errors within our services. These categories provide strategic guidance on where we need to look for potential error budget savings when we need to deliver better service to our customers.

    Calculating our error budgets

    Now that we have all of our required definitions and formulas, now it’s a simple process to actually calculate our error budgets. In fact, a quick visit to our maintenance procedures and our metrics dashboard should suffice:

    • Determine our total downtime by retrieving our current monthly error rates from our metric dashboards.
    • Find out how much downtime is scheduled for our maintenance each month.
    • Calculate our unexpected downtime amount by subtracting scheduled downtime from actual error rates.

    Now we have our three metrics: total downtime, maintenance downtime, and unexpected downtime. Now, let’s return to Bill Palmer at Acme Interfaces, Inc for a practical look at how effective error budgets can be, and how we can use all of this information to calculate them appropriately.

    Integrated Reliability Automation Platform
    Platform
    PagerDuty
    FireHydrant
    Squadcast
    Incident Retrospectives
    APM, Monitoring, ITSM,Ticketing Integrations
    Incident
    Notes
    On Call Rotations
    Built-In Public and Private Status Page
    Advanced Error Budget Tracking
    Try For free
    Platform
    Incident Retrospectives
    APM, Monitoring, ITSM,Ticketing Integrations
    Incident
    Notes
    On Call Rotations
    Built-In Public and Private Status Page
    Advanced Error Budget Tracking
    PagerDuty
    FireHydrant
    Squadcast
    Try For free

    “Help, Bill! The system is too slow!”

    Bill Palmer sat at his desk, exasperated. Acme Interfaces had been putting off their database upgrade for years. He received an email from their cloud provider today, advising that the database would be upgraded forcibly if no action was taken in the next four months. Coming in at 15TBs and feeding into over 500 interfaces, their database was at the heart of the business. As part of the upgrade, everything would need to be tested along with the actual upgrade itself. Bill required hours for the upgrade, but it actually looked like Acme Interfaces was going over their error budget by a few minutes every month. Now that their cloud provider had forced their hand, something needed to be done.

    He pulled up an excel spreadsheet with service metrics and began looking for places for a quick win.

    Within a few minutes, he’d found what looked like the root of their error budget deficit. Looking at the error reporting for HTTP requests over the last year at Acme Interfaces, relatively simple requests were returning HTTP 50X errors at quite large volumes and for unknown reasons. He’d made a promise to Dan that he’d get the error rate lower than 10% to get the error budget back in surplus for the upgrade; it was time to get to work. He looked at the detailed statistics and noticed that about half of the errors were 503s and 504s, and the other half were 500s. He just didn’t understand how there could be so many transport errors.

    He picked up his phone and dialed the NOC.

    “*Ring, Ring…. Ring, Ring….* Hello, Acme NOC, this is Charlie.”

    “G’day Charlie, this is Bill, the CTO, do you have a few minutes to discuss some statistics I’m reviewing.”

    “Sure, Bill.”

    “Excellent. I’m just taking a look at our HTTP error codes  for the past year and for some reason we return a lot of bad gateway and service unavailable errors, do you know why that would be the case?”

    “Sure do, Bill. Our load balancer software is on a really old version. It’s got a bug, where it hits a memory leak and won’t be able to parse requests back from the backend servers. That’s what throws the 502s. After a few minutes, the server will restart but because it is our load balancer we can’t easily take it out of service so we return 503s. We used to have to manually restart the servers, but we implemented a script that checks for health and can reboot within a few minutes.”

    Bill paused for a few moments. “...Is there a reason why the infrastructure team hasn’t upgraded to a new version of the load balancer?”

    “Well, that’s the problem, we don’t really have anyone dedicated to the load balancers. They were setup up a few years ago as part of a project, and now the NOC just fixes them up when they go a bit crazy. The vendor has confirmed that the newer version of the software doesn’t have the bug but we just don’t have the expertise to manage that at the moment. We also restart them all at night which takes about an hour which would cause 503s.”

    “Okay, well thanks for the information, Charlie. I’ll see what we can do. *click*”

    Bill started to write up all the information he had gained from the phone call with Charlie.

    After he was done, he called Jenny.

    “Jenny, can you please do me a favour and find out how much a System Administration course for our Load Balancing software would be, please?”

    “Sure, is this about those HTTP errors?”

    “You know it!”

    Bill continued to look at the whiteboard, and just knew the fastest way to improve performance would be to bring the load balancer up to scratch, and get the NOC team up-skilled to handle these systems. They’ve been improving these systems in spite of not having any official training, so they definitely are great operators.

    Bill’s phone rang, “Bill, it’s Jenny. I just got off the phone with them and they said they could do a 20% discount on the training with a group larger than 10 people and that it would be $10,000 a head.”

    “Okay, get back to them and book in two sessions of 15 people each. I want the whole NOC to be up-skilled on the Load Balancer immediately. Draw up a project proposal for shift-left knowledge transfer from some of the application teams as well as SRE development for the NOC team. Their skills are wasted waiting for fires to break out, I know they can get this environment up to where it needs to be.”

    “Sounds good, I’ll get onto it now!”

    *****

    Bill surveyed the room, taking in the hundreds of leaders from across Acme Interfaces, as he prepared to talk about his team’s development over the last six months.

    “Hi everyone, I’m sure most of you know me by now, but I’m Bill, the new-ish CTO. Today I’m going to be talking about how we were able to eliminate a major barrier to our database upgrade by analysing and refining the error budgets for our HTTP requests.”

    “Six months ago, we were seeing an error rate on HTTP requests of up to 15% per month which was well above our expected error budget of 10%. About 5% of these were caused by application errors, but 8.5% of these errors were being seen at the load balancer and were due to availability issues. We wanted our error budget to be 10% or less request errors, but we were tracking 5% above that. We had to improve something if we wanted to meet that target.”

    “I got onto the NOC and spoke with Charlie who enlightened me to some issues we were having with our load balancer: it hadn’t been updated for a few years and a bug was causing all these errors. Further exacerbating the issue, no one with the skills to actually upgrade the load balancer worked at the company so that wasn’t an immediate option.”

    “Jenny got onto the vendor and arranged training for the entire NOC. Within three weeks they were all skilled up, then we began our project to upgrade the load balancers. With everyone skilled up, it only took us two weeks to upgrade all of the servers and we were able to do this during downtime that was previously reserved for maintenance (otherwise known as restarting servers due to the bug). We’ve also begun transitioning all of the existing NOC operators to new SRE-based roles that will allow them to assume greater responsibility for the improvement of our core infrastructure.”

    “Within two months of defining our current state error budget, we had used them to identify where our issues were coming from, resolved those issues, and, now we’ve been able to meet (and exceed) our target of less than 10% HTTP request errors. We’ve also used the experience to refine our NOC and give our staff greater responsibility.”

    “I’d heartily recommend everyone has a look at the internal error budgets that you are responsible for, as I am very sure that it can only have positive outcomes for the business. Thanks for attending my session, and I hope the rest of the retreat goes well.”

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    squadcast
    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    February 3, 2021
    February 3, 2021
    Share this post:

    Subscribe to our latest updates

    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    In this blog:
      Subscribe to our LinkedIn Newsletter to receive more educational content
      Subscribe now
      FAQ
      Learn how organizations are using Squadcast
      to maintain and improve upon their Reliability metrics
      Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
      mapgears
      "Mapgears simplified their complex On-call Alerting process with Squadcast.
      Squadcast has helped us aggregate alerts coming in from hundreds...
      bibam
      "Bibam found their best PagerDuty alternative in Squadcast.
      By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
      tanner
      "Squadcast helped Tanner gain system insights and boost team productivity.
      Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
      Alexandre Lessard
      System Analyst
      Martin do Santos
      Platform and Architecture Tech Lead
      Sandro Franchi
      CTO
      Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
      Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
      What our
      customers
      have to say
      mapgears
      "Mapgears simplified their complex On-call Alerting process with Squadcast.
      Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
      Alexandre Lessard
      System Analyst
      bibam
      "Bibam found their best PagerDuty alternative in Squadcast.
      By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
      Martin do Santos
      Platform and Architecture Tech Lead
      tanner
      "Squadcast helped Tanner gain system insights and boost team productivity.
      Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
      Sandro Franchi
      CTO
      Revamp your Incident Response.
      Peak Reliability
      Easier, Faster, More Automated with SRE.
      Incident Response Mobility
      Manage incidents on the go with Squadcast mobile app for Android and iOS devices
      google playapple store
      Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
      Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
      Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
      Users love Squadcast on G2
      Copyright © Squadcast Inc. 2017-2024