Does your team struggle with not having balanced error budget, that impacts your reliabilty & pace of innovation? Adam Hammond in this blog talks about error budget - accountable for planned & unplanned outages that your systems may encounter & how teams can calculate error budget efficiently.
In our last few articles, we’ve discussed SLOs and how important picking them correctly can make or break for your application’s performance. Today we’re going to cover error budgets, which are used to account for planned and unplanned outages that your systems may encounter. In essence, error budgets exist to cover you when your systems fail and to allow time for upgrades and feature improvement. No system can be expected to be 100% performant, and even if it were, you need to have time available for maintenance. Activity like database major version upgrades can cause significant downtime when they occur. Error budgets allow you to plan ahead and put aside time for your team to manage their services while providing customers with lead time so that they can plan for the downstream impacts of your service going offline.
There is an easy trap to fall into when it comes to determining your error budgets. Calculating your error budgets - as with everything in regards to process improvement - is a journey. Most people would usually say “well, my error budget is simply the left-over time once my SLO is taken away” and that formula for them might look like this:
Error Budget = 100% - Service SLO
However, this is incorrect and is “starting at the end”. This is definitely your aspirational error budget, but it doesn’t take into account your service’s current performance and what the current state of your service’s error budget is. The initial equation for your error budget is as follows:
Error Budget = Projected Downtime + Projected Maintenance
If you remember from our previous article on SLOs, we need to do a lot of research into understanding factors like how performant our customers expect our system to be, but another part of that is, understanding maintenance and existing application error rates. The projections will most likely track very closely to your past performance, unless your service’s performance has been widely variable in the past. When you first define your error budget, it is acceptable to baseline it against what your service can currently provide. If you can only deliver an SLO of 85%, there is no point promising 90%. However, once you have established your baseline error budget, you must never allow it to move below your starting point. Error budgets decrease, they do not increase. The first port of call for most organisations when implementing their error budget is to focus on maintenance as you usually get the best “bang for buck”; there are usually processes that can be improved or better software versions to be installed. This is where your SRE teams come in to help deliver streamlined, automated, and focused software pipelines that minimise application downtime. Move away from manual, labour intensive processes and single-click developer experiences to minimise intentional error budget usage.
The point of error budgets it to allow you to focus on where your product improvement hours are spent. New features can be implemented if you have not utilised your error budget, consider service improvement if it is nearly consumed, and you absolutely must focus all resources on stabilizing your service if your error budget is in deficit. Ultimately, an error budget is designed to help you understand where you should focus your engineering resources to ensure your SLOs are met. The final stage of our error budget baseline is to compare it against the SLO that we intend to maintain for our service. We can do this by simply reverting to our calculation from the beginning:
Expected Service SLO = 100% - Error Budget
It is at this point, that you can determine the immediate direction you need to take in regards to service improvement. If your error budget is running higher than expected, you should focus on reducing it. Once you’ve completed your initial service improvements to bring your error budget into line (if any was required), you can then finally use the “simple” calculation to determine your error budgets:
Error Budget = 100% - Service SLO
The important thing to note is that things like customer expectations serve as minimums in terms of SLOs, so we don’t include them in our initial calculations. At the beginning of our error budget journey, we are understanding our current state and in a lot of cases, it is probably less than the desired target. Another key aspect to keep in mind is that if our SLO performance is ever less than the minimum, then we need to reduce our actual error budget via service improvement as soon as possible.
In our calculations, we separated our downtimes into two categories: unexpected and maintenance. To properly calculate our error budgets, we need definitions for what “downtime” is, in general, and then we also need to differentiate between the two categories. For our purposes, a suitable definition for downtime is “systems are not in a state to meet the required metric”. This specifically targets the SLO and it’s associated metric.
We then further define our two categories, with “maintenance downtime” being “downtime caused by an intentional disruption due to system maintenance” and “unexpected downtime” simply being “all other downtime”. We differentiate between these two types of downtime not specifically to build the error budget, but to provide us with guidance on how we can improve them. For example, if we want to reduce maintenance we need process improvement, but if we want to reduce unexpected downtime we probably need to fix bugs or errors within our services. These categories provide strategic guidance on where we need to look for potential error budget savings when we need to deliver better service to our customers.
Now that we have all of our required definitions and formulas, now it’s a simple process to actually calculate our error budgets. In fact, a quick visit to our maintenance procedures and our metrics dashboard should suffice:
Now we have our three metrics: total downtime, maintenance downtime, and unexpected downtime. Now, let’s return to Bill Palmer at Acme Interfaces, Inc for a practical look at how effective error budgets can be, and how we can use all of this information to calculate them appropriately.
Bill Palmer sat at his desk, exasperated. Acme Interfaces had been putting off their database upgrade for years. He received an email from their cloud provider today, advising that the database would be upgraded forcibly if no action was taken in the next four months. Coming in at 15TBs and feeding into over 500 interfaces, their database was at the heart of the business. As part of the upgrade, everything would need to be tested along with the actual upgrade itself. Bill required hours for the upgrade, but it actually looked like Acme Interfaces was going over their error budget by a few minutes every month. Now that their cloud provider had forced their hand, something needed to be done.
He pulled up an excel spreadsheet with service metrics and began looking for places for a quick win.
Within a few minutes, he’d found what looked like the root of their error budget deficit. Looking at the error reporting for HTTP requests over the last year at Acme Interfaces, relatively simple requests were returning HTTP 50X errors at quite large volumes and for unknown reasons. He’d made a promise to Dan that he’d get the error rate lower than 10% to get the error budget back in surplus for the upgrade; it was time to get to work. He looked at the detailed statistics and noticed that about half of the errors were 503s and 504s, and the other half were 500s. He just didn’t understand how there could be so many transport errors.
He picked up his phone and dialed the NOC.
“*Ring, Ring…. Ring, Ring….* Hello, Acme NOC, this is Charlie.”
“G’day Charlie, this is Bill, the CTO, do you have a few minutes to discuss some statistics I’m reviewing.”
“Excellent. I’m just taking a look at our HTTP error codes for the past year and for some reason we return a lot of bad gateway and service unavailable errors, do you know why that would be the case?”
“Sure do, Bill. Our load balancer software is on a really old version. It’s got a bug, where it hits a memory leak and won’t be able to parse requests back from the backend servers. That’s what throws the 502s. After a few minutes, the server will restart but because it is our load balancer we can’t easily take it out of service so we return 503s. We used to have to manually restart the servers, but we implemented a script that checks for health and can reboot within a few minutes.”
Bill paused for a few moments. “...Is there a reason why the infrastructure team hasn’t upgraded to a new version of the load balancer?”
“Well, that’s the problem, we don’t really have anyone dedicated to the load balancers. They were setup up a few years ago as part of a project, and now the NOC just fixes them up when they go a bit crazy. The vendor has confirmed that the newer version of the software doesn’t have the bug but we just don’t have the expertise to manage that at the moment. We also restart them all at night which takes about an hour which would cause 503s.”
“Okay, well thanks for the information, Charlie. I’ll see what we can do. *click*”
Bill started to write up all the information he had gained from the phone call with Charlie.
After he was done, he called Jenny.
“Jenny, can you please do me a favour and find out how much a System Administration course for our Load Balancing software would be, please?”
“Sure, is this about those HTTP errors?”
“You know it!”
Bill continued to look at the whiteboard, and just knew the fastest way to improve performance would be to bring the load balancer up to scratch, and get the NOC team up-skilled to handle these systems. They’ve been improving these systems in spite of not having any official training, so they definitely are great operators.
Bill’s phone rang, “Bill, it’s Jenny. I just got off the phone with them and they said they could do a 20% discount on the training with a group larger than 10 people and that it would be $10,000 a head.”
“Okay, get back to them and book in two sessions of 15 people each. I want the whole NOC to be up-skilled on the Load Balancer immediately. Draw up a project proposal for shift-left knowledge transfer from some of the application teams as well as SRE development for the NOC team. Their skills are wasted waiting for fires to break out, I know they can get this environment up to where it needs to be.”
“Sounds good, I’ll get onto it now!”
Bill surveyed the room, taking in the hundreds of leaders from across Acme Interfaces, as he prepared to talk about his team’s development over the last six months.
“Hi everyone, I’m sure most of you know me by now, but I’m Bill, the new-ish CTO. Today I’m going to be talking about how we were able to eliminate a major barrier to our database upgrade by analysing and refining the error budgets for our HTTP requests.”
“Six months ago, we were seeing an error rate on HTTP requests of up to 15% per month which was well above our expected error budget of 10%. About 5% of these were caused by application errors, but 8.5% of these errors were being seen at the load balancer and were due to availability issues. We wanted our error budget to be 10% or less request errors, but we were tracking 5% above that. We had to improve something if we wanted to meet that target.”
“I got onto the NOC and spoke with Charlie who enlightened me to some issues we were having with our load balancer: it hadn’t been updated for a few years and a bug was causing all these errors. Further exacerbating the issue, no one with the skills to actually upgrade the load balancer worked at the company so that wasn’t an immediate option.”
“Jenny got onto the vendor and arranged training for the entire NOC. Within three weeks they were all skilled up, then we began our project to upgrade the load balancers. With everyone skilled up, it only took us two weeks to upgrade all of the servers and we were able to do this during downtime that was previously reserved for maintenance (otherwise known as restarting servers due to the bug). We’ve also begun transitioning all of the existing NOC operators to new SRE-based roles that will allow them to assume greater responsibility for the improvement of our core infrastructure.”
“Within two months of defining our current state error budget, we had used them to identify where our issues were coming from, resolved those issues, and, now we’ve been able to meet (and exceed) our target of less than 10% HTTP request errors. We’ve also used the experience to refine our NOC and give our staff greater responsibility.”
“I’d heartily recommend everyone has a look at the internal error budgets that you are responsible for, as I am very sure that it can only have positive outcomes for the business. Thanks for attending my session, and I hope the rest of the retreat goes well.”
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.