To incentivize reliability in your platform, there should be shared goals across your team to measure & quantify the capabilities of your product/service along with customer experience. Define the path of "Always-On" services by understanding few key SRE fundamentals and their implications - SLIs, SLOs & SLA.
Framing SRE metrics for building or scaling a product is quite a daunting task.
In an SRE journey, the process of embracing risks and resolving them by proper service-level metrics are known to be the best way to achieve reliability.
In this blog, we explore the key differences between these basic site reliability metrics and their implications for building a sustainable and reliable product. Here’s the outline for quick reference:
SRE practices are now becoming more prevalent and much sought after best practices. And the prerequisite of success in SRE is availability.
These acronyms - SLIs, SLOs, and SLA are the primary metrics of Site Reliability Engineering (SRE).
SLI as defined in Google’s SRE Handbook is, “a carefully defined quantitative measure of some aspect of the level of service that is provided.”
SLIs are measurements of the characteristics of a service. SLI’s directly gauge those behaviours that have the greatest impact on the customer experience. The most common SLIs or Four Golden signals are,
Other variations are USE (Utilization, Saturation and Errors) and RED (Rate, Error and Durability).
The formula used to calculate SLI is,
SLI = Good Events * 100 / Valid Events
If the value of SLI is 100, the performance of the system is ideal and if it drops to 0, the system is broken.
It is Product (Service) - Centric, which means it always revolves around measuring the capabilities or characteristics of a product or service
SLOs are key threshold values for each SLI that quantify the availability and quality of service. They are an objective measure of your product’s reliability, or performance goals.
SLOs as explained in Google’s SRE workbook, “Service level objectives (SLOs) specify a target level for the reliability of your service. Because SLOs are key to making data-driven decisions about reliability, they’re at the core of SRE practices.”
These are numerical reliability or performance targets that a developer or an SRE should maintain while building and scaling a product. Any changes in the product or service must fall under these defined target values.
SLOs should be Customer-centric, they should be directly related to the customer experience. The core purpose of SLOs is to quantify customer reliability of the product and services.
SLOs can also be used to drive other improvements. For example, you could set an SLO for backup duration if you wanted to maintain or improve it.
SLA is an agreement between the service provider and customer about service deliverables.
With an SLA, the consumer would have a clear idea about the proposed product or service in terms of functionality, reliability, and performance.
Google’s CRE Life lessons define SLA as, “An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid.”
SLAs are Vendor-User agreement and a Customer-centric metric that defines the committed functionality, performance, and reliability of a product or service as well as the penalty for non-compliance. It also helps in establishing transparency and trust between the company and its customers. And if the company breaches the terms agreed in the SLA then it is liable to reimburse the loss incurred for its customers.
If you are a business that sells any product or service to your customers and assures them about your product capabilities, then you should draft your SLI, SLO and SLA now!
These service-level metrics would help you gain customer trust and improve your system reliability and performance.
As a service provider/vendor, you should start by coming up with key performance indicators that measure your product's performance, which forms your SLIs. Remember these are a direct measure of your system’s behaviour in every stage of your business.
Secondly, you have to set targets of availability for achieving these indicators, which forms your SLOs. This is a completely data-driven phase where you have to accumulate the data from customer queries, stakeholders' expectations, find the insights and finalize the target/threshold values to achieve better reliability.
As the final step, you should create your SLA. Here you have to list out the reliability values and help them understand your product's capabilities.
Thus, SLIs are the foundational blocks that help in building SLOs which in turn helps with overall reliability mentioned in your SLA.
Customer experience plays an important part in deciding key SRE metrics. SLOs are the focus points in deciding the assured system reliability the company would offer to the end-users.
Choosing the appropriate SLOs or target values is in itself a complex technique!
Here are few key practices that can help you in deciding the right SLOs.
The target value of a service level is always measured only by an SLI.
There is an intricate dependency between SLIs and SLOs. This forms as a controlling characteristic while measuring and monitoring the entire system architecture. So, according to Google,
"A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound."
Lower bound SLOs ≤ SLI ≤ Upper bound SLOs
What should you do while choosing the right SLOs?
Reliability values in SLA < Historical Average of your availability SLOs
Google emphasises the importance of defining the objectives in practice
“SLOs should specify how they’re measured and the conditions under which they’re valid” SLOs can never be 100%. But we can specify the limit of up to which constraint of time we can achieve the assured reliability. For example, you can specify the SLO targets in the performance curve as,
This is where Error budgets in SRE come in handy, a rate at which SLOs can be missed. This provides a clear, objective metric that helps determine how unreliable service is allowed for a specific time. It also helps to establish a balance between reliability and innovation. According to Google's SRE book, "An error budget is just an SLO for meeting other SLOs!"
Google's Motivation for Error Budgets, defines Error Budget as,
“the tool SRE uses to balance service reliability with the pace of innovation"
"the amount of error that your service can accumulate over a certain period before your users start being unhappy.”
SLI is expressed as a percentage, and the objectives derived from SLIs are the SLOs. Now, Error budget is the remainder value of the SLOs mentioned.
The formula for Error budget is,
Error budget = [100 - Internal Availability SLOs] (in %)
So, in the above example, if the internal availability SLO is 99.95%, then the corresponding error budget would be (100-99.5) 0.05%. That is, you can serve up to or below the error of 0.05%.
According to Google's SRE blog, you have to measure every service you offer with an availability SLO, without which you cannot decide on making your systems more reliable. And if you assure the services to be more reliable, then the cost to operate will be expensive. So, by quantifying your services with availability SLOs you can either allow greater momentum for product development (but less reliable) or make your systems more reliable (but slow in product development).
And to improve your services you can build "deemed SLIs" or approximate SLIs to measure customer reliability of your platform at a very granular level. This contributes to measuring low-level outages and drives the operational response with which you can refine your customer expectations. This, in turn, helps you in scaling your product for better customer experience.
Delivering product value solely depends on the performance and reliability of your services. Service level metrics act as a key tool to measure and quantify the capabilities of your product/service.
And, yes, it is necessary to define the path of how you are going to deliver the commitment towards "always-on" services. Appropriate SLOs and SLIs will help you define that path.
We hope that this article has helped you understand SLIs, SLOs, and SLA in a better way so that you can use them in improving your customer experience and overall product and service capabilities.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.