In this blog, Danny Mican, a Senior Site Reliability Engineer, outlines how to implement SLOs from scratch using the IIDARR process. He also states it is extremely crucial for your SLOs to be actionable and is always following a feedback approach as it will play an important role in the debate of Features Vs Technical Debt.
Succeeding with Service Level Objectives require buy-in and coordination across multiple different business units: Product, Engineering and Site Reliability. Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure. This post describes a process to help organizations scale adoption of Service Level Objectives. It assumes familiarity with core Service Level Objective concepts, as defined in the Google SRE book.
Many resources talk about technical specifics of choosing SLIs and defining and measuring objectives. Few resources discuss technical strategies for incorporating Service Level Objectives into regular operations and decision making. It’s important to have an organizational process that governs succeeding with SLOs throughout the entire product lifecycle and provides the foundation for incorporating Service Level Objectives into regular operations.
The goal is to establish a process that allows teams to accurately understand their customers’ experience, empower teams to detect issues before their customers report them in order to maintain service and allow organizations to understand their historical performance and use this to inform the decision of addressing technical debt vs feature velocity.
The following elements describe the phases that are essential to successfully implement Service Level Objectives:
Initial SLO adoption also benefits from centralizing the implementation in a single resource or tool:
All the elements together form a complete process. None act as a standalone, and hence risk partial implementation and eventually, failure, if treated as individual steps.
Identify: Defining what should be instrumented, and alerted on, has no inherent value to itself. The company could identify 100 important operations a day, and wouldn’t be better off because there’s no way to automatically collect data around those operations and act on it. Ad-hoc research without building data around it provides one-off insights and no recurring value.
Instrument: Collecting data without understanding how it relates to a customer has very little value. This can usually be seen with infrastructure metrics. For example: is high resource usage an issue? It may be deviated from a baseline, but it’s only the customer’s experience which provides context in order to characterize if any individual resource usage is actually an issue.
Even if the data is passively there for incidents, it gets hard to establish a baseline for normal, understand if it’s deviating from that and to tie it up with the customer experience.
Instrumentation requires guidance from SLOs to determine its importance. It also requires proactive alerting in order to ensure that it is regularly used and incorporated into decision making.
Define: Understanding the target customer experience and expectations are good, but has no value if there isn't a system in place to notify the team when there is a risk of negative impact. Definition is supported by instrumentation and action. Defining the experience is essential, but it’s only when that definition becomes actionable (through alerting), is when an organization is able to ensure a certain level of service for its customers.
Alert: Alerting is essential for closing the feedback loop and connecting engineering and customers (through a decided service level). Alerting ensures that SLO breaches are actionable and not just informational. Without the previous steps, alerting can easily be misguided on low impact metrics. This diminishes the whole objective behind implementing SLOs that tie up your product or service with the customer experience you're providing.
Report/Refine: Reporting and Refining establishes regular interval to review SLO performance. This helps to inform strategy. This is the final stage in the SLO life cycle and helps product and the organization to incorporate SLOs for strategic action.
This step specifices and chooses a service operation to use as the foundation for an SLO, based on importance. Google calls these Service Level Indicator’s and this step focuses on understanding which operations need to be measured. Identifying requires understanding the relative importance of each operation a service supports in order to identify the most important transactions.
Most teams today rely on intuition to classify important operations for services. However, this isn't necessarily done with enough information about the service itself. Some heuristics for establishing the importance of operations are:
For example, an operation that supports fetching a user profile and is called irregularly is less important for an authentication company than a transaction that authenticates a user and is called hundreds of times per second.
This step should produce a list of important operations that a service performs, ranked in order by importance. It’s important to remember that many operations may span multiple individual HTTP endpoints. Google describes strategies to identify SLIs in depth, in their recent Art of SLOs course.
This stage produces a single written entry that describes:
An example of this is available through the Art of SLOs Google Course under “Developing SLOs and SLIs” Section:
Next step is to actually get the data that will be used to implement the SLO. There are two components to this. At which logical level will the data be collected. And how will the data actually be instrumented, so that there is a record of every transaction with sufficient data to determine if it was successful or not, and which system will the data live in. The logical strategy is already defined in the identifying step. The strictest component is where the data will live. The system must support self service and alerting in order to scale SLOs.
Many established organizations will have this defined through the metric provider that the organization offers.
After defining the metric store the next step is to actually collect the data. This is done through either White Box, Or Black Box Monitoring and will be technology or provider specific. A base level of instrumentation will usually be necessary in order to identify. Even without service emitted metrics, request data will be available at load balancer or queue level from many cloud providers.
This step formally defines a service target. Google has already written a lot about choosing targets, which won’t be repeated here. It’s important to favor choosing a value and gradually refining it later, over choosing a “perfect” initial value.
A good heuristic is to look at historic performance and choose a target that is consistently achievable over the interval defined in the Identify stage (usually 7, 14, or 30 days). This can be done by consulting a monitoring system and taking a simple average of the target value and using that as the initial objective. For example, if the average latency of an operation, over the last week or month, was 200ms, start with this value. For the case where there is 0 historical data, make a guess on what is reasonable keeping in mind the kind customer experience you want to achieve. This initial value can often be discovered from either implicit constraints, such as what a human might think is reasonable, to explicit constraints such as minimum amounts of time governed by physics. Any value can trivially be refined with low effort after data is collected.
This stage should enhance the text created during the Identify step to include the SLO.
Alerting is essential in order to make objectives “living”. Alerting allows engineers to be notified in real time when their budgets are being exhausted. Having a structured and generic approach to alerting on SLOs allows us to build default tooling and provide default alerting policy. As long as we accurately express a customer's experience in terms of an SLO, alerting becomes a no-op/generic formula.
The strategy that’s recommended is called Multiple Burn Rate Alerts and is described in detail by google in their SRE workbook.
Using this approach, each SLO should have at least 2 alerts:
This strategy is relatively straightforward and the math can be templated and is easy to calculate (described in detail in the SRE Workbook).
This stage produces 2 alerts.
Succeeding with SLOs requires:
It’s important to review SLO performance on a regular interval. The closer this interval is to an organization's iteration intervals (sprint, week, etc) the more SLO performance can inform the decision towards shoring up reliability vs feature work. This stage allows SLOs to be used to help think about and quantify risk, compare availability between services, and orient future work along two poles:
The decision of feature velocity vs technical debt is one of the most important things that SLOs enable. This is the stage that is used to inform that decision.
In the IIDARR system each element is anchored to the customer and is used to help understand their perspective. ‘Identify’ chooses operations that are most important to the customer. ‘Instrumentation’ takes place in a way that is designed to capture the customer experience, and makes it explicit where gaps may occur through any given instrumentation strategy. ‘Defining’ an SLO is a direct representation of actual customers making requests (invoking operations). The objective of the process is to finally arrive at a system that is able to alert on incidents that are tied to customer experience. This will essentially be the stepping stone to quantify and measure customer experience making this whole process more actionable.
Some common risks that may keep organizations from successfully adopting SLOs across every team in the org:
Many of the largest challenges to succeeding with Service Level Objectives aren't technical. Service Level Objectives aren’t magic and don’t require difficult engineering to achieve, but they do require a clear and explicit process, which carefully considers feedback loops, in order to scale adoption. It gets easier to tackle the challenge of implementing SLOs from scratch if you are first sold on why doing it is important. It is important to keep in mind that is it extremely crucial for your SLOs to be actionable and is always following a feedback approach as it will play an important role in the debate of Features Vs Technical Debt.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.