Succeeding With Service Level Objectives

March 12, 2020
Share this post:
Succeeding With Service Level Objectives

In this blog, Danny Mican, a Senior Site Reliability Engineer, outlines how to implement SLOs from scratch using the IIDARR process. He also states it is extremely crucial for your SLOs to be actionable and is always following a feedback approach as it will play an important role in the debate of Features Vs Technical Debt.

Table of Contents:

    Implementing SLOs From Scratch With the IIDARR Process 

    Succeeding with Service Level Objectives require buy-in and coordination across multiple different business units: Product, Engineering and Site Reliability. Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure. This post describes a process to help organizations scale adoption of Service Level Objectives. It assumes familiarity with core Service Level Objective concepts, as defined in the Google SRE book

    Using SLOs Throughout the Product Life Cycle

    Many resources talk about technical specifics of choosing SLIs and defining and measuring objectives. Few resources discuss technical strategies for incorporating Service Level Objectives into regular operations and decision making. It’s important to have an organizational process that governs succeeding with SLOs throughout the entire product lifecycle and provides the foundation for incorporating Service Level Objectives into regular operations.

    The goal is to establish a process that allows teams to accurately understand their customers’ experience, empower teams to detect issues before their customers report them in order to maintain service and allow organizations to understand their historical performance and use this to inform the decision of addressing technical debt vs feature velocity.  

    The IIDARR Rule-of-Thumb for SLO Implementation

    The following elements describe the phases that are essential to successfully implement Service Level Objectives:

    • Identify - Determines the operations and flows that are essential to support customers.  This phase produces Service Level Indicator (SLI) definitions.
    • Instrument (Measure) - Captures data from the identified operations in order to later act on it. This phase produces concrete, queryable metrics.
    • Define - Quantifies the customer experience and simplifies it by expressing it as a Service Level Objective. This phase produces a Service Level Objective, defined in writing, explaining how it is calculated.
    • Alert (Action) - Enforces an SLO through automated detection of when a service level (and therefore the customer experience) is being impacted. Alerting establishes an actionable connection between customers, engineering, and product.  This phase produces an alert living in an alerting system such as Prometheus, Datadog, etc, and hooked up to an incident management system that is able to notify the owning teams.
    • Report/Refine  - Reporting provides a view into performance, week over week, month over month, and quarter over quarter. This historical record keeping enables organizations to see what level of service clients are actually receiving.
    • Refining establishes a formal cadence for performing a review in order to support continuous improvement and understand clients use of the product (i.e. which transactions are important) and the level of service is being achieved.
    • Succeeding with SLOs requires the ability to both: look back historically and find avenues for service performance improvements. Finally, Reporting democratizes SLOs and makes them available for incident responders, management, and leadership teams.


    Initial SLO adoption also benefits from centralizing the implementation in a single resource or tool:

    • Inventory - Inventorying progress is very important in order to see a global view of SLOs per team and or project. It also helps the team understand the implementation statuses of all the available SLOs. Inventorying is helpful during initial deployment of SLOs across teams to get a birds-eye view of the rollout. Centralized inventorying should be used until teams are familiar with the process and each team has an SLO that they have fully incorporated into their process.  Even after teams start to autonomously handle their SLOs, it is still extremely useful to have a centralized/searchable view into all SLOs - searchable by team, service and SLI type.

    Things to Keep in Mind During Implementation

    All the elements together form a complete process. None act as a standalone, and hence risk partial implementation and eventually, failure, if treated as individual steps.

    Identify: Defining what should be instrumented, and alerted on, has no inherent value to itself. The company could identify 100 important operations a day, and wouldn’t be better off because there’s no way to automatically collect data around those operations and act on it. Ad-hoc research without building data around it provides one-off insights and no recurring value. 

    Instrument: Collecting data without understanding how it relates to a customer has very little value. This can usually be seen with infrastructure metrics. For example: is high resource usage an issue? It may be deviated from a baseline, but it’s only the customer’s experience which provides context in order to characterize if any individual resource usage is actually an issue.

    Even if the data is passively there for incidents, it gets hard to establish a baseline for normal, understand if it’s deviating from that and to tie it up with the customer experience.

    Instrumentation requires guidance from SLOs to determine its importance. It also requires proactive alerting in order to ensure that it is regularly used and incorporated into decision making. 

    Define: Understanding the target customer experience and expectations are good, but has no value if there isn't a system in place to notify the team when there is a risk of negative impact.  Definition is supported by instrumentation and action. Defining the experience is essential, but it’s only when that definition becomes actionable (through alerting), is when an organization is able to ensure a certain level of service for its customers.

    Alert: Alerting is essential for closing the feedback loop and connecting engineering and customers (through a decided service level). Alerting ensures that SLO breaches are actionable and not just informational. Without the previous steps, alerting can easily be misguided on low impact metrics. This diminishes the whole objective behind implementing SLOs that tie up your product or service with the customer experience you're providing.

    Report/Refine: Reporting and Refining establishes regular interval to review SLO performance.  This helps to inform strategy. This is the final stage in the SLO life cycle and helps product and the organization to incorporate SLOs for strategic action.

    Approach

    Identify (Service Level Indicator)

    This step specifices and chooses a service operation to use as the foundation for an SLO, based on importance. Google calls these Service Level Indicator’s and this step focuses on understanding which operations need to be measured. Identifying requires understanding the relative importance of each operation a service supports in order to identify the most important transactions.  

    Most teams today rely on intuition to classify important operations for services. However, this isn't necessarily done with enough information about the service itself. Some heuristics for establishing the importance of operations are:

    • The operation that earns the most amount of money 
    • The operation with the most traffic
    • A “coarse” grained SLI based on all service traffic

    For example, an operation that supports fetching a user profile and is called irregularly is less important for an authentication company than a transaction that authenticates a user and is called hundreds of times per second.  

    This step should produce a list of important operations that a service performs, ranked in order by importance.  It’s important to remember that many operations may span multiple individual HTTP endpoints. Google describes strategies to identify SLIs in depth, in their recent Art of SLOs course.

    This stage produces a single written entry that describes:

    • What is being measured?
    • The Type (Availability/Latency)?
    • Specification?
    • How it is being measured?
    • From where it is being measured?

    An example of this is available through the Art of SLOs Google Course under “Developing SLOs and SLIs” Section:

    SLI examples


    Instrument 

    Next step is to actually get the data that will be used to implement the SLO.  There are two components to this.  At which logical level will the data be collected. And how will the data actually be instrumented, so that there is a record of every transaction with sufficient data to determine if it was successful or not, and which system will the data live in.  The logical strategy is already defined in the identifying step.  The strictest component is where the data will live.  The system must support self service and alerting in order to scale SLOs.  

    Many established organizations will have this defined through the metric provider that the organization offers.

    After defining the metric store the next step is to actually collect the data.  This is done through either White Box, Or Black Box Monitoring and will be technology or provider specific.  A base level of instrumentation will usually be necessary in order to identify. Even without service emitted metrics, request data will be available at load balancer or queue level from many cloud providers.

    Define (Service Level Objective)

    This step formally defines a service target. Google has already written a lot about choosing targets, which won’t be repeated here. It’s important to favor choosing a value and gradually refining it later, over choosing a “perfect” initial value.  

    A good heuristic is to look at historic performance and choose a target that is consistently achievable over the interval defined in the Identify stage (usually 7, 14, or 30 days). This can be done by consulting a monitoring system and taking a simple average of the target value and using that as the initial objective. For example, if the average latency of an operation, over the last week or month, was 200ms, start with this value.  For the case where there is 0 historical data, make a guess on what is reasonable keeping in mind the kind customer experience you want to achieve.  This initial value can often be discovered from either implicit constraints, such as what a human might think is reasonable, to explicit constraints such as minimum amounts of time governed by physics.  Any value can trivially be refined with low effort after data is collected.

    This stage should enhance the text created during the Identify step to include the SLO.

    Alert (Actionable Objectives)

    Alerting is essential in order to make objectives “living”.  Alerting allows engineers to be notified in real time when their budgets are being exhausted. Having a structured and generic approach to alerting on SLOs allows us to build default tooling and provide default alerting policy.  As long as we accurately express a customer's experience in terms of an SLO, alerting becomes a no-op/generic formula.

    The strategy that’s recommended is called Multiple Burn Rate Alerts and is described in detail by google in their SRE workbook.

    Using this approach, each SLO should have at least 2 alerts:

    • Active - Alert (to your alerting tool)- triggered when 2% of the budget has been exhausted in 1 hour
    • Passive - Log (to your communications tool) - triggered when 10% of the budget has been exhausted in 1 day

    This strategy is relatively straightforward and the math can be templated and is easy to calculate (described in detail in the SRE Workbook).

    This stage produces 2 alerts.

    Report/Refine (Revist Objective)

    Succeeding with SLOs requires:

    • Historic SLO data
    • Periodically revisiting this data

    It’s important to review SLO performance on a regular interval. The closer this interval is to an organization's iteration intervals (sprint, week, etc) the more SLO performance can inform the decision towards shoring up reliability vs feature work.  This stage allows SLOs to be used to help think about and quantify risk, compare availability between services, and orient future work along two poles:

    • Risk Aversion / Shore up Reliability / Tech Debt
    • Feature Velocity / Constant deploys / New Features

    The decision of feature velocity vs technical debt is one of the most important things that SLOs enable. This is the stage that is used to inform that decision. 

    The Customer

    In the IIDARR system each element is anchored to the customer and is used to help understand their perspective. ‘Identify’ chooses operations that are most important to the customer. ‘Instrumentation’ takes place in a way that is designed to capture the customer experience, and makes it explicit where gaps may occur through any given instrumentation strategy. ‘Defining’ an SLO is a direct representation of actual customers making requests (invoking operations). The objective of the process is to finally arrive at a system that is able to alert on incidents that are tied to customer experience. This will essentially be the stepping stone to quantify and measure customer experience making this whole process more actionable.


    overview of alerting stages

    Myths and Anti-patterns That You Can Avoid

    Some common risks that may keep organizations from successfully adopting SLOs across every team in the org:

    • Reliance on Hope - Successful org-wide adoption of SLOs requires more than an ad-hoc strategy. “Hope is not a strategy” - Ben Treynor
    • SRE “does” SLOS for teams - SLOs tie up customer experience with individual service performance. Hence, It’s essential for individual product teams to deploy SLOs for themselves.
    • SLOs are static - SLOs are meant to be iterative in nature. There will be no room for improvement if you are collecting data without introducing feedback loops.
    • Feedback Loops don’t have automated enforcement - If Feedback loops are opt-in / best effort then teams can easily fall out of keeping up with this. This could take place if a team decides that it doesn’t want to Report/Refine, or if a team doesn’t want to alert on their SLOs. These are critical communication points between the customer, product and engineering, and missing them can lead to not being able to characterize the performance of systems in terms of customer experience.

    Conclusion

    Many of the largest challenges to succeeding with Service Level Objectives aren't technical.  Service Level Objectives aren’t magic and don’t require difficult engineering to achieve, but they do require a clear and explicit process, which carefully considers feedback loops, in order to scale adoption.  It gets easier to tackle the challenge of implementing SLOs from scratch if you are first sold on why doing it is important. It is important to keep in mind that is it extremely crucial for your SLOs to be actionable and is always following a feedback approach as it will play an important role in the debate of Features Vs Technical Debt.

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    squadcast


    Written By:
    March 12, 2020
    March 12, 2020
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Danny Mican
    No items found.
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2 Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2 Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
    Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2
    Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Copyright © Squadcast Inc. 2017-2023