📢 Webinar Alert! Future-Proofing IT Operations: How Charter Enhanced Reliability with Squadcast. Register Here! 🌟

Choosing SLOs that users need, not the ones you want to provide

Oct 1, 2020
Last Updated:
Oct 1, 2020
Share this post:
Choosing SLOs that users need, not the ones you want to provide

In our latest two-part series blog, Adam Hammond, talks about how you can build sustainable SLOs that are appropriate for your users, your technology platform, and your business which in turn will help you make your systems robust, your customers happy, and your business boom.

Table of Contents:

    Service Level Objectives (SLOs) are a powerful operational tool that uses metric-based targets to constrain activities that may have a negative impact on users (such as maintenance or failed deployments). Traditionally, you may have heard it being used in contractual terms within Service Level Agreements (SLAs), where SLOs are used to identify guarantees for IT platforms (SaaS, IaaS, PaaS, etc.). However, they are far more than that: SLOs are a powerful tool that can be used not only by the “business people” but also by technical staff to drive process improvement and technological advancement. SLOs have a formidable use as metric-based indicators that show you what needs to be improved in your systems, its capabilities, and where you can get your best “bang for buck” when it comes to focusing your work efforts. However, SLOs must be influenced by data, and that data can only come from your customers. A lot of IT professionals tend to think that they know the best metrics, and they do; the only problem is that they are the best metrics for monitoring systems, not for improving customer satisfaction. Today, we’re going to help you build sustainable SLOs that are appropriate for your users, your technology platform, and your business that will help you make your systems robust, your customers happy, and your business boom.

    Asking the right questions

    Now that we have an idea of what SLOs are, we need to go about establishing a data-based approach that will result in positive user outcomes. This is a two-stage process that involves data gathering and then using that data to build your SLOs. The source data for these questions come from three main places: your users, your system, and your business processes. Prepare to go out and talk to clients on zoom calls, trawl through logs, and understand the maintenance and support lifecycle of your system. There is no prescription for these questions, they are subjective, and everyone’s scenario will be different. It is also important to remember the Pareto Principle: 80% of your users use about 20% of your system. Therefore you will get the best value out of this exercise by targeting and providing SLOs for the most commonly used parts of your system.

    Example Questions

    • When do my users actively or passively use my system?
    • How much maintenance do I need to perform and how regularly does it need to be?
    • What tolerance would my users have for outages?
    • Would your users consider your application critical to your business?
    • How well is my system performing at the moment?
    • What levels of performance do my users require?

    Determining SLOs

    When you have finished your data gathering exercise, it is time to focus on actually setting your SLOs. SLOs will generally - but not always - fall into the following categories:

    ‍

    ‍

    These categories cover most of the things that people consider to be aspects of quality. They also translate easily into metrics that you can use to objectively measure your system against the requirements of your SLOs. Finally, when you define your SLOs, remember that a good SLO should be S.M.A.R.T.

    1. Specific: an SLO should expressly state what it measures (e.g. we want to measure availability by testing whether a request can be made to the server, not we want the server to be up).
    2. Measurable: the SLO should be something that can be measured (e.g. disk latency should be less than 5ms, not the disk should be quick).
    3. Achievable: you should be able to meet your SLOs (e.g. if an underlying service has an SLO of 95%, you cannot guarantee 100%).
    4. Relevant: your SLO should reflect the user experience (e.g. an appropriate metric for a web server is response time, not CPU activity).
    5. Timebound: an SLO should cover a period that is appropriate for how your system is used (e.g. if your users only use your system between 9 AM and 5 PM, a 24-hour SLO will only dilute your actual metrics and hide issues).

    Now, let’s get down to creating an SLO. Whether an SLO is achievable or relevant is not pertinent to the specific wording required, but it dictates whether a particular SLO should be set. For example, if the average time to retrieve a file is five minutes, you would not guarantee that the file can be delivered faster than that (because on average, it won’t). Alternatively, if your users only care that files are consistently, but eventually delivered to them then a retrieval time-based SLO is probably not for you. In this case, the best SLO would be one that guarantees that a proportion of files are always delivered to users, regardless of time to retrieve and deliver (i.e. percentage of successful retrievals).

    Once we’ve determined that an SLO is appropriate, let’s get the SLO down on paper. Remember, we need to make sure that the wording is Specific, Timebound, and that it is Measurable. If it is not all of these things, then it simply cannot be used as an SLO. Let’s consider an example. A system processes stock trades and all requests need to be finalised within 300ms as dictated by a regulatory body. The company running the system wants to offer an SLO that requests, on average, over 30 days are completed faster than 250ms. The system currently responds to 98% requests within 232ms on a 30-day rolling average. The SLO text would look like this:

    ‍

    ‍

    Is this a good SLO? Yes. The system already exceeds the SLO, so it is Achievable. There is a legal requirement that requests are finalised within the SLO limits, so it is Relevant. We are Specific with the metric we want to guarantee our performance against, which is the request response rate. We have limited our SLO to a 30-day period, which allows us to run reporting that is Timebound. Finally, our metric is Measurable via a Prometheus metric. We have met all the requirements for a SMART SLO that has been tailored to the user experience.

    How to account for maintenance and scheduled downtime in your SLOs

    Everyone needs to maintain their systems; some are highly available and have no downtime, while others need some downtime. The simple answer is to bake your maintenance into the SLO. If you know you can provide 97% availability for a system over a month, but you need 14 hours of maintenance (2%), then only offer 95%. It is better to underpromise and overdeliver than be red-faced (and out of pocket) because your system has been offline (and you expected it).

    Providing Better Service (and Increasing your SLO guarantees)

    Now that we have our SLOs, they’re SMART, but… we are just not meeting our targets (or want to exceed them). What do we do? We need to make our systems performant enough to overcome this challenge. While demanding in terms of effort, this is right in the SRE wheelhouse, and will predominantly rely on your expertise and knowledge to improve your system performance. If users require faster requests, streamline your proxy config. If disk reads are too slow, consider high IOPS or higher throughput alternatives. If batch jobs are taking too long, right-size the instances so that they process in the correct amount of time. Some more difficult approaches may include changing your operating system, database platforms, or, even development frameworks. It entirely depends on your ability to analyse and understand the factors in your system that affect your SLOs and mitigating those issues through proper SRE practice.

    There are also other options aside from the more technical approach: improved monitoring and disaster recovery. By improving your monitoring, you can ensure that problems are caught before they affect your SLOs. Your disaster recovery plan is key to managing and maintaining your SLOs. Disasters come when we least expect them, so practising and improving DR procedures means that if disaster strikes, you are able to restore service as quickly as possible. This will limit the overall impact to SLOs by ensuring that any disaster downtime is limited to only that which is strictly necessary to recover your systems.

    Using these processes, you can deliver SLOs that will please your users and make their experience with your systems a delight. By meeting (and hopefully, exceeding) their expectations, you will build lifelong customers that will evangelise your business and products.

    To be Continued...

    In the second part of this blog, we will be looking at an example based on Bill from The Phoenix Project that will highlight how “achieving SLOs” is not always good for business if those SLOs aren’t derived from customer needs.

    ‍

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    squadcast
    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    Written By:
    October 1, 2020
    October 1, 2020
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Adam Hammond
    Error Budgets and their Dependencies
    Error Budgets and their Dependencies
    February 3, 2021
    How small changes to your SLOs can be SMART for your business - A narrative case study
    How small changes to your SLOs can be SMART for your business - A narrative case study
    November 17, 2020
    Keeping your teams and customers in the loop during downtime
    Keeping your teams and customers in the loop during downtime
    August 12, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
    Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
    Users love Squadcast on G2
    Copyright © Squadcast Inc. 2017-2024