You deployed a service to your Kubernetes cluster. How do you it is working as expected? In this blog, Gigi Sayfan, author of “Mastering Kubernetes” talks about Kubernetes observability tools like Prometheus, Grafana and Jaeger, how to utilize them to set proper SLOs and make sure the service meets its objectives.
You deployed a service to your Kubernetes cluster. How do you know if it works as expected? The best practice is to set a Service Level Objective (SLO) that the DevOps/SRE team and the developer responsible for the service to maintain. If the SLO is not met then a corrective action must be taken. But, how do you choose your SLOs and how do you keep track of them? In this article, we will answer these questions and more. We will understand the terminology of service levels, review methods for evaluating service health and performance. We will learn about Kubernetes observability tools like Prometheus, Grafana and Jaeger, how to utilize them to set proper SLOs and make sure the service meets its objectives.
Let's start with some terminology.
Service Level Indicator (SLI) is a measurable quantity of some relevant element of the service. For example, the error rate second of an endpoint on your service is a common SLI. Services can have hundreds or thousands of metrics associated with them, but usually just a small number of SLIs. Some very interesting metrics like the number of requests per second are not strictly SLI, because the service can't control the number of requests coming in, however a service is always designed to handle a certain volume of requests. If the number of requests per second exceeds some threshold the level of service might decline.
Service Level Objective (SLO) is the agreed on number or range for all SLIs that the service must meet to be considered healthy. For example, an error rate SLO can have an error rate < 1% for every endpoint. Since distributed systems are very dynamic, often temporary spikes are not considered violations. SLOs are internal targets and when they are violated the operators and developers must take action to restore the level of service. Customers and users are unaware of the SLO. The SLOs can be changed periodically. For example, when a new service is in beta maybe there is no SLO at all, or very limited SLO. Later, when a service is fully launched it will have much tighter SLO. But, stricter SLOs often come with a steep price. The goal is not to maximize the SLO, but find the sweet spot of a highly available and reliable service that is cheap enough to operate.
Service Level Agreement (SLA) is similar to the SLO, but it is a commitment that the service developers make to the users of the service. It is a contract that includes consequences for missing it. For example, if a cloud provider has an outage of some critical service they may give some credits to the affected customers. The SLAs are tied to SLOs, but are more lenient. For example, if the SLO for service availability is 99.99999% (5 nines) then the SLA may be just 99%. This is a huge difference. Five nines means at most 5 minutes and 15 seconds of downtime per year. 99% availability means the service may be down up to 3.65 days. The idea is that SLAs should be violated very rarely if at all because they are visible externally, impact the reputation of the company and there are serious consequences.
The big question is how to choose your SLIs. There are several schools of thought with a lot of overlap. Everyone agrees that you want a small set of SLIs that give you a reasonable picture of your service. Note that when something goes wrong you will need much more data to actually debug, troubleshoot and do a root cause analysis. Let's review the common approaches: the USE method, the RED method and the Four golden signals. Before we go ahead let's take a moment to appreciate the concise and catchy names of all the approaches.
For performance-oriented SLIs, Brendan Gregg came up with the USE method. USE stands for Utilization, Saturation and Errors. The idea is to measure for each physical resource like CPU, memory or disk its USE.
Utilization: the average time the resource was in use.
Saturation: how much extra work the resource was unable to service (often queued)
Errors: the number of error events
It's interesting to note that utilization is an average, so even if utilization is low, a spike of high utilization can cause saturation and performance issues.
For CPU and memory, utilization and saturation is the most important. For the network resource, utilization is important and for I/O devices utilization, saturation and errors are all important.
The Four Golden Signals are used by Google SRE. They are Latency, Traffic, Errors and Saturation.
Latency measures how long it takes to handle a request. When measuring latency it's best to separate successful requests from failures. A failed request may bail out very early and skew your measurements of successful requests.
Traffic measures the incoming demand of the system. For services exposing HTTP or gRPC APIs it is usually requests per second. For other systems with persistent connections it may be the number of connections, network IO or memory.
For services that expose multiple endpoints it might make sense to track the traffic per endpoint. For example, some endpoints may return cached content, while others might require heavy processing.
This golden signal means measuring the rate of errors. It's important to distinguish here invalid or forbidden requests (4xx errors for HTTP services) and actual service errors (5xx errors for HTTP services). The first category is the caller's problem and shouldn't count against the service's error budget. It makes sense to ensure that those invalid requests can be rejected quickly without dedicating too many resources to minimize DDOS attacks. The service can't control the number of invalid requests sent its way. The second category is what the service developers and operators should strive to reduce.
Saturation measures how close your service is to exhausting some critical resource. It is similar to saturation from the USE method, but often focuses on high-level resources. Saturation is often related to utilization and even errors, but in a nuanced way. For example, degradation in performance can happen even before the service reaches 100% utilization. Also, by design a service may throttle requests and return errors under a certain load. The goal of the saturation signal is to identify that the service is close to being saturated and lead to some corrective action.
The RED method was developed by Tom Wilkie, who was an SRE at Google and used the Four Golden Signals. The RED method drops the saturation because it is used for more advanced cases and people remember better things that come in threes :-).
So, RED stands for:
Rate - traffic from Four Golden Signals
Errors - errors from Four Golden Signals
Duration - latency from Four Golden Signals
The name RED was inspired by the catchy name of the USE method, so as you can see RED was inspired by both.
Kubernetes is different from traditional systems as you very well know. Kubernetes decides where your service runs and on what nodes your service pods are scheduled. This means that there is a dichotomy between monitoring your cluster infrastructure and monitoring your services. Also, if you don't specify limits and quotas properly your services can cannibalize each other. This has implications to the way you do your capacity planning. Provided you took care of infrastructure or you deploy your cluster in a managed environment like GKE, AKS or EKS, let's focus on your services.
There are several steps you should follow. Let's address them in order.
Before setting SLOs it's important to establish a baseline. If you don't know how much traffic your service receives and how it behaves under the load then you're not ready to commit to an SLO. You can observe your service in a staging environment, but I recommend running the service in production for a while before establishing SLOs. The reason is that the real world will surprise every time and various patterns will emerge. It is typically difficult to simulate real-world traffic to a new service.
The observation should involve looking at your service SLIs over a period of several days at least.
Note that since it's the first time you deploy your service in production you may discover various serious issues with your implementation, your design and the service interaction with other services. Once your service stabilizes you should restart your observation period.
The next step is to choose the proper thresholds and ranges for your SLIs that correspond to your observations. For example, if you observed your error rate fluctuates between 0.1% and 0.3% then you may set a threshold of 0.5% as your SLO (as long as it's OK with the business stakeholder). This way you account for some volatility beyond what you observe, but your SLO is still grounded in what you saw in practice.
Note that the granularity of the SLOs is very important. For example, if your service has five different endpoints, you can set an SLO for the latency of the entire service or for each endpoint independently.
Latency SLOs are especially problematic for services that depend on other services or shared resources that may have unpredictable response time and may sometimes be temporarily unavailable.
Your service may implement retry logic, but even if a request eventually succeeds it might violate the latency SLO.
In an ideal world you will deploy your service and it will always meet its SLO. In the real world the SLO will be violated from time to time. It could be bugs or misconfiguration of your code or problems with your dependencies. A naive approach is to consider every violation as an outage, stop all development and address it immediately. But, this approach is actually counter-productive. It is very expensive, if not impossible, for example, to ensure that the latency will NEVER be above a threshold X.
Instead an SLO should have an error budget. The error budget is the rate of requests that are allowed to violate the SLO. For example, 0.01% error rate means that 1 request out of 10,000 may violate the SLO, but no corrective action needs to be taken. You can review the SLO error rate every day or some other convenient period like every sprint. If the errors exceed the budget then you stop normal development and the developers should focus on reducing the error rate before going back to feature development.
SLOs shouldn't be set in stone. As the system evolves and becomes more stable or you update your dependencies you should verify that your SLOs are adequate. It is also worthwhile checking with your stakeholders that your SLOs are not too stringent. If a system or a service can tolerate more lenient limits then you may save yourself some work and/or money by loosening your SLOs or increasing your error budget.
It is sometimes very easy to meet SLOs by over provisioning, AKA throwing hardware at the problem. This is particularly easy on Kubernetes where you can just set a horizontal pod autoscaler and cluster auto scaler with no ceiling and just let them run rampant. If you become sloppy first your costs will climb and then you'll run into problems that even extra resources can't solve. You should be vigilant and make sure your service is efficient in its resource usage. This is where the USE method is very effective and you set USE-based objectives for your service.
Now, that we covered a lot of the concepts and theory behind monitoring and SLOs let's check out Prometheus, which is the de-facto standard metrics collection platform for Kubernetes.
Prometheus is the second graduated CNCF project after Kubernetes itself. It focuses on metrics collection and alert management. With Prometheus you can define recording rules that are fired at regular intervals and collect data from targets that correspond to your SLIs.
Here are some unique features of Prometheus:
Here is a diagram that shows the architecture of Prometheus:
Prometheus has a web UI too. It is decent and can show data as well as graphs:
However, the real action is when you add Grafana for visualization. Prometheus and Grafana integrate nicely and Grafana has fantastic widgets, graphs and dashboard. Here is an example for a cool dashboard with multiple widgets and a lot of information:
With Grafana you can also navigate conveniently between different dashboards and views.
Prometheus is a great metrics collection platform as well alert management. You can also record custom metrics that correspond 1:1 to your high-level SLIs. Grafana is a fantastic dashboarding solution. Together you can monitor your SLIs and SLOs both visually on a regular basis to detect various trends and anomalies that require attention, as well as set alerts to trigger when your SLOs are in jeopardy.
Jaeger is a distributed tracing platform. It is also a graduated CNCF project and considered best in class in the Kubernetes world. Jaeger adheres to the OpenTracing API. It records traces and spans and lets you track the path of a request through your system as well as how long it spent in each component. This information is invaluable when it comes to studying your system under real traffic, to understand dependencies and set your SLOs. In addition, when things go wrong and you start missing SLOs, Jaeger will be one of the first tools you can reach to understand what went wrong.
Tracing systems in general and Jaeger in particular are pretty complicated on their own. Here is the Jaeger architecture:
The primary value of Jaeger in addition to logs or metrics is that the information it collects is transaction-bound or request-bound. Here is what the Jaeger UI looks like when dissecting a complex application:
The combination of Prometheus and Jaeger (and of course your logs) are a great foundation to establish SLOs, keep track of your SLIs and respond when something goes wrong or needs adjustment.
Your SLOs are the heartbeat of your system. SLO violation (including exceeding the error budget) should be a rare occurrence if you designed and implemented a strong observability solution and took advantage of the facilities Kubernetes itself provides and additional tooling from the ecosystem like Prometheus and Jaeger. But, systems are dynamic, your developers will deploy new features, your dependencies might have their own challenges. Your own business priorities may value agility and speed of development over stability and reliability. Any combination of these reasons will lead sooner or later to an incident, such as a performance issue or an outage. By the way, performance issues often become outages since no system can wait forever and when a component takes too long to respond the waiting system might just bail out. Timeouts are the most direct example.
When an incident occurs, its important to respond properly. You should have a runbook that allows you to recover quickly and later analyze the root cause and take steps to ensure the incident will not repeat (or acknowledge that you explicitly accept the risk).
This is where platforms like Squadcast shine and can help you manage and improve your reliability posture.
In this article we covered what SLOs are, how they relate to SLIs and SLAs and the best tools on Kubernetes to arrive and keep SLOs up to date. We specifically highlighted Prometheus, Grafana and Jaeger but you may definitely use other tools instead or in addition to those tools. The main takeaway is that SLOS are critical for healthy day-2 operations and they form the quantitative basis for your observability and reliability story.