SRE Best Practices

Site Reliability Engineering (SRE) is a practice that emerged at Google because of its need for highly reliable and scalable systems. SRE unifies operations and development teams and implements DevOps principles to ensure system reliability, scalability, and performance.

There’s plenty of documentation on tactics for adopting automation and implementing infrastructure as code, but practical ops-focused SRE best practices based on real-world experience are harder to find. This article will explore 6 SRE best practices based on feedback from SREs and technical subject matter experts. Here is a list of topics we will cover.

SRE Best Practices
SRE Best Practice	Benefit
Define the role of the SRE	Removes role ambiguity and clarifies responsibilities.
Automate toil and make time for strategic tasks	Emphasizes automation of simple tasks and enables humans to focus on more complex work.
Monitor using SLIs and SLOs	Imrpoves visibility and helps determine if SLAs are met.
Maintain a transparent status page	Summarizes infrastructure performance and availability for all stakeholders.
Categorize incident severities	Helps quantify incident impact and prioritize incident management tasks.
Conduct post-mortems and share them publicly	Encourages transparency and continuous learning.

SRE Best Practices

The six SRE best practices below are based on feedback from experienced SREs and focus on the operational side of site reliability engineering.

Define the role of the SRE

The Site Reliability Engineer, also called SRE; has several responsibilities:

Designing systems to monitor, automate, and achieve the highest uptime with the lowest operational effort
Enabling developers to iterate and move fast simultaneously
Incident management
Performing root cause analysis (RCA)
Conducting post-mortems (more on these later in the article)
Creating documentation to minimize tribal knowledge

SREs should spend most of their time automating their tasks to avoid having to be constantly working “toil”. Toil is a catchall term for operational tasks that involve repetitive manual configuration or lack long-term strategic value. Without automation, toil requires the engineering team’s time. With automation, engineers can focus on more complex tasks.

This diagram represents the optimal time allocation for an SRE engineer.

The two categories of SRE work. (Source)

Automate toil and leave time for strategic tasks

To avoid wasting valuable engineering time, SREs should work on automating every repetitive task, so teams focus less on toil and more on innovation. SREs use scripts, programs, and frameworks to automate and monitor those tasks.

Within high-performing teams, eliminating toil is a core SRE function. From a tactical perspective, there are many ways to implement this best practice. The key is to limit wasting human time spent working on simple things automation can handle.

Monitor using SLIs and SLOs

Effective monitoring is a crucial part of SRE. Metrics should be as close to the user as possible since most businesses care more about user experience. Organizations should define their most important metrics. Then, SREs use these metrics to build the three key indicators: SLIs, SLOs, and SLAs.

SLI vs. SLO vs. SLA
Key Indicator	Goal	Stakeholders
SLIs (Service Level Indicators)	Collect metrics in a standardized way to gain insights into the system's performance	Development and product team
SLOs (Service Level Objectives)	Set the uptime objective for the company	Development team, product team, and company executives
SLAs (Service Level Agreement)	Set the expectations for the general public about the reliability of your services	Clients, consumers, and the general public

SLIs: Service Level Indicators

SLIs are used to collect metrics in standardized ways. Here is a breakdown of common SLI types.

Common SLI Types
Type of SLI	Description
Availability	Percentage of requests that resulted in a successful response.
Latency	Percentage of requests that returned faster than the minimum threshold.
Quality	Percentage of requests that were served in a non-optimal manner due to service affectation.
Freshness	Percentage of data that was successfully updated under the minimum threshold.

SLOs: Service Level Objectives

SLOs are the goals the organization must accomplish and are formulated using the service level indicators (SLI) explained in the previous section. These should be published internally in a place easily accessible for technical and non-technical stakeholders.

SLAs: Service Level Agreements

These are contracts with clients/consumers about what to expect from the service, usually legally bonding and with monetary implications if not met.

Maintain a transparent status page

Customers need to understand the system's status at all times. If there's an outage on a system, the customer has to know about it as soon as possible. That helps build trust and prevents them from troubleshooting an issue they cannot control.

Status pages reflect the status of services in real-time. They should be clear and concise and have a color-coded indicator for each service exposed to customers. In case of failure, it should immediately report which services are failing and why. It is always great to accompany it with an email or RSS notification.

The Squadcast status page at https://status.squadcast.com.

Categorize incident severities

With enough time and complexity, errors happen. When they do, they must be addressed in an organized manner.

Incidents have different severities: generally, those are P0, P1, P2, and P3. Severity determines the action to be taken and response time.

Severity Levels
Severity	Examples	Action	Response time
P0 (Critical)	The site is unavailable for one or several reasons: DDoS attack, wrong configuration, bad deployment, or third-party incident). It can also be related to a security issue, such as PII exposure.	Page (push to on-call, call to action, email, Slack, War Room). Most of the time, several teams are involved, with multiple stakeholders. Engineers do RCA in real-time.	Immediate (within 5 minutes)
P1 (Major)	The site is partially affected due to one or more services failing or a provider incident. This affection could also be intermittent. Page (push to on-call, email). It usually involves fewer teams than a P0, but it has to be solved relatively fast to prevent a deteriorated user experience.	Page (push to on-call, email). It usually involves fewer teams than a P0, but it has to be solved relatively fast to prevent a deteriorated user experience.	Fast (within 20-30 minutes)
P2 (Minor)	Some of the site’s non-critical functionalities are affected, like recommendations not loading correctly, some images not showing up, or loading too slowly.	Slack, email, and notify a single team. See if there is easy remediation and if there is a fix to be applied. If not, sometimes it could be delayed until the next working day. An item should be placed in the backlog and prioritized accordingly.	Standard (within a few days)
P3 (Irrelevant/Bug)	The incident is not affecting users directly, or users may not even be aware of it, like an elevated error rate, which the client applications retries on.	Notification channels may include email or Slack, but do not require an immediate response. SREs should review during working hours.	Slow (within a few days or weeks)

Conduct post-mortems and share them publicly

Shortly after an incident, SREs should do two things:

Address the issues: Critical errors are patched or hot-fixed in an improvised way, and it is usually not a permanent solution. If that is the case, those should be placed in the backlog to be revisited by the development teams. SREs should also review issues not fixed on-call during working hours.
Draft a post-mortem: Post-mortems are a briefing of what happened during the incident. These help us get all the information from the incident, what happened, why, and how to prevent them in the future. Thus, every post-mortem should have clear documentation and action items placed in a backlog and prioritized according to the severity.

It is also a great idea to share post-mortems with the public since it helps bring transparency and strengthen their trust.

Conclusion

These best practices are better elaborated on in our chapters.

‍

Chapter 1: SLA vs. SLO: Understand the similarities and differences between SLA and SLO, follow a case study, and learn the best practices for implementing them.

Chapter 2: Reliability vs. Availability: Learn the difference between reliability and availability and understand how to calculate them by following examples.

Chapter 3: DevOps vs SRE: Learn the origins of DevOps and site reliability engineering (SRE) and understand how they compare and where they overlap.

Chapter 4: O11y: Learn how o11y is different from systems monitoring and follow the best practices and recommendations for a successful implementation of an o11y solution.

Chapter 5: Runbook Template: Learn how to design the right runbook template for your use cases and follow a format and an example to get you started.

Chapter 6: Microservices Security: Understand the key aspects of microservices security and learn the best practices that help prevent misconfigurations and vulnerabilities.

Chapter 7: On-Call Rotation: Plan a successful on-call rotation schedule and learn the best practices for shift composition, hand-off, post-mortem meetings, escalation, creating cheat-sheets, and more.

Chapter 8: Canary Deployment: Learn how to roll out new features in a live SaaS environment with the canary deployment strategy, avoiding downtime and easy rollbacks.

Chapter 9: Golden Signals: Learn how to efficiently scale and enhance operational performance of software systems as SREs, by utilizing 4 key metrics, the "Golden Signals".

Chapter 10: SRE Tools: Learn how teams use various SRE tools to monitor and manage software systems for reliability, performance, and availability.

Chapter 11: Runbook Automation: Learn how to leverage runbook automation to improve IT operations, reduce risk of errors, and ensure compliance.

‍

Check back, as more chapters are coming soon!