Site Reliability Engineering (SRE) is a practice that emerged at Google because of its need for highly reliable and scalable systems. SRE unifies operations and development teams and implements DevOps principles to ensure system reliability, scalability, and performance.
There’s plenty of documentation on tactics for adopting automation and implementing infrastructure as code, but practical ops-focused SRE best practices based on real-world experience are harder to find. This article will explore 6 SRE best practices based on feedback from SREs and technical subject matter experts. Here is a list of topics we will cover.
SRE Best Practices
The six SRE best practices below are based on feedback from experienced SREs and focus on the operational side of site reliability engineering.
Define the role of the SRE
The Site Reliability Engineer, also called SRE; has several responsibilities:
- Designing systems to monitor, automate, and achieve the highest uptime with the lowest operational effort
- Enabling developers to iterate and move fast simultaneously
- Incident management
- Performing root cause analysis (RCA)
- Conducting post-mortems (more on these later in the article)
- Creating documentation to minimize tribal knowledge
SREs should spend most of their time automating their tasks to avoid having to be constantly working “toil”. Toil is a catchall term for operational tasks that involve repetitive manual configuration or lack long-term strategic value. Without automation, toil requires the engineering team’s time. With automation, engineers can focus on more complex tasks.
This diagram represents the optimal time allocation for an SRE engineer.
Automate toil and leave time for strategic tasks
To avoid wasting valuable engineering time, SREs should work on automating every repetitive task, so teams focus less on toil and more on innovation. SREs use scripts, programs, and frameworks to automate and monitor those tasks.
Within high-performing teams, eliminating toil is a core SRE function. From a tactical perspective, there are many ways to implement this best practice. The key is to limit wasting human time spent working on simple things automation can handle.
Monitor using SLIs and SLOs
Effective monitoring is a crucial part of SRE. Metrics should be as close to the user as possible since most businesses care more about user experience. Organizations should define their most important metrics. Then, SREs use these metrics to build the three key indicators: SLIs, SLOs, and SLAs.
SLIs: Service Level Indicators
SLIs are used to collect metrics in standardized ways. Here is a breakdown of common SLI types.
SLOs: Service Level Objectives
SLOs are the goals the organization must accomplish and are formulated using the service level indicators (SLI) explained in the previous section. These should be published internally in a place easily accessible for technical and non-technical stakeholders.
SLAs: Service Level Agreements
These are contracts with clients/consumers about what to expect from the service, usually legally bonding and with monetary implications if not met.
Maintain a transparent status page
Customers need to understand the system's status at all times. If there's an outage on a system, the customer has to know about it as soon as possible. That helps build trust and prevents them from troubleshooting an issue they cannot control.
Status pages reflect the status of services in real-time. They should be clear and concise and have a color-coded indicator for each service exposed to customers. In case of failure, it should immediately report which services are failing and why. It is always great to accompany it with an email or RSS notification.
Categorize incident severities
With enough time and complexity, errors happen. When they do, they must be addressed in an organized manner.
Incidents have different severities: generally, those are P0, P1, P2, and P3. Severity determines the action to be taken and response time.
Conduct post-mortems and share them publicly
Shortly after an incident, SREs should do two things:
- Address the issues: Critical errors are patched or hot-fixed in an improvised way, and it is usually not a permanent solution. If that is the case, those should be placed in the backlog to be revisited by the development teams. SREs should also review issues not fixed on-call during working hours.
- Draft a post-mortem: Post-mortems are a briefing of what happened during the incident. These help us get all the information from the incident, what happened, why, and how to prevent them in the future. Thus, every post-mortem should have clear documentation and action items placed in a backlog and prioritized according to the severity.
It is also a great idea to share post-mortems with the public since it helps bring transparency and strengthen their trust.
These best practices are better elaborated on in our chapters.
Chapter 1: SLA vs. SLO: Understand the similarities and differences between SLA and SLO, follow a case study, and learn the best practices for implementing them.
Chapter 2: Reliability vs. Availability: Learn the difference between reliability and availability and understand how to calculate them by following examples.
Chapter 3: DevOps vs SRE: Learn the origins of DevOps and site reliability engineering (SRE) and understand how they compare and where they overlap.
Chapter 4: O11y: Learn how o11y is different from systems monitoring and follow the best practices and recommendations for a successful implementation of an o11y solution.
Chapter 5: Runbook Template: Learn how to design the right runbook template for your use cases and follow a format and an example to get you started.
Chapter 6: Microservices Security: Understand the key aspects of microservices security and learn the best practices that help prevent misconfigurations and vulnerabilities.
Chapter 7: On-Call Rotation: Plan a successful on-call rotation schedule and learn the best practices for shift composition, hand-off, post-mortem meetings, escalation, creating cheat-sheets, and more.
Chapter 8: Canary Deployment: Learn how to roll out new features in a live SaaS environment with the canary deployment strategy, avoiding downtime and easy rollbacks.
Chapter 9: Golden Signals: Learn how to efficiently scale and enhance operational performance of software systems as SREs, by utilizing 4 key metrics, the "Golden Signals".
Chapter 10: SRE Principles: Learn how to measure system performance for user-facing services, set service level objectives to define availability, and use error budgets to balance development and reliability.
Chapter 11: SRE Tools: Learn how teams use various SRE tools to monitor and manage software systems for reliability, performance, and availability.
Chapter 12: Runbook Automation: Learn how to leverage runbook automation to improve IT operations, reduce risk of errors, and ensure compliance.
Check back, as more chapters are coming soon!