The past two decades have seen DevOps and site reliability engineering (SRE) rise in popularity and become hallmarks of high-performing teams. SRE began as a sysadmin practice at Google in 2003. A few years later, DevOps started its rise in popularity thanks to teams — like the Flickr software engineering team — promoting tighter alignment between developers and IT operations and the first DevOpsDays conference in 2009.
Even though DevOps started as a culture and SRE a collection of best practices, we have come to think of DevOps engineers and SREs as simply job titles today. In some cases, the terms are even used interchangeably.
This article will explain the differences in DevOps vs. SRE, explore their origins, and define how our industry now views DevOps and SRE roles and responsibilities.
DevOps vs. SRE Key Concepts
DevOps began as a culture emphasizing collaboration between developers and IT operations. Over time, the term DevOps evolved to describe job titles, tools, and practices. Today, we can describe DevOps as a superset of practices and principles that include site reliability engineering. SRE has a more narrow focus on operations and system reliability.
While DevOps is conceptually a superset that includes SRE, in practice, “DevOps engineering” now often refers to implementing, automating, and maintaining CI\CD pipelines. SRE is now synonymous with ensuring uptime and adhering to service level objectives (SLOs) which aligns with the “ops” part of DevOps..
Before digging deeper into the origins of reach and comparing the two, let us summarize the core ideas in the table below.
What is DevOps?
In the early stages of the Internet, software deployments were viewed as critical events that posed various risks for an application. These risks caused a riff between software development teams and operations teams.
Development teams are tasked with delivering new features to the end-users. Operations teams provide stable, reliable, and secure infrastructure to run the software. Even though both teams aim to deliver the best possible application to the end-users, deploying new code inherently creates new risks in production. In many cases, this created tension between teams.
The rocky relationship between software and operations teams is how the idea of “DevOps” or development operations came about. DevOps' goal is to build a company culture within the software and operations teams that are more collaborative and experimental.
10 Deploys Per Day
In 2009, John Allspaw and Paul Hammond gave their now famous presentation called 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr. This one presentation started the conversation about how development and operations teams need and should collaborate. These conversations gave rise to the term DevOps.
The idea of multiple production deployments a day was groundbreaking. This allowed software teams to focus on features and not worry about deployment.
Reducing the Cognitive Load
One big obstacle DevOps attempts to reduce is the cognitive load required for software engineers to deploy their code from the local machine to production. Reducing the developer’s need to understand all the nuances of software deployments allows them to focus more energy on creating quality software.
The ultimate goal is to provide the software teams with a “self-service” model. A self-service model allows the developers to implement their own infrastructure, playbooks, and CI/CD pipelines.
A CI/CD (Continuous Integration, Continuous Deployment) pipeline allows the development teams to deploy their application to pre-production environments — like dev, QA, or staging servers — for testing purposes. A pipeline also allows the developers to test and verify their code changes will not cause any issues for their end users.
This gives them the freedom to move at their own speed and not have to wait on operations to implement the infrastructure their project needs. It allows operations not to become the bottleneck or constraint in delivering the software to the company’s clients.
What is Site Reliability Engineering (SRE)?
In 2003, Ben Treynor joined Google and founded the site reliability team. As of 2016, Google employs over 1,000 Site Reliability Engineers, or SREs, across their entire organization.
The idea behind SRE is what would happen if you put a software engineer in an operations role. So, for example, instead of manually connecting to the production server, copying the new version's source code to the production server, and launching the new version, the SRE could utilize pipelines, or automated processes, to develop repeatable tasks for software deployment and infrastructure provisioning. For example, cache often needs to be flushed or cleared with applications that utilize caching systems. An SRE engineer could create a script that automates the flushing process.
Operational Development Practices
After deploying software to a production environment, a new issue arises: how can teams manage software running in production? SREs are tasked with developing processes and tools that allow SREs to maintain the software and developers to manage the application. This includes such tasks as restarting the application or viewing the running applications logs for triaging errors.
Operational development practices also come from documentation for deploying software into the production environment. This is one area where SREs and DevOps can collaborate to provide concise deployment steps.
Some SRE teams provide a checklist for teams to follow to determine if their software is ready for reliable deployment and operations. These checklists provide standard items to verify to ensure software meets the SRE team’s operational requirements.
Tools for Operating Software
SRE teams systems and tools to automate the maintenance and deployment of production software. These systems and tools may be accessed via a command-line interface (CLI), allowing the development teams to reboot a server or view logs for a specific application. Some tools may be accessed programmatically via an application programming interface (API) to help automate the application deployment or maintenance steps.
The same systems and tools allow development teams to maintain their applications in a self-service-like manner. For example, providing developers access only to the staging environment allows them to be self-sufficient without jeopardizing production stability. This approach enables the SRE team to focus on providing systems and tools without becoming a blocker for development.
Another example is providing a graphical user interface for the development teams to view production logs. This gives the development teams to not only triage issues in production but also gain insight into how the application performs and make improvements where needed.
Responsibilities of DevOps vs. SRE Teams
The sections below detail non-exhaustive lists of the primary responsibilities of DevOps vs. SRE teams.
DevOps Team Responsibilities
Increase Developer Productivity
A key responsibility of the DevOps team is making developers as productive as possible. Highly productive teams can estimate development time more accurately, focus more on writing code than maintaining infrastructure, release new code faster, and improve job satisfaction.
Manage Software Releases
The process of delivering software into a production platform involves a set of steps executed in a specific order. CI/CD pipelines automate these steps. Automation reduces human error and the risk of negatively impacting production.
A CI/CD system also gives automated testing. Automated unit tests check blocks of codes, and integration test suites test the overall application. By providing test results in the CI/CD pipeline, development teams have almost immediate feedback. This feedback allows them to improve their code faster than manual testing.
Define Processes and Procedures
DevOps teams should also define and document processes and procedures to align the activities of developers and SREs. Clear processes allow the application teams to operate independently of the operations team. This independence decouples development activities from the platform maintenance events and prevents the operations team from blocking the release of new software.
SRE Team Responsibilities
Once an application is deployed to production, the need to operate the application while it is serving clients comes next. Infrastructure as Code (IaC) tools are commonly used to configure an application platform.
Managing IaC means using configuration files to define the desired state and letting a configuration automation tool (like Terraform) maintain the infrastructure instead of manually configuring parameters for every node and infrastructure component.
Respond to Incidents
Incidents ensue once an application is deployed to production. An incident could be an application not responding or a database generating errors. These situations require the SRE team to acknowledge the incident, triage, and resolve it.
Additionally, SREs communicate to stakeholders via status pages during an incident and with post-mortem analysis after an incident. Transparent and real-time communication is critical to building trust with end-users and business managers.
Manage Observability Tools
Post-deployment, an application team needs to know how their application is performing in production. Observability, otherwise known as monitoring, provides the application teams visibility into the application and systems logs, measurement metrics such as CPU utilization and latency, and tracing a transaction path through infrastructure tiers.
Observability also allows SRE teams to maintain service level objectives (SLO) by defining goals for a metric’s range of values during normal operations. Below are examples of SLOs:
DevOps started as a set of tenets, best practices, and techniques that promote collaboration between developers and operators of web applications. Site reliability engineering is the term Google used to define the enhanced role of systems engineers who support the operations of critical software and platforms as a service.
While both terms are now associated with job titles, DevOps engineers focus more on code release management while SREs manage the platform's stability. Together, they rapidly deliver new software features into production environments without compromising user experience.