DevOps vs SRE: Origins, Roles & Responsibilities

The past two decades have seen DevOps and site reliability engineering (SRE) rise in popularity and become hallmarks of high-performing teams. SRE began as a sysadmin practice at Google in 2003. A few years later, DevOps started its rise in popularity thanks to teams — like the Flickr software engineering team — promoting tighter alignment between developers and IT operations and the first DevOpsDays conference in 2009.

Even though DevOps started as a culture and SRE a collection of best practices, we have come to think of DevOps engineers and SREs as simply job titles today. In some cases, the terms are even used interchangeably.

This article will explain the differences in DevOps vs. SRE, explore their origins, and define how our industry now views DevOps and SRE roles and responsibilities.

DevOps vs. SRE Key Concepts

DevOps began as a culture emphasizing collaboration between developers and IT operations. Over time, the term DevOps evolved to describe job titles, tools, and practices. Today, we can describe DevOps as a superset of practices and principles that include site reliability engineering. SRE has a more narrow focus on operations and system reliability.

DevOps is a broader concept than site reliability engineering (SRE).

While DevOps is conceptually a superset that includes SRE, in practice, “DevOps engineering” now often refers to implementing, automating, and maintaining CI\CD pipelines. SRE is now synonymous with ensuring uptime and adhering to service level objectives (SLOs) which aligns with the “ops” part of DevOps..

‍

Even though DevOps is a broader concept than SRE, the two now overlap as organizational responsibilities.

Before digging deeper into the origins of reach and comparing the two, let us summarize the core ideas in the table below.


DevOps vs. SRE Engineer Comparison
	DevOps Engineer	Site Reliability Engineer
Key measurements	The lead time to deliver features, and the deployment frequency	Service level indicators and service level objectives
Key responsibilities	Deliver code frequently, efficiently, and rapidly	Ensure platform uptime and performance
Focus	Continuous integration and delivery (CI/CD)	Observability and systems administration
Sample Tools	Jenkins, Istio	Prometheus, Squadcast

What is DevOps?

In the early stages of the Internet, software deployments were viewed as critical events that posed various risks for an application. These risks caused a riff between software development teams and operations teams.

Development teams are tasked with delivering new features to the end-users. Operations teams provide stable, reliable, and secure infrastructure to run the software. Even though both teams aim to deliver the best possible application to the end-users, deploying new code inherently creates new risks in production. In many cases, this created tension between teams.

The rocky relationship between software and operations teams is how the idea of “DevOps” or development operations came about. DevOps' goal is to build a company culture within the software and operations teams that are more collaborative and experimental.

10 Deploys Per Day

In 2009, John Allspaw and Paul Hammond gave their now famous presentation called 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr. This one presentation started the conversation about how development and operations teams need and should collaborate. These conversations gave rise to the term DevOps.

The idea of multiple production deployments a day was groundbreaking. This allowed software teams to focus on features and not worry about deployment.

Reducing the Cognitive Load

One big obstacle DevOps attempts to reduce is the cognitive load required for software engineers to deploy their code from the local machine to production. Reducing the developer’s need to understand all the nuances of software deployments allows them to focus more energy on creating quality software.

The ultimate goal is to provide the software teams with a “self-service” model. A self-service model allows the developers to implement their own infrastructure, playbooks, and CI/CD pipelines.

A CI/CD (Continuous Integration, Continuous Deployment) pipeline allows the development teams to deploy their application to pre-production environments — like dev, QA, or staging servers — for testing purposes. A pipeline also allows the developers to test and verify their code changes will not cause any issues for their end users.

This gives them the freedom to move at their own speed and not have to wait on operations to implement the infrastructure their project needs. It allows operations not to become the bottleneck or constraint in delivering the software to the company’s clients.

What is Site Reliability Engineering (SRE)?

In 2003, Ben Treynor joined Google and founded the site reliability team. As of 2016, Google employs over 1,000 Site Reliability Engineers, or SREs, across their entire organization.

The idea behind SRE is what would happen if you put a software engineer in an operations role. So, for example, instead of manually connecting to the production server, copying the new version's source code to the production server, and launching the new version, the SRE could utilize pipelines, or automated processes, to develop repeatable tasks for software deployment and infrastructure provisioning. For example, cache often needs to be flushed or cleared with applications that utilize caching systems. An SRE engineer could create a script that automates the flushing process.

Operational Development Practices

After deploying software to a production environment, a new issue arises: how can teams manage software running in production? SREs are tasked with developing processes and tools that allow SREs to maintain the software and developers to manage the application. This includes such tasks as restarting the application or viewing the running applications logs for triaging errors.

Operational development practices also come from documentation for deploying software into the production environment. This is one area where SREs and DevOps can collaborate to provide concise deployment steps.

Some SRE teams provide a checklist for teams to follow to determine if their software is ready for reliable deployment and operations. These checklists provide standard items to verify to ensure software meets the SRE team’s operational requirements.

Example Deployment Readiness Checklist
General
Ownership	Service owners are identified. Contact information and methods are provided.
SLI Defined	Service level indicators are clearly defined
SLO Defined	Service level objectives are clearly defined
SLA Defined	Service level agreements are clearly defined (where applicable)
Deployment
Deployment Strategy	Make sure the automated deployment strategy has been documented. Strategies examples are: blue-green, canary.
Continuous Integration	Engineers commit their changes, and the system kicks off automated builds, tests, and deployment to a lower-level environment.
Continuous Delivery	Deploying to production is as simple as click of a button. Changelogs and release notes represent what changes exist in each environment.
Static Code Analysis	Code is automatically scanned to get formatted, or linted according to the standards.
Disaster Recovery
Disaster recovery (DR)	DR plans have been documented and tested. Backups of data occur regularly. Services include at least two instances and require deployment in multiple regions (or locations).
Backups	Backups of data occur regularly.
Redundancy	Services should include at least two instances and could require deployment in multiple regions or locations.

Tools for Operating Software

SRE teams systems and tools to automate the maintenance and deployment of production software. These systems and tools may be accessed via a command-line interface (CLI), allowing the development teams to reboot a server or view logs for a specific application. Some tools may be accessed programmatically via an application programming interface (API) to help automate the application deployment or maintenance steps.

The same systems and tools allow development teams to maintain their applications in a self-service-like manner. For example, providing developers access only to the staging environment allows them to be self-sufficient without jeopardizing production stability. This approach enables the SRE team to focus on providing systems and tools without becoming a blocker for development.

Another example is providing a graphical user interface for the development teams to view production logs. This gives the development teams to not only triage issues in production but also gain insight into how the application performs and make improvements where needed.

Responsibilities of DevOps vs. SRE Teams

The sections below detail non-exhaustive lists of the primary responsibilities of DevOps vs. SRE teams.

DevOps Team Responsibilities

Increase Developer Productivity

A key responsibility of the DevOps team is making developers as productive as possible. Highly productive teams can estimate development time more accurately, focus more on writing code than maintaining infrastructure, release new code faster, and improve job satisfaction.

Manage Software Releases

The process of delivering software into a production platform involves a set of steps executed in a specific order. CI/CD pipelines automate these steps. Automation reduces human error and the risk of negatively impacting production.

A CI/CD system also gives automated testing. Automated unit tests check blocks of codes, and integration test suites test the overall application. By providing test results in the CI/CD pipeline, development teams have almost immediate feedback. This feedback allows them to improve their code faster than manual testing.

Define Processes and Procedures

DevOps teams should also define and document processes and procedures to align the activities of developers and SREs. Clear processes allow the application teams to operate independently of the operations team. This independence decouples development activities from the platform maintenance events and prevents the operations team from blocking the release of new software.

SRE Team Responsibilities

Manage configuration

Once an application is deployed to production, the need to operate the application while it is serving clients comes next. Infrastructure as Code (IaC) tools are commonly used to configure an application platform.

Managing IaC means using configuration files to define the desired state and letting a configuration automation tool (like Terraform) maintain the infrastructure instead of manually configuring parameters for every node and infrastructure component.

Respond to Incidents

Incidents ensue once an application is deployed to production. An incident could be an application not responding or a database generating errors. These situations require the SRE team to acknowledge the incident, triage, and resolve it.

Additionally, SREs communicate to stakeholders via status pages during an incident and with post-mortem analysis after an incident. Transparent and real-time communication is critical to building trust with end-users and business managers.

Manage Observability Tools

Post-deployment, an application team needs to know how their application is performing in production. Observability, otherwise known as monitoring, provides the application teams visibility into the application and systems logs, measurement metrics such as CPU utilization and latency, and tracing a transaction path through infrastructure tiers.

Observability also allows SRE teams to maintain service level objectives (SLO) by defining goals for a metric’s range of values during normal operations. Below are examples of SLOs:

SLO Examples
Service Availability	Time (as a percentage) a service is available for use, such as 99.9%
Error Rates	Error counts or percentages for a given application, such as a 0.5% error rate
Security	A measurement of security controls the percentage of Linux OS instances updated with the latest security patch

Conclusion

DevOps started as a set of tenets, best practices, and techniques that promote collaboration between developers and operators of web applications. Site reliability engineering is the term Google used to define the enhanced role of systems engineers who support the operations of critical software and platforms as a service.

While both terms are now associated with job titles, DevOps engineers focus more on code release management while SREs manage the platform's stability. Together, they rapidly deliver new software features into production environments without compromising user experience.