🚀 Take control of your Incident Management process with Squadcast's new Audit Logs feature.

From SysAdmin to SRE: How to evolve your skillset

Dec 16, 2020
Last Updated:
May 2, 2024
Share this post:
From SysAdmin to SRE: How to evolve your skillset

Are you wondering what it takes to become an SRE from a SysAdmin background? This blog, covers the growth areas and technical skills needed to successfully transition to an SRE role.

Table of Contents:

    The last decade has seen widespread adoption of SRE practices based on the best practices laid out by Google. Many SysAdmins have observed this trend and are now evaluating becoming SREs. Which gives rise to the question how much of a skills overlap is there between an SRE and a SysAdmin?

    Both roles are concerned with IT operations and there is a significant overlap in their respective responsibilities. Broadly, Google has defined SRE to be software engineering principles applied to IT operations at scale. What does this mean in reality? SRE is essentially applying some key principles to IT operations. It frequently involves using various technologies that may be new to some SysAdmins.

    In this blog we look at some of the growth areas and skills a SysAdmin needs to pick up to become an SRE. This transition requires some mindset changes and the acquisition of some new technical skills as well but it shouldn't be difficult for an experienced SysAdmin. So here are some of the changes you need to bring about in your mindset and skills to successfully transition to an SRE role.

    Mindset Changes

    Embracing Risk

    As a SysAdmin the primary focus of your work has been to maintain order and keep the systems under your care, running smoothly. SysAdmins have traditionally focused on keeping their infrastructure stable and secure and to eliminate any risk of failure. On the other hand, SREs recognize that some amount of failure is inevitable. Error budget is an SRE concept that quantifies the amount of downtime your infrastructure can have before you are in breach of a SLO (service level objective). Armed with that knowledge, an SRE can decide to support agility and allow riskier changes or be more safety conscious and risk averse. This allows SREs to leverage risk for the benefit of the product rather than futilely attempting to eliminate risk and potentially becoming a bottleneck

    Reducing Toil

    Much of SRE concerns itself with removing toil. In this context, toil refers to those tasks that are repetitive and don't add any enduring value to the upkeep of your infrastructure. This sometimes also includes automating those jobs that are repetitive and time-consuming. By limiting toil to half of the work, an SRE frees up time to improve other aspects of the system. Improvements in system stability and performance are encouraged, and creative solutions can materialize. SysAdmins, are all too familiar with the repetitive configuration of hardware and software to fit the needs of their organisation. Most mature SysAdmins have developed automation practices that work well within their org but are not standardised. As an SRE you are expected to know standardization practices that will work for organizations of all types and major tech stacks. Automation using software such as Puppet, Chef and Ansible helps minimise repetitive steps and frees SysAdmins for more substantive and thorough work.

    Automate all the things

    Automation is a substantial aspect of good SRE practice. It is used to automate those tasks that have been identified as toil in the system. This can include running scripts when certain events occur, monitoring clusters, automating full-scale code deployments (Infrastructure as code) and auto configuring virtual machines in the cloud. SREs seek to automate to regulate their workload and to ensure that their workload does not increase linearly with the addition of users or machines they are maintaining. Some of the other benefits of automation include greater reliability when deployments are done, improved performance and all around, cost reduction.

    Dealing with failure: Understanding SLOs and blameless postmortems

    SysAdmins are familiar with the RCA(Root Cause Analysis) process - when a failure occurs the root cause is identified, and a solution is put in place. However, as an SRE there are best practices Google has created that include going beyond root causes and concerns itself with understanding the weaknesses in the system that led to the breakdown. Blameless postmortems encourage one to pick flaws in the existing reporting and operational processes. Good SRE practices insist on keeping people in the loop when failure occurs, including your customers. This is a cultural shift for SysAdmins, as they rarely tend to keep customers in the loop when things go down. These practices also include a formal written incident post-mortem process. The conclusions from an incident post-mortem must then be fed back to the planning process for future deployments. Failure takes on a fresh perspective from a SRE’s viewpoint - it is an opportunity to learn from your mistakes and do better next time around.

    Soft Skills

    SRE culture demands much greater collaboration with other parts of the organisation. While SLOs bring greater transparency to operations, achieving consensus on those objectives and deciding on the next step can often be challenging. Business teams, product management, developers and SREs all have slightly different goals and incentives. Bridging the gap between these various stakeholder perspectives may require conflict resolution skills. Explaining the trade off between feature development, stability and how Error Budgets can help decide the best result, requires strong communication skills. Finally, good negotiation skills will ensure that SRE goals are accepted in the face of pressure from Business, Product or Development.

    Technical Skills

    Transitioning from being a SysAdmin to an SRE requires brushing up or acquiring various technical skills.

    • Programming & Testing Skills: The emphasis on toil reduction and automation in SRE will require significantly stronger programming and testing skills. Typically an SRE should know one highly productive scripting language like Python and one high performance systems language like Go.
    • Infrastructure as Code: Traditionally, infrastructure deployment is a slow, manual, labour intensive process. Because of this, it is expensive, inelastic, inconsistent and unreliable. Infrastructure as Code (IaC) is an automation technique that brings the rigor of software engineering to infrastructure management. Tools like Ansible, Terraform, Puppet or Chef can be used to power an IaC initiative.
    • Cloud, Containers & Container Orchestration: Cloud and container services make something that was previously difficult to automate -- physical hardware -- manageable via standardised APIs. As an added benefit, they are usually far cheaper, more flexible and faster to provision than traditional hardware. They have also made the IaC technique far more powerful and useful. Knowledge of Amazon AWS, Kubernetes and Docker are now considered basic skills for SREs.
    • Modern Monitoring Tools: Active checking systems, metrics collection, and log aggregation have been the traditional mainstays of monitoring. More recently, code instrumentation and distributed tracing have been added to this arsenal. Older de facto standard tools like Nagios, Ganglia and rsyslog have been surpassed by tools like Prometheus, Datadog, and the ELK stack. APMs like NewRelic are now key for instrumentation and OpenTelemetry seems very promising as a distributed tracing tool. Familiarity of these platforms is a significant requirement for a good SRE.
    • Statistical Analysis: SRE culture demands hard data to support decision making. With the vast volumes of data being generated by monitoring tools, some basic statistical analysis is necessary to generate actionable data. This data can be used for capacity planning, release planning, continuous improvement and incident response.

    Conclusion

    SysAdmins and SREs are expected to be drivers of reliability and change that is beneficial to the customers. If you are a SysAdmin, you have doubtless carried out many operations in the systems level that will be invaluable to you as an SRE. The necessary areas of growth include learning to adapt to change, since the SRE practices in vogue today may very well change tomorrow. An SRE is someone who brings practices that have been a mainstay of software development at scale to the operations side. This crossover brings dividends to the organisation as they find solutions to recurrent problems without investing on more manpower and hardware. The future of SRE is bright as more organisations are seeking to cut costs and streamline their IT operations.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    December 16, 2020
    December 16, 2020
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Biju Chacko
    Scaling Site Reliability Engineering Teams the Right Way
    Scaling Site Reliability Engineering Teams the Right Way
    April 25, 2023
    What are Canary Deployments and Why are they Important?
    What are Canary Deployments and Why are they Important?
    August 25, 2022
    Classifying Severity Levels for Your Organization
    Classifying Severity Levels for Your Organization
    July 5, 2022
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
    Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
    Users love Squadcast on G2
    Copyright © Squadcast Inc. 2017-2024
    Blog
    SRE
    From SysAdmin to SRE: How to evolve your skillset

    From SysAdmin to SRE: How to evolve your skillset

    Biju Chacko
    Nir Sharma
    Biju Chacko
    Nir Sharma
    December 16, 2020
    From SysAdmin to SRE: How to evolve your skillset

    The last decade has seen widespread adoption of SRE practices based on the best practices laid out by Google. Many SysAdmins have observed this trend and are now evaluating becoming SREs. Which gives rise to the question how much of a skills overlap is there between an SRE and a SysAdmin?

    Both roles are concerned with IT operations and there is a significant overlap in their respective responsibilities. Broadly, Google has defined SRE to be software engineering principles applied to IT operations at scale. What does this mean in reality? SRE is essentially applying some key principles to IT operations. It frequently involves using various technologies that may be new to some SysAdmins.

    In this blog we look at some of the growth areas and skills a SysAdmin needs to pick up to become an SRE. This transition requires some mindset changes and the acquisition of some new technical skills as well but it shouldn't be difficult for an experienced SysAdmin. So here are some of the changes you need to bring about in your mindset and skills to successfully transition to an SRE role.

    Mindset Changes

    Embracing Risk

    As a SysAdmin the primary focus of your work has been to maintain order and keep the systems under your care, running smoothly. SysAdmins have traditionally focused on keeping their infrastructure stable and secure and to eliminate any risk of failure. On the other hand, SREs recognize that some amount of failure is inevitable. Error budget is an SRE concept that quantifies the amount of downtime your infrastructure can have before you are in breach of a SLO (service level objective). Armed with that knowledge, an SRE can decide to support agility and allow riskier changes or be more safety conscious and risk averse. This allows SREs to leverage risk for the benefit of the product rather than futilely attempting to eliminate risk and potentially becoming a bottleneck

    Reducing Toil

    Much of SRE concerns itself with removing toil. In this context, toil refers to those tasks that are repetitive and don't add any enduring value to the upkeep of your infrastructure. This sometimes also includes automating those jobs that are repetitive and time-consuming. By limiting toil to half of the work, an SRE frees up time to improve other aspects of the system. Improvements in system stability and performance are encouraged, and creative solutions can materialize. SysAdmins, are all too familiar with the repetitive configuration of hardware and software to fit the needs of their organisation. Most mature SysAdmins have developed automation practices that work well within their org but are not standardised. As an SRE you are expected to know standardization practices that will work for organizations of all types and major tech stacks. Automation using software such as Puppet, Chef and Ansible helps minimise repetitive steps and frees SysAdmins for more substantive and thorough work.

    Automate all the things

    Automation is a substantial aspect of good SRE practice. It is used to automate those tasks that have been identified as toil in the system. This can include running scripts when certain events occur, monitoring clusters, automating full-scale code deployments (Infrastructure as code) and auto configuring virtual machines in the cloud. SREs seek to automate to regulate their workload and to ensure that their workload does not increase linearly with the addition of users or machines they are maintaining. Some of the other benefits of automation include greater reliability when deployments are done, improved performance and all around, cost reduction.

    Dealing with failure: Understanding SLOs and blameless postmortems

    SysAdmins are familiar with the RCA(Root Cause Analysis) process - when a failure occurs the root cause is identified, and a solution is put in place. However, as an SRE there are best practices Google has created that include going beyond root causes and concerns itself with understanding the weaknesses in the system that led to the breakdown. Blameless postmortems encourage one to pick flaws in the existing reporting and operational processes. Good SRE practices insist on keeping people in the loop when failure occurs, including your customers. This is a cultural shift for SysAdmins, as they rarely tend to keep customers in the loop when things go down. These practices also include a formal written incident post-mortem process. The conclusions from an incident post-mortem must then be fed back to the planning process for future deployments. Failure takes on a fresh perspective from a SRE’s viewpoint - it is an opportunity to learn from your mistakes and do better next time around.

    Soft Skills

    SRE culture demands much greater collaboration with other parts of the organisation. While SLOs bring greater transparency to operations, achieving consensus on those objectives and deciding on the next step can often be challenging. Business teams, product management, developers and SREs all have slightly different goals and incentives. Bridging the gap between these various stakeholder perspectives may require conflict resolution skills. Explaining the trade off between feature development, stability and how Error Budgets can help decide the best result, requires strong communication skills. Finally, good negotiation skills will ensure that SRE goals are accepted in the face of pressure from Business, Product or Development.

    Technical Skills

    Transitioning from being a SysAdmin to an SRE requires brushing up or acquiring various technical skills.

    • Programming & Testing Skills: The emphasis on toil reduction and automation in SRE will require significantly stronger programming and testing skills. Typically an SRE should know one highly productive scripting language like Python and one high performance systems language like Go.
    • Infrastructure as Code: Traditionally, infrastructure deployment is a slow, manual, labour intensive process. Because of this, it is expensive, inelastic, inconsistent and unreliable. Infrastructure as Code (IaC) is an automation technique that brings the rigor of software engineering to infrastructure management. Tools like Ansible, Terraform, Puppet or Chef can be used to power an IaC initiative.
    • Cloud, Containers & Container Orchestration: Cloud and container services make something that was previously difficult to automate -- physical hardware -- manageable via standardised APIs. As an added benefit, they are usually far cheaper, more flexible and faster to provision than traditional hardware. They have also made the IaC technique far more powerful and useful. Knowledge of Amazon AWS, Kubernetes and Docker are now considered basic skills for SREs.
    • Modern Monitoring Tools: Active checking systems, metrics collection, and log aggregation have been the traditional mainstays of monitoring. More recently, code instrumentation and distributed tracing have been added to this arsenal. Older de facto standard tools like Nagios, Ganglia and rsyslog have been surpassed by tools like Prometheus, Datadog, and the ELK stack. APMs like NewRelic are now key for instrumentation and OpenTelemetry seems very promising as a distributed tracing tool. Familiarity of these platforms is a significant requirement for a good SRE.
    • Statistical Analysis: SRE culture demands hard data to support decision making. With the vast volumes of data being generated by monitoring tools, some basic statistical analysis is necessary to generate actionable data. This data can be used for capacity planning, release planning, continuous improvement and incident response.

    Conclusion

    SysAdmins and SREs are expected to be drivers of reliability and change that is beneficial to the customers. If you are a SysAdmin, you have doubtless carried out many operations in the systems level that will be invaluable to you as an SRE. The necessary areas of growth include learning to adapt to change, since the SRE practices in vogue today may very well change tomorrow. An SRE is someone who brings practices that have been a mainstay of software development at scale to the operations side. This crossover brings dividends to the organisation as they find solutions to recurrent problems without investing on more manpower and hardware. The future of SRE is bright as more organisations are seeking to cut costs and streamline their IT operations.

    Written By:
    Biju Chacko
    Nir Sharma
    Biju Chacko
    Nir Sharma
    December 16, 2020
    SRE
    Share this blog:
    Get reliability insights delivered straight to your inbox.
    Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    If you wish to unsubscribe, we won't hold it against you. Privacy policy.
    Get reliability insights delivered straight to your inbox.
    Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    If you wish to unsubscribe, we won't hold it against you. Privacy policy.