7 Tips On Building And Maintaining An SRE Team In Your Company

January 22, 2021
Share this post:
7 Tips On Building And Maintaining An SRE Team In Your Company

In today's "always on" world, Reliability is a primary business KPI. Plant the culture of Reliability by implementing these 7 simple tips to build a solid SRE team in your organization.

Table of Contents:

    Many of today’s hottest jobs didn’t exist at the turn of the millennium. Social media managers, data scientists, and growth hackers were never heard of before. Another relatively new job role in demand is that of a Site Reliability Engineer or SRE. The profession is quite new. It’s noted that 64% of SRE teams are less than three years old. But despite being new, the job role adds a lot of value to an organization.

    SRE vs DevOps

    Site reliability engineering is basically the merging of development and operations into one. Most people tend to mix up SRE and DevOps. By principle, the two intertwine, but DevOps serves as the principle and SRE the practice.

    Any company looking to implement site reliability engineering to their organization might want to start by following these seven tips to build and maintain an SRE team.

    1. Start Small and Internally

    There is a high chance that your company needs an SRE team but doesn’t need a whole department right away. Site reliability management’s role is to ensure that an online service remains reliable through alert creation, incident investigation, root cause remediation, and incident postmortem.

    The average tech-based company faces a few bugs every so often. In the past, operations and development teams would come together to fix those issues in software or service. An SRE approach merges those two into one.

    If you’re just starting to build your SRE team, you can start by putting together some people from your operations and technical department and give them the sole responsibility of maintaining a service’s reliability.

    2. Get the Right People

    In cases where you’re ready to scale, the time might come where you’ll need to get additional help for your site reliability engineering team. SRE professionals are in hot demand nowadays. There are more than 1,300 site reliability engineering jobs on Indeed.

    The key to finding the right people for your SRE team is to know what you’re looking for. Here are a few qualifications to look for in a site reliability engineer.

    • Problem-solving and troubleshooting skills: Much of the SRE team’s responsibilities have to do with addressing incidents and issues in software. Most times, these problems have to do with systems or applications that they didn’t create themselves. So the ability to quickly debug even without in-depth knowledge of a system is a must-have skill.
    • A knack for automation: Toil can often become a big problem in many tech-based services. The right site reliability engineer will look for ways to automate away the toil, reducing manual work to a minimum so that staff only deal with high-priority items.
    • Constant learning: As systems evolve, so will problems. So good SREs will have to keep brushing up their knowledge on systems, codes, and processes that change with time.
    • Teamwork: Addressing incidents will rarely be a one-man-job so SREs need to work well with teams. Collaboration and communication are the skills to look out for definitely.
    • Bird’s eye view perspective: When addressing bugs, it can be easy to get caught up with the wrong things when you’re stuck in the middle of it. That’s why good SREs will need the ability to see the bigger picture and find solutions in larger contexts. A successful site reliability engineer will find the root cause and create an overarching solution.

    3. Define your SLOs

    An SRE team will most likely succeed with service level objectives in place. Service level objectives or SLOs are the key performance metrics for a site. SLOs can vary depending on the kind of service a business offers. Generally, any user-facing serving system will have to set availability, latency, and throughput as indicators. Storage-based systems will often place more emphasis on latency, availability, and durability.

    Setting up SLOs also involves placing values that a company would like to maintain in terms of indicators. The numbers your SLOs should show are the minimum thresholds that the system should hold on to. When setting an SLO, don’t base them on current performance as this might put you in a position to meet unrealistic targets. Keep your objectives simple and avoid placing any absolutes. The fewer SLOs you have in place, the better, so only measure what indicators matter to you most.

    4. Set holistic systems to handle incident management

    Incident management is one of the most important aspects of site reliability engineering. In a survey by Catchpoint, 49% of respondents said that they had worked on an incident in the last week or so. When handling incidents, a system needs to be in place to keep the debugging and maintenance process as smooth as possible.

    One of the most important aspects of an incident management system is keeping track of on-call responsibilities. SRE team responsibilities can get extremely exhausting without an effective means to control the flow of on-call incidents. Using a system like Squadcast can help resolve incidents with more clarity and structure.

    5. Accept failure as part of the norm

    Most people don’t like experiencing failure, but if your company wants to maintain a healthy and productive SRE team, one of the themes that each member must get used to is accepting failure as a part of the profession. Perfection is rarely ever the case in any system, most especially when in the early development stages.

    Many SRE teams mistake setting the bar too high right away and putting up unrealistic SLO definitions and targets. The best operational practice has always been to shoot for a minimum viable product and then slowly increase the parameters once the team and company as a whole build up confidence.

    6. Perform incident postmortems to learn from failures and mistakes

    There’s an old saying that goes this way: “Dead men tell no tales.” But that isn’t the case with system incidents. There is much to learn from incidents even after problems have been resolved. That’s why it’s a great practice to perform incident postmortems so that SRE teams can learn from their mistakes. A proper SRE approach would take into account the best practices for postmortem.

    When performing post-incident analysis, there are sets of parameters that site reliability crews must analyze. First, they should look into the cause and triggers of the failure. What caused the system to fail? Secondly, the team should pinpoint as many of the effects as they can find. What did the system failure affect? For example, a payment gateway error might have caused a discrepancy in payments made or collections, which can be a headache if left unturned for even a few days. Lastly, a successful postmortem will look into possible solutions and recommendations if a similar error might occur in the future.

    7. Maintain a simple incident management system

    An SRE team structure isn’t enough to create a productive team. There also needs to be a project and incident management system in place. There are various services and different IT management software use cases available to SRE teams today. Some of the factors that team managers need to consider are ease of use, communication barriers, available integrations, and collaboration capabilities.

    Setting Your SRE Team Up For Success

    An SRE team can be likened to an aircraft maintenance crew fixing a plane while it’s 50,000 feet in the air. Setting your SRE team up for success is crucial as they will assure that your company’s service is available to your clients. While errors and bugs are inevitable in any software as a service, it can be kept to a minimum, making downages and errors a rare occasion. But for that to happen, you’ll need a solid SRE team in place, proactively finding ways to avoid errors and being ready to spring into action when duty calls.

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    squadcast
    Written By:
    January 22, 2021
    January 22, 2021
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Squadcast Community
    Prometheus Blackbox Exporter: Guide & Tutorial
    Prometheus Blackbox Exporter: Guide & Tutorial
    May 29, 2023
    Install Prometheus on Kubernetes: Tutorial & Examples
    Install Prometheus on Kubernetes: Tutorial & Examples
    April 20, 2023
    Incident Response Guide
    Incident Response Guide
    April 17, 2023
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2 Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2 Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
    Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2
    Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Copyright © Squadcast Inc. 2017-2023