The Evolution of Incident Management from On-Call to SRE

March 7, 2023
Share this post:
The Evolution of Incident Management from On-Call to SRE

Incident Management has evolved considerably over the last couple of decades. Traditionally having been limited to just an on-call team and an alerting system, today it has evolved to include automated Incident Response combined with a complex set of SRE workflows.

Table of Contents:

    Importance of Reliability

    While the number of active internet users and people consuming digital products has been on the rise for a while, it is actually the combination of increased user expectations and competitive digital experiences that have led organizations to deliver super Reliable products and services.

    The bottom line is, customers have the right to seek reliable software, and the right to expect the product to work when they really want it. And it is the responsibility of the organizations to build Reliable products.

    But having said that, no software can be 100% reliable. Even achieving 99.9% reliability is a monumental task. As engineering infrastructure grows more complex by the day, the possibility of Incidents becomes inevitable. But triaging and remediating the issues quickly with minimal impact is what will make all of the difference.

    From the Vault: Recapping Incidents & Outages from the past

    Let’s look back at some notable outages from the past that have had a major impact on both businesses and end users alike.

    October 2021 - A mega outage took down Facebook, WhatsApp, Messenger, Instagram and Oculus VR…for almost five hours! And no one could use any of those products during those 5 hrs.

    November 2021 - A downstream effect of a Google Cloud outage led to outages across multiple GCP products. This also indirectly impacted many non-Google companies.

    December 2022 - An incident corresponding to Amazon’s Search issue impacted at least 20% of all global users for almost an entire day.

    Jan 2023 - Most recently the Federal Aviation Authority (FAA) suffered an outage due to a failed scheduled maintenance causing 32,578 flights to be delayed and a further 409 to get cancelled together. And needless to say, the monetary impact was massive. Share prices of numerous U.S. air carriers fell steeply in the immediate aftermath.

    Reliability trends as of 2023

    These are just a few of the major outges that have impacted users on a global scale. In reality, incidents such as these are not uncommon and are far more frequent. While businesses and business owners bear the brunt of such outages, the impact is experienced by end users too, resulting in a poor User/ Customer Experience (UX/CX).

    Here are some interesting stats as a result of poor CX/ UX:

    • It takes 12 positive user experiences to make up for one unresolved negative experience
    • 88% Of web visitors are less likely to return to a site after a bad experience
    • And even a 1 second Delay in page load can cause a 7% loss in customers

    And that is why resolving incidents quickly is CRITICAL! But (literally :p) the million dollar question is, how to effectively deal with incidents? Let’s address this by probing into the challenges of Incident Management in the first place

    State of Incident Management today

    Evolving business and user needs have directly impacted Incident Management practices.

    1. Increasingly complex systems have led to increasingly complex Incidents
      The use of Public cloud and Microservices architecture has made it difficult to find out what went wrong, eg: which service is impacted, does the outage have an upstream/downstream on other services, etc. Hence Incidents are complex too.
    2. User expectations have grown considerably due to increased dependency on technology
      The widespread adoption of technologies has led to more dependency on technology. This has made them more comfortable using it, and as a result, they are unwilling to put up with any kind of downtime or bad experience that they may face.
    3. Tool sprawl amid evolving business needs adds to the complexity
      The increasing number of tools within the tech stack to address complex requirements and use cases only adds to the complexity of Incident Management.
    “...you want teams to be able to reach for the right tool at the right time, not to be impeded by earlier decisions about what they think they might need in the future.” -  Steve McGhee, Reliability Advocate, SRE, Google Cloud

    Evolution of Incident Management

    Over the years, the scope of activities associated with Incident Management has only been growing. And most of the evolution that’s taken place can be bucketed into one of the four categories: Technology, People, Process, and Tools.

    Technology

    When? What was it like?
    15 years ago
    • Most teams ran monolithic applications
    • These were easy to operate systems, with very less sophistication
    7 years ago
    • Sophisticated distributed systems in medium-to-large organizations were the norm
    • Growing adoption of microservices architecture and public clouds
    Today
    • Even the smallest teams run complex, distributed apps
    • Widespread adoption of microservices architecture and public cloud services

    People

    When? What was it like?
    15 years ago
    • Large Operations teams with manual workloads
    • Basic On-Call team with low-skilled labor
    7 years ago
    • Smaller, more efficient Ops teams with partially automated workload
    • Dedicated Incident Response teams with basic automation to notify On-Call
    Today
    • Fewer members in Operations; but fully automated workloads
    • Dedicated Response teams with instant & diverse notifications for On-Call

    Process

    When? What was it like?
    15 years ago
    • Manual processes (with very low/no automation)
    • Less stringent SLAs
    • Customers more accepting of outages
    7 years ago
    • Improved automation in systems architecture
    • More stringent SLAs
    • Customers less accepting of outages
    Today
    • Heavy reliance on automation due to prevailing system complexity
    • Strict SLAs
    • No/ very less tolerance towards outages

    Tools

    When? What was it like?
    15 years ago
    • Less tooling involved
    • Basic monitoring/alerting solutions in place
    7 years ago
    • Improved operations tooling with IaC
    • Advanced monitoring/alerting with increased automation
    Today
    • Heavy operations tooling
    • Specialized tools associated with the Observability world

    Problems adjusting to modern Incident Management

    Now is the ideal time to address issues that are holding engineering teams back from doing Incident Management the right way.

    Managing Complexity

    Service Ownership and visibility are the foremost contributing factors preventing engineering teams from maximizing their time at hand during incident triage. This is a result of the adoption of Distributed applications, in particular microservices.

    An irrational number of services makes it hard to track service health and their respective owners. Tool sprawl (a great number of tools within the tech stack) makes it even more difficult to track dependencies and ownership.

    Lack of Automation

    Achieving a respectable amount of automation is still a distant dream for most incident response teams. Automating their entire infrastructure stack through incident management will make a great deal of a difference in improving MTTA and MTTR.

    The tasks that are still manual, with great potential for automation during incident response are:

    1. Ability to quickly notify the On-Call team of service outages/service degradation
    2. Ability to automate incident escalations to the senior/ more experienced responders/ stakeholders
    3. Providing the appropriate conference bridge for communication and documenting incident notes

    Poor Collaboration

    A poor effort put into collaboration during an incident is a major reason keeping response teams from doing what they do best. The process of informing members within the team, across the team, within the organization, and outside of the organization must be simplified and organized.

    Activities that can improve with better collaboration are

    1. Bringing visibility of service health to team members, internal and external stakeholders, customers, etc. with a Status Page
    2. Maintaining a single source of truth in regard to incident impact and incident response 
    3. Doing the Root cause analysis or Postmortems or Incident Retrospectives in a blameless way

    Lack of visibility into Service Health

    One of the most important (and responsible) activities for the response team is to facilitate complete transparency about incident impact, triage, and resolution to internal and external stakeholders as well as business owners. The problems:

    1. Absence of a platform such as a Status Page, that can keep all stakeholders informed of impact timelines, and resolution progress 
    2. Inability to track the health of the dependent upstream/ downstream services and not just the affected service

    Now, the timely question to probe is: what should Engineering teams start doing? And how can organizations support them in their Reliability journey?

    What can Engineering Leaders/ Teams do to mitigate the problem

    The facets of Incident Management today can be broadly classified into 3 categories:

    • On-Call Alerting
    • Incident Response (automated & collaborative)
    • Effective SRE

    Addressing the difficulties and devising appropriate processes and strategies around these categories can help engineering teams improve their Incident Management by 90%. Certainly sounds ambitious, so let's understand this in more detail.

    On-Call Alerting & Routing

    On-Call is the foundation of a good Reliability practice. Three are two main aspects to On-Call alerting and they are highlighted below.

    a. Centralizing Incident Alerting & Monitoring

    The crucial aspect of On-Call Alerting is the ability to bring all the alerts into a single/ centralized command centre. This is important because a typical tech stack is made up of multiple alerting tools monitoring different services (or parts of the infrastructure), put in place by different users. An ecosystem that can bring such alerts together will make Incident Management much more organized.

    b. On-Call Scheduling & intelligent routing

    While organized alerting is a great first step, effective Incident Response is all about having an On-Call Schedule in place and routing alerts to the concerned On-Call responder. And in case of non-resolution or inaction, escalating it to the most appropriate engineer (or user).

    Incident Response (automated & collaborative)

    While On-Call scheduling and alert routing are the fundamentals, it is Incident Response that gives structure to Incident Management.

    a. Alert noise reduction and correlation

    Oftentimes, teams get notified of unnecessary events. And more commonly, during the process of resolution, engineers tend to get notified for similar and related alerts, which are better off addressing the collective incident and not just the specific incident. Hence with the right practices in place, incident/alert fatigue can be handled with automation rules for suppressing alerts and deduplicating alerts.

    b. Integration & Collaboration

    Integrating the infrastructure stack with tools well within the response process can possibly be the simplest and easiest way to organize Incident Response. Collaboration can improve by establishing integrations with:

    1. ITSM tools for ticket management 
    2. ChatOps tools for communication 
    3. CI/CD tools for deployment/ quick rollback

    Effective SRE

    Engineering Reliability into a product requires the entire organization to adopt the SRE mindset and buy into the idealogy. While On-Call is at one end of the spectrum, we at Squadcast believe that SRE (Site Reliability Engineering) is at the other end of the spectrum.

    But what exactly is SRE?

    For starters, SRE should not be confused with what DevOps stands for. While DevOps focuses on Principles, SRE emphasizes the focus on Activities instead. SRE is fundamentally about taking an engineering approach to systems operations in order to achieve better reliability and performance. It puts a premium on monitoring, tracking bugs, and creating systems and automation that solve the problem in the long term.

    While Google was the birthplace of SRE, many top technology companies such as LinkedIn, Netflix, Amazon, Apple, and Facebook have adopted it and benefitted highly from doing that.

    POV: Gartner predicts that, by 2027, 75% of enterprises will use SRE practices organization-wide, up from 10% in 2022.

    What difference will SRE make?

    Today users are expecting nothing but the very best. And an exclusive focus on SRE practices will help in:

    1. Providing a delightful User experience (or Customer experience)
    2. Improving feature velocity
    3. Providing fast and proactive issue resolution

    How SRE adds value to the business?

    SRE adds a ton of value to any business that is digital-first. Below mentioned are some of the key points:

    1. Provides an engineering-driven and data-driven approach to improve customer satisfaction 
    2. Enables you to measure toil and save time for strategic tasks
    3. Leverage Automation
    4. Learn from Incident Retrospectives 
    5. Communicate with Status Pages

    The bottom line is, Reliability has evolved. You have to be proactive and preventive.
    Teams will have to fix things faster and keep getting better at it.

    And on that note, let’s look at the different SRE aspects that engineering teams can adopt for better Incident Management:

    a. Automated response actions

    Automating manual tasks and eliminating toil is one of the fundamental truths on which SRE is built. Be it automating workflows with Runooks, or automating response actions, SRE is a big advocate of automation, and response teams will widely benefit from having this in place.

    b. Transparency

    SRE advocates for providing complete visibility into the health status of services and this can be achieved by the use of Status Pages. It also puts a premium on the need to have greater transparency and visibility of service ownership within the organization.

    c. Blameless culture

    During times of an incident, SRE stresses greatly on blaming the process and not the individuals responsible for it. This blameless culture of not blaming individuals for outages goes a long way in fostering a healthy team culture and promoting team harmony. This process of doing RCAs is called Incident Retrospectives or Postmortems.

    d. SLO and Error Budget tracking

    This is all about using a metric-driven approach to balance Reliability and Innovation. It encourages the use of SLIs to keep track of service health. By actively tracking SLIs, SLOs and Error Budgets can be in check, thus not breaching customer any of the customer SLAs.

    Conclusion

    To summarize what you’ve just read, Squadcast is the only integrated platform that unites On-Call Alerting and incident Management along with SRE workflows under one roof. Be it setting up On-Call Schedules, leveraging Event Intelligence for Alert Suppression, or Automating Incident Response, we have it all covered.

    If these Incident Management workflows align with your needs, feel free to go ahead and Sign up for a 2-week free trial. If you want to know more about Squadcast, then you can schedule a call with our Sales team for a quick demo.

    Squadcast is an incident management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.

    squadcast
    Written By:
    March 7, 2023
    March 7, 2023
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Vardhan NS
    What are Webhooks and why should developers use them?
    What are Webhooks and why should developers use them?
    January 20, 2023
    Maximize efficiency with Terraformer: Manage Squadcast resources via IaC
    Maximize efficiency with Terraformer: Manage Squadcast resources via IaC
    December 23, 2022
    Why ‘owning Services’ is critical for effective Incident Response
    Why ‘owning Services’ is critical for effective Incident Response
    October 31, 2022
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    Copyright © Squadcast Inc. 2017-2023