📢 Webinar Alert! Live Call Routing with Squadcast: Helping Teams Achieve Faster Resolutions | Register here

Traditional vs Modern Incident Response

Feb 24, 2022
Last Updated:
May 2, 2024
Share this post:
Traditional vs Modern Incident Response
Table of Contents:

    What is Incident Response?

    An incident is an event (network outage, system failure, data breach, etc.) that can lead to loss of, or disruption to, an organization's operations, services or functions. Incident Response is an organization’s effort to detect, analyze and correct the hazards caused due to an incident. In the most common cases, when an incident response is mentioned, it usually relates to security incidents. Sometimes incident response and incident management are more or less used interchangeably.

    However, an incident can be of any nature, it doesn’t have to be tied to security, for example:

    • Physical damage to hardware or systems (fire, flooding)
    • Human error (misconfigurations, accidental deletion of data)
    • Malicious actors (denial of service attacks, malware, ransomware)

    Every incident is different and may require a different response. The incident response consists of steps taken by an organization to address the outage and reinstate services to their normal operation, often in real-time. For example, treating an outage is referred to as an incident response.

    A good incident response plan can help your company respond quickly and effectively when an outage occurs. Keep in mind that incident response is not just a technical function to be done by a specific team. Instead, it is more of a corporate process that involves all areas of the business.

    Traditional vs Modern Incident Response

    The biggest change in the world of incident response was the widespread adoption of automation.

    Traditionally, the incident response was a highly manual process. Everything from creating a ticket to patching a server required human interaction. It was effective until the world experienced the internet boom.

    Easy internet access has certainly opened up opportunities for people and businesses alike. According to IDC, 60% or more organizations have spent more on technology to embrace the digital future.

    The rise in use of digital platforms has resulted in complex infrastructures with multiple application dependencies. Hence, downtime and system failures for even a few minutes can incur huge monetary losses (in some cases, even millions).

    In order to avoid such events, organizations have resorted to dealing with incidents using teams that are on-call 24/7. This puts a lot of pressure on incident response teams as they are required to manually monitor systems, keep track of alerts and avoid fatigue. Hence automating some or most of the incident response processes can help get rid of repetitive work. It helps response teams be more effective with less effort.

    That's not to say that people are no longer involved with incident response. People are still involved in triage, troubleshooting, and postmortem analysis. It's just that those tasks are much less frequent than they were before automation became the norm.

    Incident Response used to be about reacting to what happened with a solution for an immediate ‘bleed stop’. Nowadays, it is more about being proactive and trying to prevent incidents altogether by understanding and gaining intelligence about why something has happened.

    Incident response and management have become more of a DevOps-based activity. Where operational issues are addressed through code and automation, rather than manual intervention.

    Responding to an Incident

    In the SRE (Site Reliability Engineering) realm, the incident response can be divided into the following steps:

    1. Detect
    2. Respond
    3. Resolution and Recovery
    4. Postmortems

    Let’s expand on those and understand how incidents were responded to in the past, and how they are now.

    Detect

    This step is where you detect an issue or determine if there has been a breach. A breach or incident could originate from different sources.

    Traditional: Primary source of detection would most likely be calls or emails from the impacted users. Monitoring and alerting tools weren’t as ubiquitous as today.

    Modern: An issue will usually be caught through monitoring and alerting on metrics, or in another case by people noticing something strange while they're doing their work. With alerting tools and the right schedules in place, it is easier to detect such issues so they can be dealt with due process.

    Respond

    This is the step where you analyze the issue at hand and take a call on whether to contain the damage or terminate the concerned services.

    Traditional: The limitations of technology made it difficult to connect globally. Cross-functional localized teams would come together to figure out the issue. It often led to forcing resources to quit the work at hand and focus on solving the issue. This chopping and changing would particularly impact developers the most.

    Modern: Modern-day teams analyze the metrics and logs to determine how bad the outage is. Is it a brief spike in errors? Are a few nodes going offline? Or is it a full-on service disruption? This step involves analyzing metrics and logs before responding further. This is where your colleagues from other sectors would collaborate for help. Using modern ChatOps tools like Slack, Microsoft Teams helps in effective collaboration. This keeps the right people connected even globally if needed.

    Resolution and Recovery

    Once you've analyzed and pinpointed the root cause, you need to resolve the issue and ensure the system has recovered, with the affected systems and devices up and running again.

    Traditional: The process was unstructured. There was a lack of coordination between people, which led to support people tripping over and duplicating efforts. The aim of recovery was to get the system up and running, and nothing much followed. Getting to the root cause was rarely an objective until the same issue occurred repeatedly.

    This changed with time as processes were put into place. But lack of automation meant that the on-call schedules were still not very efficient and there was a lot of manual work.

    Modern: These days, various tools and techniques are used to deal with issues. The decision is based on the issue that is being dealt with and the team's capabilities. For example, if you're experiencing network issues and your team has access to network engineering resources, they may be able to resolve the issue quickly by adjusting settings on routers or switches.

    Recovery is usually coordinated by the on-call incident handler, who is responsible for implementing a solution and making sure it does not fail. The SRE team then follows up with the manager to make sure the fix works as intended and, if necessary, to mitigate any damage caused by the outage. Another goal is to prevent such incidents from happening again

    Postmortems

    A postmortem is written after the issue is resolved, and everything has calmed down. Once the postmortem write-up is ready, a meeting occurs and is led by an SRE manager or incident handler who distributes the postmortem notes to relevant parties within the organization.

    The goal of this meeting is to review what happened during the outage, why it happened, what was done to stop it, and how it could have been prevented in the future. The postmortem then becomes part of an organization's operational history, allowing teams to learn from past mistakes and improve their overall reliability going forward.

    Traditional: Traditional postmortems were either internal reports that were never seen outside the company or formal reports submitted to external auditors.

    Both constraints made it difficult to share detailed information about what happened and why it happened. Traditional postmortems are typically tactical documents that focus on how IT personnel responds to an incident.

    Modern: The practice of postmortems is an established part of modern incident response and is generally written as after-the-fact documentation.

    Modern digital postmortems are more inclusive of all teams involved, including the stakeholders. And should be viewed as strategic points that focus on lessons learned by the entire organization.

    They can be used for training purposes since they document case studies from completed investigations. They allow you to:

    • Analyze past issues
    • Find trends and make predictions about future risks
    • Help you learn from mistakes
    • Prevent a recurrence.

    An excellent example of a postmortem template, and what should be included, can be found in the first SRE book by Google. Also, do check out our blog on Postmortems.

    This brings us to the end of this blog. We have successfully explored what incident response is and how it has evolved with time.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Kristijan Mitevski
    Complete Incident Management Playbook for Enterprises
    Complete Incident Management Playbook for Enterprises
    June 14, 2024
    The Complete Incident Management Tech Stack To Increase Performance, Reduce Cost And Optimize Tool Sprawl
    The Complete Incident Management Tech Stack To Increase Performance, Reduce Cost And Optimize Tool Sprawl
    May 30, 2024
    What is Site Reliability Engineering and How it Transforms IT Operations?
    What is Site Reliability Engineering and How it Transforms IT Operations?
    May 27, 2024

    Traditional vs Modern Incident Response

    Traditional vs Modern Incident Response
    Feb 24, 2022
    Last Updated:
    Feb 24, 2022

    What is Incident Response?

    An incident is an event (network outage, system failure, data breach, etc.) that can lead to loss of, or disruption to, an organization's operations, services or functions. Incident Response is an organization’s effort to detect, analyze and correct the hazards caused due to an incident. In the most common cases, when an incident response is mentioned, it usually relates to security incidents. Sometimes incident response and incident management are more or less used interchangeably.

    However, an incident can be of any nature, it doesn’t have to be tied to security, for example:

    • Physical damage to hardware or systems (fire, flooding)
    • Human error (misconfigurations, accidental deletion of data)
    • Malicious actors (denial of service attacks, malware, ransomware)

    Every incident is different and may require a different response. The incident response consists of steps taken by an organization to address the outage and reinstate services to their normal operation, often in real-time. For example, treating an outage is referred to as an incident response.

    A good incident response plan can help your company respond quickly and effectively when an outage occurs. Keep in mind that incident response is not just a technical function to be done by a specific team. Instead, it is more of a corporate process that involves all areas of the business.

    Traditional vs Modern Incident Response

    The biggest change in the world of incident response was the widespread adoption of automation.

    Traditionally, the incident response was a highly manual process. Everything from creating a ticket to patching a server required human interaction. It was effective until the world experienced the internet boom.

    Easy internet access has certainly opened up opportunities for people and businesses alike. According to IDC, 60% or more organizations have spent more on technology to embrace the digital future.

    The rise in use of digital platforms has resulted in complex infrastructures with multiple application dependencies. Hence, downtime and system failures for even a few minutes can incur huge monetary losses (in some cases, even millions).

    In order to avoid such events, organizations have resorted to dealing with incidents using teams that are on-call 24/7. This puts a lot of pressure on incident response teams as they are required to manually monitor systems, keep track of alerts and avoid fatigue. Hence automating some or most of the incident response processes can help get rid of repetitive work. It helps response teams be more effective with less effort.

    That's not to say that people are no longer involved with incident response. People are still involved in triage, troubleshooting, and postmortem analysis. It's just that those tasks are much less frequent than they were before automation became the norm.

    Incident Response used to be about reacting to what happened with a solution for an immediate ‘bleed stop’. Nowadays, it is more about being proactive and trying to prevent incidents altogether by understanding and gaining intelligence about why something has happened.

    Incident response and management have become more of a DevOps-based activity. Where operational issues are addressed through code and automation, rather than manual intervention.

    Responding to an Incident

    In the SRE (Site Reliability Engineering) realm, the incident response can be divided into the following steps:

    1. Detect
    2. Respond
    3. Resolution and Recovery
    4. Postmortems

    Let’s expand on those and understand how incidents were responded to in the past, and how they are now.

    Detect

    This step is where you detect an issue or determine if there has been a breach. A breach or incident could originate from different sources.

    Traditional: Primary source of detection would most likely be calls or emails from the impacted users. Monitoring and alerting tools weren’t as ubiquitous as today.

    Modern: An issue will usually be caught through monitoring and alerting on metrics, or in another case by people noticing something strange while they're doing their work. With alerting tools and the right schedules in place, it is easier to detect such issues so they can be dealt with due process.

    Respond

    This is the step where you analyze the issue at hand and take a call on whether to contain the damage or terminate the concerned services.

    Traditional: The limitations of technology made it difficult to connect globally. Cross-functional localized teams would come together to figure out the issue. It often led to forcing resources to quit the work at hand and focus on solving the issue. This chopping and changing would particularly impact developers the most.

    Modern: Modern-day teams analyze the metrics and logs to determine how bad the outage is. Is it a brief spike in errors? Are a few nodes going offline? Or is it a full-on service disruption? This step involves analyzing metrics and logs before responding further. This is where your colleagues from other sectors would collaborate for help. Using modern ChatOps tools like Slack, Microsoft Teams helps in effective collaboration. This keeps the right people connected even globally if needed.

    Resolution and Recovery

    Once you've analyzed and pinpointed the root cause, you need to resolve the issue and ensure the system has recovered, with the affected systems and devices up and running again.

    Traditional: The process was unstructured. There was a lack of coordination between people, which led to support people tripping over and duplicating efforts. The aim of recovery was to get the system up and running, and nothing much followed. Getting to the root cause was rarely an objective until the same issue occurred repeatedly.

    This changed with time as processes were put into place. But lack of automation meant that the on-call schedules were still not very efficient and there was a lot of manual work.

    Modern: These days, various tools and techniques are used to deal with issues. The decision is based on the issue that is being dealt with and the team's capabilities. For example, if you're experiencing network issues and your team has access to network engineering resources, they may be able to resolve the issue quickly by adjusting settings on routers or switches.

    Recovery is usually coordinated by the on-call incident handler, who is responsible for implementing a solution and making sure it does not fail. The SRE team then follows up with the manager to make sure the fix works as intended and, if necessary, to mitigate any damage caused by the outage. Another goal is to prevent such incidents from happening again

    Postmortems

    A postmortem is written after the issue is resolved, and everything has calmed down. Once the postmortem write-up is ready, a meeting occurs and is led by an SRE manager or incident handler who distributes the postmortem notes to relevant parties within the organization.

    The goal of this meeting is to review what happened during the outage, why it happened, what was done to stop it, and how it could have been prevented in the future. The postmortem then becomes part of an organization's operational history, allowing teams to learn from past mistakes and improve their overall reliability going forward.

    Traditional: Traditional postmortems were either internal reports that were never seen outside the company or formal reports submitted to external auditors.

    Both constraints made it difficult to share detailed information about what happened and why it happened. Traditional postmortems are typically tactical documents that focus on how IT personnel responds to an incident.

    Modern: The practice of postmortems is an established part of modern incident response and is generally written as after-the-fact documentation.

    Modern digital postmortems are more inclusive of all teams involved, including the stakeholders. And should be viewed as strategic points that focus on lessons learned by the entire organization.

    They can be used for training purposes since they document case studies from completed investigations. They allow you to:

    • Analyze past issues
    • Find trends and make predictions about future risks
    • Help you learn from mistakes
    • Prevent a recurrence.

    An excellent example of a postmortem template, and what should be included, can be found in the first SRE book by Google. Also, do check out our blog on Postmortems.

    This brings us to the end of this blog. We have successfully explored what incident response is and how it has evolved with time.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    Share this post:
    In this blog:
      Subscribe to our LinkedIn Newsletter to receive more educational content
      Subscribe now

      Subscribe to our latest updates

      Thank you! Your submission has been received!
      Oops! Something went wrong while submitting the form.
      FAQ
      Learn how organizations are using Squadcast
      to maintain and improve upon their Reliability metrics
      Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
      mapgears
      "Mapgears simplified their complex On-call Alerting process with Squadcast.
      Squadcast has helped us aggregate alerts coming in from hundreds...
      bibam
      "Bibam found their best PagerDuty alternative in Squadcast.
      By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
      tanner
      "Squadcast helped Tanner gain system insights and boost team productivity.
      Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
      Alexandre Lessard
      System Analyst
      Martin do Santos
      Platform and Architecture Tech Lead
      Sandro Franchi
      CTO
      Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
      Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
      What our
      customers
      have to say
      mapgears
      "Mapgears simplified their complex On-call Alerting process with Squadcast.
      Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
      Alexandre Lessard
      System Analyst
      bibam
      "Bibam found their best PagerDuty alternative in Squadcast.
      By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
      Martin do Santos
      Platform and Architecture Tech Lead
      tanner
      "Squadcast helped Tanner gain system insights and boost team productivity.
      Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
      Sandro Franchi
      CTO
      Revamp your Incident Response.
      Peak Reliability
      Easier, Faster, More Automated with SRE.
      Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
      Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
      Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
      Users love Squadcast on G2
      Copyright © Squadcast Inc. 2017-2024