Incident Response Guide

April 17, 2023
Share this post:
Incident Response Guide
Table of Contents:

    Site reliability engineering (SRE) is a critical discipline that focuses on ensuring the continuous availability and performance of modern systems and applications. One of the most vital aspects of SRE is incident response, a structured process for identifying, assessing, and resolving system incidents that can lead to downtime, revenue loss, and brand reputation damage. 

    In this article, we discuss the importance of incident response, examining the key elements of triaging and troubleshooting and offering real-world examples to demonstrate their practical applications. We will then use our insights to create an ideal incident response plan that can be utilized by teams of all sizes to effectively manage and mitigate system incidents, ensuring the highest levels of service reliability and user satisfaction.

    Summary of incident response planning

    The table below summarizes the steps involved in incident response planning that we will explore in this article.

    Create an incident response plan Analyze Take stock of your current systems and environments. Which users make use of which systems, what are the bottlenecks, and what are the failure points?
    Prepare Calculate the impact and severity of various failures or outages. Do you have the necessary resources to respond to incidents and quickly bring systems back online?
    Simulate scenarios Make plans to inform customers and necessary stakeholders (i.e., legal and compliance) if failures occur. What else will happen while informing customers (e.g., engaging the production support team, setting up NOC calls, and potentially engaging vendor support teams)?
    Learn from retrospective meetings Compile learnings from dry runs. Did everyone have everything they needed? What was missing (communication, automation, or tools)?
    Finalize the incident response plan Detect Use modern detection tools. Ensure that there is a scheduled support team that will be notified of incidents and can quickly respond.
    Triage Observe the issues at hand. Involve colleagues and vendors who could help resolve issues as necessary. Notify consumers and stakeholders of issues as soon as possible.
    Resolve and recover Resolve the issue and ensure that the fix works correctly and the impact is mitigated. Notify consumers and stakeholders of the resolution and the total impact.
    Contribute to the runbook Investigate, automate, and (if possible) add the solution to the runbook. This way, if the issue reoccurs, it will be resolved automatically.
    Conduct a postmortem Engage all colleagues involved in triaging and resolving the issue. Identify ways to prevent the issue from occurring again. Document what happened and how to prevent it in the future.
    Collect metrics Ensure that metrics and reports are stored to measure historical IR effectiveness. Review them from time to time to track progress and identify any automation potential or bottlenecks.

    Triage

    The first step in the incident response process is triaging, which involves determining the severity and scope of an incident. This phase aims to assess the situation and prioritize resources to address the problem. Triaging consists of three primary activities: detection, assessment, and prioritization. We will walk through two scenarios to illustrate this process.

    Detection 

    Monitoring tools and alerting systems can identify potential incidents using predefined criteria, such as error rates or response times, with anomaly detection and threshold-based alerts being common methods. On-call personnel may also be notified when customers report issues. In cases where the system does not generate any obvious symptoms or alerts, on-call personnel must conduct a deeper investigation based on user-reported issues.

    Scenario 1: Users are complaining that their orders are not getting filled when placing buy or sell orders on a stock exchange platform. This is a critical incident without any obvious alert.

    Scenario 2: A monitoring tool detects a sudden spike in error rates for a microservice, triggering an alert for the SRE team.

    Assessment 

    Once an incident is detected, the SRE team must assess its impact on the system and its users. This involves understanding the root cause, affected components, and scope of the issue.

    Scenario 1: The customer support team notices many complaints regarding the non-fulfillment of orders. SRE reviews the Service Catalog and determines the service responsible for order routing. 

    Scenario 2: The SRE team identifies that the error rate spike is due to a recent deployment that introduced a bug in a specific service, affecting only a subset of users. 

    Prioritization 

    After assessing the incident, the SRE team assigns a severity level based on its impact and urgency. The severity level helps prioritize resources and determine the response time. If possible, the status page should be updated to show which services are impacted and the estimated resolution time. 

    The table below shows some sample severity levels for incidents.

    Severity level Importance Example
    SEV-0 Critical A complete system outage or massive data loss situation affecting all users
    SEV-1 High Significant degradation of the system or an unusable major feature affecting a large number of users
    SEV-2 Moderate Partial degradation of the system or a minor unusable feature affecting some users
    SEV-3 Low A small issue that does not significantly impact the user experience or system performance

    Scenario 1: Since this is a widespread incident, it is classified as SEV-0.

    Scenario 2: The SRE team classifies the incident as a SEV-2 issue because it affects a small number of users but is not a complete service outage.

    Additional example 

    Let’s look at one more scenario in detail. In this example, a web application experiences increased latency and sporadic errors due to an issue connecting to the Redis cache.

    System overview

    • The application uses a three-tier architecture, with a front-end web server, a back-end application server, and a database server.
    • The back-end server is built using Python and Flask and relies on Redis for caching.
    • The system runs on AWS EC2 instances, with an Elastic Load Balancer (ELB) distributing traffic among the instances.
    • Monitoring and alerting are handled by Prometheus, Alertmanager, and Grafana.
    Grafana Dashboard for Redis (source)

    Triage

    1. A Prometheus alert notifies the SRE team of increased latency and error rates in the web application.
    2. The team checks the Grafana dashboard and confirms that the issue has persisted over the past 20 minutes.
    3. The SRE team identifies the affected components, including the back-end application server and Redis cache, and assesses the potential impact on users. 

    The next course of action is to troubleshoot and investigate the root cause, eventually fixing the issue as described in the next section of this article.

    Integrated full stack reliability management platform
    Try for free
    Drive better business outcomes with incident analytics, reliability insights, SLO tracking, and error budgets
    Manage incidents on the go with native iOS and Android mobile apps
    Seamlessly integrated alert routing, on-call, and incident response
    Try for free

    Troubleshoot

    Once an incident has been triaged, the next step is engaging in troubleshooting to identify the root cause and find a solution. Troubleshooting typically involves data collection, hypothesis generation, and testing and validation.

    Data collection 

    The SRE team gathers relevant data, such as logs, metrics, and traces, to understand the issue better. This may involve querying monitoring tools, checking application logs, or using distributed tracing to identify problematic requests.

    Scenario 1: The SRE team determines that there has been a spike in RAM usage on one of the production bare metal machines that are used for high-performance computing. 

    Scenario 2: The SRE team queries monitoring tools to collect data on the error rates, response times, and resource utilization of the affected microservice.

    Hypothesis generation 

    Based on the collected data, the team formulates a hypothesis about the root cause of the incident. This step may require collaboration with other teams, such as developers, service owners or network engineers, depending on the nature of the issue.

    Scenario 1: Upon further investigation alongside SRE and development teams, it is hypothesized that the application queue size keeps growing, causing a spike in RAM usage. 

    Scenario 2: The SRE team hypothesizes that a bug in the deployment is causing the service to consume excessive resources, leading to slow response times and increased error rates.

    Testing and validation 

    The team tests its hypotheses by running experiments or modifying system configurations to validate or refute its theories. This step often involves iterative cycles of testing and refining hypotheses until the root cause is found.

    Scenario 1: This growth of queue size is confirmed using the logs. Since this troubleshooting event is being done in collaboration with multiple teams, the application owner suggests investigating the worker processes. If one or more of the worker processes is in a bad state (hung or otherwise zombied), it is recommended to restart or force kill ( kill -6 <pid>) that process so that the application starts a new worker process and continues to pick up items from the queue. This will result in the reduction of the queue size.

    Scenario 2: The SRE team temporarily rolls back the deployment to the previous version, observing decreased error rates and resource consumption, thus validating the hypothesis.

    Additional example

    We will now describe the troubleshooting process for the additional example in the previous section. 

    Troubleshooting

    1. The team examines the logs of the back-end application server and notices intermittent "Redis connection error" messages. They use AWS CloudWatch Logs Insights to search for and analyze these logs.
    # Example error message in the application logs
    
    RedisError: Error 110 connecting to <redis_host>:<redis_port>. Connection timed out.
    
    1. The SRE team checks the golden signals (latency, traffic, Errors, and Saturation) for the back-end application server and Redis cache in Datadog. They find that the Redis cache is experiencing high latency.
    2. The team inspects the health status and metrics of the AWS resources, including the EC2 instances and ELB, using the AWS Management Console. No issues are found.
    3. The team checks the Redis cache configuration and discovers that the connection pool is set too low, causing the application to exhaust available connections under high load.
    
    	# Example Redis configuration in the Python application (before troubleshooting) 
    	import redis 
    	
    	redis_pool = redis.ConnectionPool(host='redis_host',  port=redis_port, db=0, max_connections=5) 
    	
    	redis_cache = redis.Redis(connection_pool=redis_pool)
    	
    	

    Resolution

    1. The team increases the Redis connection pool size to a more appropriate value, reducing the likelihood of exhausting the available connections.
    # Example Redis configuration in the Python application (after troubleshooting)
    	import redis
    	
    	redis_pool = redis.ConnectionPool(host='redis_host', port=redis_port, db=0, max_connections=50)
    	
    	redis_cache = redis.Redis(connection_pool=redis_pool)
    	
    1. The updated back-end application server is deployed, and the latency and error rates return to normal.
    2. The SRE team verifies that the issue is resolved by checking the Datadog dashboard and monitoring the golden signals for the back-end application server and Redis cache.
    3. The team communicates the resolution to stakeholders and conducts a post-mortem analysis to learn from the incident and improve the incident response process.

    Ideal incident response plan

    Incorporating the following detailed steps and examples into your incident response plan can help you create a robust and effective framework to guide your SRE team through the entire process, from incident identification to resolution and continuous improvement:

    1. Incident identification and classification: Define clear criteria for identifying incidents and classifying their severity. This should include performance-related factors, such as latency and error rates, as well as system availability and network incidents. Establish thresholds for triggering alerts and escalating incidents.
    2. Communication and escalation protocols: Develop clear communication channels and escalation protocols for the SRE team and other stakeholders. This may include using tools like Slack, email, or dedicated incident management platforms. Define the communication expectations for each stage of an incident, including regular updates on the incident status, resolution progress, and any changes in severity.
    3. Roles and responsibilities: Assign specific roles and responsibilities to team members for each stage of the incident response process. This may include incident commanders who oversee the response effort, subject matter experts who provide technical expertise, and communication coordinators who keep stakeholders informed. Ensure that all team members understand their roles and are trained to execute them effectively. Service Catalog should be used for this purpose.
    4. Incident response procedures: Document step-by-step procedures for responding to different types of incidents, such as triaging, troubleshooting, and recovery. These procedures should be tailored to your organization's specific infrastructure, technologies, and services. Include checklists and flowcharts to guide team members through the response process.
    5. Incident response tools and resources: Provide the SRE team with the necessary tools and resources to respond to incidents efficiently. This may include monitoring and alerting tools, Service Catalog, log aggregation, analysis platforms, and access to documentation and runbooks.
    6. Training and simulations: Despite often being overlooked, regular training on incident response plans, procedures, tools, and resources is crucial for SRE teams to minimize MTTR. While incidents may not follow a pattern, investing time and budget in training will equip SREs to triage and resolve issues, ensuring swift and efficient service restoration.
    7. Recovery and post-incident review: Establish procedures for recovering from incidents and conducting post-incident reviews. This can happen in post-mortem meetings as well. These reviews should identify the root causes of the incident, assess the effectiveness of the response effort, and determine any improvements needed in the incident response plan or infrastructure. Most importantly, it’s possible to get an answer to this key question: “Could this incident repeat?” Runbooks and the automation arsenal should be updated based on the answer. 
    8. Continuous improvement: Periodically review and update the incident response plan to ensure that it remains effective and aligned with your organization's changing needs and technologies. Incorporate lessons learned from real incidents and simulated exercises to improve the plan and procedures.

    Conclusion

    A comprehensive and effective incident response plan is essential for site reliability engineering teams to quickly and effectively address system incidents, minimizing downtime and potential losses. 

    This article discussed the importance of incident response and provided real-world examples to illustrate the process of triaging and troubleshooting. By following the steps and guidelines outlined in this article, SRE teams can develop a robust incident response plan that is tailored to their specific infrastructure and technology stacks, ensuring swift detection, assessment, prioritization, and resolution of incidents. 

    Regular training, continuous improvement, and post-incident reviews will help maintain the plan's effectiveness and keep it up to date with the organization's evolving needs. 

    By investing in a well-structured incident response plan, organizations can better safeguard their users, revenue, and brand reputation from the adverse effects of system outages and failures. Squadcast has a plethora of features that help with all the tenets mentioned in the article, including Service Catalog, Runbook Automation, Incident Analytics and Reliability Insights, Retrospectives, and Status Page.

    Integrated full stack reliability management platform
    Platform
    Blameless
    Lightstep
    Squadcast
    Incident Retrospectives
    Seamless Third-Party Integrations
    Built-In Status Page
    On Call Rotations
    Incident
    Notes
    Advanced Error Budget Tracking
    Try For free
    Platform
    Incident Retrospectives
    Seamless Third-Party Integrations
    Incident
    Notes
    Built-In Status Page
    On Call Rotations
    Advanced Error Budget Tracking
    Blameless
    FireHydrant
    Squadcast
    Try For free
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Squadcast Community
    Unlocking Visibility and Control: Introducing Squadcast’s Service Graph Feature
    Unlocking Visibility and Control: Introducing Squadcast’s Service Graph Feature
    November 28, 2023
    RapidSpike + Squadcast: Routing Alerts Made Easy
    RapidSpike + Squadcast: Routing Alerts Made Easy
    October 25, 2023
    Blue Matador + Squadcast: Alert Routing Simplified
    Blue Matador + Squadcast: Alert Routing Simplified
    October 19, 2023
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    Copyright © Squadcast Inc. 2017-2023