Best Practices in Incident Management

May 7, 2020
Share this post:
Best Practices in Incident Management

In an always-on world, companies look to systems & processes to keep their services up & running at all times. Squadcast latest post outlines a few best practices in incident management to restore services during unplanned downtime.

Table of Contents:

    In an always-on world, companies look to systems and processes to keep their services up and running at all times. The most important part of maintaining this uptime is having an Incident Management process in place to restore your services in the event of an interruption or unplanned downtime.

    Incident Management processes are typically used by SRE, DevOps, NOC and other IT teams to respond to incidents that affect services and work on restoring their uptime. Any team that also follows ITIL and ITSM practices have similar processes in place with slightly different terminologies.

    For the purpose of categorizing the different aspects of Incident Management, we can go over the different stages of an Incident Lifecycle.

    stages of an Incident Lifecycle.
    Note: This does not indicate the various inter-dependencies of each stage in the cycle and is only depicted in a simple format to provide a holistic overview of the Incident Lifecycle.

    What is IT incident management?

    Incident management is the process of managing an event that disrupts the normal function of a system, network, or process. They can be caused by hardware or software problems and can be a result of a single event or a series of events. While the process can vary depending on the size of the organization, most organizations handle incidents by creating a series of processes that share one goal: to identify the root cause of the incident, and take corrective action.

    An organization’s Incident Management process is meant to tie these stages in together seamlessly and cover the entire lifecycle of the incident - from incident trigger to post-incident reviews and postmortems. It is also important to note that these practices are meant to be dynamic and constantly evolving with the people, systems, and architectures.

    This post outlines some best practices to keep in mind while implementing or improving your processes.

    Incident Detection & Classification: 

    • The initial details you receive about an incident while on-call saves a lot of time in the triage and mitigation process. Configuring the right data fields and Event Tags in order to automate this level of classification is a must.
    • Set up Deduplication rules to group all similar alerts together. This also ensures that your on-call team is not notified for the same incident repeatedly.
    • Send only vital information that can help assist in the remediation, in the details field.
    • Make sure you add any other important data manually as a part of the classification process after the team has been alerted.
    incident management api and tags

    Incident Alerting: 

    • Make sure to send alerts only for relevant and actionable events even if other events are also being sent into your incident management tool.
    • Make sure you configure Deduplication and Suppression Rules to ensure you do not get notified for un-important alerts. This could otherwise cause severe alert fatigue and also affects your team’s response times and productivity.
    Incident Alerting

    Incident Prioritization: 

    • A crucial form of incident classification is prioritization. This helps the on-call team understand the severity of the issue at first glance. Configure automation to assign priority to every incident routed to your alerting tool.
    • The prioritization matrix followed by an organization should always be linked to service and customer impact. This gives the on-call team the clarity needed to understand the situation.
    Incident Prioritization

    Triage and Collaboration: 

    • Configure your incident routing and escalation policies to always reach the right responder. Assign tags to indicate severity or priority and configure routing rules to ensure that the first responder is always the right responder.
    • In a high-fire situation, the ease with which you can communicate can make or break the customer perception and ultimately the impact on your bottom line. Having a platform-specific collaboration space can reduce the time taken to assemble elsewhere to discuss the incident.
    • If you use Slack for this, make sure that there is an assigned channel to have any kind of incident related discussion in order to reduce MTTR.
    Triage and Collaboration

    Incident Communication: 

    • It’s important to keep both customers and customer-facing internal teams in the know of all mitigation activities. This is easier when you automate all communication updates and manage it from one place.
    • Add in the relevant teams as stakeholders so they can see what’s done to mitigate an incident. Also you can provide additional details on a private status page for internal folks.
    • Maintain a Public Status Page and constantly update it. The first thing a user would do when facing service issues, is to look at your status page. So, always ensure your status page has all the essential information a user would need to understand the impact of the issue.
    Incident Communication

    Incident Resolution: 

    • Automate as much as possible. Connect your tools to take action directly from within the incident management platform itself. Little steps go a long way.
    • Document any attempts at resolution or mitigation, as soon as you have taken the steps. What you perceive to be a small problem might not be the case for someone else on your team.
    • Maintain a repository of Runbooks and RCAs / Incident Reviews for you and your team to go back and review resolution steps for similar incidents in the future.
    Incident Resolution

     Incident Review & Remediation: 

    • Start with an auto-generated incident timeline which has a chronological list of everything that was recorded during a live incident.
    • Drive a collaborative Incident Review process complete with Root-Cause-Analysis (RCA) to get to a fine-grained understanding of any incident as quickly as possible.
    • Always run an Incident Review process for Medium and High severity incidents. Remember to be Blameless. At the time of a crisis, it’s important to focus on the `What`, `Why`, `How` and `What Next` rather than the `Who`.
    • Maintain a checklist of tasks that have to be completed for longer-term remediation.
     Incident Review & Remediation

    Ensuring that you learn from every incident should be the biggest takeaway from your Incident Response process.

    Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution, and create a knowledge base to effectively handle incidents.

    squadcast
    Written By:
    May 7, 2020
    May 7, 2020
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Prakya Vasudevan
    On-call On-boarding Checklist
    On-call On-boarding Checklist
    May 20, 2020
    Configure an Intuitive Service Dashboard & Reduce Response Time
    Configure an Intuitive Service Dashboard & Reduce Response Time
    April 30, 2020
    What you should know about Squadcast + Grafana Integration
    What you should know about Squadcast + Grafana Integration
    April 2, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2 Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2 Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
    Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2
    Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Squadcast is a leader in IT Service Management (ITSM) Tools on G2
    Copyright © Squadcast Inc. 2017-2023