Our Product Roadmap is now public. Check it out here!

Better Incident Response: Incident Classification & Setting Severities with Tags

What you absolutely must know when responding to an incident is what kind of impact it has on customers and how negatively it can affect your team. This is typically addressed by following some kind of incident classification, usually “incident severity levels”, to indicate the importance of  every incident - that is, to understand how seriously various stakeholders are affected and to route the incident differently if necessary. One should note that Incident Classification is not used to determine the root cause or find the resolution. 

Implementing an incident classification step in your incident management software  and process can significantly bring down the MTTR and stress involved in the first few minutes of an incident. 

How to implement Incident classification? 

Apart from setting up on-call schedules and adopting best practices on how to handle various kinds of incidents, incident management also has to do with constantly refining processes and benchmarks to ultimately achieve higher system reliability. One way of refining processes is making use of incident classification like that of incident severities. 

Every team has their own unique way of defining severities. But this evolves once they have a basic classification framework for defining the severity of an incident. The most common starting point is the SEV 1 - SEV 5 scale, outlined below: 

  • SEV-1 incidents are those that are critical and have a very large impact on the customer experience. Typically major incidents that cause outages hindering product or service usability for a large percentage of the customers. 
  • SEV-2 incidents are also critical in nature but are less severe in comparison with SEV-1 incidents. Incidents that impact a smaller percentage of customers and impede product usage nevertheless come under SEV-2. 
  • SEV-3 incidents are those that can be minor but may have a significant impact if not addressed immediately. These may be incidents that involve degradation of product stability but may not impact product usage right away. 
  • SEV-4 incidents are minor incidents that indicate that the product is not performing to the required standard but needn’t necessarily impact product usability. 
  • SEV-5 incidents are minor bugs that need to be fixed but don’t affect the product usability. 

However, there are other factors like the required urgency in solving the incident, or how the incident can affect other parts of the system that may not be taken into account while assigning the severity of an incident. Some incident management tools attempt to solve this by adding other forms of classification like incident urgency, and incident criticality. Many solutions only allow for incident severity as the one form of classification and in some cases, this is done manually instead of automatically assigning the severity levels based on the incoming alert context. 

There’s a clear opportunity to improve incident response processes with better incident classification. If implemented the right way, this can bring down MTTR significantly and also provide an opportunity to reduce the toil involved with routing manually and also adds more context to an incident during the primary analysis. 

At Squadcast, we chose to add more flexibility to this process by creating a custom rule-based auto-tagging system instead of having just a dropdown to manually select or assign tags. We basically define tags as key-value pairs for eg. the key could be severity and the possible values could be SEV0, SEV1, SEV2, etc. or the key could be Team and the possible values could be Backend, Frontend, Database, etc. With the Tagging and Routing features in Squadcast, you can set  pretty much any kind of custom tags which will be automatically assigned based on the rules you define on top of the attributes being passed in the incident payload. You can then use these tags to set routing rules ensuring that the right responder is notified at the right time to bring down the resolution time. 

Introducing Part 2 of the Kevin Series, we illustrate how to use tags to set severities in Squadcast. We have more use-case based articles lined up to show you other ways to implement incident classification using tags - stay tuned! 

P.S. In case you were wondering, Kevin has previously also set up his own alert deduplication rules to reduce alert noise in Squadcast. 

Severities and Auto-Routing with Incident Tags

It's February 13th on a warm afternoon and Kevin is lazily dreaming about how his date is going to pan out the next day. His dream is suddenly disrupted by a torrent of database incidents that pour in. What's more annoying is that most of them are not particularly critical or even related to the class of issues he generally handles. 

Kevin’s got a new ringtone for incidents. Love Me Do, in keeping with the Valentine spirit.

Also, he works with Kai, who is expected to handle all the low-severity incidents and typically everything that comes in with regard to query optimization.

Kevin realised that he could be spending his time more effectively by

  • Classifying his incidents by assigning the type or class of incidents that they fit into
  • Assigning severity to get to critical incidents faster
  • Automatically route incidents based on tags to ensure that the right responder is alerted

This would allow more time for Kevin’s day dreaming!

Given that they work in a relatively small company where on-call rotations are rather erratic or handled by both when fires happen, he decided to make this process a whole lot better by simply routing more efficiently. 

Plus, anticipating the same barrage of incidents while he’s on his date tomorrow, he decides to take matters into his own hands. He sees that the database incident is a query optimisation based incident. And not even a severe one at that, based on the visited_returned_ratio value in the payload.

	
    {  
      "payload": {    
        "id" : 23,    
        "issue" : "SLOW_QUERY_PERF",    
        "metric" : {      
          "visited_returned_ratio" : 1300.2334,      
          "time_interval" : 10	  
        },    
        "summary" : "Slow query performance",    
        "cluster_name" : "cluster-prod-0-awsumdb",    
        "cluster_id" : 9,    
        "hostnames" : [      
          "rpl0-awsumdb.cluster-prod-0-awsumdb.db.com",      
          "rpl2-awsumdb.cluster-prod-0-awsumdb.db.com"	  
        ],    
        "link" : "",    
        "created" : "2020-02-13T13:00:00.116Z",    
        "status" : "open"  
      }
    }
  


He then writes a rule to auto-add tags to the incident to add more context to it and classify it better

Rule: re(payload.issue, "QUERY") && payload.metric.visited_returned_ratio < 5000


Tags assigned:

  • issueType : optimisation
  • severity : low

Finally, now he's done ensuring that at least the incidents are classified. With a satisfied smug, he sits back and admires his work of art. A quick thought jumps through his head and he rubs his hands in devious mischief.


He now uses routing rules and the issueType tag to automatically route it to the right person going forward. In this case, to Kai. So that Kevin does not get disturbed for these kinds of issues anymore.


Kevin thoughtfully arrives at the conclusion that this is quite possibly the best gift he could give to his single friend on Valentine's day.

Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, centralize SLO dashboards, unify internal and external SLIs and automate incident resolution with Squadcast Actions and create a knowledge base to effectively handle incidents.

Learn more about Squadcast:
February 20, 2020
Prakya Vasudevan
About the Author:

Better Incident Response: Incident Classification & Setting Severities with Tags

February 20, 2020
Implementing an incident classification step in your incident management software and process can significantly bring down the MTTR and stress involved in the first few minutes of an incident.

Implementing an incident classification step in your incident management software  and process can significantly bring down the MTTR and stress involved in the first few minutes of an incident. 

How to implement Incident classification? 

Apart from setting up on-call schedules and adopting best practices on how to handle various kinds of incidents, incident management also has to do with constantly refining processes and benchmarks to ultimately achieve higher system reliability. One way of refining processes is making use of incident classification like that of incident severities. 

Every team has their own unique way of defining severities. But this evolves once they have a basic classification framework for defining the severity of an incident. The most common starting point is the SEV 1 - SEV 5 scale, outlined below: 

  • SEV-1 incidents are those that are critical and have a very large impact on the customer experience. Typically major incidents that cause outages hindering product or service usability for a large percentage of the customers. 
  • SEV-2 incidents are also critical in nature but are less severe in comparison with SEV-1 incidents. Incidents that impact a smaller percentage of customers and impede product usage nevertheless come under SEV-2. 
  • SEV-3 incidents are those that can be minor but may have a significant impact if not addressed immediately. These may be incidents that involve degradation of product stability but may not impact product usage right away. 
  • SEV-4 incidents are minor incidents that indicate that the product is not performing to the required standard but needn’t necessarily impact product usability. 
  • SEV-5 incidents are minor bugs that need to be fixed but don’t affect the product usability. 

However, there are other factors like the required urgency in solving the incident, or how the incident can affect other parts of the system that may not be taken into account while assigning the severity of an incident. Some incident management tools attempt to solve this by adding other forms of classification like incident urgency, and incident criticality. Many solutions only allow for incident severity as the one form of classification and in some cases, this is done manually instead of automatically assigning the severity levels based on the incoming alert context. 

There’s a clear opportunity to improve incident response processes with better incident classification. If implemented the right way, this can bring down MTTR significantly and also provide an opportunity to reduce the toil involved with routing manually and also adds more context to an incident during the primary analysis. 

At Squadcast, we chose to add more flexibility to this process by creating a custom rule-based auto-tagging system instead of having just a dropdown to manually select or assign tags. We basically define tags as key-value pairs for eg. the key could be severity and the possible values could be SEV0, SEV1, SEV2, etc. or the key could be Team and the possible values could be Backend, Frontend, Database, etc. With the Tagging and Routing features in Squadcast, you can set  pretty much any kind of custom tags which will be automatically assigned based on the rules you define on top of the attributes being passed in the incident payload. You can then use these tags to set routing rules ensuring that the right responder is notified at the right time to bring down the resolution time. 

Introducing Part 2 of the Kevin Series, we illustrate how to use tags to set severities in Squadcast. We have more use-case based articles lined up to show you other ways to implement incident classification using tags - stay tuned! 

P.S. In case you were wondering, Kevin has previously also set up his own alert deduplication rules to reduce alert noise in Squadcast. 

Severities and Auto-Routing with Incident Tags

It's February 13th on a warm afternoon and Kevin is lazily dreaming about how his date is going to pan out the next day. His dream is suddenly disrupted by a torrent of database incidents that pour in. What's more annoying is that most of them are not particularly critical or even related to the class of issues he generally handles. 

Kevin’s got a new ringtone for incidents. Love Me Do, in keeping with the Valentine spirit.

Also, he works with Kai, who is expected to handle all the low-severity incidents and typically everything that comes in with regard to query optimization.

Kevin realised that he could be spending his time more effectively by

  • Classifying his incidents by assigning the type or class of incidents that they fit into
  • Assigning severity to get to critical incidents faster
  • Automatically route incidents based on tags to ensure that the right responder is alerted

This would allow more time for Kevin’s day dreaming!

Given that they work in a relatively small company where on-call rotations are rather erratic or handled by both when fires happen, he decided to make this process a whole lot better by simply routing more efficiently. 

Plus, anticipating the same barrage of incidents while he’s on his date tomorrow, he decides to take matters into his own hands. He sees that the database incident is a query optimisation based incident. And not even a severe one at that, based on the visited_returned_ratio value in the payload.

	
    {  
      "payload": {    
        "id" : 23,    
        "issue" : "SLOW_QUERY_PERF",    
        "metric" : {      
          "visited_returned_ratio" : 1300.2334,      
          "time_interval" : 10	  
        },    
        "summary" : "Slow query performance",    
        "cluster_name" : "cluster-prod-0-awsumdb",    
        "cluster_id" : 9,    
        "hostnames" : [      
          "rpl0-awsumdb.cluster-prod-0-awsumdb.db.com",      
          "rpl2-awsumdb.cluster-prod-0-awsumdb.db.com"	  
        ],    
        "link" : "",    
        "created" : "2020-02-13T13:00:00.116Z",    
        "status" : "open"  
      }
    }
  


He then writes a rule to auto-add tags to the incident to add more context to it and classify it better

Rule: re(payload.issue, "QUERY") && payload.metric.visited_returned_ratio < 5000


Tags assigned:

  • issueType : optimisation
  • severity : low

Finally, now he's done ensuring that at least the incidents are classified. With a satisfied smug, he sits back and admires his work of art. A quick thought jumps through his head and he rubs his hands in devious mischief.


He now uses routing rules and the issueType tag to automatically route it to the right person going forward. In this case, to Kai. So that Kevin does not get disturbed for these kinds of issues anymore.


Kevin thoughtfully arrives at the conclusion that this is quite possibly the best gift he could give to his single friend on Valentine's day.

Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, centralize SLO dashboards, unify internal and external SLIs and automate incident resolution with Squadcast Actions and create a knowledge base to effectively handle incidents.

Prakya Vasudevan
Want to share the awesomeness?
Our Product Roadmap is now public. Check it out here!
Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
Squadcast recognized in Incident Management based on user reviews Users love Squadcast on G2 Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Incident Management on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
Squadcast recognized in Incident Management based on user reviews Users love Squadcast on G2 Squadcast is a leader in Incident Management on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
Copyright © Squadcast Inc. 2017-2020