Better Incident Response: Incident Classification & Setting Severities with Tags

In This Article:

Our Products

Implementing an incident classification step in your incident management software and process can significantly bring down the MTTR and stress involved in the first few minutes of an incident.

How to implement Incident classification?

Apart from setting up on-call schedules and adopting best practices on how to handle various kinds of incidents, incident management also has to do with constantly refining processes and benchmarks to ultimately achieve higher system reliability. One way of refining processes is making use of incident classification like that of incident severities.

Every team has their own unique way of defining severities. But this evolves once they have a basic classification framework for defining the severity of an incident. The most common starting point is the SEV 1 - SEV 5 scale, outlined below:

SEV-1 incidents are those that are critical and have a very large impact on the customer experience. Typically major incidents that cause outages hindering product or service usability for a large percentage of the customers.

SEV-2 incidents are also critical in nature but are less severe in comparison with SEV-1 incidents. Incidents that impact a smaller percentage of customers and impede product usage nevertheless come under SEV-2.
SEV-3 incidents are those that can be minor but may have a significant impact if not addressed immediately. These may be incidents that involve degradation of product stability but may not impact product usage right away.
SEV-4 incidents are minor incidents that indicate that the product is not performing to the required standard but needn’t necessarily impact product usability.
SEV-5 incidents are minor bugs that need to be fixed but don’t affect the product usability.

However, there are other factors like the required urgency in solving the incident, or how the incident can affect other parts of the system that may not be taken into account while assigning the severity of an incident. Some incident management tools attempt to solve this by adding other forms of classification like incident urgency, and incident criticality. Many solutions only allow for incident severity as the one form of classification and in some cases, this is done manually instead of automatically assigning the severity levels based on the incoming alert context.

There’s a clear opportunity to improve incident response processes with better incident classification. If implemented the right way, this can bring down MTTR significantly and also provide an opportunity to reduce the toil involved with routing manually and also adds more context to an incident during the primary analysis.

At Squadcast, we chose to add more flexibility to this process by creating a custom rule-based auto-tagging system instead of having just a dropdown to manually select or assign tags. We basically define tags as key-value pairs for eg. the key could be severity and the possible values could be SEV0, SEV1, SEV2, etc. or the key could be Team and the possible values could be Backend, Frontend, Database, etc. With the Tagging and Routing features in Squadcast, you can set pretty much any kind of custom tags which will be automatically assigned based on the rules you define on top of the attributes being passed in the incident payload. You can then use these tags to set routing rules ensuring that the right responder is notified at the right time to bring down the resolution time.

Introducing Part 2 of the Kevin Series, we illustrate how to use tags to set severities in Squadcast. We have more use-case based articles lined up to show you other ways to implement incident classification using tags - stay tuned!

P.S. In case you were wondering, Kevin has previously also set up his own alert deduplication rules to reduce alert noise in Squadcast.

Severities and Auto-Routing with Incident Tags

It's February 13th on a warm afternoon and Kevin is lazily dreaming about how his date is going to pan out the next day. His dream is suddenly disrupted by a torrent of database incidents that pour in. What's more annoying is that most of them are not particularly critical or even related to the class of issues he generally handles.

Kevin’s got a new ringtone for incidents. Love Me Do, in keeping with the Valentine spirit.

Also, he works with Kai, who is expected to handle all the low-severity incidents and typically everything that comes in with regard to query optimization.

Kevin realised that he could be spending his time more effectively by

Classifying his incidents by assigning the type or class of incidents that they fit into
Assigning severity to get to critical incidents faster
Automatically route incidents based on tags to ensure that the right responder is alerted

This would allow more time for Kevin’s day dreaming!

Given that they work in a relatively small company where on-call rotations are rather erratic or handled by both when fires happen, he decided to make this process a whole lot better by simply routing more efficiently.

Plus, anticipating the same barrage of incidents while he’s on his date tomorrow, he decides to take matters into his own hands. He sees that the database incident is a query optimisation based incident. And not even a severe one at that, based on the visited_returned_ratio value in the payload.

	
    {  
      "payload": {    
        "id" : 23,    
        "issue" : "SLOW_QUERY_PERF",    
        "metric" : {      
          "visited_returned_ratio" : 1300.2334,      
          "time_interval" : 10	  
        },    
        "summary" : "Slow query performance",    
        "cluster_name" : "cluster-prod-0-awsumdb",    
        "cluster_id" : 9,    
        "hostnames" : [      
          "rpl0-awsumdb.cluster-prod-0-awsumdb.db.com",      
          "rpl2-awsumdb.cluster-prod-0-awsumdb.db.com"	  
        ],    
        "link" : "",    
        "created" : "2020-02-13T13:00:00.116Z",    
        "status" : "open"  
      }
    }

He then writes a rule to auto-add tags to the incident to add more context to it and classify it better

Rule: re(payload.issue, "QUERY") && payload.metric.visited_returned_ratio < 5000

‍
Tags assigned:

issueType : optimisation
severity : low

Finally, now he's done ensuring that at least the incidents are classified. With a satisfied smug, he sits back and admires his work of art. A quick thought jumps through his head and he rubs his hands in devious mischief.

He now uses routing rules and the issueType tag to automatically route it to the right person going forward. In this case, to Kai. So that Kevin does not get disturbed for these kinds of issues anymore.

Kevin thoughtfully arrives at the conclusion that this is quite possibly the best gift he could give to his single friend on Valentine's day.

Infact, he believes he has cracked the "gifting" secret code for any occasion, for his on-call team members (flaunts an evil grin)

‍

Read More on: Incident Severity Level Classification

Written By:

Prakya Vasudevan

February 20, 2020

Prakya Vasudevan

February 20, 2020

Incident Response

Incident Management

Best Practices

Share this blog:

In This Article:

Learn how organizations are using Squadcast
to maintain and improve upon their Reliability metrics

Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics

"Mapgears simplified their complex On-call Alerting process with Squadcast.

Squadcast has helped us aggregate alerts coming in from hundreds...

Read Case Study

"Bibam found their best PagerDuty alternative in Squadcast.

By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...

Read Case Study

"Squadcast helped Tanner gain system insights and boost team productivity.

Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...

Read Case Study

Alexandre Lessard

System Analyst

Martin do Santos

Platform and Architecture Tech Lead

Sandro Franchi

CTO

Squadcast is a leader in Incident Management on G2

Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2

Squadcast is a leader in Americas IT Alerting on G2

Squadcast is a leader in Europe IT Alerting on G2

Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2

Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.

What our
customers
have to say

"Mapgears simplified their complex On-call Alerting process with Squadcast.

Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...

Read Case Study

Alexandre Lessard

System Analyst

"Bibam found their best PagerDuty alternative in Squadcast.

By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...

Read Case Study

Martin do Santos

Platform and Architecture Tech Lead

"Squadcast helped Tanner gain system insights and boost team productivity.

Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...

Read Case Study

Sandro Franchi

CTO

Case Studies

Revamp your Incident Response.
Peak Reliability

Easier, Faster, More Automated with SRE.

Schedule a 1:1 Demo

Better Incident Response: Incident Classification & Setting Severities with Tags

How to implement Incident classification?

Severities and Auto-Routing with Incident Tags

Related Posts