What you absolutely must know when responding to an incident is what kind of impact it has on customers and how negatively it can affect your team. This is typically addressed by following some kind of incident classification, usually “incident severity levels”, to indicate the importance of every incident - that is, to understand how seriously various stakeholders are affected and to route the incident differently if necessary. One should note that Incident Classification is not used to determine the root cause or find the resolution.
Implementing an incident classification step in your incident management software and process can significantly bring down the MTTR and stress involved in the first few minutes of an incident.
Apart from setting up on-call schedules and adopting best practices on how to handle various kinds of incidents, incident management also has to do with constantly refining processes and benchmarks to ultimately achieve higher system reliability. One way of refining processes is making use of incident classification like that of incident severities.
Every team has their own unique way of defining severities. But this evolves once they have a basic classification framework for defining the severity of an incident. The most common starting point is the SEV 1 - SEV 5 scale, outlined below:
However, there are other factors like the required urgency in solving the incident, or how the incident can affect other parts of the system that may not be taken into account while assigning the severity of an incident. Some incident management tools attempt to solve this by adding other forms of classification like incident urgency, and incident criticality. Many solutions only allow for incident severity as the one form of classification and in some cases, this is done manually instead of automatically assigning the severity levels based on the incoming alert context.
There’s a clear opportunity to improve incident response processes with better incident classification. If implemented the right way, this can bring down MTTR significantly and also provide an opportunity to reduce the toil involved with routing manually and also adds more context to an incident during the primary analysis.
At Squadcast, we chose to add more flexibility to this process by creating a custom rule-based auto-tagging system instead of having just a dropdown to manually select or assign tags. We basically define tags as key-value pairs for eg. the key could be severity and the possible values could be SEV0, SEV1, SEV2, etc. or the key could be Team and the possible values could be Backend, Frontend, Database, etc. With the Tagging and Routing features in Squadcast, you can set pretty much any kind of custom tags which will be automatically assigned based on the rules you define on top of the attributes being passed in the incident payload. You can then use these tags to set routing rules ensuring that the right responder is notified at the right time to bring down the resolution time.
Introducing Part 2 of the Kevin Series, we illustrate how to use tags to set severities in Squadcast. We have more use-case based articles lined up to show you other ways to implement incident classification using tags - stay tuned!
P.S. In case you were wondering, Kevin has previously also set up his own alert deduplication rules to reduce alert noise in Squadcast.
It's February 13th on a warm afternoon and Kevin is lazily dreaming about how his date is going to pan out the next day. His dream is suddenly disrupted by a torrent of database incidents that pour in. What's more annoying is that most of them are not particularly critical or even related to the class of issues he generally handles.
Kevin’s got a new ringtone for incidents. Love Me Do, in keeping with the Valentine spirit.
Also, he works with Kai, who is expected to handle all the low-severity incidents and typically everything that comes in with regard to query optimization.
Kevin realised that he could be spending his time more effectively by
This would allow more time for Kevin’s day dreaming!
Given that they work in a relatively small company where on-call rotations are rather erratic or handled by both when fires happen, he decided to make this process a whole lot better by simply routing more efficiently.
Plus, anticipating the same barrage of incidents while he’s on his date tomorrow, he decides to take matters into his own hands. He sees that the database incident is a query optimisation based incident. And not even a severe one at that, based on the visited_returned_ratio value in the payload.
He then writes a rule to auto-add tags to the incident to add more context to it and classify it better
Rule: re(payload.issue, "QUERY") && payload.metric.visited_returned_ratio < 5000
Finally, now he's done ensuring that at least the incidents are classified. With a satisfied smug, he sits back and admires his work of art. A quick thought jumps through his head and he rubs his hands in devious mischief.
He now uses routing rules and the issueType tag to automatically route it to the right person going forward. In this case, to Kai. So that Kevin does not get disturbed for these kinds of issues anymore.
Kevin thoughtfully arrives at the conclusion that this is quite possibly the best gift he could give to his single friend on Valentine's day.
Infact, he believes he has cracked the "gifting" secret code for any occasion, for his on-call team members (flaunts an evil grin)
Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, centralize SLO dashboards, unify internal and external SLIs and automate incident resolution with Squadcast Actions and create a knowledge base to effectively handle incidents.