There is a famous quote that goes like this…
‘For every minute spent organizing, an hour is earned.’
At least in the world of incident response, nothing is more apt than this. Digital infrastructure these days are made up of multiple services, an outage could result from either one impacted service or multiple impacted services. So it's essential to have a catalog of all the services along with the point of contact (service owner) responsible for maintaining it.
However, in the absence of service ownership details, the incident response will go on for longer than necessary. And even basic questions such as these will seem like a mystery to everyone involved:
Being ignorant of these questions will make it a reactive incident response process with an obvious drift between Mean Time to Detection and Mean Time to Recovery. This not only brings down metrics closely tied to team goals (such as MTTA & MTTR) but also increases the chances of more customers getting exposed to the issue.
Most readers here will argue that maintaining Service Ownership is an age-old practice. Rightfully said, documenting the list of services and their respective owners were a standard practice followed by Infrastructure and Operations teams over the years because they were responsible for the system’s performance and uptime.
In recent years, it's not about - ‘are ownership details documented?’
Rather it's about - ‘where are ownership details documented?’
The foremost questions you need to ask yourself (and your team) are -
‘Do we have the details stored in the right place?’
‘Are the details centralized and easily accessible by everyone?’
‘Can everyone quickly access it during emergencies?’
‘Is there automation in place to alert the right people?’
Better Ownership & Greater Transparency. Response teams must be able to access ownership details in mere seconds, even if not minutes. And the best place to document these details can’t just be any random tool, but an Incident Management platform such as Squadcast.
And to meet this need, we’ve built a feature that can act as a centralized Service Directory, highlighting the health status of Services and their respective owners. This not only makes incident response less chaotic but is also the first step in making it a proactive process, rather than a reactive process.
Before we get into the details of how modern incident response teams are using our Service Catalog, to prevent incidents from spiralling out of control, let’s spend some time understanding what it means to actually ‘own Services’.
Service ownership is the act where team members take responsibility for supporting the software they deliver at every stage of the development lifecycle. Since Service owners are the SMEs (subject matter experts) for their services – it makes a lot of sense for them to own response and resolution of production issues. This not only promotes a stable product but also bridges the gap between engineering teams and the impact they have on customers.
When it comes to Incident Management, being organized is a superpower that can prevent you from losing millions of dollars in a short window of downtime, all thanks to the timely availability of information. On the contrary, every minute spent scrambling for data, will only lead to more tickets and escalations.
Our Service Catalog is a Service Directory that acts like a centralized knowledge base containing all the specifics of that particular service, and the personnel within the team responsible for maintaining it.
It can typically answer questions such as:
Having all the service-related information in a centralized location can make Service Ownership less chaotic for the team not only at the time of an outage, but also when there is a partial service degradation.
But associating ownership with services is not as easy as it sounds. There are numerous processes and best practices that should be followed. Let’s read about that in the next section.
Now let’s understand what exactly is a Service within Squadcast’s ecosystem.
Services in Squadcast represent specific systems, applications, or core components of your infrastructure for which alerts are generated, and incidents get created.
In the simplest terms, a Service in Squadcast can be summarized as a component that you want to constantly monitor for uptime, report incidents at the slightest hint of performance degradation, and have certain people on-call to quickly remediate the issue.
For every service created in Squadcast, appropriate service owners should be defined.
Establishing the culture of ‘owning Services’ will help you take the next big next leap in your reliability journey, and every member involved in the process should buy-in to the cause. This includes everyone in incident response - starting from the incident commander to the on-call engineers working on L1 issues.
So in the next section of this blog, let’s understand the best practices to keep in mind while configuring services and ownership. To check out the best practices to reduce MTTR for Services configured in Squadcast, refer to this guide.
First, create a list of all the services that are critical to your business. This should include both *Technical Services and *Business Services that need to be monitored 24*7. Start by differentiating between the two types of services and assign ownership to the appropriate teams accordingly because even a few seconds of degradation or downtime can upset customers and stakeholders.
*Technical Service - a discrete piece of code or functionality within the product owned by the engineering team
*Business Service - can be a combination of one or more Technical Services that have a direct impact on the business/ customer
Using appropriate naming conventions will make incident response less chaotic during times of urgency. When naming services:
Every Service should be wholly owned by a team or an individual. Ideally, this should be the same team responsible for developing and maintaining the service because they are the Subject Matter Experts who understand how the service works and should be notified when something goes wrong.
On-call rotations are key to distributing the load equally among team members. Based on your organization’s requirements and structure, you should build out a roster (a full-blown on-call calendar) for indicating how many individuals will be on-call at a given time and who will be notified straight away for certain severe incidents.
The best practice is to:
‘Tags’ help in classifying services appropriately. And classifying services adds a lot more context to the services based on incident impact. For ex:
Setting up ownership for services is only the first step towards better incident response. In order to strengthen the value that its adding, you can do the following:
SLOs (Service Level Objectives) is one of the best indicators to measure service functionality. Various functional targets should be established for every service. Targets here, could be in the form of the expected amount of uptime, acceptable amount of latency, number of errors, error rate, etc.
But the key point is to make sure the owner has a tab on these performance indicators, along with some form of automation that can notify the owner(s) as and when the targets are not being met.
Analytics is another useful medium to understand the health of the service. By analyzing a service’s past behavior, you can get answers to various questions like:
The key point is, analytics can be leveraged to decipher various patterns in a Service’s behavior. This data can be used to drive home numerous insights that can improve on-call and incident response processes.
Most of all, having open discussions with the team is very important in maintaining team harmony. It can also help to bolster confidence and increase psychological safety as service degradations are inevitable. Exchanging perspectives and settling down on an approach to deliver maximum uptime is the best way forward.
Customers and stakeholders tend to be happier when they see a healthy and functioning service. A functioning service is thus a result of proactive incident response, which is itself a byproduct of well-defined Service Ownership.
Squadcast is an incident management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.