As an SRE of an organization with a rapidly growing infrastructure with several interdependencies, you may have struggled with configuring things on an incident management platform. If you have a smaller team with a monolithic architecture in place it is still relatively easier to connect the infrastructure to your incident management platform and create rules for escalations and alerting. But what happens if you have a large on-call team spread across time zones looking after the infrastructure that has hundreds of microservices running concurrently? How do you configure it all in your incident management platform while keeping in mind the load your on-call team will be under?
Since most platforms let you create services that accept alerts from monitoring tools, should you create 100 such services for every component of your infrastructure?
We will be tackling similar questions in this blog. But before we dive deeper, here are few things to be aware of.
Q: What are the key aspects this article would be addressing?
A: In this blog, we look at ways your team can configure incident management platform, in particular Squadcast, to ensure that you don't waste precious time responding to incidents.
Q: What this article won't cover?
A: Unfortunately, we cannot have a single solution that will work for every type of situation. This post seeks to provide some clarity to this problem. We have put together a set of best practices that should cover most production systems out there.
Some of the concerns you may have while modelling your services are
- Will I be alerted on time?
- How to avoid irrelevant alerts?
- Is the alert getting routed to the right person?
- Am I getting alerts for the most critical pieces of my infrastructure?
As a modern incident management platform, Squadcast aggregates and routes alerts from monitoring tools and provides a centralised dashboard for tracking and prioritising alerts along with taking action and ultimately resolving the incident (the latter part will be covered in our blog titled “Intelligent Incident Response Plan”). Owing to its flexible configuration capabilities, there are many ways you can set-up alerting for services within Squadcast.
This blog takes into account the different kinds of infrastructure (monolithic/microservices or distributed) and types of on-call teams that are present.
Before we get started with the best practices, here are some Squadcast specific features that you need to know while configuring the platform.
- Squads: These are groups of on-call engineers and non-technical users that can be organized by business function or technology.
- Services: Services are a logical group of alert sources that can be tagged, deduplicated or routed to the right person/team. They are most commonly used to represent individual parts of your infrastructure. Please note that services can receive alerts from more than one monitoring tool.
- Tags: Tags in Squadcast can be auto-created to include context rich information with alerts. You can create your own rules for tagging alerts.
- Routing: Routing in Squadcast is used when you want alerts to be sent to someone who is not the default recipient. This is helpful when a specific part of your infrastructure is facing issues that require more specialised knowledge.
- Escalation Policies: These policies see to it that a critical alert is never missed. You can configure them to ensure that the right users and squads are alerted at the right time.
- On-call Rotations: On-call schedules are used to determine who will be notified when an incident is triggered. This helps you build a balanced on-call culture and ensures that no critical alerts are missed.
These features provide the backbone for the best practices in alerting for your organisation. While the solutions described in this blog are generic, with a little tweaking, chances are they will work for you. We have tried to be as inclusive as possible while creating these best practices. Before we get started on modelling your system in Squadcast, here are the assumptions we are making about the alerting systems you have in place.
Monitoring: We are assuming that you are already monitoring all the important aspects of your infrastructure. This includes alerting, metric collection, log aggregation and tracing/instrumentation practices. We are also assuming that you have a good mix of proactive, reactive and investigative alerts in place. Further, you have also categorised the alerts based on whether they are related to the infrastructure or to the application side(business dependent).
Relevant Alerting: The alerts you have in place are linked to important parts of your infrastructure and are already optimised. This includes alerts that are actionable and not over sensitive (the right threshold). This also includes having the right deduplication rules in place to mitigate alert noise. We are also assuming that you can add identifying information to your alert payloads.
Our recommendations assume that the alerting system you have in place presently is well suited to the type of business and tech stack that you are using.
The way you model your system will depend on several factors. First we will be looking at the kind of architecture you have in place.
Architecture
For the purpose of this blog post, we will consider the following as different types of architecture that you may be using :
- Monolithic Architecture: All of your core functionality is concentrated to a single executable application with related infrastructure dependencies like app server, databases, load balancers etc. Your SRE team is responsible for maintenance of this part of the infrastructure.
- Distributed: A distributed architecture has multiple interdependent executable applications that intercommunicate with their related infrastructure dependencies. These may or may not be replicated. We will assume that the number of internal units is low enough, that they can be committed to memory
- Microservices: A distributed architecture with a very large number of components. Due to the sheer number of these services, it is not feasible to create individual Squadcast services for each component.
- Multiple Unrelated Applications: Though less commonly found, these can be treated as a special case of the types of architecture mentioned above. This scenario may come into being when you need an incident management system with a proprietary application framework that doesn't fit into any of the above. This kind of architecture may be seen in organisations that require compartmentalised applications for security or compliance reasons.
- Kubernetes based architecture: Some types of alerts from this kind of infrastructure are eliminated or automatically resolved by Kubernetes itself. Other than this, there is no significant difference from a common microservice architecture.
Response Team Organisation
- All-in-One Incident Response Team: In this type of setup, all responders are organised into one team. Due to the nature of this setup it is possible to have lesser or negligible routing for alerts in your incident management platform.
- Service based: For larger organisations with more complex infrastructure, each application may have a dedicated team. Each team maintains their application and the infrastructure it depends on. Some examples are:
- Public API Team
- Inventory Service Team
- Infrastructure Layer based: This type of team organisation can be found in larger companies. In addition to application teams, there are teams that specialise in managing certain kinds of technology. Examples include
- Inventory System Team
- Database Team
- Load balancer Team
- Networking Team
- L1/L2/L3 Teams: In this system, teams are organised into first responders and escalation teams. This type or team organisation can be considered a special case of the types mentioned above and for the sake of simplicity, we will not be discussing these separately.
Recommendations for Configuring Services
Before we recommend the best way to configure Squadcast for your organization, please select the type of architecture and on-call team you have.
1: What kind of architecture does your application have?
2: What kind of on-call team do you have in your organization?
All-in-One Incident Response Teams
Infrastructure Layer Based Teams
For the above choices, this is the ideal Squadcast configuration for your architecture and on-call team type.
Monolithic Architecture with an All-in-One Incident Response Teams
Squads: Creation of one squad in Squadcast is sufficient for this kind of architecture. This squad will have members of the on-call team or any non-technical stakeholder if required.
Services: Creation of a single service in Squadcast will suffice and all alerts from monitoring tools can be sent to this service.
Tagging: Event tagging is optional in this scenario.
Routing: Alert Routing is not strictly necessary, unless you have an on-call team with varying levels of expertise.
Monolithic Architecture for Service Based Teams
Squads: Each business specific team will need their own squad with relevant escalation policies. One additional cross-functional team may be required for handling infrastructure related issues.
Services: Individual services have to be created in Squadcast for each function/team specific area and one for infrastructure related issues.
Tagging: Event tagging is optional in this scenario.
Routing: Alert Routing is optional in this scenario.
Monolithic Architecture for Infrastructure Layer Based Teams
Squads: One team will be required for each infrastructure layer being monitored in the backend. Alerts will be sent to the team responsible for handling the incident.
Services: You will need to create separate services in Squadcast for each layer of infrastructure being monitored.
Tagging: Event Tagging is optional in this scenario.
Routing: Alert Routing is optional in this scenario.
Distributed Architecture All-in-One Incident Response Teams
Squads: One squad has to be created in Squadcast. This squad will include all members of the on-call team or any other non-technical stakeholder.
Services: One service needs to be created in Squadcast for each critical application service being monitored.
Tagging: Event tagging is optional in this scenario.
Routing: Alert routing is optional in this scenario.
Distributed Architecture for Service Based Teams
Squads: Multiple squads have to be created in Squadcast for respective business teams.
Services: One service needs to be created in Squadcast for each application in the distributed system
Tagging: Event Tagging is optional in this scenario.
Routing: Alert Routing is optional in this scenario.
Distributed Architecture for Infrastructure Layer Based Teams
Squads: One squad has to be created in Squadcast for each infrastructure team. This is required since each team needs separate routing and escalation policies.
Services: You have two options to choose from while configuring services for this type:
- One squadcast service for each source.This will require information in the alert payload to distinguish whether it is application related or infra layer related.
- Multiple services for each infrastructure layer and for application related alerts
Tagging: Event Tagging is required if you are using one squadcast service per infra layer
Routing: Alert Routing is required to send alerts to specific engineers in charge of respective infrastructure layers.
Microservice Architecture for All-in-One Incident Response Teams
Squads: One squad has to be created in Squadcast for the on-call team. Escalation policies will need to be created depending on the nature of your application.
Services: Alert payloads will need to be customised to have information regarding the affected service and instance/node. This information will be used to add visible contextual information to the incident tags.
Tagging: Tags will need to include information about the affected service and the instance or node.
Routing: Routing is optional in this scenario.
Microservice Architecture for Business Specific Teams
Squads: One squad to be created in Squadcast for each team.
Services: One service for each team. Alert payloads will need to be customised to have information regarding the affected service and instance or node. This information will be used to add visible contextual information to the incident tags.
Tagging: Tags will need to include information about the affected service and the instance or node.
Routing: Alerts will need to be routed to each specific squad. For example, if it’s an error related to the payment gateway monitoring service it has to be routed to that specific business team.
Microservice Architecture for Infrastructure Layer Based Teams
Squads: One squad to be created in Squadcast for each team looking after specific parts of the infrastructure.
Services: One service connected to the infra layer for each team. Alert payloads will need to be customised to have information regarding the affected infra layer, and instance or node. Without this identifying information it will be much harder to fix issues.
Tagging: Tags will need to include information about the affected service and the instance or node.
Routing: Alerts will need to be routed to each specific squad.
Note: All other scenarios involving this kind of setup can be modelled like this but is not recommended as you may lose some amount of analytics and access control capabilities.
Conclusion: Depending on the nature of your infrastructure as well as the size and composition of your on-call staff, combinations of the above guidelines would be ideal for your organization. Initially, you may need to do several tests to determine the best way to model services in Squadcast depending on your specific needs. If you are a large organization with multiple interconnected services, our recommendations will assist you in implementing a framework that will optimize your alerting processes and help reduce your MTTR (Mean Time To Resolve).
Our next blog in this series titled “Intelligent Incident Response”, will help you understand what needs to be done to mitigate impact or fix the issue with help of Squadcast and all the while ensuring that you learn from every incident, which should be the biggest takeaway from your Incident Response process.