Please fill in all the required fields.
With a rise in digital platforms, IT infrastructure has grown exponentially complex to a level where multiple application interdependencies coexist with varied architecture & oncall team types. This blog looks at how you can model your infrastructure in Squadcast to reduce your time to respond & resolve incidents.
As an SRE of an organization with a rapidly growing infrastructure with several interdependencies, you may have struggled with configuring things on an incident management platform. If you have a smaller team with a monolithic architecture in place it is still relatively easier to connect the infrastructure to your incident management platform and create rules for escalations and alerting. But what happens if you have a large on-call team spread across time zones looking after the infrastructure that has hundreds of microservices running concurrently? How do you configure it all in your incident management platform while keeping in mind the load your on-call team will be under?
Since most platforms let you create services that accept alerts from monitoring tools, should you create 100 such services for every component of your infrastructure?
We will be tackling similar questions in this blog. But before we dive deeper, here are few things to be aware of.
Q: What are the key aspects this article would be addressing?
A: In this blog, we look at ways your team can configure incident management platform, in particular Squadcast, to ensure that you don't waste precious time responding to incidents.
Q: What this article won't cover?
A: Unfortunately, we cannot have a single solution that will work for every type of situation. This post seeks to provide some clarity to this problem. We have put together a set of best practices that should cover most production systems out there.
Some of the concerns you may have while modelling your services are
As a modern incident management platform, Squadcast aggregates and routes alerts from monitoring tools and provides a centralised dashboard for tracking and prioritising alerts along with taking action and ultimately resolving the incident (the latter part will be covered in our blog titled “Intelligent Incident Response Plan”). Owing to its flexible configuration capabilities, there are many ways you can set-up alerting for services within Squadcast.
This blog takes into account the different kinds of infrastructure (monolithic/microservices or distributed) and types of on-call teams that are present.
Before we get started with the best practices, here are some Squadcast specific features that you need to know while configuring the platform.
These features provide the backbone for the best practices in alerting for your organisation. While the solutions described in this blog are generic, with a little tweaking, chances are they will work for you. We have tried to be as inclusive as possible while creating these best practices. Before we get started on modelling your system in Squadcast, here are the assumptions we are making about the alerting systems you have in place.
Monitoring: We are assuming that you are already monitoring all the important aspects of your infrastructure. This includes alerting, metric collection, log aggregation and tracing/instrumentation practices. We are also assuming that you have a good mix of proactive, reactive and investigative alerts in place. Further, you have also categorised the alerts based on whether they are related to the infrastructure or to the application side(business dependent).
Relevant Alerting: The alerts you have in place are linked to important parts of your infrastructure and are already optimised. This includes alerts that are actionable and not over sensitive (the right threshold). This also includes having the right deduplication rules in place to mitigate alert noise. We are also assuming that you can add identifying information to your alert payloads.
Our recommendations assume that the alerting system you have in place presently is well suited to the type of business and tech stack that you are using.
The way you model your system will depend on several factors. First we will be looking at the kind of architecture you have in place.
For the purpose of this blog post, we will consider the following as different types of architecture that you may be using :
Before we recommend the best way to configure Squadcast for your organization, please select the type of architecture and on-call team you have.
For the above choices, this is the ideal Squadcast configuration for your architecture and on-call team type.
Squads: Creation of one squad in Squadcast is sufficient for this kind of architecture. This squad will have members of the on-call team or any non-technical stakeholder if required.
Services: Creation of a single service in Squadcast will suffice and all alerts from monitoring tools can be sent to this service.
Tagging: Event tagging is optional in this scenario.
Routing: Alert Routing is not strictly necessary, unless you have an on-call team with varying levels of expertise.
Squads: Each business specific team will need their own squad with relevant escalation policies. One additional cross-functional team may be required for handling infrastructure related issues.
Services: Individual services have to be created in Squadcast for each function/team specific area and one for infrastructure related issues.
Tagging: Event tagging is optional in this scenario.
Routing: Alert Routing is optional in this scenario.
Squads: One team will be required for each infrastructure layer being monitored in the backend. Alerts will be sent to the team responsible for handling the incident.
Services: You will need to create separate services in Squadcast for each layer of infrastructure being monitored.
Tagging: Event Tagging is optional in this scenario.
Routing: Alert Routing is optional in this scenario.
Squads: One squad has to be created in Squadcast. This squad will include all members of the on-call team or any other non-technical stakeholder.
Services: One service needs to be created in Squadcast for each critical application service being monitored.
Tagging: Event tagging is optional in this scenario.
Routing: Alert routing is optional in this scenario.
Squads: Multiple squads have to be created in Squadcast for respective business teams.
Services: One service needs to be created in Squadcast for each application in the distributed system
Tagging: Event Tagging is optional in this scenario.
Routing: Alert Routing is optional in this scenario.
Squads: One squad has to be created in Squadcast for each infrastructure team. This is required since each team needs separate routing and escalation policies.
Services: You have two options to choose from while configuring services for this type:
Tagging: Event Tagging is required if you are using one squadcast service per infra layer
Routing: Alert Routing is required to send alerts to specific engineers in charge of respective infrastructure layers.
Squads: One squad has to be created in Squadcast for the on-call team. Escalation policies will need to be created depending on the nature of your application.
Services: Alert payloads will need to be customised to have information regarding the affected service and instance/node. This information will be used to add visible contextual information to the incident tags.
Tagging: Tags will need to include information about the affected service and the instance or node.
Routing: Routing is optional in this scenario.
Squads: One squad to be created in Squadcast for each team.
Services: One service for each team. Alert payloads will need to be customised to have information regarding the affected service and instance or node. This information will be used to add visible contextual information to the incident tags.
Tagging: Tags will need to include information about the affected service and the instance or node.
Routing: Alerts will need to be routed to each specific squad. For example, if it’s an error related to the payment gateway monitoring service it has to be routed to that specific business team.
Squads: One squad to be created in Squadcast for each team looking after specific parts of the infrastructure.
Services: One service connected to the infra layer for each team. Alert payloads will need to be customised to have information regarding the affected infra layer, and instance or node. Without this identifying information it will be much harder to fix issues.
Tagging: Tags will need to include information about the affected service and the instance or node.
Routing: Alerts will need to be routed to each specific squad.
Note: All other scenarios involving this kind of setup can be modelled like this but is not recommended as you may lose some amount of analytics and access control capabilities.
Conclusion: Depending on the nature of your infrastructure as well as the size and composition of your on-call staff, combinations of the above guidelines would be ideal for your organization. Initially, you may need to do several tests to determine the best way to model services in Squadcast depending on your specific needs. If you are a large organization with multiple interconnected services, our recommendations will assist you in implementing a framework that will optimize your alerting processes and help reduce your MTTR (Mean Time To Resolve).
Our next blog in this series titled “Intelligent Incident Response”, will help you understand what needs to be done to mitigate impact or fix the issue with help of Squadcast and all the while ensuring that you learn from every incident, which should be the biggest takeaway from your Incident Response process.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.
With a rise in digital platforms, IT infrastructure has grown exponentially complex to a level where multiple application interdependencies coexist with varied architecture & oncall team types. This blog looks at how you can model your infrastructure in Squadcast to reduce your time to respond & resolve incidents.
As an SRE of an organization with a rapidly growing infrastructure with several interdependencies, you may have struggled with configuring things on an incident management platform. If you have a smaller team with a monolithic architecture in place it is still relatively easier to connect the infrastructure to your incident management platform and create rules for escalations and alerting. But what happens if you have a large on-call team spread across time zones looking after the infrastructure that has hundreds of microservices running concurrently? How do you configure it all in your incident management platform while keeping in mind the load your on-call team will be under?
Since most platforms let you create services that accept alerts from monitoring tools, should you create 100 such services for every component of your infrastructure?
We will be tackling similar questions in this blog. But before we dive deeper, here are few things to be aware of.
Q: What are the key aspects this article would be addressing?
A: In this blog, we look at ways your team can configure incident management platform, in particular Squadcast, to ensure that you don't waste precious time responding to incidents.
Q: What this article won't cover?
A: Unfortunately, we cannot have a single solution that will work for every type of situation. This post seeks to provide some clarity to this problem. We have put together a set of best practices that should cover most production systems out there.
Some of the concerns you may have while modelling your services are
As a modern incident management platform, Squadcast aggregates and routes alerts from monitoring tools and provides a centralised dashboard for tracking and prioritising alerts along with taking action and ultimately resolving the incident (the latter part will be covered in our blog titled “Intelligent Incident Response Plan”). Owing to its flexible configuration capabilities, there are many ways you can set-up alerting for services within Squadcast.
This blog takes into account the different kinds of infrastructure (monolithic/microservices or distributed) and types of on-call teams that are present.
Before we get started with the best practices, here are some Squadcast specific features that you need to know while configuring the platform.
These features provide the backbone for the best practices in alerting for your organisation. While the solutions described in this blog are generic, with a little tweaking, chances are they will work for you. We have tried to be as inclusive as possible while creating these best practices. Before we get started on modelling your system in Squadcast, here are the assumptions we are making about the alerting systems you have in place.
Monitoring: We are assuming that you are already monitoring all the important aspects of your infrastructure. This includes alerting, metric collection, log aggregation and tracing/instrumentation practices. We are also assuming that you have a good mix of proactive, reactive and investigative alerts in place. Further, you have also categorised the alerts based on whether they are related to the infrastructure or to the application side(business dependent).
Relevant Alerting: The alerts you have in place are linked to important parts of your infrastructure and are already optimised. This includes alerts that are actionable and not over sensitive (the right threshold). This also includes having the right deduplication rules in place to mitigate alert noise. We are also assuming that you can add identifying information to your alert payloads.
Our recommendations assume that the alerting system you have in place presently is well suited to the type of business and tech stack that you are using.
The way you model your system will depend on several factors. First we will be looking at the kind of architecture you have in place.
For the purpose of this blog post, we will consider the following as different types of architecture that you may be using :
Before we recommend the best way to configure Squadcast for your organization, please select the type of architecture and on-call team you have.
For the above choices, this is the ideal Squadcast configuration for your architecture and on-call team type.
Squads: Creation of one squad in Squadcast is sufficient for this kind of architecture. This squad will have members of the on-call team or any non-technical stakeholder if required.
Services: Creation of a single service in Squadcast will suffice and all alerts from monitoring tools can be sent to this service.
Tagging: Event tagging is optional in this scenario.
Routing: Alert Routing is not strictly necessary, unless you have an on-call team with varying levels of expertise.
Squads: Each business specific team will need their own squad with relevant escalation policies. One additional cross-functional team may be required for handling infrastructure related issues.
Services: Individual services have to be created in Squadcast for each function/team specific area and one for infrastructure related issues.
Tagging: Event tagging is optional in this scenario.
Routing: Alert Routing is optional in this scenario.
Squads: One team will be required for each infrastructure layer being monitored in the backend. Alerts will be sent to the team responsible for handling the incident.
Services: You will need to create separate services in Squadcast for each layer of infrastructure being monitored.
Tagging: Event Tagging is optional in this scenario.
Routing: Alert Routing is optional in this scenario.
Squads: One squad has to be created in Squadcast. This squad will include all members of the on-call team or any other non-technical stakeholder.
Services: One service needs to be created in Squadcast for each critical application service being monitored.
Tagging: Event tagging is optional in this scenario.
Routing: Alert routing is optional in this scenario.
Squads: Multiple squads have to be created in Squadcast for respective business teams.
Services: One service needs to be created in Squadcast for each application in the distributed system
Tagging: Event Tagging is optional in this scenario.
Routing: Alert Routing is optional in this scenario.
Squads: One squad has to be created in Squadcast for each infrastructure team. This is required since each team needs separate routing and escalation policies.
Services: You have two options to choose from while configuring services for this type:
Tagging: Event Tagging is required if you are using one squadcast service per infra layer
Routing: Alert Routing is required to send alerts to specific engineers in charge of respective infrastructure layers.
Squads: One squad has to be created in Squadcast for the on-call team. Escalation policies will need to be created depending on the nature of your application.
Services: Alert payloads will need to be customised to have information regarding the affected service and instance/node. This information will be used to add visible contextual information to the incident tags.
Tagging: Tags will need to include information about the affected service and the instance or node.
Routing: Routing is optional in this scenario.
Squads: One squad to be created in Squadcast for each team.
Services: One service for each team. Alert payloads will need to be customised to have information regarding the affected service and instance or node. This information will be used to add visible contextual information to the incident tags.
Tagging: Tags will need to include information about the affected service and the instance or node.
Routing: Alerts will need to be routed to each specific squad. For example, if it’s an error related to the payment gateway monitoring service it has to be routed to that specific business team.
Squads: One squad to be created in Squadcast for each team looking after specific parts of the infrastructure.
Services: One service connected to the infra layer for each team. Alert payloads will need to be customised to have information regarding the affected infra layer, and instance or node. Without this identifying information it will be much harder to fix issues.
Tagging: Tags will need to include information about the affected service and the instance or node.
Routing: Alerts will need to be routed to each specific squad.
Note: All other scenarios involving this kind of setup can be modelled like this but is not recommended as you may lose some amount of analytics and access control capabilities.
Conclusion: Depending on the nature of your infrastructure as well as the size and composition of your on-call staff, combinations of the above guidelines would be ideal for your organization. Initially, you may need to do several tests to determine the best way to model services in Squadcast depending on your specific needs. If you are a large organization with multiple interconnected services, our recommendations will assist you in implementing a framework that will optimize your alerting processes and help reduce your MTTR (Mean Time To Resolve).
Our next blog in this series titled “Intelligent Incident Response”, will help you understand what needs to be done to mitigate impact or fix the issue with help of Squadcast and all the while ensuring that you learn from every incident, which should be the biggest takeaway from your Incident Response process.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.