With a rise in digital platforms, IT infrastructure has grown exponentially complex to a level where multiple application interdependencies coexist with varied architecture & oncall team types. This blog looks at how you can model your infrastructure in Squadcast to reduce your time to respond & resolve incidents.
As an SRE of an organization with a rapidly growing infrastructure with several interdependencies, you may have struggled with configuring things on an incident management platform. If you have a smaller team with a monolithic architecture in place it is still relatively easier to connect the infrastructure to your incident management platform and create rules for escalations and alerting. But what happens if you have a large on-call team spread across time zones looking after the infrastructure that has hundreds of microservices running concurrently? How do you configure it all in your incident management platform while keeping in mind the load your on-call team will be under?
Since most platforms let you create services that accept alerts from monitoring tools, should you create 100 such services for every component of your infrastructure?
We will be tackling similar questions in this blog. But before we dive deeper, here are few things to be aware of.
Q: What are the key aspects this article would be addressing?
A: In this blog, we look at ways your team can configure incident management platform, in particular Squadcast, to ensure that you don't waste precious time responding to incidents.
Q: What this article won't cover?
A: Unfortunately, we cannot have a single solution that will work for every type of situation. This post seeks to provide some clarity to this problem. We have put together a set of best practices that should cover most production systems out there.
Some of the concerns you may have while modelling your services are
As a modern incident management platform, Squadcast aggregates and routes alerts from monitoring tools and provides a centralised dashboard for tracking and prioritising alerts along with taking action and ultimately resolving the incident (the latter part will be covered in our blog titled “Intelligent Incident Response Plan”). Owing to its flexible configuration capabilities, there are many ways you can set-up alerting for services within Squadcast.
This blog takes into account the different kinds of infrastructure (monolithic/microservices or distributed) and types of on-call teams that are present.
Before we get started with the best practices, here are some Squadcast specific features that you need to know while configuring the platform.
These features provide the backbone for the best practices in alerting for your organisation. While the solutions described in this blog are generic, with a little tweaking, chances are they will work for you. We have tried to be as inclusive as possible while creating these best practices. Before we get started on modelling your system in Squadcast, here are the assumptions we are making about the alerting systems you have in place.
Monitoring: We are assuming that you are already monitoring all the important aspects of your infrastructure. This includes alerting, metric collection, log aggregation and tracing/instrumentation practices. We are also assuming that you have a good mix of proactive, reactive and investigative alerts in place. Further, you have also categorised the alerts based on whether they are related to the infrastructure or to the application side(business dependent).
Relevant Alerting: The alerts you have in place are linked to important parts of your infrastructure and are already optimised. This includes alerts that are actionable and not over sensitive (the right threshold). This also includes having the right deduplication rules in place to mitigate alert noise. We are also assuming that you can add identifying information to your alert payloads.
Our recommendations assume that the alerting system you have in place presently is well suited to the type of business and tech stack that you are using.
The way you model your system will depend on several factors. First we will be looking at the kind of architecture you have in place.
For the purpose of this blog post, we will consider the following as different types of architecture that you may be using :
Before we recommend the best way to configure Squadcast for your organization, please select the type of architecture and on-call team you have.
Conclusion: Depending on the nature of your infrastructure as well as the size and composition of your on-call staff, combinations of the above guidelines would be ideal for your organization. Initially, you may need to do several tests to determine the best way to model services in Squadcast depending on your specific needs. If you are a large organization with multiple interconnected services, our recommendations will assist you in implementing a framework that will optimize your alerting processes and help reduce your MTTR (Mean Time To Resolve).
Our next blog in this series titled “Intelligent Incident Response”, will help you understand what needs to be done to mitigate impact or fix the issue with help of Squadcast and all the while ensuring that you learn from every incident, which should be the biggest takeaway from your Incident Response process.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.