On the quest to provide the best uptime, software platforms depend on complex interconnected microservices. This often leaves them vulnerable to cascading failures creating a massive deluge of alerts from monitoring tools when things go wrong. In this blog, we explore how Squadcast can be configured to curb alert noise for better productivity with the help of the most advanced deduplication features.
Modern software platforms depend on complex interconnected microservices for smooth operation. As stakeholders rush to deliver consistent service to their customers, maintaining uptime and service stability becomes one of the most important factors for customer satisfaction. To provide the best on-time services, organisations end up creating a complex web of interconnected microservices (often as a collection of redundant services to ensure there is minimal downtime).
However, this type of complex infrastructure has its own set of shortcomings. As interdependent systems experience breakdowns, often a cascade of failures occurs, triggering a massive deluge of alerts. In one of our previous blogs, we explored alert noise and how it can kill productivity and introduced you to Squadcast's deduplication rules. In this post we will take a look at service dependency based and status based deduplication rules, their benefits, use cases and how to implement them.
To reduce the amount of alert noise many monitoring tools have their own deduplication features. However, in the case of interdependent microservices a dedicated incident management platform that aggregates all alerts is always a better option. This comes as handy especially with organisations that are experiencing exponential growth and may have a haphazard web of interconnected services that are reliant on the handful of people who built it for it’s proper functioning.
For the sake of your on-call team’s sanity it is vital that you have clear noise reduction rules in place. A fully featured incident management solution will have the capabilities to ensure that this tangled web of microservices is not bombarding your on-call team with unnecessary alerts. For instance, an ecommerce platform may employ tens if not hundreds of microservices for something as simple as searching for a book and then placing an order for it.
Enter Squadcast, a super-customisable platform that helps you reduce alert noise. Be it too many alerts from the same critical service or alerts that are going out from dependent services.
A typical on-call engineer’s time is too valuable to be wasted on inconsequential alerts. Squadcast gives granular control over alerts and configuration for interdependent services. The platform is built with the experience of serving organisations dealing with hundreds of requests per second. Each organisation has an unique set of interdependent systems running on top of the other. Squadcast allows you to customise the level of alert noise you are comfortable with.
Regular deduplication rules in incident management platforms are in place to ensure that the on-call engineer is not overwhelmed by alerts coming from the same source in a short period of time. Our status based deduplication goes a step further by allowing granular control over the alerts being received. This includes checking the state of the incident status(whether “triggered”, “suppressed” ), and then deciding if deduplication should be done. This feature gives you that additional control of narrowing down the list of past incidents(based on the status they have) against which deduplication is to be considered.
Below screenshots show how you can configure this feature to deduplicate based on the status of a single critical service.
As stated earlier, Squadcast offers a very high level of granularity in the creation and application of deduplication rules. This includes creating complex regex expressions to cover any scenario and allowing dedupe based on the status (triggered, suppressed or acknowledged). For example, if the database service is affected it is possible to generate an alert based on the incident status for any subsequent failure.
When your platform relies on a variety of microservices working in tandem, we see dependencies springing up (that can lead to cascading failures if a critical service goes down). For instance, a typical design dependency for e-commerce sites features a central database service that is linked with a payment gateway and frontend service. Any failure in database would create two alerts:
With Squadcast’s service dependency based deduplication rules in place, only a single alert will be generated if the two services are configured to be interdependent.
Once you have properly configured your incident management platform to combine the alerts for these two dependent services you will be able to deal with the problem without being drowned in alert noise.
Below screenshots highlight deduplication rules configured for any interdependent service. In this case subsequent alerts are automatically deduplicated.
Squadcast also gives you the flexibility to configure granular rules to combine alerts based on the JSON payload from your monitoring services.This high degree of control lets you prepare for novel scenarios as the dependency map of your organisation evolves. Once the dependent services are mapped in the platform deduplication can be enabled by ticking the checkbox as shown in the screenshot.
Replicating the dependency map of your physical infrastructure to prevent spamming your on-call engineers with alerts.
An incident management platform that responds proactively and helps you in learning from past failures is indispensable in today’s world of microservices and distributed computing. As customer expectations rise the pressure to have 99% and above uptime rises in step. Understandably, maintaining that very high level of uptime can have a significant human cost if not managed properly. Status based and service based deduplication are just two among many of Squadcast’s features to ensure that alert noise does not become unbearable and gives organisations the power to make the lives of on-call engineers better.
We hope these practices help you reduce alert noise and improve your on-call experience. We’d love to hear from you on other best practices that can be followed to better on-call.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.