It is important to invest time and effort in understanding why a system performs the way it does and how we can improve it. Companies continue with practices that yield successful results, but ignoring anti-patterns can be far worse than choosing rigid processes. In this blog we will explore anti-patterns in incident response and why you should unlearn those.
Alerting everyone each time an incident is detected is not the best of practices. Sometimes notifying everyone is easier or it adds value. For example,
This practice may not be ideal when teams scale. You will end up notifying people who have nothing to do with the incident. This may result in alert fatigue where people get accustomed to not paying attention and often ignore incidents where their attention is needed.
Having on-call rotations and targeted alerting can help with efficient routing and prevent burnouts.
Responders deal with critical incidents where stakeholders expect constant status updates. Updates are great as it keeps everyone in the loop and may potentially offer more solutions. Sometimes, teams deal with minor incidents, which they can resolve quickly and then pass on updates to concerned members. However, while dealing with critical incidents, teams may be forced to focus more on sending updates rather than just resolving the incident. This may compromise the resolution process.
To address this issue a dedicated person can be assigned for handling communication and to provide timely updates to the stakeholders.
There is a perception that while dealing with critical incidents, people will move around with lots of discussions chaos, and panic. This is not always true. When multiple people are responding to an incident, it is absolutely critical that they collaborate and keep everyone in sync with the actions being taken. Chaos and panic can worsen the situation and should be avoided by defining clear roles and responsibilities. Teams should have an incident commander who takes decisions and authorizes changes that can impact the outcome. Teams also use chat rooms to give updates and maintain records effectively. By setting up these processes, teams can ensure effective communication and prevent chaos and panic.
Debating over the severity of the incident at the last minute is a waste of people’s time. This time should be used in resolving incidents. It is important to define unambiguous severity levels for incidents, as responses, plans, and policies are chosen based on the severity. Ideally, rules should be technically driven, clear and automated so that every incident comes with a pre-defined severity level.
Training and drills should be conducted to educate teams on how to handle these situations better.
Teams fail to inform the right responders when they don't have mechanisms to associate/relate incidents to the right responders. In order to find the right person, teams go back and forth, slowing down the process. Another reason when the right people aren't notified is when there are multiple teams involved and team structures are complex. It is important to have an identifiable and reachable person for every team. There should be a clear, well-oiled mechanism to route alerts to the right responders to ensure smooth routing and escalation.
Postmortems are important for incident response because they help you learn from the events that happened in the past and help you plan your future actions.
There are various reasons that result in postmortem failures,
Without postmortems, you fail to recognize what’s working and where you can improve. Most importantly, they help you avoid making the same mistakes in the future. Hence, postmortems should be an integral part of the incident response process and must be done sincerely.
Organizations find comfort in practices that return successful results and like to continue with those practices. However, at times you cannot anticipate certain events and established solutions do not work. Having flexible policies and processes can help you adapt to changing requirements and find the right solutions when needed. You don't have to be reckless and should try to introduce sensible changes. Also, don't be afraid to make changes. Some changes will slow down proceedings in the short-term, but promise faster and better results in the long-run.
Incidents are confusing at the best of times. People taking up different roles uninformed, just adds to the confusion. In high-pressure situations, people are expected to act quickly. Also, there is limited information coming in and lack of clarity on who needs to do what. This only makes the situation worse. Hence, it is important to define the right roles and responsibilities for people. Also, as an individual, one should keep others involved and informed about a change when needed.
Incident response is a field where we constantly look for processes and stability, but ignoring anti-patterns can be far worse than choosing optimal solutions or rigid processes.
Incident response teams need to identify issues early on, so they can help save time, prevent frustration, and reduce refactoring in the long run. Hence, it is very important to unlearn anti-patterns and learn new processes that can help accelerate incident response.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like Runbooks to eliminate toil.