Imagine being an Ops engineer in a team just struck by tragedy. [sigh…]
Alarms start ringing, and incident response is in full force. It may sound like the situation is in control.
There’s panic everywhere. The on-call team is scrambling for the heavenly door to redemption. But, the only thing that doesn’t stop - Stakeholder Inquiries.
This situation is bad. But it could be worse.
Now imagine being a less-experienced Ops engineer in a relatively small on-call team struck by tragedy. If you don’t have sufficient guidance, let alone moral support- you’re toast.
If being ‘on-call during an incident’ is a battle, then ensuring ‘things don’t go as bad the next time’ is a war. And the formal term in IT for this ‘war’ is called Site Reliability Engineering (SRE).
The ability to do good SRE hinges on the ability of the response team to be ‘prepared’. This is exactly what Runbooks can help SRE / response teams with. Runbooks can help teams be prepared…for just about anything.
GitLab has defined it best.
Runbooks are a collection of documented procedures that explain how to carry out a particular process, be it starting, stopping, debugging, or troubleshooting a particular system.
Here’s another apt definition, specific to incident response -
A Runbook is a compilation of routine procedures and operations that are documented for reference while working on a critical incident. It doesn’t necessarily need to be a critical incident.
A Runbook can also be used to document the standard procedures for maintenance and upgrades.
The school-boy error that most teams commit whilst documenting such procedural actions is storing them in Google Docs/ Notion/ Confluence/ other tools. Or worse, storing them in physical books.
Valuable time is lost whilst looking for the right document instead of fixing the issue, and letting things go from bad to worse. The problem lies not in the document itself, but in ‘where it’s documented’.
Hence it is recommended to store such procedural documents in a centralized location. A centralized location accessible by the entire incident response team - within their Incident Response tool.
Incident response teams can thus leverage Squadcast’s Runbooks for this purpose. Teams can store checklists of tasks that need to be performed manually for an incident.
These checklists can be ‘steps to be performed’ in the event of either a SEV-1 incident, or during routine checks/ maintenance/ upgrades, or they can be technical/ functional instructions to debug and fix a certain issue, along with the code that needs to be manually executed.
Leveraging Runbooks is thus a means to reduce MTTA and MTTR by avoiding delays scrambling for external documents. Instead you can use Runbooks to store procedural steps, associate them with incidents, and assign tasks to relevant users.
This is a good segway into understanding the different types of Runbooks.
These are Runbooks that contain step-by-step instructions to be followed by an operator. It could include commands to be executed in the event of a certain condition.
These are Runbooks that comprise a combination of manual and automated steps which can be executed directly from Squadcast. While the steps and/ or commands may be documented, operators might have to manually execute them.
These are Runbooks that require no manual intervention and will be automatically executed based on preset conditions. They are part of workflows that get executed on the occurrence of a particular event. For eg., Automatically restart the server if the CPU spikes to 100%.
Now that we’ve established what a Runbook does, let’s run through some use cases and understand how a Runbook comes to the rescue of response teams.
When an incident occurs, the response team is expected to quickly spring into action and restore affected services. But how are they to know what steps to take? Especially if they are inexperienced and new to the team.
Even if the plan of action is as simple as having to restart a service, successful SRE teams document the necessary steps or prepare guidelines for the course of action. These could include the commands that need to be executed, names of senior personnel that need to be informed, best practices, etc.
This is one such use case of leveraging Runbooks.
Performing maintenance/ upgrades, daily backups, applying patches, updating the respective teams of scheduled downtimes, etc., are tasks to be performed routinely. Since such tasks require the same commands to be executed and the same course of action to be followed, response teams are better off documenting/ templatizing these steps and executing them when necessary.
This is a better alternative to researching the commands every single time since unknown/ unfamiliar commands behave unexpectedly and can potentially cause downtime. By using Runbooks to store commands and procedures to be followed during maintenance, response teams can complete the tasks quickly, as well as prevent any unexpected behavior from the system.
At times when senior devs are on leave and not reachable for providing guidance, junior devs will most likely undergo on-call stress. Instead of expecting junior devs to figure out remediation steps on their own, a better practice is to document response actions for common incidents and leverage that as a starting point for resolution.
This way, the stress levels for junior devs can be alleviated, as well as preventing unnecessary escalations to senior devs for issues major and minor.
One of the core tenets of SRE is automation. Some of the key objectives for an SRE are:
Be it a sequence of steps to be performed, or a set of commands to be executed, Runbooks make all of this possible. Either by setting up Runbooks to execute manually or by enabling human-in-the-loop automation, SREs can gain a ton of value by leveraging Runbooks.
Runbooks help Developers & SREs automate toil and give On-call teams access to templatized actions. The bigger the organization, the more important it becomes to implement Runbooks.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like Runbooks to eliminate toil.