How do you structure a runbook?

Since a Runbook is a step-by-step guide to performing complex tasks and processes, it has to be strucutred around these points: 1) Actionable, 2) Accessible, 3) Accurate, 4) Authoritative, 5) Adaptable.

How do you test a runbook?

The best way to test a Runbook is by simulating a test environment around the same conditions that need to be tested. The condition under which a Runbook needs to get executed in realtime, has to be reproduced in a simulation or test environment, and the steps defined as part of the Runbooks has to be executed one after the other in a controlled environment.

What do you put in a runbook?

The steps that need to be taken or the commands that need to be executed one after the other have to be put inside a Runbook.

What is hybrid runbook?

A hybrid Runbook, also referred to as a semi-automated Runbook is one that is partially automated and partially manual. While a part of it can be executed automatically, a part of it has to be trig erred by manual human intervention.

What is Webhook in runbook?

While a Runbook is a pre-documented set of tasks and actions to be executed based on the occurrence of a particular event. A Webhook is a means to execute those commands or actions in an external service via a series of HTTP requests. So using Runbook + Webhook in combination can help you automate repeatable tasks and actions.

What is a runbook in Devops?

Runbook is a concept in DevOps to automate routine tasks such as scheduled maintenance, routine system updates, and recurring system alerts and outages. When pre-defined conditions are met, a sequential set of steps can be executed either automatically or semi-automatically or manually. These steps are called Runbooks.

What is a Runbook? Importance of Runbook | Site Reliability Engineering

In This Article:

Our Products

Need for Runbooks

Imagine being an Ops engineer in a team just struck by tragedy. [sigh…]

Alarms start ringing, and incident response is in full force. It may sound like the situation is in control.

WRONG!

There’s panic everywhere. The on-call team is scrambling for the heavenly door to redemption. But, the only thing that doesn’t stop - Stakeholder Inquiries.

This situation is bad. But it could be worse.

Now imagine being a less-experienced Ops engineer in a relatively small on-call team struck by tragedy. If you don’t have sufficient guidance, let alone moral support- you’re toast.

If being ‘on-call during an incident’ is a battle, then ensuring ‘things don’t go as bad the next time’ is a war. And the formal term in IT for this ‘war’ is called Site Reliability Engineering (SRE).

The ability to do good SRE hinges on the ability of the response team to be ‘prepared’. This is exactly what Incident Response Runbooks can help SRE / response teams with. Runbooks can help teams be prepared…for just about anything.

What are Runbooks?

GitLab has defined it best.

Runbooks are a collection of documented procedures that explain how to carry out a particular process, be it starting, stopping, debugging, or troubleshooting a particular system.

Here’s another apt definition, specific to incident response -

A Runbook is a compilation of routine procedures and operations that are documented for reference while working on a critical incident. It doesn’t necessarily need to be a critical incident.

A Runbook can also be used to document the standard procedures for maintenance and upgrades.

The school-boy error that most teams commit whilst documenting such procedural actions is storing them in Google Docs/ Notion/ Confluence/ other tools. Or worse, storing them in physical books.

Valuable time is lost whilst looking for the right document instead of fixing the issue, and letting things go from bad to worse. The problem lies not in the document itself, but in ‘where it’s documented’.

Hence it is recommended to store such procedural documents in a centralized location. A centralized location accessible by the entire incident response team - within their Incident Response tool.

Incident response teams can thus leverage Squadcast’s Runbooks for this purpose. Teams can store checklists of tasks that need to be performed manually for an incident.

These checklists can be ‘steps to be performed’ in the event of either a SEV-1 incident, or during routine checks/ maintenance/ upgrades, or they can be technical/ functional instructions to debug and fix a certain issue, along with the code that needs to be manually executed.

Leveraging Runbooks is thus a means to reduce MTTA and MTTR by avoiding delays scrambling for external documents. Instead you can use Runbooks to store procedural steps, associate them with incidents, and assign tasks to relevant users.

Types of Runbooks

This is a good segway into understanding the different types of Runbooks.

1. Manual Runbooks (interchangeably referred to as Playbooks)

These are Runbooks that contain step-by-step instructions to be followed by an operator. It could include commands to be executed in the event of a certain condition.

2. Executable Runbooks

These are Runbooks that comprise a combination of manual and automated steps which can be executed directly from Squadcast. While the steps and/ or commands may be documented, operators might have to manually execute them.

3. Fully-automated Runbooks

These are Runbooks that require no manual intervention and will be automatically executed based on preset conditions. They are part of workflows that get executed on the occurrence of a particular event. For eg., Automatically restart the server if the CPU spikes to 100%.

How to Create a Runbook - Steps

Get a full understanding of your systems architecture. Identify all processes, configurations and dependencies
Brainstorm the most common issues that come up. What problems do you see people running into again and again? What kind of information does it take to resolve them?
Create a flowchart or diagram of the steps involved in resolving each issue, from start to finish—from when someone first encounters the problem until they've resolved it and gotten back to work. Add information of key personnels (such as an Incident lead) who can help in keeping systems and processes running
Before you deploy your runbooks, make sure that they have been thoroughly tested. Keep them in a place where everyone who needs them will be able to find them easily. Review them periodically to make sure they are up-to-date.

What are operational Runbooks?

Operational runbooks are created and maintained by the operations teams and provide a standardized approach to managing complex systems. They may include steps for deploying new software updates, monitoring system performance, responding to security incidents, and resolving service interruptions.

What is Runbook automation?

Runbook Automation tools are a powerful tool for automating repetitive tasks. They allow you to automate many repetitive tasks that often take up your time and energy, such as provisioning new servers.

Automated Runbooks can be used in conjunction with existing tools, scripts, APIs, or manual commands; they provide a way for you to create workflows that span these different tools.

‍

Runbooks - Use cases

Now that we’ve established what a Runbook does, let’s run through some use cases and understand how a Runbook comes to the rescue of response teams.

A knowledgebase to bounce back from incidents

When an incident occurs, the response team is expected to quickly spring into action and restore affected services. But how are they to know what steps to take? Especially if they are inexperienced and new to the team.

Even if the plan of action is as simple as having to restart a service, successful SRE teams document the necessary steps or prepare guidelines for the course of action. These could include the commands that need to be executed, names of senior personnel that need to be informed, best practices, etc.

This is one such use case of leveraging Incident Response Runbooks.

Documenting standard procedures during maintenance/ upgrades

Performing maintenance/ upgrades, daily backups, applying patches, updating the respective teams of scheduled downtimes, etc., are tasks to be performed routinely. Since such tasks require the same commands to be executed and the same course of action to be followed, response teams are better off documenting/ templatizing these steps and executing them when necessary.

This is a better alternative to researching the commands every single time since unknown/ unfamiliar commands behave unexpectedly and can potentially cause downtime. By using Runbooks to store commands and procedures to be followed during maintenance, response teams can complete the tasks quickly, as well as prevent any unexpected behavior from the system.

Preventing unnecessary Escalations

At times when senior devs are on leave and not reachable for providing guidance, junior devs will most likely undergo on-call stress. Instead of expecting junior devs to figure out remediation steps on their own, a better practice is to document response actions for common incidents and leverage that as a starting point for resolution.

This way, the stress levels for junior devs can be alleviated, as well as preventing unnecessary escalations to senior devs for issues major and minor.

Automation for better SRE

One of the core tenets of SRE is automation. Some of the key objectives for an SRE are:

Codifying actions into executable code
Setting up automated checklists that improve the speed of diagnosing and resolving incidents
Reducing toil while reviewing audit logs

Be it a sequence of steps to be performed, or a set of commands to be executed, Runbooks make all of this possible. Either by setting up Runbooks to execute manually or by enabling human-in-the-loop automation, SREs can gain a ton of value by leveraging Runbooks.

Conclusion

Runbooks help Developers & SREs automate toil and give On-Call teams access to templatized actions. The bigger the organization, the more important it becomes to implement Runbooks.

Read more about Runbook Automation Tools