Fundamentally, a runbook is a set of instructions that — when followed precisely — result in a system producing a specific outcome or reaching a desired state. For example, a runbook can define a process to restore a network device to a working state.
As modern IT infrastructures continue to grow in complexity and scale, triaging potential incidents becomes more and more time-consuming. Runbooks help reduce mean time to resolve (MTTR) by providing engineers with a proven recovery path, and automation helps scale the benefits. A platform-agnostic runbook template provides process stability and reliability, and an automation strategy can provide the confidence and repeatability needed to recover quickly.
This article will deep dive into runbook templates and help you to provide some order to the chaos of a potential disaster scenario.
Runbook template basics
There are some basic details you should include in a well-structured runbook template. The goal is to be complete but concise. That means a runbook should provide readers with all the detail and context needed to complete the task but not overwhelm them or overload the document with unnecessary details that can become confusing.
The table below details the key components of a quality runbook template.
Triggering a runbook
The first iteration, or the first few iterations of a runbook, will likely be triggered by a manual process. For example, tasks to recover a website that has crashed or offboard an employee for HR should be performed by a human before being automated.
As the process improves, the runbook may be triggered via an API or ticketing system. Cloud monitoring solutions like AWS CloudWatch are great examples of services that can detect issues in a production system, highlight them using interactive graphs, and even trigger an automated response. As the runbook evolves, the automated response can start to handle some of the responsibilities of the engineer in charge of executing the runbook, and may eventually automate the process in its entirety.
Of course, a monitoring solution can be separate from a particular technology or provider. Custom monitoring solutions require more effort but can be just as effective. These solutions can be as basic as hooking up an out-of-the-box graphing solution like Grafana to a MySQL database or as complex as a Python script that builds an entire secondary region architecture and tweets when it is complete.
A runbook example
As an example, we’ll use the case of an employee whose contract has been terminated for misconduct. The company has outlined the steps IT should take once they receive the email notifying them of the termination. This set of steps is essentially a runbook. The job of the IT team is to document this process and provide instructions clear enough to empower a repeatable and reliable result.
- Disable their user account for the internal system - The former employee has access to the internal sales and marketing system, and their credentials should be expired and/or account deleted so they can no longer access confidential information.
- Disable their GitHub account - The former employee is part of the company's GitHub organization. They should be removed from the organization as soon as possible so they can no longer access intellectual property like source code.
- Disable their AWS keys - The former employee has access to the AWS system as they required database access from time to time. Their AWS keys should be revoked so that they can no longer access the AWS infrastructure.
- Download their usage logs from both AWS and the internal system - To ensure no malicious actions were carried out in their final days, the company would like insight into what actions the customer was taking in their final days. This includes AWS CloudTrail logs to gain insight into their activity on AWS and the activity logs from the internal system to gain insight into what data they accessed or modified before leaving.
- Store their usage logs in S3 - The data gathered in the previous step should be stored in S3 so it can be easily reviewed and any findings from the review can be validated at a later date.
- Investigate/audit usage logs - Finally, once the data has been stored in S3, it should be reviewed for malicious or suspicious activity. This could include accessing or modifying resources not usually associated with the employee’s role, or even unusual log-in times could be indicative of suspicious activity.
Given these requirements, the IT team is charged with going through the process in detail and documenting the actions required to accomplish each objective. Their deliverable is a well-documented, minimal set of easily reproducible steps to be added to the Task Details section of the runbook.
Automating the runbook
This simple example of a runbook requirement may seem trivial, but even a small mistake in executing the actions could lead to disastrous results. And there’s a reason the phrase “I’m only human” is so common. Humans make mistakes. That's an inevitability that should be taken into account when creating runbook steps. Screenshots or diagrams to go with complex instructions can help, but automating the task is ultimately the most reliable way to ensure a predictable result.
Let's go step-by-step using the example above and see how such a process could be automated using a script or set of scripts. The task details would then become a lot simpler and point the reader to the script(s) to run, explain how to run them, and advise how to validate their success.
- Disable their user account for the internal system - Most modern web applications will contain a REST API that can be programmatically invoked via simple scripts that can trigger most actions (potentially more) than those that can be triggered via the frontend user interface. The start of our automated solution would involve a call to an API to disable the user account.
- Disable their GitHub account - GitHub is an example of a web application with such an API. Similar to step one, our script can make a call to the GitHub API to remove the user from the company organization.
- Disable their AWS keys - Automated solutions are a huge part of the AWS ecosystem, and to empower its users it provides an API that can be interacted with using software development kits (SDKs) written in multiple different languages, as well as a command line interface (CLI) that can perform almost any action offered by the various AWS services. We can use the API or CLI in our script to revoke the user’s keys programmatically.
- Download their usage logs from both AWS and the internal system - This step is simply two more API calls. First, we can invoke AWS CloudTrail to download the AWS user logs. Then invoke our internal systems API to download any relevant user activity.
- Store their usage logs in S3 - Again a simple API call. The AWS SDK and the AWS CLI allow you to copy files to and from Amazon's simple storage service.
- Investigate/audit usage logs - This can be done in a variety of different ways, but a simple script that searches the logs for certain words or patterns can quickly detect unusual activity. As the runbook evolves, this script may also evolve and do things such as link into custom machine learning services that can learn and detect suspicious patterns.
Recommendations for designing a runbook template
Of course, our runbook is not perfect and may take time to reach an ideal state. In fact, it may never reach a final state and simply continue to adapt and evolve. Below are some runbook template recommendations that can help you get the most out of your runbooks.
Don’t try to automate everything on day one
Attempting to script every step from day one can lead to confusion and even mistakes. It’s important to perform the task manually at least once to fully understand and explain the process being automated.
A picture truly paints a thousand words. Use screenshots and diagrams so that a reader can follow along with the process and be confident that everything is executing as expected.
Remember to validate
Once the runbook has been followed, you should validate that the system is in the desired state. In some cases, this can be a single check. In other cases, validation may be necessary on a step-by-step basis. Validation steps should be included with the runbook steps.
Know how much automation is too much
Think about the consequences of automating the runbook. Sometimes it may be wise to require some manual intervention, even if it's just to trigger the automation process.
For example, a temporary network blip may trigger a response to spin up a production infrastructure in a secondary region and switch all production traffic to this region. In a time-consuming, expensive, and customer-impacting case like this, it may make sense for a human to first decide if the blip is cause for concern or if they are happy that service has returned to normal in an acceptable time frame.
Runbooks are invaluable to a growing enterprise. It is inevitable that as a solution grows that things will go wrong sometimes. Using a quality runbook template can bring order to the chaos of solution engineering. By following a familiar structure the runbook reader can put aside the stress of reinventing the wheel, overengineering a solution, or preparing a business-ready document. With a runbook in place, all they need to do is follow the steps.
Further, a structured format opens up the possibility of process automation. As recommended steps become more refined and reliable, it makes them easier to automate, either via third-party solutions or custom scripts. Some companies have begun to realize the importance of establishing this structure quickly and have built runbook solutions to provide out-of-the-box runbook templates that can save organizations months of trial and error.