Chapter 5:

Runbook Template: Best Practices & Example

December 27, 2022
15 min

Fundamentally, a runbook is a set of instructions that — when followed precisely — result in a system producing a specific outcome or reaching a desired state. For example, a runbook can define a process to restore a network device to a working state. 

As modern IT infrastructures continue to grow in complexity and scale, triaging potential incidents becomes more and more time-consuming. Runbooks help reduce mean time to resolve (MTTR) by providing engineers with a proven recovery path, and automation helps scale the benefits. A platform-agnostic runbook template provides process stability and reliability, and an automation strategy can provide the confidence and repeatability needed to recover quickly.

This article will deep dive into runbook templates and help you to provide some order to the chaos of a potential disaster scenario. 

Integrated full stack reliability management platform
Try for free
Drive better business outcomes with incident analytics, reliability insights, SLO tracking, and error budgets
Manage incidents on the go with native iOS and Android mobile apps
Seamlessly integrated alert routing, on-call, and incident response
Try for free

Runbook template basics

There are some basic details you should include in a well-structured runbook template. The goal is to be complete but concise. That means a runbook should provide readers with all the detail and context needed to complete the task but not overwhelm them or overload the document with unnecessary details that can become confusing. 

The table below details the key components of a quality runbook template.

Runbook template components
Runbook component Description Example
Task ID This is usually a reference and a link to the ticket created in the organization's project management system or incident board. (Jira, Asana, Trello). This essentially tells the reader where to search for more information and where to log any details pertaining to the runbook's execution. INC-101
Task Name A quick description of the task (2 to 3 words). Employee Offboarding
Task Description A longer description of the task. This doesn’t need to go into too much detail and should not specify how the task should be performed at a technical level. Employee has been dismissed for misconduct and needs to be removed from all relevant systems.
Task Details Steps required to execute this task. This is the core of the runbook. Each detail or step should be outlined in a simple format. The required action should be described, the reason for the action should be described, and if required a step on how to validate and/or troubleshoot the step. Step 1. Power on the machine.
Step 2 .Input credentials

Step n. Power off the machine.
Team executing this task Team responsible for this task. DevOps
Task Owner Team member responsible for executing the task or coordinating the team. Alice@example.com
Time to complete this task Particularly useful when performing a task which will affect production systems. There should be an expected value provided along with an actual value when the action has been completed. Estimated time: 10 - 20 minutes
Started: 11/11/22 11:00:00
Completed: 11/11/22 11:11:00
Status A status provides all stakeholders insight into the issue or task in question. ASSIGNED, IN_PROGRESS, BLOCKED, or COMPLETE

Triggering a runbook

The first iteration, or the first few iterations of a runbook, will likely be triggered by a manual process. For example, tasks to recover a website that has crashed or offboard an employee for HR should be performed by a human before being automated. 

As the process improves, the runbook may be triggered via an API or ticketing system. Cloud monitoring solutions like AWS CloudWatch are great examples of services that can detect issues in a production system, highlight them using interactive graphs, and even trigger an automated response. As the runbook evolves, the automated response can start to handle some of the responsibilities of the engineer in charge of executing the runbook, and may eventually automate the process in its entirety.

Of course, a monitoring solution can be separate from a particular technology or provider. Custom monitoring solutions require more effort but can be just as effective. These solutions can be as basic as hooking up an out-of-the-box graphing solution like Grafana to a MySQL database or as complex as a Python script that builds an entire secondary region architecture and tweets when it is complete.

A runbook example

As an example, we’ll use the case of an employee whose contract has been terminated for misconduct. The company has outlined the steps IT should take once they receive the email notifying them of the termination. This set of steps is essentially a runbook. The job of the IT team is to document this process and provide instructions clear enough to empower a repeatable and reliable result. 

Task ID ACME-INC-108
Task Name Employee Offboarding - Elmer Fudd
Task Description Employee has been dismissed for misconduct. Any active credentials need to be revoked, users need to be offboarded from all internal systems, and recent activity needs to be reviewed.
Task Task Details For full details, see the instructions below.
• Disable user account from the Acme  management portal
• Remove from GitHub
• Revoke AWS keys
• Download activity logs from the Acme management portal
• Download activity logs from AWS
• Store activity logs
• Audit activity log
Team executing this task DevSecOps.
Task Owner joe.bloggs@acme.com
Time to complete this task Estimated time: 40 - 60 minutes
Started: 01/11/22 14:20:00
Completed: TBD
Status IN_PROGRESS
  1. Disable their user account for the internal system - The former employee has access to the internal sales and marketing system, and their credentials should be expired and/or account deleted so they can no longer access confidential information.
  2. Disable their GitHub account - The former employee is part of the company's GitHub organization. They should be removed from the organization as soon as possible so they can no longer access intellectual property like source code.
  3. Disable their AWS keys - The former employee has access to the AWS system as they required database access from time to time. Their AWS keys should be revoked so that they can no longer access the AWS infrastructure.
  4. Download their usage logs from both AWS and the internal system - To ensure no malicious actions were carried out in their final days, the company would like insight into what actions the customer was taking in their final days. This includes AWS CloudTrail logs to gain insight into their activity on AWS and the activity logs from the internal system to gain insight into what data they accessed or modified before leaving.
  5. Store their usage logs in S3 - The data gathered in the previous step should be stored in S3 so it can be easily reviewed and any findings from the review can be validated at a later date.
  6. Investigate/audit usage logs - Finally, once the data has been stored in S3, it should be reviewed for malicious or suspicious activity. This could include accessing or modifying resources not usually associated with the employee’s role, or even unusual log-in times could be indicative of suspicious activity.

Given these requirements, the IT team is charged with going through the process in detail and documenting the actions required to accomplish each objective. Their deliverable is a well-documented, minimal set of easily reproducible steps to be added to the Task Details section of the runbook. 

Automating the runbook 

This simple example of a runbook requirement may seem trivial, but even a small mistake in executing the actions could lead to disastrous results. And there’s a reason the phrase “I’m only human” is so common. Humans make mistakes. That's an inevitability that should be taken into account when creating runbook steps. Screenshots or diagrams to go with complex instructions can help, but automating the task is ultimately the most reliable way to ensure a predictable result.

Let's go step-by-step using the example above and see how such a process could be automated using a script or set of scripts. The task details would then become a lot simpler and point the reader to the script(s) to run, explain how to run them, and advise how to validate their success.

  1. Disable their user account for the internal system - Most modern web applications will contain a REST API that can be programmatically invoked via simple scripts that can trigger most actions (potentially more) than those that can be triggered via the frontend user interface. The start of our automated solution would involve a call to an API to disable the user account. 
  2. Disable their GitHub account - GitHub is an example of a web application with such an API. Similar to step one, our script can make a call to the GitHub API to remove the user from the company organization. 
  3. Disable their AWS keys - Automated solutions are a huge part of the AWS ecosystem, and to empower its users it provides an API that can be interacted with using software development kits (SDKs) written in multiple different languages, as well as a command line interface (CLI) that can perform almost any action offered by the various AWS services. We can use the API or CLI in our script to revoke the user’s keys programmatically.
  4. Download their usage logs from both AWS and the internal system - This step is simply two more API calls. First, we can invoke AWS CloudTrail to download the AWS user logs. Then invoke our internal systems API to download any relevant user activity.
  5. Store their usage logs in S3 - Again a simple API call. The AWS SDK and the AWS CLI allow you to copy files to and from Amazon's simple storage service.
  6. Investigate/audit usage logs - This can be done in a variety of different ways, but a simple script that searches the logs for certain words or patterns can quickly detect unusual activity. As the runbook evolves, this script may also evolve and do things such as link into custom machine learning services that can learn and detect suspicious patterns.
A script that can automate our example runbook.

Recommendations for designing a runbook template

Of course, our runbook is not perfect and may take time to reach an ideal state. In fact, it may never reach a final state and simply continue to adapt and evolve. Below are some runbook template recommendations that can help you get the most out of your runbooks.

Don’t try to automate everything on day one

Attempting to script every step from day one can lead to confusion and even mistakes. It’s important to perform the task manually at least once to fully understand and explain the process being automated. 

Document clearly

A picture truly paints a thousand words. Use screenshots and diagrams so that a reader can follow along with the process and be confident that everything is executing as expected. 

Remember to validate 

Once the runbook has been followed, you should validate that the system is in the desired state. In some cases, this can be a single check. In other cases, validation may be necessary on a step-by-step basis. Validation steps should be included with the runbook steps.

Know how much automation is too much

Think about the consequences of automating the runbook. Sometimes it may be wise to require some manual intervention, even if it's just to trigger the automation process.

For example, a temporary network blip may trigger a response to spin up a production infrastructure in a secondary region and switch all production traffic to this region. In a time-consuming, expensive, and customer-impacting case like this, it may make sense for a human to first decide if the blip is cause for concern or if they are happy that service has returned to normal in an acceptable time frame.

Conclusion

Runbooks are invaluable to a growing enterprise. It is inevitable that as a solution grows that things will go wrong sometimes. Using a quality runbook template can bring order to the chaos of solution engineering. By following a familiar structure the runbook reader can put aside the stress of reinventing the wheel, overengineering a solution, or preparing a business-ready document. With a runbook in place, all they need to do is follow the steps. 

Further, a structured format opens up the possibility of process automation. As recommended steps become more refined and reliable, it makes them easier to automate, either via third-party solutions or custom scripts. Some companies have begun to realize the importance of establishing this structure quickly and have built runbook solutions to provide out-of-the-box runbook templates that can save organizations months of trial and error.

Integrated full stack reliability management platform
Platform
Blameless
Lightstep
Squadcast
Incident Retrospectives
✔️
✔️
✔️
Seamless Third-Party Integrations
✔️
✔️
✔️
Built-In Status Page
✔️
On Call Rotations
✔️
Incident
Notes
✔️
Advanced Error Budget Tracking
✔️
Try For free
Platform
Incident Retrospectives
Seamless Third-Party Integrations
Built-In Status Page
On Call Rotations
On Call Rotations
Advanced Error Budget Tracking
Blameless
✔️
✔️
Lightstep
✔️
✔️
✔️
Squadcast
✔️
✔️
✔️
✔️
✔️
✔️
Try For free
Subscribe to our LinkedIn Newsletter to receive more educational content
Subscribe now
Subscribe to our Linkedin Newsletter to receive more educational content
Subscribe now
Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2 Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Incident Management on G2 Users love Squadcast on G2
Best IT Management Products 2022 Squadcast is a leader in IT Service Management (ITSM) Tools on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
Squadcast is a leader in IT Service Management (ITSM) Tools on G2
Copyright © Squadcast Inc. 2017-2023