Automated Runbooks = Faster Recovery

November 11, 2019
Share this post:
Automated Runbooks = Faster Recovery

Runbook automation can speed up your incident management. Learn how you can implement runbook automation and reduce toil.

Table of Contents:

    A Runbook is a predefined set of steps or procedures that is usually executed manually by a systems engineer. For instance: say you want to upgrade an application on production, and you have a defined set of steps that are documented. We call this a runbook. It contains procedures to begin, stop, supervise, and debug the system.

    Recent research shows that 80% of the time spent by engineering teams is invested in triaging incidents. Over the past few years, shift to microservices has resulted in an exponential increase in code-base complexity. Managing and monitoring several microservice endpoints means a large number of checkpoints and alerts. As a result, we end up having too many incidents during outages and engineering teams get buried in operational work. To get a better handle on incidents, teams can use Automated and Executable Runbooks to set up auto-mitigation or remediation. These runbooks should be triggered by Events/Logs and create incidents for engineers only when necessary.Broadly speaking, Runbooks can be categorized as:

    1. Procedural Runbooks: Procedural Runbooks are manual runbooks where you have to just follow the technical documents and run the steps. Here, a systems engineer will use standard tools to access production systems and follow the procedure manually.
    2. Executable Runbooks: Executable Runbooks are like procedural Runbooks where systems engineers will follow the procedure as described. Additionally, systems engineers can also run an automation task from his or her machine (could be Shell-Script, Powershell or any other scripts) on a target system and fix the problem.
    3. Automated Runbooks: As the name suggests automated runbooks runs automatically without any manual interaction.This blog talks about Automated Runbooks and a few automation tools.
      Automated Runbooks allow us to automate time-consuming and repetitive tasks. Using these, we can automate any tasks on one or more servers.
      Listed below are a few instances where automated runbooks can potentially save the day:
    1. Active Directory:
      We can use automated runbooks to update Active directories when any new user is onboarded onto the system. Using these runbooks, we can create a user account and assign the user to multiple groups. This will ensure that they have the appropriate permissions and are part of an organizational domain. We could also add activities that might be needed when any new employee is onboarded.And with automated runbooks we can automate these manual tasks and help users to onboard quickly.
    2. Virtual Machine/Service Management:
      We can use automated runbooks to manage our Virtual Machine(VM) or services. These can be in scenarios such,
      * Need to restart VMs after patching
      * To know any service status
      * Want to restart any services running in VMs after deployments.
      When you see VMs in a hung state or not serving any traffic/requests, create a quick fix type of runbook to run on top of the active incident and mitigate them.
    3. Log Archive:
      One of the use cases is to automate log management by creating runbooks which can either delete your old data or archive your data into some azure log tables. Later you can use these Azure log table to analyze and get some themes out of them. It could be what types of error our webApps server encountered in the last 30 days. Again, by looking into that data you can improve the reliability of the product.
    4. Monitoring:
      Another use case scenario would be monitoring. Using runbooks, we can monitor computer responsiveness. Is the host available on the network? How much disk space is left on the machine? How is the health of the daemon or services? What is the resource utilization for servers? By using any scripting language we can fetch these details and update them on incidents or start our investigations.
    5. Configurations Management:
      Deploying standard baseline configurations can be done using runbooks. Configurations could be related to services, clients or network equipment. Even mobile devices can be configured. This way we can meet a certain minimum-security standard as per the organizational security policy. We can also deploy OS and app configuration using runbooks. And if any software or patching needs to be deployed we can achieve it using runbook automation.Here are a few runbook automation tools, that we may use for the above -

    Azure Automation:
    Azure Automation is Microsoft's cloud-hosted automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources. It uses PowerShell Runbook or Powershell Graphical/Workflows or Python Runbook. We can trigger these runbooks from Azure Alerts, Webhooks, Schedule, Logic Apps, Another Runbook or watcher tasks.

    In this example, we have created one Powershell Runbook to restart the WebApps servers. We can schedule this Powershell runbook as per requirement also we can trigger it from Webhooks or by schedule.

    Powershell Runbook


    Rundeck is a web-accessible console for dispatching commands and scripts to your nodes. It can also be used for deployments, operations tasks and more. Rundeck lets you create jobs made from existing scripts, run commands on selected nodes or schedule jobs to run at a later time.In short, using Rundeck you can automate routine or ad hoc tasks by creating runbooks.

    Rundeck Features:

    1. Rundeck supports multi-step workflows.
    2. Distributed command execution.
    3. Job Execution can be done with ad-hoc demands or we can set it with the scheduler.
    4. Rundeck provides a graphical web console for job execution and command.
    5. It’s a command line interface tool with Web API to operate it from code.
    6. It logs all the command or job execution history for audit purposes.

    Rundeck can integrate with tools in several manners.

    • A Rundeck plugin implemented in Java or shell script, installed into a Rundeck server.
    • External Service.
    • An external service that is used by a Rundeck plugin or the Rundeck core.
    • External Plugin.
    • A plugin is installed in another tool that interacts with Rundeck through its API.


    Ansible is a very powerful open-source configuration management tool. Ansible uses 'Playbook' to deploy, manage and configure anything from a single-server to multi-server environments. Here Playbook is similar to runbooks where you can define a set of procedures.

    Ansible Features:

    1. Agentless: It means there is no need for any software or client/agent to manage your nodes unlike Puppet or chef.
    2. Python Supported: Ansible is built on python and provides a lot of python features and modules. Once you install ansible you will see python is also getting installed on your servers.
    3. Secure SSH: Ansible uses a secure shell to connect to the servers to do any operation. Secure shell is the password-less network authentication protocol. This makes ansible fast and more secure than others.
    4. Push Architecture: Ansible follows push-based architecture for deploying any configuration. Whenever you want to push any configuration, just update the playbook and push. It will take care of the rest. In short, the central server manages all the configuration and pushes it to the target servers.

    Ansible Playbook written in YAML, declaratively defines your configuration. Let's see one example of a playbook here we are installing Nginx servers using Ansible.

    playbook installing by Nginx servers using Ansible


    Squadcast Runbooks will allow you to up level up your Incident Management with the next generation Reliability Orchestration Engine based on Site Reliability Engineering (SRE). It is designed to host and execute runbooks automation in response to operational events or incidents. By using Squadcast runbooks you can remove the toil or repetitive tasks from your system. We have already seen how we can create runbook using Azure Automation lets see how easy we can create it using Squadcast.

    Let's say your Squadcast dashboard is showing that your Web Apps servers are using a large amount of resources, and could be due to high CPU or high traffic. To mitigate this, we want to create an automation that checks if resource utilization on web apps server has been increased, and has crossed a certain threshold—say 65%. To do this, we would create a runbook in Squadcast, and schedule this runbook to run on incident tickets automatically.

    Squadcast Runbook Automation

    Runbooks Support:

    Currently, Squadcast Runbooks supports the below languages...

    • Shell script
    • Lua script
    • Python3 script
    • NodeJS script
    • Ansible configuration

    Here are some best practices for runbook:

    1. Know your Application:
      Within our application, we need to consider which processes need improvement and when we define processes that could benefit from automation using runbook, we need to start gathering requirements.
    2. Gather Requirement:
      While gathering requirements we should focus on determining input and output values for our runbook whether its automatically supplied or the user needs to input these values.
    3. Use of Integration pack:
      The Integration pack gives us additional runbook activities. For example: if you want to automate user onboarding and that includes working with an active directory user account, you are going to need to register and deploy the integration pack for the active directory.
    4. Single or multiple host runbook:
      We need to know whether we are going to run our automation runbook for single or multiple host at the same time because we need to design our runbook based on that decision.
    5. Runbook Execution Trigger:
      We should know how we are going to execute the runbook. Will it be a schedule? Is it going to be done periodically so manual? Will it need any kind of user interaction?
    6. Runbook Logs:We should also focus on what logs will be needed once runbook is executed and where we are going to save these logs for future or debug purposes.

    How do you write a runbook?

    A runbook is a collection of procedures for dealing with common issues. They can help your team deal with these situations more efficiently. Here are a few general steps to keep in mind while writing a runbook: 

    • Get a full understanding of your systems architecture. Identify all processes, configurations and dependencies
    • Brainstorm the most common issues that come up. What problems do you see people running into again and again? What kind of information does it take to resolve them?
    • Create a flowchart or diagram of the steps involved in resolving each issue, from start to finish—from when someone first encounters the problem until they've resolved it and gotten back to work. Add information of key personnels (such as an Incident lead) who can help in keeping systems and processes running 
    • Before you deploy your runbooks, make sure that they have been thoroughly tested. Keep them in a place where everyone who needs them will be able to find them easily. Review them periodically to make sure they are up-to-date.

    What should a runbook include?

    A runbook should include the following:

    • Detailed clear and concise steps to deal with specific problems, such as systems failures and security breaches 
    • It should include who is on-call to resolve an incident, what are the resources available with them to tackle an incident and who can assist them in resolving an incident 
    • A runbook may also include emergency contact information, procedures for data backup and recovery, and a list of critical systems with their dependencies
    • Keep runbooks in a place where everyone who needs them can easily find them. Review them periodically to make sure they are up-to-date 

    Difference between runbook and sop? 

    A Runbook is a predefined set of technical steps, procedures or documentation that is usually executed manually by a systems engineer. A runbook can also contain information related to application deployment, monitoring and maintenance. Whereas SOPs are descriptions of the steps required to complete specific activities or tasks. They can be used to ensure that industry rules and regulations are followed in an organization.

    Playbooks versus runbooks

    A runbook is a step-by-step procedure that helps ensure the technical aspects of an organization's systems continue to function smoothly. A playbook is more general—outlining an organization's approach to a task and the responsibilities of its workers. While both a runbook and a playbook include information on technical aspects, a playbook will likely go into greater detail about the cultural, compliance, or user experience aspects of a task.


    With the right amount of automation and strategic process management, you can improve incident remediation instructions and ensure runbooks are updated in a timely manner. This ensures that when an incident occurs next, the documentation is updated and also is available to the right person at the right time.

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    Written By:
    November 11, 2019
    November 11, 2019
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    More from
    Shreyash Naithani
    No items found.
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    have to say
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    Copyright © Squadcast Inc. 2017-2023