🚀 Take control of your Incident Management process with Squadcast's new Audit Logs feature.

Automated Runbooks = Faster Recovery

Nov 11, 2019
Last Updated:
August 29, 2024
Share this post:
Automated Runbooks = Faster Recovery

Runbook automation can speed up your incident management. Learn about runbooks and how you can implement runbook automation and reduce toil.

Table of Contents:

    A Runbook is a predefined set of steps or procedures that is usually executed manually by a systems engineer. For instance: say you want to upgrade an application on production, and you have a defined set of steps that are documented. We call this a runbook. It contains procedures to begin, stop, supervise, and debug the system.

    Recent research shows that 80% of the time spent by engineering teams is invested in triaging incidents. Over the past few years, shift to microservices has resulted in an exponential increase in code-base complexity. Managing and monitoring several microservice endpoints means a large number of checkpoints and alerts. As a result, we end up having too many incidents during outages and engineering teams get buried in operational work. To get a better handle on incidents, teams can use Automated and Executable Runbooks to set up auto-mitigation or remediation. These run book should be triggered by Events/Logs and create incidents for engineers only when necessary. Broadly speaking, Runbooks can be categorized as:

    ‍

    1. Procedural Runbooks: Procedural Runbooks are manual runbooks where you have to just follow the technical documents and run the steps. Here, a systems engineer will use standard tools to access production systems and follow the procedure manually.
    2. Executable Runbooks: Executable Runbooks are like procedural Runbooks where systems engineers will follow the procedure as described. Additionally, systems engineers can also run an automation task from his or her machine (could be Shell-Script, Powershell or any other scripts) on a target system and fix the problem.
    3. Automated Runbooks: As the name suggests automated runbooks runs automatically without any manual interaction.This blog talks about Automated Runbooks and a few automation tools.
      Automated Runbooks allow us to automate time-consuming and repetitive tasks. Using these, we can automate any tasks on one or more servers.
      Listed below are a few instances where automated runbooks can potentially save the day:
    1. Active Directory:
      We can use automated runbooks to update Active directories when any new user is onboarded onto the system. Using these runbooks, we can create a user account and assign the user to multiple groups. This will ensure that they have the appropriate permissions and are part of an organizational domain. We could also add activities that might be needed when any new employee is onboarded.And with automated runbooks we can automate these manual tasks and help users to onboard quickly.
    2. Virtual Machine/Service Management:
      We can use automated runbooks to manage our Virtual Machine(VM) or services. These can be in scenarios such,
      * Need to restart VMs after patching
      * To know any service status
      * Want to restart any services running in VMs after deployments.
      When you see VMs in a hung state or not serving any traffic/requests, create a quick fix type of runbook to run on top of the active incident and mitigate them.
    3. Log Archive:
      One of the use cases is to automate log management by creating runbooks which can either delete your old data or archive your data into some azure log tables. Later you can use these Azure log table to analyze and get some themes out of them. It could be what types of error our webApps server encountered in the last 30 days. Again, by looking into that data you can improve the reliability of the product.
    4. Monitoring:
      Another use case scenario would be monitoring. Using runbooks, we can monitor computer responsiveness. Is the host available on the network? How much disk space is left on the machine? How is the health of the daemon or services? What is the resource utilization for servers? By using any scripting language we can fetch these details and update them on incidents or start our investigations.
    5. Configurations Management:
      Deploying standard baseline configurations can be done using runbooks. Configurations could be related to services, clients or network equipment. Even mobile devices can be configured. This way we can meet a certain minimum-security standard as per the organizational security policy. We can also deploy OS and app configuration using runbooks. And if any software or patching needs to be deployed we can achieve it using runbook automation. Here are a few runbook automation tools, that we may use for the above -
      ‍

    Azure Automation:
    Azure Automation is Microsoft's cloud-hosted automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources. It uses PowerShell Runbook or Powershell Graphical/Workflows or Python Runbook. We can trigger these runbooks from Azure Alerts, Webhooks, Schedule, Logic Apps, Another Runbook or watcher tasks.

    In this example, we have created one Powershell Runbook to restart the WebApps servers. We can schedule this Powershell runbook as per requirement also we can trigger it from Webhooks or by schedule.

    Powershell Runbook

    Rundeck:

    Rundeck is a web-accessible console for dispatching commands and scripts to your nodes. It can also be used for deployments, operations tasks and more. Rundeck lets you create jobs made from existing scripts, run commands on selected nodes or schedule jobs to run at a later time.In short, using Rundeck you can automate routine or ad hoc tasks by creating runbooks.

    Rundeck Features:

    1. Rundeck supports multi-step workflows.
    2. Distributed command execution.
    3. Job Execution can be done with ad-hoc demands or we can set it with the scheduler.
    4. Rundeck provides a graphical web console for job execution and command.
    5. It’s a command line interface tool with Web API to operate it from code.
    6. It logs all the command or job execution history for audit purposes.

    Rundeck can integrate with tools in several manners.

    • A Rundeck plugin implemented in Java or shell script, installed into a Rundeck server.
    • External Service.
    • An external service that is used by a Rundeck plugin or the Rundeck core.
    • External Plugin.
    • A plugin is installed in another tool that interacts with Rundeck through its API.

    Ansible:

    Ansible is a very powerful open-source configuration management tool. Ansible uses 'Playbook' to deploy, manage and configure anything from a single-server to multi-server environments. Here Playbook is similar to runbooks where you can define a set of procedures.

    Ansible Features:

    1. Agentless: It means there is no need for any software or client/agent to manage your nodes unlike Puppet or chef.
    2. Python Supported: Ansible is built on python and provides a lot of python features and modules. Once you install ansible you will see python is also getting installed on your servers.
    3. Secure SSH: Ansible uses a secure shell to connect to the servers to do any operation. Secure shell is the password-less network authentication protocol. This makes ansible fast and more secure than others.
    4. Push Architecture: Ansible follows push-based architecture for deploying any configuration. Whenever you want to push any configuration, just update the playbook and push. It will take care of the rest. In short, the central server manages all the configuration and pushes it to the target servers.

    Ansible Playbook written in YAML, declaratively defines your configuration. Let's see one example of a playbook here we are installing Nginx servers using Ansible.

    ‍

    playbook installing by Nginx servers using Ansible

    Squadcast:

    Squadcast Runbooks will allow you to up level up your Incident Management with the next generation Reliability Orchestration Engine based on Site Reliability Engineering (SRE). It is designed to host and execute runbooks automation in response to operational events or incidents. By using Squadcast runbooks you can remove the toil or repetitive tasks from your system. We have already seen how we can create runbook using Azure Automation lets see how easy we can create it using Squadcast.

    Let's say your Squadcast dashboard is showing that your Web Apps servers are using a large amount of resources, and could be due to high CPU or high traffic. To mitigate this, we want to create an automation that checks if resource utilization on web apps server has been increased, and has crossed a certain threshold—say 65%. To do this, we would create a runbook in Squadcast, and schedule this runbook to run on incident tickets automatically.

    Squadcast Runbook Automation

    Runbooks Support:

    Currently, Squadcast Runbooks supports the below languages...

    • Shell script
    • Lua script
    • Python3 script
    • NodeJS script
    • Ansible configuration

    Here are some best practices for runbook:

    1. Know your Application:
      Within our application, we need to consider which processes need improvement and when we define processes that could benefit from automation using runbook, we need to start gathering requirements.
    2. Gather Requirement:
      While gathering requirements we should focus on determining input and output values for our runbook whether its automatically supplied or the user needs to input these values.
    3. Use of Integration pack:
      The Integration pack gives us additional runbook activities. For example: if you want to automate user onboarding and that includes working with an active directory user account, you are going to need to register and deploy the integration pack for the active directory.
    4. Single or multiple host runbook:
      We need to know whether we are going to run our automation runbook for single or multiple host at the same time because we need to design our runbook based on that decision.
    5. Runbook Execution Trigger:
      We should know how we are going to execute the runbook. Will it be a schedule? Is it going to be done periodically so manual? Will it need any kind of user interaction?
    6. Runbook Logs:We should also focus on what logs will be needed once runbook is executed and where we are going to save these logs for future or debug purposes.

    ‍

    How do you write a runbook?

    A runbook is a collection of procedures for dealing with common issues. They can help your team deal with these situations more efficiently. Here are a few general steps to keep in mind while writing a runbook: 

    ‍

    • Get a full understanding of your systems architecture. Identify all processes, configurations and dependencies
    • Brainstorm the most common issues that come up. What problems do you see people running into again and again? What kind of information does it take to resolve them?
    • Create a flowchart or diagram of the steps involved in resolving each issue, from start to finish—from when someone first encounters the problem until they've resolved it and gotten back to work. Add information of key personnels (such as an Incident lead) who can help in keeping systems and processes running 
    • Before you deploy your runbooks, make sure that they have been thoroughly tested. Keep them in a place where everyone who needs them will be able to find them easily. Review them periodically to make sure they are up-to-date.

    What should a runbook include?

    A runbook should include the following:

    • Detailed clear and concise steps to deal with specific problems, such as systems failures and security breaches 
    • It should include who is on-call to resolve an incident, what are the resources available with them to tackle an incident and who can assist them in resolving an incident 
    • A runbook may also include emergency contact information, procedures for data backup and recovery, and a list of critical systems with their dependencies
    • Keep runbooks in a place where everyone who needs them can easily find them. Review them periodically to make sure they are up-to-date 

    Difference between runbook and sop? 

    A Runbook is a predefined set of technical steps, procedures or documentation that is usually executed manually by a systems engineer. A runbook can also contain information related to application deployment, monitoring and maintenance. Whereas SOPs are descriptions of the steps required to complete specific activities or tasks. They can be used to ensure that industry rules and regulations are followed in an organization.

    Playbooks versus runbooks

    A runbook is a step-by-step procedure that helps ensure the technical aspects of an organization's systems continue to function smoothly. A playbook is more general—outlining an organization's approach to a task and the responsibilities of its workers. While both a runbook and a playbook include information on technical aspects, a playbook will likely go into greater detail about the cultural, compliance, or user experience aspects of a task.

    ‍

    Conclusion

    With the right amount of automation and strategic process management, you can improve incident remediation instructions and ensure runbooks are updated in a timely manner. This ensures that when an incident occurs next, the documentation is updated and also is available to the right person at the right time.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    November 11, 2019
    November 11, 2019
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Shreyash Naithani
    No items found.
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
    Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
    Users love Squadcast on G2
    Copyright © Squadcast Inc. 2017-2024
    Blog
    SRE
    Automated Runbooks = Faster Recovery

    Automated Runbooks = Faster Recovery

    Shreyash Naithani
    Shreyash Naithani
    November 11, 2019
    Automated Runbooks = Faster Recovery

    A Runbook is a predefined set of steps or procedures that is usually executed manually by a systems engineer. For instance: say you want to upgrade an application on production, and you have a defined set of steps that are documented. We call this a runbook. It contains procedures to begin, stop, supervise, and debug the system.

    Recent research shows that 80% of the time spent by engineering teams is invested in triaging incidents. Over the past few years, shift to microservices has resulted in an exponential increase in code-base complexity. Managing and monitoring several microservice endpoints means a large number of checkpoints and alerts. As a result, we end up having too many incidents during outages and engineering teams get buried in operational work. To get a better handle on incidents, teams can use Automated and Executable Runbooks to set up auto-mitigation or remediation. These run book should be triggered by Events/Logs and create incidents for engineers only when necessary. Broadly speaking, Runbooks can be categorized as:

    ‍

    1. Procedural Runbooks: Procedural Runbooks are manual runbooks where you have to just follow the technical documents and run the steps. Here, a systems engineer will use standard tools to access production systems and follow the procedure manually.
    2. Executable Runbooks: Executable Runbooks are like procedural Runbooks where systems engineers will follow the procedure as described. Additionally, systems engineers can also run an automation task from his or her machine (could be Shell-Script, Powershell or any other scripts) on a target system and fix the problem.
    3. Automated Runbooks: As the name suggests automated runbooks runs automatically without any manual interaction.This blog talks about Automated Runbooks and a few automation tools.
      Automated Runbooks allow us to automate time-consuming and repetitive tasks. Using these, we can automate any tasks on one or more servers.
      Listed below are a few instances where automated runbooks can potentially save the day:
    1. Active Directory:
      We can use automated runbooks to update Active directories when any new user is onboarded onto the system. Using these runbooks, we can create a user account and assign the user to multiple groups. This will ensure that they have the appropriate permissions and are part of an organizational domain. We could also add activities that might be needed when any new employee is onboarded.And with automated runbooks we can automate these manual tasks and help users to onboard quickly.
    2. Virtual Machine/Service Management:
      We can use automated runbooks to manage our Virtual Machine(VM) or services. These can be in scenarios such,
      * Need to restart VMs after patching
      * To know any service status
      * Want to restart any services running in VMs after deployments.
      When you see VMs in a hung state or not serving any traffic/requests, create a quick fix type of runbook to run on top of the active incident and mitigate them.
    3. Log Archive:
      One of the use cases is to automate log management by creating runbooks which can either delete your old data or archive your data into some azure log tables. Later you can use these Azure log table to analyze and get some themes out of them. It could be what types of error our webApps server encountered in the last 30 days. Again, by looking into that data you can improve the reliability of the product.
    4. Monitoring:
      Another use case scenario would be monitoring. Using runbooks, we can monitor computer responsiveness. Is the host available on the network? How much disk space is left on the machine? How is the health of the daemon or services? What is the resource utilization for servers? By using any scripting language we can fetch these details and update them on incidents or start our investigations.
    5. Configurations Management:
      Deploying standard baseline configurations can be done using runbooks. Configurations could be related to services, clients or network equipment. Even mobile devices can be configured. This way we can meet a certain minimum-security standard as per the organizational security policy. We can also deploy OS and app configuration using runbooks. And if any software or patching needs to be deployed we can achieve it using runbook automation. Here are a few runbook automation tools, that we may use for the above -
      ‍

    Azure Automation:
    Azure Automation is Microsoft's cloud-hosted automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources. It uses PowerShell Runbook or Powershell Graphical/Workflows or Python Runbook. We can trigger these runbooks from Azure Alerts, Webhooks, Schedule, Logic Apps, Another Runbook or watcher tasks.

    In this example, we have created one Powershell Runbook to restart the WebApps servers. We can schedule this Powershell runbook as per requirement also we can trigger it from Webhooks or by schedule.

    Powershell Runbook

    Rundeck:

    Rundeck is a web-accessible console for dispatching commands and scripts to your nodes. It can also be used for deployments, operations tasks and more. Rundeck lets you create jobs made from existing scripts, run commands on selected nodes or schedule jobs to run at a later time.In short, using Rundeck you can automate routine or ad hoc tasks by creating runbooks.

    Rundeck Features:

    1. Rundeck supports multi-step workflows.
    2. Distributed command execution.
    3. Job Execution can be done with ad-hoc demands or we can set it with the scheduler.
    4. Rundeck provides a graphical web console for job execution and command.
    5. It’s a command line interface tool with Web API to operate it from code.
    6. It logs all the command or job execution history for audit purposes.

    Rundeck can integrate with tools in several manners.

    • A Rundeck plugin implemented in Java or shell script, installed into a Rundeck server.
    • External Service.
    • An external service that is used by a Rundeck plugin or the Rundeck core.
    • External Plugin.
    • A plugin is installed in another tool that interacts with Rundeck through its API.

    Ansible:

    Ansible is a very powerful open-source configuration management tool. Ansible uses 'Playbook' to deploy, manage and configure anything from a single-server to multi-server environments. Here Playbook is similar to runbooks where you can define a set of procedures.

    Ansible Features:

    1. Agentless: It means there is no need for any software or client/agent to manage your nodes unlike Puppet or chef.
    2. Python Supported: Ansible is built on python and provides a lot of python features and modules. Once you install ansible you will see python is also getting installed on your servers.
    3. Secure SSH: Ansible uses a secure shell to connect to the servers to do any operation. Secure shell is the password-less network authentication protocol. This makes ansible fast and more secure than others.
    4. Push Architecture: Ansible follows push-based architecture for deploying any configuration. Whenever you want to push any configuration, just update the playbook and push. It will take care of the rest. In short, the central server manages all the configuration and pushes it to the target servers.

    Ansible Playbook written in YAML, declaratively defines your configuration. Let's see one example of a playbook here we are installing Nginx servers using Ansible.

    ‍

    playbook installing by Nginx servers using Ansible

    Squadcast:

    Squadcast Runbooks will allow you to up level up your Incident Management with the next generation Reliability Orchestration Engine based on Site Reliability Engineering (SRE). It is designed to host and execute runbooks automation in response to operational events or incidents. By using Squadcast runbooks you can remove the toil or repetitive tasks from your system. We have already seen how we can create runbook using Azure Automation lets see how easy we can create it using Squadcast.

    Let's say your Squadcast dashboard is showing that your Web Apps servers are using a large amount of resources, and could be due to high CPU or high traffic. To mitigate this, we want to create an automation that checks if resource utilization on web apps server has been increased, and has crossed a certain threshold—say 65%. To do this, we would create a runbook in Squadcast, and schedule this runbook to run on incident tickets automatically.

    Squadcast Runbook Automation

    Runbooks Support:

    Currently, Squadcast Runbooks supports the below languages...

    • Shell script
    • Lua script
    • Python3 script
    • NodeJS script
    • Ansible configuration

    Here are some best practices for runbook:

    1. Know your Application:
      Within our application, we need to consider which processes need improvement and when we define processes that could benefit from automation using runbook, we need to start gathering requirements.
    2. Gather Requirement:
      While gathering requirements we should focus on determining input and output values for our runbook whether its automatically supplied or the user needs to input these values.
    3. Use of Integration pack:
      The Integration pack gives us additional runbook activities. For example: if you want to automate user onboarding and that includes working with an active directory user account, you are going to need to register and deploy the integration pack for the active directory.
    4. Single or multiple host runbook:
      We need to know whether we are going to run our automation runbook for single or multiple host at the same time because we need to design our runbook based on that decision.
    5. Runbook Execution Trigger:
      We should know how we are going to execute the runbook. Will it be a schedule? Is it going to be done periodically so manual? Will it need any kind of user interaction?
    6. Runbook Logs:We should also focus on what logs will be needed once runbook is executed and where we are going to save these logs for future or debug purposes.

    ‍

    How do you write a runbook?

    A runbook is a collection of procedures for dealing with common issues. They can help your team deal with these situations more efficiently. Here are a few general steps to keep in mind while writing a runbook: 

    ‍

    • Get a full understanding of your systems architecture. Identify all processes, configurations and dependencies
    • Brainstorm the most common issues that come up. What problems do you see people running into again and again? What kind of information does it take to resolve them?
    • Create a flowchart or diagram of the steps involved in resolving each issue, from start to finish—from when someone first encounters the problem until they've resolved it and gotten back to work. Add information of key personnels (such as an Incident lead) who can help in keeping systems and processes running 
    • Before you deploy your runbooks, make sure that they have been thoroughly tested. Keep them in a place where everyone who needs them will be able to find them easily. Review them periodically to make sure they are up-to-date.

    What should a runbook include?

    A runbook should include the following:

    • Detailed clear and concise steps to deal with specific problems, such as systems failures and security breaches 
    • It should include who is on-call to resolve an incident, what are the resources available with them to tackle an incident and who can assist them in resolving an incident 
    • A runbook may also include emergency contact information, procedures for data backup and recovery, and a list of critical systems with their dependencies
    • Keep runbooks in a place where everyone who needs them can easily find them. Review them periodically to make sure they are up-to-date 

    Difference between runbook and sop? 

    A Runbook is a predefined set of technical steps, procedures or documentation that is usually executed manually by a systems engineer. A runbook can also contain information related to application deployment, monitoring and maintenance. Whereas SOPs are descriptions of the steps required to complete specific activities or tasks. They can be used to ensure that industry rules and regulations are followed in an organization.

    Playbooks versus runbooks

    A runbook is a step-by-step procedure that helps ensure the technical aspects of an organization's systems continue to function smoothly. A playbook is more general—outlining an organization's approach to a task and the responsibilities of its workers. While both a runbook and a playbook include information on technical aspects, a playbook will likely go into greater detail about the cultural, compliance, or user experience aspects of a task.

    ‍

    Conclusion

    With the right amount of automation and strategic process management, you can improve incident remediation instructions and ensure runbooks are updated in a timely manner. This ensures that when an incident occurs next, the documentation is updated and also is available to the right person at the right time.

    Written By:
    Shreyash Naithani
    Shreyash Naithani
    November 11, 2019
    SRE
    Incident Response
    Incident Management
    Share this blog:
    Get reliability insights delivered straight to your inbox.
    Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    If you wish to unsubscribe, we won't hold it against you. Privacy policy.
    Get reliability insights delivered straight to your inbox.
    Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    If you wish to unsubscribe, we won't hold it against you. Privacy policy.