Traditional Runbooks can become 10x more useful if they were automated or at least made executable (partly, if not fully). Shreyash Naithani from Microsoft Azure SRE team and author of "Practical Site Reliability Engineering" talks about how to take advantage of runbooks to eliminate toil.
A Runbook is a predefined set of steps or procedures that is usually executed manually by a systems engineer. For instance: say you want to upgrade an application on production, and you have a defined set of steps that are documented. We call this a runbook. It contains procedures to begin, stop, supervise, and debug the system.
Recent research shows that 80% of the time spent by engineering teams is invested into triaging incidents. Over the past few years, due to the shift to a microservice world everyone has experienced an exponential increase in code-base complexity. Managing and monitoring several microservice endpoints means a large number of checkpoints and alerts. As a result, we end up having too many incidents during outages and engineering teams get buried in operational work. This is where Automated and Executable Runbooks come into the picture to rescue engineering teams by setting up auto-mitigation or remediation on top of these incidents.
These runbooks could be triggerred on the basis of Events/Logs and should create incidents only if it needs further attention from the engineering team. Broadly speaking, Runbooks can be categorised as:
Rundeck:
Rundeck is a web-accessible console for dispatching commands and scripts to your nodes. It can be also used for deployments, ops tasks etc. Within Rundeck you can easily create Jobs (can be triggered by the scheduler or on-demand), dispatch scripts to the selected nodes or simply some user defined commands. In short, using Rundeck you can automate routine or adhoc tasks by creating runbooks.
Rundeck Features:
Rundeck can integrate with tools in several manners.
Ansible:
Ansible is a very powerful open source configuration management tool. Ansible uses 'Playbook' to deploy, manage and configure anything from single server to multi-server environments. Here Playbook is similar like runbooks where you can define a set of procedures.
Ansible Features:
Ansible Playbook written in YAML, declaratively define your configuration. Let's see one example of playbook here we are installing Nginx servers using Ansible.
Squadcast:
Squadcast Runbooks will allow you to up level your Incident Management with the next generation Reliability Orchestration Engine based on Site Reliability Engineering (SRE). It is designed to host and execute runbooks automation in response to operational events or incidents. By using Squadcast runbooks you can remove the toil or repetitive tasks from your system. We have already seen how we can create runbook using Azure Automation lets see how easy we can create it using Squadcast.
Let's take a scenario, your Squadcast dashboard is alerting that your WebApps servers are utilizing more resources, could be due to high CPU or high traffic etc. and as part of the auto-mitigation we would like to create automation to see if resource utilization on webapps server has been increased and crossed a certain threshold say 65%,so we would like to add more WebApps servers to handle those requests for time being. In order to achieve this, we can write a simple runbook using Squadcast and execute these from incident tickets manually or automatically using any scheduler.
Runbooks Support:
Currently, Squadcast Runbooks supports the below languages...
Here are some best practices for runbook:
Conclusion
With the right amount of automation and strategic process management, you can improve incident remediation instructions and ensure runbooks are updated in a timely manner. This ensures that when an incident occurs next, the documentation is updated and also is available to the right person at the right time.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.
Traditional Runbooks can become 10x more useful if they were automated or at least made executable (partly, if not fully). Shreyash Naithani from Microsoft Azure SRE team and author of "Practical Site Reliability Engineering" talks about how to take advantage of runbooks to eliminate toil.
A Runbook is a predefined set of steps or procedures that is usually executed manually by a systems engineer. For instance: say you want to upgrade an application on production, and you have a defined set of steps that are documented. We call this a runbook. It contains procedures to begin, stop, supervise, and debug the system.
Recent research shows that 80% of the time spent by engineering teams is invested into triaging incidents. Over the past few years, due to the shift to a microservice world everyone has experienced an exponential increase in code-base complexity. Managing and monitoring several microservice endpoints means a large number of checkpoints and alerts. As a result, we end up having too many incidents during outages and engineering teams get buried in operational work. This is where Automated and Executable Runbooks come into the picture to rescue engineering teams by setting up auto-mitigation or remediation on top of these incidents.
These runbooks could be triggerred on the basis of Events/Logs and should create incidents only if it needs further attention from the engineering team. Broadly speaking, Runbooks can be categorised as:
Rundeck:
Rundeck is a web-accessible console for dispatching commands and scripts to your nodes. It can be also used for deployments, ops tasks etc. Within Rundeck you can easily create Jobs (can be triggered by the scheduler or on-demand), dispatch scripts to the selected nodes or simply some user defined commands. In short, using Rundeck you can automate routine or adhoc tasks by creating runbooks.
Rundeck Features:
Rundeck can integrate with tools in several manners.
Ansible:
Ansible is a very powerful open source configuration management tool. Ansible uses 'Playbook' to deploy, manage and configure anything from single server to multi-server environments. Here Playbook is similar like runbooks where you can define a set of procedures.
Ansible Features:
Ansible Playbook written in YAML, declaratively define your configuration. Let's see one example of playbook here we are installing Nginx servers using Ansible.
Squadcast:
Squadcast Runbooks will allow you to up level your Incident Management with the next generation Reliability Orchestration Engine based on Site Reliability Engineering (SRE). It is designed to host and execute runbooks automation in response to operational events or incidents. By using Squadcast runbooks you can remove the toil or repetitive tasks from your system. We have already seen how we can create runbook using Azure Automation lets see how easy we can create it using Squadcast.
Let's take a scenario, your Squadcast dashboard is alerting that your WebApps servers are utilizing more resources, could be due to high CPU or high traffic etc. and as part of the auto-mitigation we would like to create automation to see if resource utilization on webapps server has been increased and crossed a certain threshold say 65%,so we would like to add more WebApps servers to handle those requests for time being. In order to achieve this, we can write a simple runbook using Squadcast and execute these from incident tickets manually or automatically using any scheduler.
Runbooks Support:
Currently, Squadcast Runbooks supports the below languages...
Here are some best practices for runbook:
Conclusion
With the right amount of automation and strategic process management, you can improve incident remediation instructions and ensure runbooks are updated in a timely manner. This ensures that when an incident occurs next, the documentation is updated and also is available to the right person at the right time.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.