Being on call is a critical part of the job for many operations and engineering teams to keep their services reliable and available. It is one of the essential duties to help meet various SLAs. This article describes the most important tenets of on-call activities and provides industry examples of scheduling and performing these activities for a global team of site reliability engineers (SREs).
Summary of key concepts
Someone who is on call is available during a set period of time and ready to quickly respond to production incidents to avoid a breach of SLAs that might critically impact the business. Typically, an SRE team spends at least 25% of its time performing on-call activities; for example, one week per month might be spent on call.
Traditionally, SREs are engineering teams that treat on-call activity as more than just an engineering problem. Keeping up with technical changes, managing work, scheduling, and balancing workloads are some of the biggest challenges. Site reliability engineering is also a culture that any organization needs to inculcate.
The following are the key elements that need to be considered for successful on-call management within SRE teams.
SRE teams are traditionally created to support complex distributed software systems. These systems may be deployed in multiple data centers around the world.
Using a “follow-the-sun” approach
Usually, SRE teams are located in various geographic locations to take advantage of the varied time zones. The SRE team could be set up as shown in the org chart below, with a global SRE director to whom regional SRE managers report. SREs then report to their respective managers; these are the engineers who perform most of the on-call responsibilities.
As long as the team is large enough, on-call shifts can be designed via the “follow-the-sun” model. For example, suppose the company’s headquarters are in Chicago, which uses US Central Time (CT). With a distributed team, the SRE team in the US could be on call from 10 am CST to 4 pm CST. By that time, the working day in Australia has started, so the Sydney team then goes on call from 8 am Sydney time to 2 pm Sydney time (4 pm to 10 pm CST). The Singapore team would then be the third in the rotation, picking up on-call duty from 11 am Singapore time to 5 pm Singapore time (10 pm to 4 am CST). Finally, the last shift in the on-call cycle goes to the London team, which picks up the on-call responsibilities from the Singapore team starting at 9 am London time and continuing to 3 pm London time (4 am to 10 am CST), handing the shift back to the US team at 10 am CST.
The on-call rotation described above is summarized in the following table.
At the end of each shift, a “handover” process is necessary, in which information about on-call and other important issues is communicated to the team taking over on-call duties. This cycle is repeated for five consecutive working days. For the weekend, assuming less traffic than on the weekdays, SRE on-call staff could be reduced to one person per shift. This individual should be compensated with an additional day off in the following week or, if they prefer, monetary compensation.
While the above is true for most SRE teams, some organizations are structured as having a disproportionately large centralized office augmented with very small satellite teams. In that scenario, if the responsibilities are split across different regions, the small satellite teams may feel overworked and marginalized and eventually become demoralized, leading to high turnover. Having full ownership of responsibility at a single location is then considered worth the cost of having to handle on-call issues outside of normal working hours.
Scheduling a rotation using a single-region team
If the organization does not have a multi-region team, the on-call schedule could be designed by splitting a year into quarters, such as January to March, April to June, July to September, and October to December. The existing team should be assigned to one of these groups, and the overnight work effort shifted every three months.
Since it is not healthy to be on call a few days every week, it is optimal to have a schedule like this to support the human sleep cycle and have a well-structured team that cycles every three months instead of every few days, which is more taxing on people’s schedules.
Vacation and PTO management
Since the SRE role is critical to ensuring the entire platform’s availability and reliability, it is imperative to manage the personal time off (PTO) schedule. The team size should be big enough to have people who are not on call accommodate covering people who are absent, and on-call support must be prioritized over development work.
Suppose there are five SREs reporting to the regional SRE manager in North America (SRE MGR NA). Two SREs will be on-call during the week, following the usual cycle, as described in the section above. The other three SREs will be doing development work. In case of an emergency, a development SRE will swap with an on-call SRE. To compensate the dev SRE for doing on-call, monetary compensation should be provided or their on-call rotation in the subsequent shift altered.
Each geographic location will have its own local holidays, such as Diwali in India or Thanksgiving in the USA. During these times, SRE teams globally should be able to swap their shifts. Globally common holidays, such as New Year’s Day, should be treated as weekend on-call rotation, i.e., with limited personnel support and relaxed pager response time.
At the start of each shift, the on-call engineer receives a summary of the main incidents, things to observe, and any existing issues that need to be resolved as a part of the handover from the previous shift. The SRE then prepares for the on-call session, opening all the dashboards, the command line terminal, and monitoring consoles along with the ticket queue.
The following duties encompass an on-call shift:
SRE and developer teams determine SLIs and create alerts based on metrics. In addition to the metrics, event-based monitoring systems can also be configured to alert based on events. For example, suppose the SREs (during their development time when not on call) and the engineering team decide to use the metric cassandra_threadpools_activetasks as an SLI to monitor the health of the production Cassandra cluster. In this case, the SRE can set up the alert in the Prometheus alert manager YAML file. In the snippet below, the highlighted annotation can be used to publish alerts. This annotation can be used to integrate with a modern incident response management platform.
Once the alert condition is met, the Prometheus alert manager routes the alert to the incident response management platform. The SRE engineer on call must drill down into the metrics, check the number of active tasks in the dashboard, and determine what is creating the high number of tasks. Once the cause is determined, remedial actions must be performed.
Another example is event-based monitoring, for example, if a host or a cluster is not reachable, a ping failure alert is triggered. The SRE should investigate the issue and escalate the alert to the networking team if it’s determined that it’s an issue with the network. This process can be integrated with a modern incident response management platform to diagnose and take remedial actions using runbook automation features.
These alerting systems should be integrated globally with a ticketing system such as Jira, a ticket management platform offered by Atlassian. Each alert should automatically create a ticket. Once the alert is actioned, the ticket must be updated by the SRE and all other stakeholders who took appropriate actions on the alert.
At the beginning of the on-call period, the SRE should be ready with a terminal console to use ssh or any other organization-provided CLI tools. The on-call SRE might be engaged by the customer support or engineering team for help on a technical issue. For example, suppose an action taken by a user on the platform (say, clicking the checkout button on the cart) generates a unique request ID. This request travels through multiple components in the distributed system (a load balancing service, compute service, database service, etc.). A request ID may be provided to the SRE, who is expected to present details on the lifecycle of that request ID. Examples of this information might include which components and instance machines logged the request, the time taken by each component to process it, and how the request traveled through the critical path.
In the case where the issue isn’t apparent — for example, if the request ID mentioned above wasn’t logged by any service — an SRE might be required to investigate whether there are any issues with the network. To rule this out, the SRE might capture packets using Wireshark or TCPDump (open source packet analysis software) to see whether or not there are any network-related issues. This activity could be time-consuming, and the SRE might leverage the network team’s support and have them analyze the packets.
The collective knowledge obtained from the diverse backgrounds of SREs in a particular team will certainly be helpful in troubleshooting such issues. All these steps should be documented and used later to train new SREs as part of the onboarding process.
During production hours, the on-call SRE should own the deployment process. If something goes wrong, the SRE should be in a position to address the issues and roll back or roll forward the changes. SRE should only do emergency deployments that impact production and should know enough to help the deployment and development team avoid any production impact. Deployment practices should be well documented and tightly coupled with the change management process.
SREs actively monitor tickets being routed in their queues that are waiting to be actioned. These tickets are either escalated by other teams or generated by the alerting software and are a lower priority than the tickets generated by active production issues.
It is a standard practice that each ticket must have a comment about SRE actions that have taken place. If no action was taken, the tickets must be escalated to the relevant teams. Ideally, the queue should be empty during the handover from one SRE team to another.
These issues have the greatest impact on SREs who are on call. If the monitoring and alerting software doesn’t alert the SRE engineer, or if there is an unknown issue that only comes to light when the customer reports it, the on-call SRE must utilize all their knowledge and know-how to quickly fix the issue. The SRE should also engage the appropriate development teams to assist.
Once the issue is resolved, a ticket containing detailed documentation of the incident, a list of all the steps taken to resolve the issue, and a description of what could be automated and alerted on should be presented to all relevant stakeholders. The development of this ticket should be prioritized over all other development work.
To ensure a smooth handover of the on-call responsibilities to the next personnel, certain conventions must be followed during each transition of SRE responsibilities, including handing a summary packet to the person who is next on call. All the tickets in the shift must be documented and updated with the resolution steps, clarification questions from the development team, and any other comments and remarks from the SRE personnel. These tickets should be classified based on their impact on the platform’s availability and reliability. Any production-impacting tickets should be highlighted and tagged, especially for the post-mortem. Once the on-call SRE compiles this list of tickets and classifies them, it should be posted to a common handover communication channel. Depending on the team’s choice, this could be a separate channel in Slack, Microsoft Teams, or using the collaboration tools available in incident response tools such as Squadcast. This summary should be available to the entire SRE organization.
This is a weekly meeting in which all the engineering leads and SREs participate. The following flowchart summarizes the scope of post-mortem meetings.
The most important outcome of the post-mortem is ensuring that the issues discussed never recur. This is not straightforward to achieve; while the remediation steps could be as simple as writing a script or adding additional checks in the code, they could also be as complicated as requiring the entire application to be redesigned. Both SRE and development teams should work closely to get as close as possible to the goal of avoiding repeated problems.
Design escalation plan
Whenever an issue occurs, a ticket must be created that contains all relevant documentation, action items, and feature requests. Much of the time, it is not immediately clear which team will handle a particular ticket. Squadcast has tagging and routing features that allow for automation of this process which helps address this challenge. Initially, it might start off as a customer service ticket; depending on the severity and kind of incident that the ticket represents, it might be escalated to SRE or engineering teams. Once the appropriate team acts on the ticket, it will then be routed back to customer support. Depending on the resolution, it could either be communicated to the customer or simply be documented and closed. Additionally, this ticket will be part of the handoff process and discussed in the post-mortem for further analysis. The best practice is to correctly classify and assign the issue to the best of knowledge to reduce alert fatigue.
Optimizing the pager load
Members of engineering teams that are on-call often get paged. Pages must be appropriately targeted to the appropriate teams while minimizing paging frequency. The customer support team is usually the first point of contact in case of an issue that is page-worthy. From then on, multiple investigation and escalation steps need to be followed, as discussed in the section above.
Every alert must have a documented plan and steps for resolution. For example, in the case of customer-reported issues, the SRE should see whether there are any ongoing alerts and see if those alerts somehow correlate to the customer issue.
There are four possible outcomes of customer-reported events, as shown in the table below.
The goal should be to only page the SRE for true positive issues. In situations where it is not possible to determine which team to page, a central communication channel must be opened where SRE, engineering, customer support, and other team members can join in, discuss, and resolve the issue.
Unlike a cheat sheet that summarizes all commands for a given platform (such as Kubernetes or Elasticsearch), a runbook is an summary of specific steps, including commands, that an SRE engineer requires to respond a particular incident. Time is a critical factor when solving problems, and solutions can be implemented much more quickly if the engineer on call is knowledgeable and has a checklist of action items and commands handy.
A single runbook could be managed centrally as a team, or individuals may want to have their own. The following is a sample runbook.
It could be argued that many of the things on a runbook could be automated and performed using Jenkins or implemented as other CI jobs. That is the ideal scenario, but not all SRE teams are mature enough to implement everything correctly. A good runbook could also be a useful guide to creating automation. There is never a time when an engineer doesn’t need to rely on the command line, so whatever needs to be typed into a terminal repeatedly is an ideal candidate for a runbook.
Any engineering team that needs to introduce new features should get SRE approval for them. Anything that goes into production must have SRE approval, which should be discussed as thoroughly as possible with regard to dependencies, load testing, capacity planning, and disaster recovery.
Necessary development tickets should be created for monitoring and alerting required by the SRE, and both the platform feature and its corresponding alerting should hit production in tandem. Ideally, the changes should be initially deployed in A/B or canary fashion. A company-wide platform should be used to create, execute, schedule, and manage changes.
When someone from the engineering team creates a change request (CR), it is sent to higher management and the SRE team for approval. The person who creates a CR must have all the documentation of the impact the change might have on production system and related tickets attached to the CR. Once both SRE and managerial approval is granted, the CR is then scheduled to be executed based on the opener’s requested time.
Once the CR is executed, functionality and platform checkouts must be performed to ensure the stability of the platform. If something breaks, it should be immediately rolled back by the SRE. Upon successful execution, it should be documented and closed on the central company-wide change management platform.
Training and documentation
It is necessary to have proper documentation regarding SRE tools and services; this helps not only with onboarding new engineers but also upskilling existing staff. It requires a significant effort to maintain and keep documentation up to date, but it is worthwhile. A proficient SRE should be aware of all the distributed software systems and at least know the following:
- The architecture and block diagram of the entire platform
- How the distributed software components communicate with each other
- Software component interdependencies
- Requirements for the startup and shutdown of individual services
- How each application and software component caches its data and how to refresh the cache
- Provisioned infrastructure both locally and in the cloud.
Once the basics mentioned above are understood, it makes more sense to understand the core SRE services like monitoring, automation, and development. If any open-source monitoring platforms are used by the organization, then installation, customization, and deployment should be read, and appropriate notes for doing the same should be documented.
Once all the knowledge is acquired, it is a good idea to put these newly acquired skills to the test in internal non-production environments. This also provides an opportunity to address lapses in the documentation, if any.
The next step after receiving all the technical know-how is for the SRE person in training to be allowed to shadow on-call sessions with more senior members of the team. During the shadowing process, the new SRE sits alongside the primary on-call engineer to learn about the on-call process in practice. As SREs progress in learning, they should be given more and more responsibilities until they can perform on-call duties independently.
There are many considerations involved in designing a successful and sustainable on-call rotation plan. They range from organizing schedules and planning shift transitions to designing an escalation process and conducting post-mortem analysis meetings after each incident. Squadcast is designed to organize, implement, measure, and automate the activities involved in on-call rotations and other typical site reliability engineering tasks.