Holiday season's peak traffic is the most challenging period for SREs and on-call engineers. In this blog, we have highlighted the things that SREs can do to make the holiday season less chaotic.
The recently concluded Black Friday weekend could have potentially been the most challenging shift for On-Call engineers working in the Retail or E-Commerce sector. Since such peak-traffic events push the system to the limits, engineering teams are engulfed in a lot of tension preparing for it.
This is because the holiday season, globally and especially in the US, is a buzzing period of time for shopping enthusiasts. And this excitement brings with it a lot of website traffic. Eager customers wanting to shop are not as actively visiting local stores as they are visiting websites these days, in part due to the pandemic.
Online retail sales in the US are about $1.4 billion on a normal day. However on peak traffic days like the Black Friday, sales are more than 5x that amount. On Black Friday 2018, U.S. online sales totalled $6.22 billion and on Cyber Monday 2018, sales surged to $7.9 billion—the biggest online sales day up to that point in the US.
And such increased web traffic means the load will hit the systems hard. Which in turn means pagers buzzing, alert notifications flying, grumpy stakeholders, unhappy customers, and much more. This is the worst scenario for businesses because, when you should be making more money, you are actually losing customers and brand value.
Whether your servers have crashed because of increased transactions/second or because the page load time increased 3x, failed transactions could mean losses in thousands of dollars for every second of downtime. Downtime costs per minute are roughly $220K at Amazon and around $40K at Walmart, making outages scary and expensive.
Ask any engineer working in E-Commerce or Retail, and they will talk about the ‘capacity planning horror show’ they typically face during such peak seasons with systems firing alerts all over the place. But it doesn’t need to be this way. This blog by Google Cloud talks about how teams can prepare early, perform testing, and leverage war rooms to quickly overcome downtime during peak season.
Adopting best practices and converting these learnings into action items will not only help on-call engineers / SREs enjoy a chaos-free holiday break, but it will also help them understand a thing or two about their customers and how systems respond to a periodic increase in footfall.
For example, if your systems were receiving 1,000 qps(queries per second) during peak hours of Black Friday from the previous year, and assuming your business has grown by 20% since last year, then you need to ensure your systems can handle a load of 10%-30% growth in qps this Black Friday.
Analyzing postmortems of past incidents will give you a fair idea of the limitations of your infrastructure. You will understand the breaking point of various services and how to tackle outages if they were to happen again.
The ideal way to take away learnings from past incidents is to make a checklist of action items from previous outages and ensure they are addressed this time around.
A good Site Reliability Engineering practice is routinely performing:
By definition, Load testing is the process of determining the behavior of a system when multiple users access it at the same time. The behavior of systems under extreme load can help us determine the threshold of a breakpoint. Various questions like the sustainability of the system under a particular load and the operating capacity of the system can be answered.
On the other hand, Performance testing measures system attributes such as Speed, Scalability, Reliability, Stability and how the system adapts to change during normal load conditions. The idea here is to validate if the system is performing efficiently when the limit of the load is both above and below the threshold of break.
Ideally, these should be ongoing practices and not something that is done a few days prior to a major event as it will not give you a complete picture of system behaviour when under stress. Routine load tests will help SREs be more prepared and assist them in scaling up or scaling out accordingly.
‘You can’t fix what you can’t see.’ This is a very famous saying which is applicable to fixing production issues. One of the most important aspects of running systems effectively in production is making your system more observable and taking proactive measures when a red signal is flagged.
Having a clear view of your system makes early recognition and preemptive solving of problems possible. Getting the right data at the right time with associated context is a game changer for those who want better system stability. For example, if an outage occurs, and an on-call engineer gets notified, it is ideal to give him more context into why the outage occurred. By referring to a dashboard with key system metrics recorded, he can debug the issue faster, reduce the duration of the outage and bring down the overall MTTR.
There are different observability tools to monitor different system metrics like log aggregation, APM, time series databases, distributed tracing and metrics collection tools. The below table will give you a better understanding of the different tools and how/when SREs can use them. (read more here: https://www.squadcast.com/blog/top-observability-tools-for-devops-engineers-and-sres)
A worthy point to note is that, ‘Observability’ is what truly sets apart ‘competent SRE teams’ from ‘incompetent SRE teams’.
Establishing realistic SLO targets is the most important of all techniques covered in this blog because it directly impacts business goals. They are also the easiest way to measure user experience because they are established on the basis of reliably providing service to end users.
But how can SLOs accurately measure user experience?
By tracking page load times when the time taken to load the page exceeds a certain amount of time more than a set threshold, we can interpret that users are not enjoying their browsing experience. Similarly, if server response fails more than a certain number of times within a given time span, we can imply that users are not getting the expected service. Such scenarios that directly indicate outage or affected user service, can be numerically measured by SLOs.
Thus, the ability to use Alerts to keep track of SLOs and adjust error budget accordingly is an effective way to identify system limitations and fix them preemptively before customers or stakeholders start complaining about the downtime/outage. This is where Observability tools come into the picture as they are a numerical representation of good/bad UX and validate the need for alerting based on SLOs.
Besides the techniques mentioned above, there are numerous on-call, alerting, and monitoring best practices that can be followed by engineering teams during such peak-traffic events.
The spike in traffic will inevitably lead to incidents and outages or at the very least, impact some services. This is bound to create chaos and lead to increased stress for on-call engineers. During such times, engineering teams should take the right measure to prevent on-call burnout. Encouraging on-call engineers to take vacation either before or after the peak season is one such thing that teams can do to compensate for the extreme stress levels they experience during on-call. Read more on ‘How to avoid on-call burnout’.
Other best practices to keep in mind are techniques such as ‘Alert-as-Code’ and configuring proactive alert checks that monitor system health. Read more on this topic here- ‘Best practices in Incident Management’.
Automate, Automate, Automate...
Ensure that automation rules are properly configured to reach the right responders. Teams can also leverage automation to prioritize responding to high severity incidents. You can also make sure to send alerts only for relevant and actionable events even if other events are monitored. Deduplication and Suppression rules can be configured to group all similar alerts together and prevent un-important alerts from getting reported.
These are a few ways for SRE teams to prepare for peak traffic events. Be it Black Friday or some random stock clearance sale, the website load should not break the system. And if the system does break, this article can be a good reference point for SREs / on-call engineers to prevent such issues from occurring again.
If you lead a team of SREs, or if Site Reliability Engineering as a function interests you, then you must attend this webinar: Reliability Reimagined: How SREs Spearhead Competitive CX. Register for free to hear from Sr. SRE at Google, VP of Engineering at Squadcast and other panelists.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.