What can SREs do to make holiday season’s peak traffic less chaotic?

Dec 3, 2021

Last Updated:

Dec 3, 2021

Share this post:

What can SREs do to make holiday season’s peak traffic less chaotic?

Holiday season's peak traffic is the most challenging period for SREs and on-call engineers. In this blog, we have highlighted the things that SREs can do to make the holiday season less chaotic.

Table of Contents:

The recently concluded Black Friday weekend could have potentially been the most challenging shift for On-Call engineers working in the Retail or E-Commerce sector. Since such peak-traffic events push the system to the limits, engineering teams are engulfed in a lot of tension preparing for it.

This is because the holiday season, globally and especially in the US, is a buzzing period of time for shopping enthusiasts. And this excitement brings with it a lot of website traffic. Eager customers wanting to shop are not as actively visiting local stores as they are visiting websites these days, in part due to the pandemic.

Online retail sales in the US are about $1.4 billion on a normal day. However on peak traffic days like the Black Friday, sales are more than 5x that amount. On Black Friday 2018, U.S. online sales totalled $6.22 billion and on Cyber Monday 2018, sales surged to $7.9 billion—the biggest online sales day up to that point in the US.

And such increased web traffic means the load will hit the systems hard. Which in turn means pagers buzzing, alert notifications flying, grumpy stakeholders, unhappy customers, and much more. This is the worst scenario for businesses because, when you should be making more money, you are actually losing customers and brand value.

Whether your servers have crashed because of increased transactions/second or because the page load time increased 3x, failed transactions could mean losses in thousands of dollars for every second of downtime. Downtime costs per minute are roughly $220K at Amazon and around $40K at Walmart, making outages scary and expensive.

The role of SRE / Infrastructure teams

Ask any engineer working in E-Commerce or Retail, and they will talk about the ‘capacity planning horror show’ they typically face during such peak seasons with systems firing alerts all over the place. But it doesn’t need to be this way. This blog by Google Cloud talks about how teams can prepare early, perform testing, and leverage war rooms to quickly overcome downtime during peak season.

Adopting best practices and converting these learnings into action items will not only help on-call engineers / SREs enjoy a chaos-free holiday break, but it will also help them understand a thing or two about their customers and how systems respond to a periodic increase in footfall.

For example, if your systems were receiving 1,000 qps(queries per second) during peak hours of Black Friday from the previous year, and assuming your business has grown by 20% since last year, then you need to ensure your systems can handle a load of 10%-30% growth in qps this Black Friday.

So what can teams do to make the holiday season less chaotic?

Learn from past incidents
Load testing and Performance testing
Observability - Monitoring, Logging, Tracing, etc.
SLO-based alerting- Revisit SLOs and plan releases keeping in mind peak season traffic
Other best practices- SRE Automation, Oncall, Alerting & Monitoring

Learn from past Incidents

Analyzing postmortems of past incidents will give you a fair idea of the limitations of your infrastructure. You will understand the breaking point of various services and how to tackle outages if they were to happen again.

The ideal way to take away learnings from past incidents is to make a checklist of action items from previous outages and ensure they are addressed this time around.

Load testing and Performance testing to understand system thresholds

A good Site Reliability Engineering practice is routinely performing:

Load tests on the systems to understand the stress levels that they can hold and
Performance tests to understand how the system behaves in normal load conditions.

By definition, Load testing is the process of determining the behavior of a system when multiple users access it at the same time. The behavior of systems under extreme load can help us determine the threshold of a breakpoint. Various questions like the sustainability of the system under a particular load and the operating capacity of the system can be answered.

On the other hand, Performance testing measures system attributes such as Speed, Scalability, Reliability, Stability and how the system adapts to change during normal load conditions. The idea here is to validate if the system is performing efficiently when the limit of the load is both above and below the threshold of break.

Ideally, these should be ongoing practices and not something that is done a few days prior to a major event as it will not give you a complete picture of system behaviour when under stress. Routine load tests will help SREs be more prepared and assist them in scaling up or scaling out accordingly.

Make Observability truly actionable

‘You can’t fix what you can’t see.’ This is a very famous saying which is applicable to fixing production issues. One of the most important aspects of running systems effectively in production is making your system more observable and taking proactive measures when a red signal is flagged.

Having a clear view of your system makes early recognition and preemptive solving of problems possible. Getting the right data at the right time with associated context is a game changer for those who want better system stability. For example, if an outage occurs, and an on-call engineer gets notified, it is ideal to give him more context into why the outage occurred. By referring to a dashboard with key system metrics recorded, he can debug the issue faster, reduce the duration of the outage and bring down the overall MTTR.

There are different observability tools to monitor different system metrics like log aggregation, APM, time series databases, distributed tracing and metrics collection tools. The below table will give you a better understanding of the different tools and how/when SREs can use them. (read more here: https://www.squadcast.com/blog/top-observability-tools-for-devops-engineers-and-sres)

Observability Areas	Tools
Metrics Collection	Prometheus, DataDog, Pingdom, etc.
Visualization	Grafana, Kibana, etc.
Logging	Loggly, Logstash,etc.
Application Performance Monitoring	AppDynamics, NewRelic, Dynatrace, etc.
Distributed Tracing	Zipkin, Jaeger, etc.

A worthy point to note is that, ‘Observability’ is what truly sets apart ‘competent SRE teams’ from ‘incompetent SRE teams’.

SLO based Alerting- Revisit SLOs and plan releases keeping in mind peak season traffic

Establishing realistic SLO targets is the most important of all techniques covered in this blog because it directly impacts business goals. They are also the easiest way to measure user experience because they are established on the basis of reliably providing service to end users.

But how can SLOs accurately measure user experience?

By tracking page load times when the time taken to load the page exceeds a certain amount of time more than a set threshold, we can interpret that users are not enjoying their browsing experience. Similarly, if server response fails more than a certain number of times within a given time span, we can imply that users are not getting the expected service. Such scenarios that directly indicate outage or affected user service, can be numerically measured by SLOs.

Thus, the ability to use Alerts to keep track of SLOs and adjust error budget accordingly is an effective way to identify system limitations and fix them preemptively before customers or stakeholders start complaining about the downtime/outage. This is where Observability tools come into the picture as they are a numerical representation of good/bad UX and validate the need for alerting based on SLOs.

Additional Resources: To understand the importance of SLO tracking for SREs, read this blog. This is the link to our open-source SLO Tracker.

Other best practices- SRE Automation, On-call, Alerting & Monitoring

Besides the techniques mentioned above, there are numerous on-call, alerting, and monitoring best practices that can be followed by engineering teams during such peak-traffic events.

The spike in traffic will inevitably lead to incidents and outages or at the very least, impact some services. This is bound to create chaos and lead to increased stress for on-call engineers. During such times, engineering teams should take the right measure to prevent on-call burnout. Encouraging on-call engineers to take vacation either before or after the peak season is one such thing that teams can do to compensate for the extreme stress levels they experience during on-call. Read more on ‘How to avoid on-call burnout’.

Other best practices to keep in mind are techniques such as ‘Alert-as-Code’ and configuring proactive alert checks that monitor system health. Read more on this topic here- ‘Best practices in Incident Management’.

Automate, Automate, Automate...

Ensure that automation rules are properly configured to reach the right responders. Teams can also leverage automation to prioritize responding to high severity incidents. You can also make sure to send alerts only for relevant and actionable events even if other events are monitored. Deduplication and Suppression rules can be configured to group all similar alerts together and prevent un-important alerts from getting reported.

Conclusion

These are a few ways for SRE teams to prepare for peak traffic events. Be it Black Friday or some random stock clearance sale, the website load should not break the system. And if the system does break, this article can be a good reference point for SREs / on-call engineers to prevent such issues from occurring again.

If you lead a team of SREs, or if Site Reliability Engineering as a function interests you, then you must attend this webinar: Reliability Reimagined: How SREs Spearhead Competitive CX. Register for free to hear from Sr. SRE at Google, VP of Engineering at Squadcast and other panelists.

What you should do now

Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
Enjoyed the article? Explore further insights on the best SRE practices.

Schedule a personalized demo to witness firsthand how Squadcast supports and upholds key SRE best practices.
Experience Squadcast with a 14-day free trial. Experience all our On-Call and Noise reduction features.
Enjoyed the article? Explore further insights on the best SRE practices.

Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
Enjoyed the article? Explore further insights on the best SRE practices.

Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit

See Redis' Journey to Efficient Incident Management though alert noise reduction With Squadcast
Wondering how Squadcast can help you streamline your Incident Management Process? Explore the platform through this Interactive Demo
Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.

Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
Experience Squadcast with a 14-day free trial. Experience all our On-Call and Noise reduction features.
Interested in Squadcast? Check out our pricing plans and find the right fit for you

Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
Experience Squadcast with a 14-day free trial. Experience all our On-Call and Noise reduction features.
Interested in Squadcast? Check out our pricing plans and find the right fit for you

Learn how Squadcast helped Scoro to create a solid foundation for better on-call practices
Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
Schedule a demo session with Squadcast where we can show you around, answer your questions and help see if Squadcast is the right fit for you.

Experience Squadcast with a 14-day free trial. Experience all our On-Call and Noise reduction features.
Schedule a demo session with Squadcast where we can show you around, answer your questions and help see if Squadcast is the right fit for you.
Learn how Squadcast helped Scoro to create a solid foundation for better on-call practices

Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit

Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit

Start a 14-day free trial and experience the benefits of our Incident Management and on-call solution firsthand
Compare Squadcast with Opsgenie and see if Squadcast is the right fit for your needs
Pricing Page - Compare our plans and find the perfect fit for your business

What you should do now?

Here are 3 ways you can continue your journey to learn more about Unified Incident Management

Explore our Interactive Demo

Discover the platform's capabilities through our Interactive Demo.

Read Success Stories

See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.

Share the article

Schedule a Demo session

We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.

Start 14 Day Free trial

Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.

Pricing Page

Compare our plans and find the perfect fit for your business.

Read Success Stories

See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.

Explore Our Interactive Demo

Discover the platform's capabilities through our Interactive Demo.

Schedule a Demo session

We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.

Start 14 Day Free trial

Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.

Compare Squadcast & PagerDuty / Opsgenie

Compare and see if Squadcast is the right fit for your needs.

Pricing Page

Compare our plans and find the perfect fit for your business.

Read Success Stories

Learn how Scoro created a solid foundation for better on-call practices with Squadcast.

Explore Our Interactive Demo

Discover the platform's capabilities through our Interactive Demo.

Schedule a Demo session

We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.

Start 14 Day Free trial

Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.

Schedule a Demo session

We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.

Read Success Stories

Learn how Scoro created a solid foundation for better on-call practices with Squadcast.

Schedule a Demo session

We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.

Explore Our Interactive Demo

Discover the platform's capabilities through our Interactive Demo.

Enjoyed the article? Explore further insights on the best SRE practices.

Schedule a Demo session

We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.

Start 14 Day Free trial

Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.

Enjoyed the article? Explore further insights on the best SRE practices.

Written By:

Vardhan NS

December 3, 2021

Vardhan NS

December 3, 2021

Share this post:

Subscribe to our latest updates

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

What can SREs do to make holiday season’s peak traffic less chaotic?

Vardhan NS

Dec 3, 2021

Last Updated:

Dec 3, 2021

Holiday season's peak traffic is the most challenging period for SREs and on-call engineers. In this blog, we have highlighted the things that SREs can do to make the holiday season less chaotic.