🚀 AI Generated Incident Summaries Feature is Now Live! See it in action! 🎉
Blog
SRE
What can SREs do to make holiday season’s peak traffic less chaotic?

What can SREs do to make holiday season’s peak traffic less chaotic?

December 3, 2021
What can SREs do to make holiday season’s peak traffic less chaotic?
In This Article:
Our Products
On-Call Management
Incident Response
Continuous Learning
Workflow Automation

The recently concluded Black Friday weekend could have potentially been the most challenging shift for On-Call engineers working in the Retail or E-Commerce sector. Since such peak-traffic events push the system to the limits, engineering teams are engulfed in a lot of tension preparing for it.

This is because the holiday season, globally and especially in the US, is a buzzing period of time for shopping enthusiasts. And this excitement brings with it a lot of website traffic. Eager customers wanting to shop are not as actively visiting local stores as they are visiting websites these days, in part due to the pandemic.

Online retail sales in the US are about $1.4 billion on a normal day. However on peak traffic days like the Black Friday, sales are more than 5x that amount. On Black Friday 2018, U.S. online sales totalled $6.22 billion and on Cyber Monday 2018, sales surged to $7.9 billion—the biggest online sales day up to that point in the US.

And such increased web traffic means the load will hit the systems hard. Which in turn means pagers buzzing, alert notifications flying, grumpy stakeholders, unhappy customers, and much more. This is the worst scenario for businesses because, when you should be making more money, you are actually losing customers and brand value.

Whether your servers have crashed because of increased transactions/second or because the page load time increased 3x, failed transactions could mean losses in thousands of dollars for every second of downtime. Downtime costs per minute are roughly $220K at Amazon and around $40K at Walmart, making outages scary and expensive.

The role of SRE / Infrastructure teams

Ask any engineer working in E-Commerce or Retail, and they will talk about the ‘capacity planning horror show’ they typically face during such peak seasons with systems firing alerts all over the place. But it doesn’t need to be this way. This blog by Google Cloud talks about how teams can prepare early, perform testing, and leverage war rooms to quickly overcome downtime during peak season.

Adopting best practices and converting these learnings into action items will not only help on-call engineers / SREs enjoy a chaos-free holiday break, but it will also help them understand a thing or two about their customers and how systems respond to a periodic increase in footfall.

For example, if your systems were receiving 1,000 qps(queries per second) during peak hours of Black Friday from the previous year, and assuming your business has grown by 20% since last year, then you need to ensure your systems can handle a load of 10%-30% growth in qps this Black Friday.

So what can teams do to make the holiday season less chaotic?

  • Learn from past incidents
  • Load testing and Performance testing
  • Observability - Monitoring, Logging, Tracing, etc.
  • SLO-based alerting- Revisit SLOs and plan releases keeping in mind peak season traffic
  • Other best practices- SRE Automation, Oncall, Alerting & Monitoring

Learn from past Incidents

Analyzing postmortems of past incidents will give you a fair idea of the limitations of your infrastructure. You will understand the breaking point of various services and how to tackle outages if they were to happen again.

The ideal way to take away learnings from past incidents is to make a checklist of action items from previous outages and ensure they are addressed this time around.

Load testing and Performance testing to understand system thresholds

A good Site Reliability Engineering practice is routinely performing:

  • Load tests on the systems to understand the stress levels that they can hold and
  • Performance tests to understand how the system behaves in normal load conditions.

By definition, Load testing is the process of determining the behavior of a system when multiple users access it at the same time. The behavior of systems under extreme load can help us determine the threshold of a breakpoint. Various questions like the sustainability of the system under a particular load and the operating capacity of the system can be answered.

On the other hand, Performance testing measures system attributes such as Speed, Scalability, Reliability, Stability and how the system adapts to change during normal load conditions. The idea here is to validate if the system is performing efficiently when the limit of the load is both above and below the threshold of break.

Ideally, these should be ongoing practices and not something that is done a few days prior to a major event as it will not give you a complete picture of system behaviour when under stress. Routine load tests will help SREs be more prepared and assist them in scaling up or scaling out accordingly.

Make Observability truly actionable

You can’t fix what you can’t see.’ This is a very famous saying which is applicable to fixing production issues. One of the most important aspects of running systems effectively in production is making your system more observable and taking proactive measures when a red signal is flagged.

Having a clear view of your system makes early recognition and preemptive solving of problems possible. Getting the right data at the right time with associated context is a game changer for those who want better system stability. For example, if an outage occurs, and an on-call engineer gets notified, it is ideal to give him more context into why the outage occurred. By referring to a dashboard with key system metrics recorded, he can debug the issue faster, reduce the duration of the outage and bring down the overall MTTR.

There are different observability tools to monitor different system metrics like log aggregation, APM, time series databases, distributed tracing and metrics collection tools. The below table will give you a better understanding of the different tools and how/when SREs can use them. (read more here: https://www.squadcast.com/blog/top-observability-tools-for-devops-engineers-and-sres)

Observability Areas Tools
Metrics Collection Prometheus, DataDog, Pingdom, etc.
Visualization Grafana, Kibana, etc.
Logging Loggly, Logstash,etc.
Application Performance Monitoring AppDynamics, NewRelic, Dynatrace, etc.
Distributed Tracing Zipkin, Jaeger, etc.

A worthy point to note is that, ‘Observability’ is what truly sets apart ‘competent SRE teams’ from ‘incompetent SRE teams’.

SLO based Alerting- Revisit SLOs and plan releases keeping in mind peak season traffic

Establishing realistic SLO targets is the most important of all techniques covered in this blog because it directly impacts business goals. They are also the easiest way to measure user experience because they are established on the basis of reliably providing service to end users.

But how can SLOs accurately measure user experience?

By tracking page load times when the time taken to load the page exceeds a certain amount of time more than a set threshold, we can interpret that users are not enjoying their browsing experience. Similarly, if server response fails more than a certain number of times within a given time span, we can imply that users are not getting the expected service. Such scenarios that directly indicate outage or affected user service, can be numerically measured by SLOs.

Thus, the ability to use Alerts to keep track of SLOs and adjust error budget accordingly is an effective way to identify system limitations and fix them preemptively before customers or stakeholders start complaining about the downtime/outage. This is where Observability tools come into the picture as they are a numerical representation of good/bad UX and validate the need for alerting based on SLOs.

Additional Resources: To understand the importance of SLO tracking for SREs, read this blog. This is the link to our open-source SLO Tracker.

Other best practices- SRE Automation, On-call, Alerting & Monitoring

Besides the techniques mentioned above, there are numerous on-call, alerting, and monitoring best practices that can be followed by engineering teams during such peak-traffic events.

The spike in traffic will inevitably lead to incidents and outages or at the very least, impact some services. This is bound to create chaos and lead to increased stress for on-call engineers. During such times, engineering teams should take the right measure to prevent on-call burnout. Encouraging on-call engineers to take vacation either before or after the peak season is one such thing that teams can do to compensate for the extreme stress levels they experience during on-call. Read more on ‘How to avoid on-call burnout’.

Other best practices to keep in mind are techniques such as ‘Alert-as-Code’ and configuring proactive alert checks that monitor system health. Read more on this topic here- ‘Best practices in Incident Management’.

Automate, Automate, Automate...

Ensure that automation rules are properly configured to reach the right responders. Teams can also leverage automation to prioritize responding to high severity incidents. You can also make sure to send alerts only for relevant and actionable events even if other events are monitored. Deduplication and Suppression rules can be configured to group all similar alerts together and prevent un-important alerts from getting reported.

Conclusion

These are a few ways for SRE teams to prepare for peak traffic events. Be it Black Friday or some random stock clearance sale, the website load should not break the system. And if the system does break, this article can be a good reference point for SREs / on-call engineers to prevent such issues from occurring again.

If you lead a team of SREs, or if Site Reliability Engineering as a function interests you, then you must attend this webinar: Reliability Reimagined: How SREs Spearhead Competitive CX. Register for free to hear from Sr. SRE at Google, VP of Engineering at Squadcast and other panelists.

Written By:
December 3, 2021
Vardhan NS
Vardhan NS
December 3, 2021
SRE
On-Call
SLOs
Monitoring
Observability
Best Practices
Share this blog:
In This Article:
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get reliability insights delivered straight to your inbox.
Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Get the latest scoop on Reliability insights. Delivered straight to your inbox.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
If you wish to unsubscribe, we won't hold it against you. Privacy policy.
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
Users love Squadcast on G2
Copyright © Squadcast Inc. 2017-2024

What can SREs do to make holiday season’s peak traffic less chaotic?

Dec 3, 2021
Last Updated:
May 2, 2024
Share this post:
What can SREs do to make holiday season’s peak traffic less chaotic?

Holiday season's peak traffic is the most challenging period for SREs and on-call engineers. In this blog, we have highlighted the things that SREs can do to make the holiday season less chaotic.

Table of Contents:

    The recently concluded Black Friday weekend could have potentially been the most challenging shift for On-Call engineers working in the Retail or E-Commerce sector. Since such peak-traffic events push the system to the limits, engineering teams are engulfed in a lot of tension preparing for it.

    This is because the holiday season, globally and especially in the US, is a buzzing period of time for shopping enthusiasts. And this excitement brings with it a lot of website traffic. Eager customers wanting to shop are not as actively visiting local stores as they are visiting websites these days, in part due to the pandemic.

    Online retail sales in the US are about $1.4 billion on a normal day. However on peak traffic days like the Black Friday, sales are more than 5x that amount. On Black Friday 2018, U.S. online sales totalled $6.22 billion and on Cyber Monday 2018, sales surged to $7.9 billion—the biggest online sales day up to that point in the US.

    And such increased web traffic means the load will hit the systems hard. Which in turn means pagers buzzing, alert notifications flying, grumpy stakeholders, unhappy customers, and much more. This is the worst scenario for businesses because, when you should be making more money, you are actually losing customers and brand value.

    Whether your servers have crashed because of increased transactions/second or because the page load time increased 3x, failed transactions could mean losses in thousands of dollars for every second of downtime. Downtime costs per minute are roughly $220K at Amazon and around $40K at Walmart, making outages scary and expensive.

    The role of SRE / Infrastructure teams

    Ask any engineer working in E-Commerce or Retail, and they will talk about the ‘capacity planning horror show’ they typically face during such peak seasons with systems firing alerts all over the place. But it doesn’t need to be this way. This blog by Google Cloud talks about how teams can prepare early, perform testing, and leverage war rooms to quickly overcome downtime during peak season.

    Adopting best practices and converting these learnings into action items will not only help on-call engineers / SREs enjoy a chaos-free holiday break, but it will also help them understand a thing or two about their customers and how systems respond to a periodic increase in footfall.

    For example, if your systems were receiving 1,000 qps(queries per second) during peak hours of Black Friday from the previous year, and assuming your business has grown by 20% since last year, then you need to ensure your systems can handle a load of 10%-30% growth in qps this Black Friday.

    So what can teams do to make the holiday season less chaotic?

    • Learn from past incidents
    • Load testing and Performance testing
    • Observability - Monitoring, Logging, Tracing, etc.
    • SLO-based alerting- Revisit SLOs and plan releases keeping in mind peak season traffic
    • Other best practices- SRE Automation, Oncall, Alerting & Monitoring

    Learn from past Incidents

    Analyzing postmortems of past incidents will give you a fair idea of the limitations of your infrastructure. You will understand the breaking point of various services and how to tackle outages if they were to happen again.

    The ideal way to take away learnings from past incidents is to make a checklist of action items from previous outages and ensure they are addressed this time around.

    Load testing and Performance testing to understand system thresholds

    A good Site Reliability Engineering practice is routinely performing:

    • Load tests on the systems to understand the stress levels that they can hold and
    • Performance tests to understand how the system behaves in normal load conditions.

    By definition, Load testing is the process of determining the behavior of a system when multiple users access it at the same time. The behavior of systems under extreme load can help us determine the threshold of a breakpoint. Various questions like the sustainability of the system under a particular load and the operating capacity of the system can be answered.

    On the other hand, Performance testing measures system attributes such as Speed, Scalability, Reliability, Stability and how the system adapts to change during normal load conditions. The idea here is to validate if the system is performing efficiently when the limit of the load is both above and below the threshold of break.

    Ideally, these should be ongoing practices and not something that is done a few days prior to a major event as it will not give you a complete picture of system behaviour when under stress. Routine load tests will help SREs be more prepared and assist them in scaling up or scaling out accordingly.

    Make Observability truly actionable

    You can’t fix what you can’t see.’ This is a very famous saying which is applicable to fixing production issues. One of the most important aspects of running systems effectively in production is making your system more observable and taking proactive measures when a red signal is flagged.

    Having a clear view of your system makes early recognition and preemptive solving of problems possible. Getting the right data at the right time with associated context is a game changer for those who want better system stability. For example, if an outage occurs, and an on-call engineer gets notified, it is ideal to give him more context into why the outage occurred. By referring to a dashboard with key system metrics recorded, he can debug the issue faster, reduce the duration of the outage and bring down the overall MTTR.

    There are different observability tools to monitor different system metrics like log aggregation, APM, time series databases, distributed tracing and metrics collection tools. The below table will give you a better understanding of the different tools and how/when SREs can use them. (read more here: https://www.squadcast.com/blog/top-observability-tools-for-devops-engineers-and-sres)

    Observability Areas Tools
    Metrics Collection Prometheus, DataDog, Pingdom, etc.
    Visualization Grafana, Kibana, etc.
    Logging Loggly, Logstash,etc.
    Application Performance Monitoring AppDynamics, NewRelic, Dynatrace, etc.
    Distributed Tracing Zipkin, Jaeger, etc.

    A worthy point to note is that, ‘Observability’ is what truly sets apart ‘competent SRE teams’ from ‘incompetent SRE teams’.

    SLO based Alerting- Revisit SLOs and plan releases keeping in mind peak season traffic

    Establishing realistic SLO targets is the most important of all techniques covered in this blog because it directly impacts business goals. They are also the easiest way to measure user experience because they are established on the basis of reliably providing service to end users.

    But how can SLOs accurately measure user experience?

    By tracking page load times when the time taken to load the page exceeds a certain amount of time more than a set threshold, we can interpret that users are not enjoying their browsing experience. Similarly, if server response fails more than a certain number of times within a given time span, we can imply that users are not getting the expected service. Such scenarios that directly indicate outage or affected user service, can be numerically measured by SLOs.

    Thus, the ability to use Alerts to keep track of SLOs and adjust error budget accordingly is an effective way to identify system limitations and fix them preemptively before customers or stakeholders start complaining about the downtime/outage. This is where Observability tools come into the picture as they are a numerical representation of good/bad UX and validate the need for alerting based on SLOs.

    Additional Resources: To understand the importance of SLO tracking for SREs, read this blog. This is the link to our open-source SLO Tracker.

    Other best practices- SRE Automation, On-call, Alerting & Monitoring

    Besides the techniques mentioned above, there are numerous on-call, alerting, and monitoring best practices that can be followed by engineering teams during such peak-traffic events.

    The spike in traffic will inevitably lead to incidents and outages or at the very least, impact some services. This is bound to create chaos and lead to increased stress for on-call engineers. During such times, engineering teams should take the right measure to prevent on-call burnout. Encouraging on-call engineers to take vacation either before or after the peak season is one such thing that teams can do to compensate for the extreme stress levels they experience during on-call. Read more on ‘How to avoid on-call burnout’.

    Other best practices to keep in mind are techniques such as ‘Alert-as-Code’ and configuring proactive alert checks that monitor system health. Read more on this topic here- ‘Best practices in Incident Management’.

    Automate, Automate, Automate...

    Ensure that automation rules are properly configured to reach the right responders. Teams can also leverage automation to prioritize responding to high severity incidents. You can also make sure to send alerts only for relevant and actionable events even if other events are monitored. Deduplication and Suppression rules can be configured to group all similar alerts together and prevent un-important alerts from getting reported.

    Conclusion

    These are a few ways for SRE teams to prepare for peak traffic events. Be it Black Friday or some random stock clearance sale, the website load should not break the system. And if the system does break, this article can be a good reference point for SREs / on-call engineers to prevent such issues from occurring again.

    If you lead a team of SREs, or if Site Reliability Engineering as a function interests you, then you must attend this webinar: Reliability Reimagined: How SREs Spearhead Competitive CX. Register for free to hear from Sr. SRE at Google, VP of Engineering at Squadcast and other panelists.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    December 3, 2021
    December 3, 2021
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Vardhan NS
    The Evolution of Incident Management from On-Call to SRE
    The Evolution of Incident Management from On-Call to SRE
    March 7, 2023
    What are Webhooks and why should developers use them?
    What are Webhooks and why should developers use them?
    January 20, 2023
    Maximize efficiency with Terraformer: Manage Squadcast resources via IaC
    Maximize efficiency with Terraformer: Manage Squadcast resources via IaC
    December 23, 2022
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.