VOOZH about

URL: https://thenewstack.io/sre-tips-to-prepare-for-black-friday/

⇱ SRE Tips to Prepare for Black Friday - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2020-11-25 03:00:39
SRE Tips to Prepare for Black Friday
contributed,sponsor-gremlin,sponsored,sponsored-post-contributed,
DevOps / Security

SRE Tips to Prepare for Black Friday

Preparing for Black Friday is an ongoing project for engineering teams who are responsible for building, deploying, and operating production workloads.
Nov 25th, 2020 3:00am by Austin Gunter
👁 Featued image for: SRE Tips to Prepare for Black Friday
Gremlin sponsored this post.

Gremlin sponsored this post.

Austin Gunter
Austin has been working with high scale cloud technology for over a decade. He leads Gremlin's technical product marketing team and spends his spare time meditating, training, and hanging out with his cat Franklin.

Preparing for Black Friday, or any peak traffic event, is an ongoing project for engineering teams who are responsible for building, deploying, and operating production workloads.

Since Site Reliability Engineers (SREs) and engineering teams are probably staying home this year due to COVID-19, instead of making preparations alongside teammates in our offices, we’ll need to accomplish the same work from our workstations at home — where we’ll also convene our war rooms and manage any incidents that may arise.

Here’s a list of the ways that SREs from companies like Dropbox, Amazon, and Netflix have prepared for peak traffic this holiday season.

Review Past Incidents

Reviewing past incidents is a powerful way to gain an understanding of how your system has failed previously; and will offer you a lot of insight into how the system actually behaves in production. Armed with this insight, you’ll be more confident in the case of an outage. Plus it will give you a checklist of questions to ask your teams.

  1. Have we validated fixes for past incidents in light of any new code changes? To prevent the drift into failure, it’s important to revisit fixes for past bugs to ensure the reliability of code and configuration updates.
  2. Are we prepared with the right amount of infrastructure and correct autoscaling rules to handle a surge in traffic?
  3. Have we tested the reliability of our application’s critical paths? Validating that the core functionality of our application will perform under stress will make a massive difference to our company’s bottom line.

Get to Know Your ‘Problem Services’

A pragmatic way to identify “problem services” is to ask your team “which services do folks avoid writing code for?” Once you have a list of these services, you can start looking into how to make sure those services don’t cause any headaches on the big day.

Do a little bit of digging to see how those services tend to fail and how the rest of the system responds. Once you understand the failure patterns of a given service, the reliability mechanisms become more obvious. Does the service need a bit more redundancy? Does it have issues with auto-scaling properly? Is the connection to an upstream service a little fragile?

Run a Remote FireDrill to Test Your Observability and Runbooks

A FireDrill is a planned event that validates people and processes. Specifically, it is designed to run a team through the proper actions to take when a specific problem arises. Like business continuity plans, FireDrills should be a regular and expected facet of our incident management preparation.

Now that we’re working from home, it’s important for us to do a dress rehearsal to make sure that we are confident we’ll find gaps in our process before we end up troubleshooting an incident from the living room in the middle of Thanksgiving. Are our alerts set up properly, or are we getting paged for non-issues and missing alerts for real problems? Will our dashboards give us the right data, so that we can resolve an incident quickly? And are our runbooks up to date, complete, and accurate?

Create a One-Pager for Your Whole Company About the Event

One of the more time-consuming elements of incident management is making sure that everyone is on the same page. Publishing a company wiki page about the traffic spike and sharing it across your organization will save valuable minutes in the event of an outage.

Here’s a starter list of topics you can include:

  1. Why you expect the traffic spike and how long you estimate it to last.
  2. Contact information for all on-call people and a link to the rotation calendar (this should be easily accessible in the first place).
  3. Known system trouble spots, like potential bottlenecks or single points of failure. This allows everyone in the organization to keep an eye out for potential problems.
  4. Check primary database query plans and any expected query pattern changes, including how long these queries take to run under normal conditions.
  5. Scaling bounds and known capacity limits, such as a capacity limit on Lambdas.
  6. Results from Chaos Engineering experiments run on services.

Reproduce Past Incidents with Chaos Engineering

Sometimes we think we have a fix for our past incidents, but we never actually go and test that the fix works. This can be for a number of reasons: inadequate tooling, hesitance to test in production, or perhaps even laziness. But this is a core use case for Chaos Engineering. Because Chaos Engineering enables engineers to precisely and repeatedly recreate turbulent production conditions, we can often reproduce what led to a major incident and verify that a fix does work.

Uneventful Black Fridays

There’s an apt quote for 2020 that goes, “may you live in interesting times.” But when it comes to our on-call rotations and system behavior, we’d prefer things be boring and predictable. We hope that the above list can help your team prepare for a Black Friday full of happy customers and plenty of downtime with your loved ones.

Gremlin is the world’s first hosted Chaos Engineering service with a mission to help build a more reliable internet. It turns failure into resilience by enabling engineers to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.
Learn More
The latest from Gremlin
TRENDING STORIES
Austin has been working with high scale cloud technology for over a decade. He leads Gremlin's technical product marketing team and spends his spare time meditating, training, and hanging out with his cat Franklin.
Read more from Austin Gunter
Gremlin sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.