![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Gremlin sponsored this post.
Preparing for Black Friday, or any peak traffic event, is an ongoing project for engineering teams who are responsible for building, deploying, and operating production workloads.
Since Site Reliability Engineers (SREs) and engineering teams are probably staying home this year due to COVID-19, instead of making preparations alongside teammates in our offices, we’ll need to accomplish the same work from our workstations at home — where we’ll also convene our war rooms and manage any incidents that may arise.
Here’s a list of the ways that SREs from companies like Dropbox, Amazon, and Netflix have prepared for peak traffic this holiday season.
Reviewing past incidents is a powerful way to gain an understanding of how your system has failed previously; and will offer you a lot of insight into how the system actually behaves in production. Armed with this insight, you’ll be more confident in the case of an outage. Plus it will give you a checklist of questions to ask your teams.
A pragmatic way to identify “problem services” is to ask your team “which services do folks avoid writing code for?” Once you have a list of these services, you can start looking into how to make sure those services don’t cause any headaches on the big day.
Do a little bit of digging to see how those services tend to fail and how the rest of the system responds. Once you understand the failure patterns of a given service, the reliability mechanisms become more obvious. Does the service need a bit more redundancy? Does it have issues with auto-scaling properly? Is the connection to an upstream service a little fragile?
A FireDrill is a planned event that validates people and processes. Specifically, it is designed to run a team through the proper actions to take when a specific problem arises. Like business continuity plans, FireDrills should be a regular and expected facet of our incident management preparation.
Now that we’re working from home, it’s important for us to do a dress rehearsal to make sure that we are confident we’ll find gaps in our process before we end up troubleshooting an incident from the living room in the middle of Thanksgiving. Are our alerts set up properly, or are we getting paged for non-issues and missing alerts for real problems? Will our dashboards give us the right data, so that we can resolve an incident quickly? And are our runbooks up to date, complete, and accurate?
One of the more time-consuming elements of incident management is making sure that everyone is on the same page. Publishing a company wiki page about the traffic spike and sharing it across your organization will save valuable minutes in the event of an outage.
Here’s a starter list of topics you can include:
Sometimes we think we have a fix for our past incidents, but we never actually go and test that the fix works. This can be for a number of reasons: inadequate tooling, hesitance to test in production, or perhaps even laziness. But this is a core use case for Chaos Engineering. Because Chaos Engineering enables engineers to precisely and repeatedly recreate turbulent production conditions, we can often reproduce what led to a major incident and verify that a fix does work.
There’s an apt quote for 2020 that goes, “may you live in interesting times.” But when it comes to our on-call rotations and system behavior, we’d prefer things be boring and predictable. We hope that the above list can help your team prepare for a Black Friday full of happy customers and plenty of downtime with your loved ones.