VOOZH about

URL: https://thenewstack.io/6-lessons-learned-from-netflixs-new-years-eve-outage/

⇱ 6 Lessons Learned from Netflix's New Year's Eve Outage - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-09-01 08:21:43
6 Lessons Learned from Netflix's New Year's Eve Outage
contributed,sponsor-rookout,sponsored,sponsored-post-contributed,
DevOps / Software Development

6 Lessons Learned from Netflix’s New Year’s Eve Outage

Resilience is not just a companywide effort, but an industrywide one too. Read about lessons learned from Netflix’s New Year’s Eve outage.
Sep 1st, 2021 8:21am by Adam LaGreca
👁 Featued image for: 6 Lessons Learned from Netflix’s New Year’s Eve Outage
Photo by cottonbro from Pexels.
Rookout sponsored this post.
Adam LaGreca
Adam is the founder of 10KMedia, a boutique public relations agency for B2B DevOps. Prior, he was the director of communications for DigitalOcean, Datadog and Gremlin.

I recently had the opportunity to sit down with Jeremy Edberg, MinOps CEO and previously a site reliability engineer (SRE) for Netflix and Reddit, as well as Liran Haimovitch and John Egan, founders of Rookout and Kintaba respectively, to talk about incident management best practices.

In the tech industry, we feel like we are on the cutting edge, but in reality we are often playing catch-up to other industries. For example, the aviation industry has already learned that trying to reduce your incident count is counterproductive when trying to become more resilient. In reality, what you want is a culture that embraces incidents. A culture that files problems early and often, distributes the learnings, and in turn drastically reduces the chances for SEV0 or SEV1 disasters.

To open up the discussion, Jeremy talked about a Netflix outage he experienced in 2012. It was one minute past midnight on New Year’s Eve when he received an alert that user signups were broken. His gut was telling him that the issue must be time-related, but he was continuously being assured that couldn’t be the case, as everything was in Greenwich Mean Time (GMT) and anything time-related would have broken eight hours before.

After three hours of troubleshooting, they found the problem: The user signup flow required a database table to be created once a year because it stored a log of the creation in Pacific Time. No one had created the table before midnight, so the system broke when it couldn’t find the table it was looking for.

Jeremy was tempted to say “I told you so,” but as we all know, that isn’t productive during a retrospective. So in the interest of productivity, I’ve put together six takeaways about modern incident management from the conversation:

1) Trust your gut. Often it’s the fear of being wrong that prevents us from taking action. No one wants to posit an incorrect theory, let alone set off a fire alarm that pages everyone. But creating a positive incident culture means everyone should feel empowered to be open and speak up.

2) Declare early and often. Creating a positive incident culture also means that issues are filed early and often. This is also known as the “big red button” — since the 1950s, factories have had big red buttons that can be pressed at any time by anyone. The insight here is that what you actually want is your incident count to go up, by increasing access to declaring them, because addressing them early will prevent them snowballing into SEV0/SEV1 disasters.

3) Involve the entire organization. We have a tendency in the tech industry to silo the responsibility of resilience to SREs. But the truth is that problems come from everywhere, so any employee should be able to press that big red button. Yes, oftentimes a problem is identified inside a Datadog dashboard or PagerDuty alert. But they can also be flagged inside a support ticket or a customer complaint. Giving everyone the keys to declare an incident means that problems will be surfaced, and resolved, much faster. Moreover, after a problem is addressed, it shouldn’t be the job of SREs to handle everything from looping in customer reps, legal and PR as the incident unfolds. Adopting modern tooling should help orchestrate a lot of that process.

4) Developers should be on the hook for reliability. Bad code deployments are a leading cause of SEV0 or SEV1 incidents, according to Gremlin’s “State of Chaos Engineering Report.Long gone are the days when developers write code, then throw it over the wall to operations. They need to have skin in the game and be on the hook for that code being reliable. This means adopting modern tooling like observability and live debugging for more effective troubleshooting and root cause analysis.

Rookout empowers engineers to solve customer issues 5x faster, by making debugging easy and accessible in any environment; from cloud-native to on-prem and from dev to production. Rookout allows engineers to get the data they need instantly, without additional coding, restarts, or redeployments.
Learn More
The latest from Rookout

5) Automate what you can: As the Netflix story demonstrates, if creating a new table before midnight is a necessary repeatable task, why not automate it? Jeremy explained in the conversation that because it was a simple task, everyone just assumed that someone else would do it. In an ideal world, predictable and repeatable tasks should be automated, saving the manual work of incident management for the truly unique, black-swan, unpredictable events.

6) Read more postmortems. After every incident, there should be an effort made to document, in at least one sentence, what happened and one takeaway action for preventing it from happening again. These are conversations that happen around water coolers or are kept in one engineer’s head, but especially in a remote world, it’s important that these learnings are documented and distributed. One big myth about modern incident management is that postmortems need to be long complicated documents with multiple data fields. But the truth is, that often serves as a deterrent for reading the postmortem — or ever writing it in the first place. Getting something down, even if it’s just a couple of sentences, from the person who was there when the incident happened is crucial to improving resilience. NASA is known for actually reading other companies’ postmortems, because the agency is hungry for more learnings that aren’t being generated internally. (Check out postmortem.io if you want to build up this habit yourself).

As you can see, resilience is not just a companywide effort, but an industrywide one too. Reading about Netflix’s New Year’s Eve outage may very well convince some of you to make sure that a new database table is automatically created before midnight! These are the kinds of learnings we can share with one another, simply by being more open and transparent.

Listen to the full story of Netflix’s New Year’s Eve outage here:

Rookout empowers engineers to solve customer issues 5x faster, by making debugging easy and accessible in any environment; from cloud-native to on-prem and from dev to production. Rookout allows engineers to get the data they need instantly, without additional coding, restarts, or redeployments.
Learn More
The latest from Rookout
TRENDING STORIES
Adam LaGreca is the founder of 10KMedia, a boutique public relations agency for B2B DevOps. Prior, he was the director of communications for DigitalOcean, Datadog and Gremlin.
Read more from Adam LaGreca
Rookout sponsored this post.
SHARE THIS STORY
TRENDING STORIES
PagerDuty is a sponsor of The New Stack.
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.