VOOZH about

URL: https://thenewstack.io/how-amazon-prime-videos-engineering-teams-build-resilience/

⇱ How Amazon Prime Video Engineering Builds Team Resilience - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-02-08 03:00:43
How Amazon Prime Video Engineering Builds Team Resilience
case-study,in-depth-news,sponsor-chaosnative,sponsored,sponsored-event-coverage,
Observability / Tech Culture

How Amazon Prime Video Engineering Builds Team Resilience

At Chaos Carnival, speakers from the streaming giant revealed how they are using machine learning to help avoid incidents when traffic spikes.
Feb 8th, 2022 3:00am by Jennifer Riggins
👁 Featued image for: How Amazon Prime Video Engineering Builds Team Resilience
ChaosNative sponsored this post. Insight Partners is an investor in ChaosNative and TNS.

Highly distributed software systems allow organizations to scale faster, and give their engineers more speed, control and autonomy. But as companies scale, so does complexity.

No company knows that better than Amazon Prime Video, which employs that means cross-functional teams in multiple geographies serving millions of users across thousands of APIs. With the added complexity of live events and popular streaming premieres, its teams have had to learn to grapple with sharp spikes in traffic and workloads.

As a result, continuous resilience is at the core of Amazon’s cross-organizational success. For Prime Video, that comes down to addressing scale, complexity and impact on customers.

Two memorable sessions from January’s Chaos Carnival, the annual users’ conference hosted by ChaosNative, came from Prime Video’s resilience and chaos engineering team, which centers on supporting the company’s DevOps teams in continuously improving how they predict, prepare, operate, and learn.

In order to keep up with often unpredictable traffic loads at a global scale, Prime Video is experimenting with a mix of human creativity, machine learning and team resilience scores.

👁 Supporting DevOps is written with a cycle that has predict > prepare > operate > learn in a continuous loop. Next to it has the following related points: - proactive reliability - provide actionable insights - build tools

Machine Learning for Continuous Resilience

Olga Hall, director of technical programs at Prime Video, kicked off the panel on achieving continuous resilience in DevOps with machine learning — by reflecting on Kiwi cricket.

Her team was preparing to release Amazon Prime in New Zealand, so it had to be ready to live stream the popular matches, which can be played over three to five days, lasting at least six hours per day. The peak in traffic is usually when everyone tunes into the final hour of the match — but the timing of that viewership spike isn’t often predictable in advance.

ChaosNative Inc. provides products and services for the reliability of cloud native DevOps built on top of the popular open source Chaos engineering project LitmusChaos. ChaosNative offers the hosted Litmus service at cloud.chaosnative.com. ChaosNative and TNS are under common control.
Learn More
The latest from ChaosNative

Modeling after lessons from DevOps bestsellers “Accelerate” and “Architecting for Scale,” Hall’s team ran some science experiments ahead of its most southern launch to date. It created machine-learning models to figure out workload shape and demand, applied chaos engineering to simulate failure scenarios, and practiced incident recovery.

One of Amazon founder Jeff Bezos’s mantras has famously been: “Good intentions never work, you need good mechanisms to make anything happen.”

Hall’s team looked for ways to build continuous resilience mechanisms around five principles:

  1. Workload modeling. What kind of event, with what audience size, at what time? Which devices? Which regions? While customers are watching live sports, what will happen with the on-demand content?
  2. Play game days around those models.
  3. Run failure injections in parallel. In addition to automatic load testing in production across all the services during off-hours for customers, the team performs stress testing and performance testing in non-production environments. It also runs injections for latency, always checking for consistent timeout settings.
  4. Contingencies and alternate pathways. Making sure fallbacks and failovers automatically kick in at varying levels of architecture.
  5. Observability across everything.

Hall’s team uncovered a pattern and decided to split the experiments into two buckets — controllable and uncontrollable.

  • Controllable inputs, or workload modeling and game days, are where the team applies machine learning to run continuous or timed experiments.
  • Uncontrollable inputs, or failure injections and contingency planning, are when humans can make decisions, so engineers can have fun experimenting.

Observability is needed across both to really gain from this mashup of automatic anomaly detection and human-led scientific experimentation.

The next step will be applying machine learning and artificial intelligence to things that the team so far deems uncontrollable.

Machine Learning for Workload Forecasting

Workload forecasting is rather like weather forecasting. You’re predicting how workloads will vary in the future under increasingly complex and unpredictable circumstances.

But while the climate crisis is making historical patterns less reliable as the basis for forecasting, at Prime Video, teams rely on what Ali Jalali, an Amazon applied scientist, dubbed “normal circumstances” — before performing experiments like suddenly increasing a customer base.

In the same Chaos Carnival panel where Hall spoke, Jalali said there are also a lot of variables for Amazon Prime’s teams to consider, including:

  • Customer metrics
  • Feature rollout
  • Long-term planning
  • Cloud-based tools
  • New marketing strategies
  • Seasonalities, like days of the week, monthly, quarterly

Jalali’s team needs to take all those variables and determine an optimal future risk level — which, for Prime Video, he said, is somewhere between 90% and 95%. With that in mind, his team use “classical time series models to essentially narrow down the area for the forecast, and then use more advanced technologies, like deep learning, to really zoom into that area and find the exact numbers.”

He says this combination has worked for his team, but that it still has a way to go in terms of workload forecasting for a baseline and then scaling up. It gets harder when a live sports event streams at the same time as a big on-demand premiere.

“Resiliency is the intersection of complexity, scale and impact.”

— Olga Hall, director of technical programs, Amazon Prime Video

When Amazon Prime released the second season of the wildly popular Indian action series “Mirzapur,” the company was able to easily predict the time of a huge peak in viewership. Its teams then leveraged machine-learning models to predict the traffic spike when combined with any other live events and partner content releases.

“We need a predictive model that can tell us ahead of time what’s going to happen at that exact moment in time, so that we can prepare for it,” Jalali said.

With this in mind, Prime Video has built a library of past Amazon events to create a similarity engine, which engineers combine with data pouring in via social media and IMDB ratings, to predict hype. Then they test capacity and resiliency against that hype.

The teams can even automate regional considerations. For example, Jalali said that a lot of Indians are live streaming events from their phones. But since data is so expensive, they will gather together in free public Wi-Fi spots. So, his team has trained models to test under those conditions.

With all of this in place, the Prime Video teams are then able to automate the delegation of loads to different data centers based on availability and latency, making reactive decisions made in real-time, including over CPU and memory optimizations, allowing for auto-tuning and autoscaling.

This includes a new carbon footprint model that Jalali says factors in how the power is created and machine type.

Machine Learning to Reduce Incident Management

Geoffrey Robinson, principal technical program manager at Prime Video, also on the Chaos Carnival panel, compared incident management automation with adaptive cruise control, which has evolved from speed control all the way toward autonomy.

Robinson’s team is dedicated to answering questions like, “What are the things that engineers have to do multiple times and where can we automate that? How can we improve our process so that they can use their brainpower to solve more strategic needs?”

His team focuses on ways to reduce cognitive load at one of the most stressful times — when the pager calls.

One of his team’s objectives is to reduce time to mitigation. It uses data to uncover patterns for things like false alarms, allowing for easier error flagging and troubleshooting. This tool also highlights the likely culprits: any deployments made in the last 15 minutes.

Through machine learning, his team got the incident onboarding process down to five minutes, starting with a ticket-declaration service. With all the tagging, after resolution, the team feeds more live incident data back into the model, which then feeds into game-day simulations.

“Anything we can automate, like adaptive cruise control, we can feed back into that incident,” Robinson said. “So before the next incident occurs, we know that data will either be available to the team that’s troubleshooting or they’ve gone through the game days and they’ve tested to make sure that things are ready for it.”

This data-informed automation will likely decrease incidents over time, Hall said. “We see a future for all of us — for us as a team, for many of you in the audience — where an engineer sees a problem or issue only once,” she said. “This is a repeatable, controllable, understandable problem.”

But for now there’s still a human involved for those rare events or anomalies that fall outside machine-recognizable patterns. Just like we haven’t yet eliminated the need for humans to drive cars, Amazon Prime’s process isn’t automating the humans out of incident management — yet. It’s just trying to keep them from waking up unnecessarily at two in the morning.

Team Resilience Score

Prime Video colleague Sudeepa Prakash kicked off her Chaos Carnival talk by asking the live audience their reaction when an executive establishes a new scoring system. About three-quarters of the audience members admitted that their companies were in the process of new top-down measurements — although that doesn’t mean they were happy about it.

Prakash, a senior product manager at Prime Video, told the crowd how her company’s team resilience score tackled not just how it developed this mechanism to encourage teams’ preparedness to drive operational readiness. She gave tips for how to go about introducing new metrics-based concepts — without scaring everyone off.

The focus must be on teams aligning around achieving proactive reliability, which Prakash said is to “influence operational excellence through preparedness to avoid failures.” To address the unyielding complexity impact, they chose to anchor around the goal of availability because, as she said, “When you have a higher resilience, you’ll perhaps have a higher availability.”

The term resilience at Prime Video translates to:

  • Preparing to avoid failures.
  • Operating successfully in the presence of failures.
  • Accepting that failures are inevitable, so you need to have contingencies.

The company looked to turn this concept into a mechanism. But Prakash emphasized that building a tool was just a means to a specific end — alignment around availability.

This of course maps back to the Prime Video’s continuous improvement cycle of predict, prepare, operate and learn. Managers asked the resilience and chaos engineering team to look at existing engineering practices.

Amazon is really good at recording root cause analysis, Prakash said, so the company already had the data for what she called  “the more meaty components of the score,” which reflected prior issues.

The score components also all came from existing tools, including:

  • Deployment safety measures. What are different tools that you can deploy that aren’t going to cause issues?
  • Operational readiness review.
  • Unit testing and integration testing.
  • Root cause analysis. Specifically, the organization wanted to run checks for any open action items from a root cause analysis.

👁 Reduce Cognitive Load -- a dark gray diamond with the number 83.40 in the middle and four points: operational readiness review, code coverage, CoE action items, and, finally, deployment safety, which is the area that shows less dark gray and needs the most work on, which is weighted as the most important.

The team resilience score brings together all of this data into one score per team, that everyone can see in one place, around four score goals, with the following weighting:

  • Deployment safety: 40%
  • Operational readiness review: 30%
  • Center of Excellence action items: 15%
  • Code coverage: 15%

The goal of this diamond-shaped scoring grid is to provide actionable insights, based on regular reporting, democratized by the teams — without increasing cognitive load. At a glance, teams can see where they stand, highlighting any missing components in light gray. This is where they first look to reduce repetitive actions and automate wherever possible.

“Visualization is important,” Prakash said. “By just looking at this, teams are able to make quick decisions of what they are going to prioritize.”

What feeds into each team resilience score is different and decided on at the team level. And the scores widely vary based on team complexity.

If a particular area is already optimized for operational excellence, the team moves on to a different area of improvement. Over time, Prakash said, the organization has learned that it needs to be flexible and constantly iterate on scores, getting continuous feedback from the teams. However, while scoring may change, it shouldn’t change frequently, and the goalposts shouldn’t be moved without a  clear reason.

A Resilience Score Can Help Set Priorities

Transparency is key to a successful implementation of a team resilience score, with team members being able to deep dive into any reasons for lost points. While the score is automated, each team has an override button to add notes or further data in case they think they have satisfied criteria.

Prakash emphasized that the team resilience score is a mechanism to help teams, paired with a tool to provide actionable insights, but it is “not a report card, not a means to mandate processes, not only for leaders — it is meant to be a tool for the teams.”

She warned to “treat these scores as proxies, not another evaluation or performance mechanism. It’s meant for the teams to prioritize what they should and should not be working on.”

And while it has the Prime Video teams aligning around resiliency, in true chicken-or-egg fashion, the company isn’t even sure if a high team resilience score correlates with high availability — or the other way around. It is, however, clear that team resilience is grounded in continuous improvement, so scores should continue to rise.

ChaosNative Inc. provides products and services for the reliability of cloud native DevOps built on top of the popular open source Chaos engineering project LitmusChaos. ChaosNative offers the hosted Litmus service at cloud.chaosnative.com. ChaosNative and TNS are under common control.
Learn More
The latest from ChaosNative
TRENDING STORIES
Jennifer Riggins is a tech storyteller and journalist, event and panel host. She bridges the gap between business, culture and technology, with her work grounded in the developer experience. She has been a working writer since 2003, and is based...
Read more from Jennifer Riggins
ChaosNative sponsored this post. Insight Partners is an investor in ChaosNative and TNS.
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services and ChaosNative are sponsors of The New Stack.
TNS owner Insight Partners is an investor in: Pragma, Root, Unit, ChaosNative.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.