VOOZH about

URL: https://thenewstack.io/chaos-engineering-now-part-of-aws-well-architected-framework/

⇱ Chaos Engineering Now Part of AWS Well-Architected Framework - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2020-08-31 09:23:01
Chaos Engineering Now Part of AWS Well-Architected Framework
contributed,sponsor-gremlin,sponsored,sponsored-post-contributed,
Cloud Native Ecosystem / DevOps

Chaos Engineering Now Part of AWS Well-Architected Framework

The eighth update to the Well-Architected Framework (WAF) was recently announced. It now includes Chaos Engineering as a requirement for a reliable system.
Aug 31st, 2020 9:23am by Taylor Smith
👁 Featued image for: Chaos Engineering Now Part of AWS Well-Architected Framework
Gremlin sponsored this post.
Taylor Smith
Taylor is a Technical Product Marketing Manager at Gremlin, helping customers develop more reliable systems. He is passionate about modern applications and infrastructure. Previously he developed organic and inorganic strategies for Cisco and NetApp.

This summer, Amazon Web Services announced the eighth update to the Well-Architected Framework (WAF) since its launch in 2012. The WAF was originally provided only internally for AWS’ own architecture and developer teams, as a best practice guide for building and reviewing the infrastructure of applications launched under AWS’s brand. Since its public release in 2015, the WAF has been regularly modified based on learnings from customers, to define and update a set of best practices for anyone architecting high-performing, resilient, and efficient cloud infrastructure and applications.

The framework includes five pillars: operational excellence, security, reliability, performance efficiency, and cost optimization. Among these five pillars, there has just been a notable addition to the Reliability Pillar. Along with guidance around Workload Architectures to build resilient distributed systems, the authors now name Chaos Engineering as a requirement for a reliable system. Seth Eliot, principal reliability solutions architect with AWS Well-Architected, notes “You can’t consider your workload to be resilient until you hypothesize how your workload will react to failures, inject those failures to test your design, and then compare your hypothesis to the testing results.”

Previous versions of the WAF mention injecting failure and performing GameDays, which are days dedicated to experimenting on our systems, but this latest update from AWS expressly calls out the necessity and benefits of Chaos Engineering for reliable systems.

The Early Days of Chaos Engineering

Chaos Engineering is the process of intentionally experimenting on a system by injecting precise and measured amounts of failure, to observe how our systems and teams respond for the purpose of improving reliability. The update to the Reliability Pillar rightly calls out how Netflix popularized the practice and that Amazon was applying the process even before that.

Gremlin is the world’s first hosted Chaos Engineering service with a mission to help build a more reliable internet. It turns failure into resilience by enabling engineers to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.
Learn More
The latest from Gremlin

The practice of Chaos Engineering began as a recognition that things can fail at any moment and that preparing for those failures was better than wishful thinking that it would never happen to our systems. Clouds were built on commodity hardware and so there was a recognition that instances would fail. Adding in random node shutdowns, as Chaos Monkey was designed to do, was not meant to create harm — rather, it was meant as a forcing function for engineers to design systems that could handle random host failure in an environment that a company no longer directly controlled.

Since the early days of cloud computing, AWS has become far more reliable and outages caused by AWS infrastructure failing have become rare — in part due to performing Chaos Engineering internally. However, while the rise of rapid innovation, microservices and agile methodologies increased the velocity of innovation, it simultaneously added complexity to systems. We can be less concerned about an EC2 instance failing under our application, but small configuration changes can lead to unknown dependencies, where one service breaking in what was thought to be a loosely coupled architecture has the potential to bring down an entire application.

The Evolution of Failure Testing

Chaos Engineering, as a practice, has evolved in two ways. First, in order to test newly, more distributed systems with increasing complexity, simple node failures are not enough. SREs and application teams can inject latency, up to complete service connection loss, to test how a service handles common network failures between services; or inject high CPU and memory usage to simulate the symptoms of high load. This comprehensive testing finds flaws in reliability mechanisms like circuit breakers and autoscaling, that may not be properly in place or are too quick or too slow to react. Testing for these failures ensures customers are never stranded with a full cart and no way to check out, or with a bill to pay and no way to access their funds.

Second, the AWS Reliability Pillar emphasizes a core principle of the practice: thoughtful, controlled experiments; rather than randomly injecting failure. The WAF states:

“Run tests that inject failures regularly into pre-production and production environments. Hypothesize how your workload will react to the failure, then compare your hypothesis to the testing results and iterate if they do not match.”

Engineers can now build failure testing throughout their development lifecycle. Similar to test-driven development, developing and testing with failure in mind lowers the cost of bug fixing by finding availability bugs upfront, rather than waiting for customers to find them. Incorporating failure testing into our usual integration testing highlights if applications work well together under good and bad conditions.

Once your code has been tested and prepared for failure in pre-production, move testing to production. Performing GameDays in production will, as the report notes, “help you understand where improvements can be made and can help develop organizational experience in dealing with events.”

These exercises test not only our tools — such as autoscaling and load balancing — for self-healing, but also third-party dependencies and teams. For instance, we can prepare for a loss of connection to APIs that are out of our control, such as payment providers or a service run by another team. We can also run teams through their playbooks, to prepare them to leverage monitoring tools and make sure they can react to and resolve incidents quickly.

In 2016, Gremlin founders Kolton Andrus and Matthew Fornaciari took their Chaos Engineering experience from Amazon and Netflix to build the first fully-hosted platform for safely and securely running experiments. The mission is to democratize the practice and increase the benefits of failure injection that many companies began to see when writing scripts and using open source tools like Chaos Monkey and ToxiProxy.

With thoughtful chaos experiments readily available for more engineers, we can build systems using the Well-Architected Framework that are operationally excellent and more reliable — systems that our customers can depend on.

Gremlin is the world’s first hosted Chaos Engineering service with a mission to help build a more reliable internet. It turns failure into resilience by enabling engineers to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.
Learn More
The latest from Gremlin
TRENDING STORIES
Taylor is a Technical Product Marketing Manager at Gremlin, helping customers develop more reliable systems. He is passionate about modern applications and infrastructure. Previously he developed organic and inorganic strategies for Cisco and NetApp.
Read more from Taylor Smith
Gremlin sponsored this post.
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services is a sponsor of The New Stack.
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.