VOOZH about

URL: https://thenewstack.io/how-chaos-engineering-helps-you-reduce-cloud-spend/

⇱ How Chaos Engineering Helps You Reduce Cloud Spend - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2020-12-28 08:35:10
How Chaos Engineering Helps You Reduce Cloud Spend
contributed,sponsor-gremlin,sponsored,sponsored-post-contributed,
Cloud Services / DevOps / Observability

How Chaos Engineering Helps You Reduce Cloud Spend

Cloud platforms provide a number of cost-optimization features — like resource sizing and autoscaling. The trick is knowing how to use these features.
Dec 28th, 2020 8:35am by Andre Newman
👁 Featued image for: How Chaos Engineering Helps You Reduce Cloud Spend
Gremlin sponsored this post.

Gremlin sponsored this post.

Andre Newman
Andre is a technical writer for Gremlin where he writes about the benefits and applications of Chaos Engineering. Prior to joining Gremlin, he worked as a consultant for startups and SaaS providers where he wrote on DevOps, observability, SIEM, and microservices. He has been featured in DZone, StatusCode Weekly, and Next City.

Cloud platforms opened the floodgates for engineering teams to run enterprise-scale applications at much lower cost than traditional on-premises data centers. That said, cloud computing can still get expensive — especially as you scale up your operations. The Flexera 2020 State of the Cloud report found that cost savings was the number one priority for 73% of organizations, and that 23% had gone over budget on cloud spend.

Fortunately, cloud platforms provide a number of cost-optimization features — like resource sizing, on-demand infrastructure, and autoscaling. The trick is knowing how to use these features, while also providing high performance and high reliability in your applications.

In this article, we’ll look at a few different ways you can reduce your cloud spend and how to use Chaos Engineering to do so safely and intelligently.

Right-Size Your Infrastructure

There’s a balance to strike between provisioning enough capacity and not paying for unused capacity, but finding this balance is tough. For example, how do you:

  • Right-size a virtual machine instance so that it isn’t excessively idle, but can still handle changes in demand?
  • Scale down idle resources without inadvertently creating a bottleneck?
  • Know that you can reliably scale your applications?

We need a safe way to validate that our changes are right for our environment; the way we do this is with Chaos Engineering. Chaos Engineering is the practice of deliberately testing systems for failure, by injecting them with precise amounts of harm. By observing how our systems respond to this failure, we can make them more resilient.

How does this apply to right-sizing cloud infrastructure? Imagine we have a group of virtual machine instances that we want to scale once CPU usage reaches a certain threshold (e.g. 80% across all nodes for more than one minute). Traditionally, in order to test this autoscaling rule, we’d either need to wait for traffic to organically reach this threshold, or simulate the traffic ourselves using complex scripts. But with Chaos Engineering, we can easily consume CPU cycles across the cluster. We can then monitor our instances and applications to make sure that:

  • The new systems start up correctly.
  • We can load balance traffic between our systems.
  • The customer experience isn’t negatively affected.

Of course, we also want to make sure that we can scale back down when resources aren’t in use. We don’t want to pay for resources we’re not using. So once your systems scale up, halt your experiment and continue monitoring your instance group to make sure that it automatically scales back down.

Be Smart About Redundancy

Having redundant systems is essential for maintaining service during a failure. Organizations that don’t have redundancy risk losing as much as $220,000 for every minute of downtime. A common strategy is to create a replica of your environment and run it in a separate location (known as active-active redundancy). This has a better chance of protecting you during a major outage, but it’s also extremely expensive. Not only are you doubling your operating costs, but you have the added costs of transferring data between both environments.

Alternatively, you can create a replica of your environment that remains on standby and only operates when the primary fails (known as active-passive redundancy). This has the advantage of being lower cost, but it may take longer to spin up during a failover. In this case, we need a way to test our failover strategy to make sure that the replica automatically kicks in and handles load without downtime.

For example, let’s say we have two virtual machine instance groups placed behind a load balancer. One instance group is our primary group, while the second is our failover group. With Chaos Engineering, we can drop all network traffic between the load balancer and the instances in our primary group, to simulate a regional or zonal outage. We can then monitor traffic flow and application availability to make sure that:

  1. The load balancer detects the primary outage and redirects traffic to the secondary group.
  2. The secondary instance group can start-up and serve traffic with minimal delays.
  3. Users don’t experience significant delays or data loss.

If we fail to meet any of these conditions, we can halt the attack and immediately return the flow of traffic to the primary group while we troubleshoot the problem. Approaching redundancy this way is effective for making sure that your redundant systems are working correctly and that you’re protected in case of an outage.

Find Unused Resources

It’s easy for cloud resources to become abandoned over time, for any number of reasons:

  • Teams create a temporary test or demo environments that they forget to decommission.
  • Misconfigured autoscaling rules create new resources, but don’t remove unused resources.
  • Applications change and no longer use old systems, but engineers keep those systems running because they’re not sure if they’re still in use.
  • Engineers leave the company and forget to document older systems.

The challenge of removing abandoned resources is not knowing whether those resources are still being used. What if that compute instance that’s been running for three years is actually hosting a critical service? Even if the service isn’t critical, will destroying it cause some other, unexpected problem in our application?

Fortunately, we can use Chaos Engineering to test the essentiality of a service without deleting or shutting down the instance. As with redundancy, we can drop network traffic to the host to simulate a host failure, then observe the impact on our application. If we’re worried that this is an important production server, we can lower the magnitude of the attack by adding latency to network calls instead. If we notice that adding a reasonably small amount of latency (e.g. 150ms) has a corresponding effect on throughput, then we’ll know this is a critical server. If not, we can scale up our attack to a blackhole attack. In any case, we can always halt the experiment and return service to normal before we do additional testing.

Conclusion

Reducing cloud spend is an ongoing challenge for SRE teams, especially as cloud platforms roll out new services and features. Chaos Engineering can help reduce your costs by helping you right-size your infrastructure, be more intelligent about redundancy, and uncover unused resources — all while helping you keep your applications running reliably.

Gremlin is the world’s first hosted Chaos Engineering service with a mission to help build a more reliable internet. It turns failure into resilience by enabling engineers to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.
Learn More
The latest from Gremlin
TRENDING STORIES
Andre is a technical writer for Gremlin where he writes about the benefits and applications of Chaos Engineering. Prior to joining Gremlin, he worked as a consultant for startups and SaaS providers where he wrote on DevOps, observability, SIEM and...
Read more from Andre Newman
Gremlin sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.