![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Gremlin sponsored this post.
Cloud platforms opened the floodgates for engineering teams to run enterprise-scale applications at much lower cost than traditional on-premises data centers. That said, cloud computing can still get expensive — especially as you scale up your operations. The Flexera 2020 State of the Cloud report found that cost savings was the number one priority for 73% of organizations, and that 23% had gone over budget on cloud spend.
Fortunately, cloud platforms provide a number of cost-optimization features — like resource sizing, on-demand infrastructure, and autoscaling. The trick is knowing how to use these features, while also providing high performance and high reliability in your applications.
In this article, we’ll look at a few different ways you can reduce your cloud spend and how to use Chaos Engineering to do so safely and intelligently.
There’s a balance to strike between provisioning enough capacity and not paying for unused capacity, but finding this balance is tough. For example, how do you:
We need a safe way to validate that our changes are right for our environment; the way we do this is with Chaos Engineering. Chaos Engineering is the practice of deliberately testing systems for failure, by injecting them with precise amounts of harm. By observing how our systems respond to this failure, we can make them more resilient.
How does this apply to right-sizing cloud infrastructure? Imagine we have a group of virtual machine instances that we want to scale once CPU usage reaches a certain threshold (e.g. 80% across all nodes for more than one minute). Traditionally, in order to test this autoscaling rule, we’d either need to wait for traffic to organically reach this threshold, or simulate the traffic ourselves using complex scripts. But with Chaos Engineering, we can easily consume CPU cycles across the cluster. We can then monitor our instances and applications to make sure that:
Of course, we also want to make sure that we can scale back down when resources aren’t in use. We don’t want to pay for resources we’re not using. So once your systems scale up, halt your experiment and continue monitoring your instance group to make sure that it automatically scales back down.
Having redundant systems is essential for maintaining service during a failure. Organizations that don’t have redundancy risk losing as much as $220,000 for every minute of downtime. A common strategy is to create a replica of your environment and run it in a separate location (known as active-active redundancy). This has a better chance of protecting you during a major outage, but it’s also extremely expensive. Not only are you doubling your operating costs, but you have the added costs of transferring data between both environments.
Alternatively, you can create a replica of your environment that remains on standby and only operates when the primary fails (known as active-passive redundancy). This has the advantage of being lower cost, but it may take longer to spin up during a failover. In this case, we need a way to test our failover strategy to make sure that the replica automatically kicks in and handles load without downtime.
For example, let’s say we have two virtual machine instance groups placed behind a load balancer. One instance group is our primary group, while the second is our failover group. With Chaos Engineering, we can drop all network traffic between the load balancer and the instances in our primary group, to simulate a regional or zonal outage. We can then monitor traffic flow and application availability to make sure that:
If we fail to meet any of these conditions, we can halt the attack and immediately return the flow of traffic to the primary group while we troubleshoot the problem. Approaching redundancy this way is effective for making sure that your redundant systems are working correctly and that you’re protected in case of an outage.
It’s easy for cloud resources to become abandoned over time, for any number of reasons:
The challenge of removing abandoned resources is not knowing whether those resources are still being used. What if that compute instance that’s been running for three years is actually hosting a critical service? Even if the service isn’t critical, will destroying it cause some other, unexpected problem in our application?
Fortunately, we can use Chaos Engineering to test the essentiality of a service without deleting or shutting down the instance. As with redundancy, we can drop network traffic to the host to simulate a host failure, then observe the impact on our application. If we’re worried that this is an important production server, we can lower the magnitude of the attack by adding latency to network calls instead. If we notice that adding a reasonably small amount of latency (e.g. 150ms) has a corresponding effect on throughput, then we’ll know this is a critical server. If not, we can scale up our attack to a blackhole attack. In any case, we can always halt the experiment and return service to normal before we do additional testing.
Reducing cloud spend is an ongoing challenge for SRE teams, especially as cloud platforms roll out new services and features. Chaos Engineering can help reduce your costs by helping you right-size your infrastructure, be more intelligent about redundancy, and uncover unused resources — all while helping you keep your applications running reliably.