VOOZH about

URL: https://thenewstack.io/use-monitoring-insights-to-optimize-cost/

⇱ Use Monitoring Insights to Optimize Cost - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-11-19 10:00:42
Use Monitoring Insights to Optimize Cost
contributed,

Use Monitoring Insights to Optimize Cost

Environment-as-a-service as a cost-savings measure.
Nov 19th, 2021 10:00am by Nati Shalom
👁 Featued image for: Use Monitoring Insights to Optimize Cost
Feature Image par Pavel Charny de Pixabay.
Nati Shalom
Nati Shalom is the CTO and Founder of Cloudify. He is a serial entrepreneur and widely published thought leader and speaker in open source, multicloud orchestration, network virtualization, DevOps, and edge computing. Nati has received multiple recognitions including YCombinator and is one of the leaders of cloud native and DevOps Israel groups.

One of the side effects of the “gold rush” to the public cloud is that many organizations are now facing rapidly growing cloud costs, including large and sudden spikes. About 55% of respondents to Anodot’s survey say they have been “surprised” by cloud costs or had an incident where cloud costs suddenly spiked. This problem gets worse as software companies scale. Many hyperscale companies have surpassed the point at which 50% of their total COGS are allocated to cloud spend, making cloud infrastructure cost optimization a strategic imperative as I noted in my previous post, Reducing Cloud Spend Need Not Be a Paradox.

In this post, I’ll compare two methods for controlling our cloud cost: cost monitoring vs cost efficiency. I’ll also use a specific example to illustrate how we can gain 10x better efficiency and get closer to an optimum level of efficiency by creating separate, optimized environments for development and production.

Some of the contributors to this cost escalation issue are a result of human error. Let’s look at the following real-world examples:

“One employee selected the wrong EC2 instance, and it cost the company nearly $40,000 over the course of a couple of days before the error was caught and corrected.”

“Internal users have left cloud-based GPUs spinning, even after work on them has stopped.” — Danny Zalkind, Kenshoo’s DevOps group manager

Methods for Achieving Cost-Efficiency

Cost-efficiency is all about matching the right infrastructure to the job.

Cloud offers a wide range of infrastructure resources at varying costs. For example, on EC2 alone AWS currently offers nearly 400 different instances with choices across storage options, networking, and operating systems. Complicating this further is that users can choose from machines located in 24 regions and 77 availability zones around the world.

This is just a small fraction of the options you can choose from to optimize your infrastructure, and this list keeps on growing. For the sake of simplicity, I grouped the primary optimization options into three main categories.

  1. Policy refers to usage patterns. For example, de-commissioning policy refers to the case in which a certain workload has a time limit to avoid running unused Idle resources. Autoscaling is another way we use a policy-driven approach to match the infrastructure capacity with the real-time demand. Placement policy can be used to define at runtime the right infrastructure target for a particular workload based on availability, location, etc. Repatriation is a policy in which we use a hybrid cloud to offload some of the workloads into a dedicated and highly optimized cloud infrastructure purposely built to run this specific workload. Dropbox storage or Netflix CDN can be an example of such workloads.

  2. HW Profile refers to the choice of the specific compute or storage resource combination that provides the best cost/performance ratio. This category alone includes thousands of possible combinations ranging from Spot to a dedicated bare-metal machine.

  3. Architecture refers to the selection of a specific platform architecture such as EKS, ECS, Servlets, etc. Quite often the choice of a platform requires that the application be written specifically for that platform to achieve the best cost/performance ratio.

👁 Methods for achieving cost-efficiency

Cost Monitoring vs. Cost Efficiency

Cost monitoring tells you where your infrastructure costs are being spent, and it may also highlight areas of potential inefficiencies. However, it provides very little direction on how to fix those inefficiencies. As with any other monitoring system, cost monitoring can quickly overwhelm you with data making it hard to filter out the crucial insights from the noise.

Cost efficiency on the other hand continuously looks at how to match specific workloads to the right choice of architecture and infrastructure. Quite often that optimization will involve code or architecture changes, and therefore it requires more dedicated and continuous engineering work.

The following example is a good demonstration of that difference. In this example, we use the same containerized workload and run it on two different platforms on the same cloud provider, EKS and ECS. As can be seen in this benchmark, by choosing ECS over EKS we can save 67%. In this specific case, this optimization comes at the expense of portability.

The lesson from this very simple example is that cost-efficiency is an ongoing engineering task as we constantly have to choose between conflicting tradeoffs that can sometimes have long-term implications and cannot be easily addressed just by throwing a tool at the problem.

👁 In this example, ECS saves 67% over EKS, but at the cost of limiting workload portability.


In this example, ECS saves 67% over EKS, but at the cost of limiting workload portability.

Achieving the Optimum — the Right Infrastructure for the Job

The theoretical optimum in terms of efficiency is to tailor the infrastructure specifically for each particular workload. This is practically impossible but, nevertheless, it gives us a higher benchmark that we can strive to achieve.

As the number of infrastructure choices and platforms continue to grow it becomes harder to handle the matchmaking exercise between the workload and the infrastructure at a granular infrastructure resource level.

While the number of infrastructure choices is extremely high, the number of types of workload environments is relatively lower. This is especially so if we’re looking at the workloads that consume the bulk of our infrastructure resources.

Environment-as-a-Service (EaaS) provides a means by which we create an optimized stack for each workload environment. An example for such environments types can be:

  • Development and production environments

  • Machine learning environments

  • Environment per project/customer/product

We refer to those environments as “certified environments.”

Example — Optimizing Development vs. Production Environments

In the following example, we took a typical Kubernetes-based environment, which includes the Kubernetes cluster as well as shared infrastructure services such as network, storage, and database.

We created multiple versions of that same stack. The first is optimized for production in AWS and Azure. In this case, we chose a fully managed stack, managed Kubernetes (EKS, AKS), managed storage, and database. For the development environment, we chose a stack that would be optimized for low cost and agility. For that stack, we chose a Minikube and K3S as the lightweight Kubernetes, single-instance storage (Minio), Postgres, and a simple network all running on a single VM.

The following diagram shows the specific mapping of the different flavors of that environment.

👁 Environment as a service example

Achieving 10X Cost Saving

We used real-time cost monitoring to measure the actual cost per environment. As we expected, the development environment cost was equal to the cost of a single VM and was 10x slower than the plain vanilla production stack.

👁 A hypothetical display of a total of costs

What’s interesting is that by looking at the detailed resource breakdown we could also see that the number of resources and associated hidden costs included whenever we create a managed resource is significantly higher, as we can see in the following resource breakdown table:

👁 Data view by cost

You should note that in this case, we didn’t include things like geo-redundancy — which will double the number of production resources as well as increase the bandwidth and networking cost. The dynamic nature of those production resources also makes the ability to predict the actual cost close to impossible, whereas in the development stack we obtain all the resources in a single VM, which makes cost prediction fairly deterministic.

Final Notes

Cost-efficiency is an ongoing engineering task. The thought that we can achieve higher efficiency just by moving our workload to the cloud or by choosing a cloud-native stack and automation tools, using spot instances where possible, etc. is a good start but would still put us far from achieving the optimum.

A cost monitoring tool can help us detect anomalies as well as show us where we spend our infrastructure cost, but it doesn’t replace the need for ongoing engineering work needed to optimize our stacks.

With the number of infrastructure choices and platforms continuously growing, it’s going to be close to impossible to optimize stacks at the granular infrastructure resource level. EaaS simplifies this engineering work by taking a more cross-grain approach in which we organize our environment into a small number of highly optimized stacks based on their target usage. One of the common examples for such an environment is separating the production and development environments as we demonstrated in the example above.

With this approach, we can get much closer to the theoretical optimum.

It’s Not Just About Cost

The move to Environment-as-a-Service and certified environments brings with it additional benefits other than cost-efficiency (such as better agility) by democratizing our development environment. Stay tuned for more in this regard.

TRENDING STORIES
Nati Shalom is CTO and Founder at Cloudify. He is a serial entrepreneur and thought leader in open source, multicloud orchestration, network virtualization, DevOps, edge computing, and more. Nati has received multiple recognitions from publications such as The CIO Magazine...
Read more from Nati Shalom
SHARE THIS STORY
TRENDING STORIES
Minio is a sponsor of The New Stack.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.