VOOZH about

URL: https://thenewstack.io/scaling-environments-with-opentelemetry-and-service-mesh/

⇱ Scaling Environments with OpenTelemetry and Service Mesh - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-10-17 11:13:06
Scaling Environments with OpenTelemetry and Service Mesh
sponsor-signadot,sponsored-post-contributed,
Microservices / Operations / Service Mesh

Scaling Environments with OpenTelemetry and Service Mesh

OTel Baggage and service meshes like Istio and Linkerd can be used together to implement highly scalable dev, preview and test environments.
Oct 17th, 2023 11:13am by Anirudh Ramanathan
👁 Featued image for: Scaling Environments with OpenTelemetry and Service Mesh
Feature image from Pakorn Khantiyaporn on Shutterstock.
Signadot sponsored this post.

With microservices, each team is dealing with smaller pieces of the application at a time, modularizing development and operational complexity. On the flip side, however, it has created a need to validate and test that all the pieces are working well together. This need has given rise to many new classes of solutions in the past couple of years — ephemeral environments, on-demand environments, preview environments, etc. They share a common purpose: helping to ensure that functionality works as a whole, as early in the development life cycle as possible.

All these classes of microservice environments have traditionally been set up as fully separate copies of the entire set of microservices. These stacks may, in fact, share infrastructure underneath — like running in the same Kubernetes cluster in different namespaces, or run on single-node clusters, or even (at smaller scale), as Docker containers on some local or remote node. However, this very notion of running stacks of each microservice and all its dependencies separately from one another has some drawbacks:

  1. Cost scaling: They scale in cost with the number of microservices, and often end up needing workarounds to keep costs in check, both in terms of effort to maintain and infrastructure spend. The cost implications may make developers queue up on some shared environment to accomplish their testing.
  2. Stale dependencies and divergence from production: Each environment contains its own copy of each dependency, which is difficult to keep in sync, especially as changes are made to each microservice and pushed continuously. Additionally, another form of divergence that occurs is that third-party dependencies and integrations with cloud services may behave differently in these environments from staging or production, increasing the likelihood of a “it worked in test but not in prod” class of issues.
  3. Increased operational overhead: Operational costs go up even if a person owns just a single microservice in the stack.
  4. Suboptimal developer experience: It is difficult for a platform team to support each of these environments, often leading to poor developer experience and low usage. The time it takes to set up the environment also affects developer productivity. The more microservices you have, the slower these environments are to bring up.

There have been many workarounds explored to help deal with these in practice, but I want to introduce a different way of thinking about environments that has several benefits over previous approaches.

Signadot is a Kubernetes-native platform that empowers AI coding agents to verify code at scale. Combining fast, scalable ephemeral environments with a validation framework built for complex distributed systems, Signadot ensures high-velocity code generation results in safely merged pull requests.
Learn More
The latest from Signadot
Hear more from our sponsor

Rethinking Microservice Environments

When we’re developing microservices, each developer or development team is working on changing a small part of the overall whole. Regardless of how often releases land in production, it is common for each microservice to have its own CI/CD process that sends updates to some higher environment like staging. Given this setup and the desire to test early in the development life cycle, we can think of each microservice dev/preview/test environment as being a combination of what’s changed and the “latest” versions of everything else.

👁 Image

As shown above, we define the latest versions of all the microservices in the stack as the baseline environment. The baseline environment serves as the default version of every microservice dependency for any environment that is set up and is continuously updated from each CI/CD process. It’s often a single Kubernetes cluster, like staging (or even production). For each new dev/test/preview environment, we only deploy “what changed” (referred to as the sandbox above), which is often a small number of microservices in comparison with the overall number, and share any unchanged dependencies with the baseline environment.

This methodology shares some similarities with canarying in production, but in this case, there is a greater emphasis on isolating microservices sufficiently to create sandboxes that can be used during the development process. In the following section, we’ll look at how such a system of sandbox environments can be built in practice.

Request Tenancy

In the previous section, we looked at the logical construct of a sandbox, which combines things under test with a common set of dependencies from the baseline environment. In practice, such a system relies on two key ideas: request tenancy and routing.

👁 Image

Taking the figure above, we assume that a request can be tagged with a special identifier, something that indicates which tenant is sending the request. As long as this tenancy information is passed along the chain from service to service as the call traverses through the system, we can make a routing decision using that particular tenancy to decide that a particular request should be satisfied by a “sandboxed” service `svcA` rather than the latest version of it from the baseline’s version of `svcA`. So, we need two components to make this type of flow:

  1. A way to tag requests with tenancy using a special identifier as they flow through a network of microservices.
  2. A way to make a localized routing decision based on the presence of the identifier specified above.

Thankfully, this notion of passing a piece of request context has become simple in modern microservices, thanks to OpenTelemetry. With OpenTelemetry instrumentation in microservices, this functionality is already available. A special baggage header is automatically forwarded along to the next subsequent microservice. So, as long as OpenTelemetry is used to instrument our microservices, we get the ability to tag a request automatically with no additional effort.

Now, when it comes to actually making the routing decision, the most natural solution is service meshes such as Istio, Linkerd, etc. These meshes enable the creation of rules to make exactly these types of localized routing decisions. Therefore, we end up with something like this:

👁 Image

One of the big wins of using such a system is that testing multiple microservices together becomes extremely simple. Often, features span multiple microservices, which makes them hard to test together till they all have landed into some common shared environment. Here, it’s possible to create a new tenant that is a combination of two other tenants by just controlling the identifier with which we’re tagging the request, which helps introduce new ways of collaboration during the microservice-building process.

👁 Image

Data Isolation

Above we used a simple stateless microservice, where we were using an L7 protocol like HTTP or gRPC, which made request labeling and routing easy. In practice, there are databases, message queues, cloud dependencies, webhooks, etc., for which isolation using request tenancy might not be sufficient.

For example, testing schema changes to a database that a microservice uses might require setting up an ephemeral database instance or logical databases to realize the isolation necessary. In these cases where request tenancy is insufficient, you can use a higher isolation level. Typically, there are two higher levels of isolation that are typically used: logical isolation and infrastructure isolation.

Logical isolation is when you use the same underlying infrastructure (say PostgreSQL database cluster), but set up some unit of tenancy underneath, like a new database or a schema for that particular tenant. Infrastructure isolation is the catch-all, offering dedicated infrastructure for that particular tenant, such as setting up a separate PostgreSQL database cluster. In either case, you can use configuration mechanisms like environment variables/config maps in Kubernetes to wire the ephemeral logical or physical resource with the rest of the sandbox.

👁 Image

The level of isolation to choose depends on the use case, but there is a clear trade-off: Higher levels increase the operational work involved in setting up and managing infrastructure, while offering lesser interference from other actors in the rest of the system. In practice, in most cases, logical isolation suffices, except where the data store itself lacks such a provision, or in certain performance/load-testing scenarios.

Message Queues

For message queues, it is simplest to build tenancy information into the messages themselves (as is enabled by OpenTelemetry) and make a decision at the consuming microservice whether a particular message is relevant to itself. The key idea here is to enable the consumers to consume messages selectively so that they don’t end up processing messages intended for a different tenant.

👁 Image

In a system like Apache Kafka, this is done by setting up a separate consumer group per tenant, and then making application-layer changes to the consumer libraries to implement this kind of logic to consume messages selectively.

Async Jobs and Third-Party Dependencies

In some cases, a microservice may not be participating in request flows, but acting in a completely asynchronous manner, like a cron job that does some operation periodically, or be a point of origin of requests itself. In this case, you can still create a “sandbox” for a new version of it, but the tenancy would be specified for that particular sandboxed instance of the microservice itself. Essentially, our “tenant” in this context becomes an entire microservice, rather than a request.

This same method applies also in the cases where a third-party dependency exists that does not respect tenancy headers, or if you’re using a custom protocol where adding header metadata is not possible. The key idea is to fall back to using configuration for isolation wherever it is not possible to use request tenancy.

Conclusion

The approach of creating environments using request tenancy and tunable isolation solves several drawbacks of the traditional setup of preview, test and dev environments in Kubernetes. Specifically, since we’re deploying as few microservices as needed for each environment, this is highly cost-effective, even at scale, as evidenced by companies that run several hundreds such systems internally like Uber’s SLATE, Lyft’s Staging Overrides and Doordash.

It also ensures high-fidelity testing against the latest dependencies and is quick to set up, bringing wins in terms of developer experience and productivity. There are new ways possible with this approach to collaborate more seamlessly across developers and development teams working on different microservices.

We at Signadot are building a Kubernetes-native solution that makes it easy to create these types of environments and use them for previews, dev, and test environments in Kubernetes. We’re excited to help make this possible and reduce the complexity involved in operationalizing the above. You can read more about Signadot’s approach in our documentation or come talk to us on our community Slack channel!

Signadot is a Kubernetes-native platform that empowers AI coding agents to verify code at scale. Combining fast, scalable ephemeral environments with a validation framework built for complex distributed systems, Signadot ensures high-velocity code generation results in safely merged pull requests.
Learn More
The latest from Signadot
Hear more from our sponsor
TRENDING STORIES
Anirudh Ramanathan is CTO of Signadot where he focuses on cloud native development. Prior to this, he worked at Google focusing on Kubernetes core controllers and extensibility. He's also a committer on the Apache Spark project with a focus on...
Read more from Anirudh Ramanathan
Signadot sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, Docker.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
👁 Image
Enable cloud-native agentic workflows at scale and validate code as fast as agents can generate it.