VOOZH about

URL: https://thenewstack.io/the-importance-of-feedback-loops-in-distributed-systems/

⇱ The Importance of Feedback Loops in Distributed Systems - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2020-09-24 09:35:56
The Importance of Feedback Loops in Distributed Systems
contributed,sponsor-servicenow,sponsored,sponsored-post-contributed,
Observability / Software Development

The Importance of Feedback Loops in Distributed Systems

Distributed tracing gives you request-level visibility into what’s happening in your system. It’s the glue that binds diagnostic data together.
Sep 24th, 2020 9:35am by Austin Parker
👁 Featued image for: The Importance of Feedback Loops in Distributed Systems
Feature image via Pixabay.
ServiceNow sponsored this post.
This is part of a series on Distributed Tracing. For a list of other articles in this series, check out the introductory post.
Austin Parker
Austin Parker is the Principal Developer Advocate at Lightstep and maintainer on the OpenTracing and OpenTelemetry projects. In addition to his professional work, he's taught college classes, spoken about all things DevOps and Distributed Tracing, and even found time to start a podcast. Austin is also the co-author of Distributed Tracing in Practice, published by O'Reilly Media.

There’s an aphorism that’s not usually applied to software, but I think it’s a useful one to keep in mind: “If at first you don’t succeed, try, try again.” When I say it, you may think that I’m referring to simply the act of development, an encouraging phrase to pick you up after an attempted bug fix fails to get the results you want. While it’s useful for that, it’s more interesting to think of from the perspective of a user of software. When you click a button on a website, and it doesn’t work… what do you do? Well, most likely, you’ll click it again. Maybe you’ll refresh the page. If you’re particularly interested, driven, or savvy enough you might open the browser’s developer tools pane and poke around in the JavaScript console or network tab, trying to glean some information about what’s happening and what the problem could be.

This basic loop encompasses a lot of what we do as developers trying to understand our systems. You do something, see what happens, gather more information, and try to do it again after tweaking some things. Most systems will behave in a similar way. If I’m calling some external service and it doesn’t respond, my service should try again after a few moments. If I’m trying to write a value to a database, I shouldn’t give up if it fails — I should try again, depending on why it failed. That concept of why, though, is important. I need to have context for what happened, in order to determine how to try again. These feedback loops are vital for everything we do in life! If I’m trying to get a large piece of furniture through a door, it’s not going to do a lot of good if I simply keep pushing when it gets stuck. I need to evaluate why it’s stuck, then change what I’m doing. Blindly pushing away just gets me a ruined couch, a broken door, and back pain.

In the previous installment of this series, I talked about the rise of distributed systems, and some of the reasons they’ve become so popular. Now, let’s think about them in the context of what we just discussed: the notion of feedback loops. When I’m developing a small application, my feedback loop is very short. I can make a change to a line of code, recompile and re-run the software, and immediately see what changed. This is pretty useful, to say the least, when it comes to not only understanding my software — but improving it. The connection between a change and the result of that change are extremely obvious and easy to quantify. Imagine, though, a distributed system where I can only change small parts of it at once. My changes may be a relative drop in the ocean, but even a small drop can ripple outwards and eventually become a mighty wave.

Consider the following scenario: I have a service that receives some data (it doesn’t really matter what kind). Maybe I’m taking a value and re-encoding or reformatting it as part of an integration with a new source of customer data. The service gets deployed, and all’s right with the world. But one day, a ticket comes in. The data format is being changed, so I need to change how I convert it. No problem: a couple of lines of code, a couple of test cases and let’s even assume we gracefully handle both the old and new data formats. Heck, it’s a Friday, let’s go ahead and deploy it — what’s the worst that can happen, right? I push my code, the PR merges, and I knock off for a relaxing weekend of wistfully looking out the window and watching YouTube, remembering the times before Covid-19.

ServiceNow Cloud Observability powered by Lightstep helps organizations manage the growing scale and complexity of cloud and cloud-native infrastructure, for complete visibility across the enterprise. For more information, visit: ServiceNow Observability
Learn More
The latest from Lightstep

I did everything right in this scenario, didn’t I? In isolation, of course I did. I wrote test cases, I defensively programmed, I made sure everything worked in staging, I had someone review my code, I double-checked the specifications and the documentation… everything should be fine. Except, what if it’s not fine? What if the new conversion logic is slower — even just so — than the old code? What if my defensive programming, checking to make sure that I can convert data in the old and the new format, what if that’s added latency to the critical path of my application? And, depending on where my service is being called from, what happens when those other services get backed up? The extra milliseconds don’t seem noticeable at first, but maybe they do matter… and let’s say they do. Suddenly, an older service, four or five hops away from mine, starts timing out because of the additional latency I added. Those timeouts cause rippling failures, as other services begin to time out as well, or even begin to fail with errors, and crash. These timeouts and crashes eventually lead to unexpectedly high load on the primary database for the application, which starts to fall over, and suddenly my small change has caused a complete outage. Whoops!

Alright, let’s pause — how do we fix this? Well, that’s actually a tricky question to answer! These sort of systemic failures can be solved in many different ways, and that’s one of the things that makes them so challenging to tackle. Do we address the direct cause: our service deployment? Perhaps, and you could even say that it’s the root cause of the outage — but really, is it? A lot of other things had to go wrong, after all, to cause the entire system to fall over. A bunch of other services, for example, started to time out and fail — that’s a cause. Legacy services started to crash because they couldn’t address the timeouts, that’s also a cause. When those services restarted, it caused additional load on the database, which caused the database to fail — also a cause. We can pull the camera out a bit too, though. The reason we deployed a new version of our service in the first place was due to a request for a change in data formats, which caused this whole rigamarole to begin with. We could zoom out even further, if we wanted to — why did the data format change? Was that avoidable? It’s important to understand that nothing is just a technical problem, just a computer problem — everything starts, and ends, with people.

Diagnosing and solving these problems can be challenging! As you can hopefully see, it’s not enough to just have data about what’s going on. You need the context of why things are happening. You need the ability to work in reverse — from effect, to cause — and the ability to understand how services in your system are connected together, and how changes in one part of the system can affect other services, even those that aren’t immediately upstream or downstream of it. Generating this data can be hard — you need something that is capable of being integrated into a variety of services, each of which could be deployed in a distinct way, running in data centers around the world. You’d then need a way to collect this information and display it in a human-readable form, and build tools to help you interpret it by allowing you to search and query it.

Now, thankfully, there’s a lot of great solutions to these problems. As I mentioned in the last part of this series, quite a few tools have been developed over the years to monitor services, collect diagnostic information about them, and allow you to try and puzzle out what’s wrong. One common problem people run into, though, is that they all work a little bit differently across different languages, runtimes, and deployment strategies. Most critically, though, is the lack of context. It’s this specific issue that distributed tracing addresses. If you think back to the beginning of this piece, figuring out why something happened was the critical part of our problem-solving feedback loop. Distributed tracing gives you the why, by giving you request-level visibility into what’s happening in your system. It’s not a panacea — it doesn’t solve all your problems on its own — but it’s the glue that binds the diagnostic data you receive from your services together. How does it work, though, and what is a distributed trace, really? In the next part of this series, we’ll dive into the technical details of OpenTelemetry and explain exactly what a trace is.

ServiceNow Cloud Observability powered by Lightstep helps organizations manage the growing scale and complexity of cloud and cloud-native infrastructure, for complete visibility across the enterprise.
Learn More
The latest from ServiceNow
TRENDING STORIES
Austin Parker is the Principal Developer Advocate at LightStep and maintainer on the OpenTracing and OpenTelemetry projects. In addition to his professional work, he's taught college classes, spoken about all things DevOps and Distributed Tracing, and even found time to...
Read more from Austin Parker
ServiceNow sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.