![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
When you think about “tracing,” what do you think of? It’s one of those words in software development that is overloaded with meaning. Some people may think of a “stack trace,” the familiar blob of text issued by an application runtime that shows each function call preceding a point in the code where an exception occurred. Other people may go a step further, and think of the action of tracing — navigating through logs from different services and computers, literally tracing the path of a request as it moves through a system. Conveniently enough, “distributed tracing” is really a combination of both of these processes — you can think of it as the “call stack” for a distributed system, a way to represent a single request as it flows from one computer to another.
Now, if this doesn’t make a ton of sense, that’s normal — I just dropped a lot of terminology on you! This series is going to demystify distributed tracing, starting from the basics. Today, I’m going to talk about distributed systems — what they are, why we use them, and why the rise of distributed systems has made tracing so important.
So, let’s start at the beginning: code. A computer, without software, can’t do a whole lot — it can do a lot of math, really fast, but not a whole lot else. Software is just instructions to the computer, and code is a human-readable way to express those instructions. You can code in a lot of different programming languages — Java, C#, Go, JavaScript, Ruby, and hundreds more — but all of the programs you write have things in common with each other. Software needs to do something to be useful to people.
Twenty years ago, the way we used software was very different than the way we do today. We didn’t have “the cloud,” and the internet itself was a nascent technology. That said, since the 1970’s, a new type of software system was being developed — the distributed system. Now, the idea of a distributed system wasn’t new, per se — but by the 1970s computer technology had advanced to the point where they were feasible. In a distributed system, computers can act as both clients and servers, allowing for tasks to be performed on different machines. These systems can leverage economies of scale, allowing for large quantities of messages or data to be stored on a central server, which can be accessed by lightweight clients over a network. The servers take care of the “heavy lifting” of processing the data, while the clients simply make requests for what they need. This basic idea led to more codified forms, such as a three-tier or n-tier architecture, or even peer-to-peer architecture, where an “application” spread out into more independent services, working in concert with each other to satisfy a user’s request.
A note on architectures, tiers, and layers: Formally, “tiers” and “layers” are not substitutable; a tier refers to a discrete, physical unit, whereas a layer is a logical group of software components. That said, the two terms are often used interchangeably in conversation.
As high-speed internet access became more prevalent throughout the United States and the rest of the world, software architecture changed with it. Rather than specialized client software on home computers, web browsers began to act as an interface to more complex server applications running in remote data centers. These server applications, in turn, began to grapple with a problem — scaling. Not the kind of scaling you do trying to climb a wall — although, I’m sure that many programmers were driven up the wall trying to bring more capacity online! Scaling an application under load can be challenging, depending on how it’s designed. If your application is stateful (as in, it maintains some sort of “user state” in-memory that needs to exist for a long period of time), then it can be extremely difficult to add capacity — especially when you need more memory, storage, or CPUs that can only be obtained by physically buying and installing more servers. These challenges led to changing techniques: creating stateless services, and breaking them into smaller units of functionality. If you’ve heard of a “service-oriented architecture” (or SOA), this is where it came into its own — being able to split up a service into different parts, communicating with each other over a network, made it possible to more easily scale your application in response to demand.
It is into this world that distributed tracing found its purchase. If your application is split across many individual servers, you need a way to understand the behavior and performance of that entire system, rather than just its individual parts. The failure mode of your application changes — an individual service crashing may result in unexpected or unexplained behavior in a completely different part of the system. When these failures occur, you need more than just a stack trace logged to the offending machine — you need to be able to see the entire request, from beginning to end. Developers came up with a lot of different solutions to this problem — centralized logging, remote debugging, and a variety of other tools to aid in diagnosing problems with distributed systems. Over time, though, the problems continued to compound. Applications became more complex, more distributed. New deployment platforms and tooling — virtual machines, containers, Kubernetes — made it easier to create more complex applications, with more moving parts. The cloud made it possible to easily provision new infrastructure and scale it around the world, and all of this led to even more complexity and confusion.
Let’s look at this in a bit more detail — what are the problems that crop up with these distributed systems and why do we need these tools?
One solution to this constellation of complexity is distributed tracing — specifically, distributed tracing built for cloud native, polyglot applications. However, what is distributed tracing? Why do we need it? What are, exactly, the problems caused by these distributed systems? In the next part of this series, I’ll cover the issues that distributed systems can lead to, and why distributed tracing is the backbone of understanding how our systems function.