Data Science

Shannon Information: Discovering Atoms of Communication

Physical objects have atoms, information has bits. Claude Shannon believes that information, although intangible, can be quantified…

Casey Cheng

Mar 21, 2022

16 min read

👁 Information is the difference between current beliefs and facts. Sun and Rain icons are created by Freepik from Flaticon, edited with permission by the author.

Information is the difference between current beliefs and facts. Sun and Rain icons are created by Freepik from Flaticon, edited with permission by the author.

Shannon Information – We found the atoms of information

👁 Image by qimono from Pixabay.

Image by qimono from Pixabay.

When the ball rolls over the tabletop, it drops. Never have I questioned why it did because, from the day we are born, we have been interacting and observing physical objects and their responses in the observable universe. If we see enough repetitions of an event, we start accepting it as "how the world works".

Not only are we somewhat familiar with the physics of the world already, but the discovery of atoms in the early 1800s also graced us with a much more granular look into the fundamentals of why it worked. There is concrete evidence – observable, and quantifiable.

But what about abstract ideas? We can’t see information nor touch it. We know it exists, but the lack of physicality complicates its understanding and makes working with information much less intuitive than it needs to be. It begs the question… Are there some indivisible units or some grand unified theory that shapes information like atoms for objects?

Perhaps.

If we trace the roots back to ground zero, what we will find is Shannon’s Information Theory.

What is information?

Think of the last time you read a great Medium article and you said to yourself, "This is incredibly informative," as you lean back on your chair and ruminate in absolute gratification.

Why? Why is it informative?

Most people use the word "informative" when there is something to be learned. It should either…

introduce a fresh perspective,
reinforce current beliefs, or
invalidate previous truths.

Whichever it is, it’s informative because it changed our understanding of the world – our beliefs.

The world we live in is so unreasonably dynamic and complex, it’s impossible to attain absolute knowledge in every facet of life. Every day, we are forced to navigate through our lives with incomplete information. As new information presents itself, we calibrate our beliefs to better align with reality so that we can make better decisions.

Since we are not 100% certain about most things, our standpoint is rarely a simple binary yes or no. Instead, they are better expressed in a continuous spectrum between 0 and 1 indicating our confidence level.

👁 Beliefs are usually in-betweens, not binary. Sun, Rain, Delete, and Checked icons are created by Freepik hqrloveq from Flaticon, edited with permission by the author.

Beliefs are usually in-betweens, not binary. Sun, Rain, Delete, and Checked icons are created by Freepik hqrloveq from Flaticon, edited with permission by the author.

If someone asked about the weather, without knowing any better, we would have a 50/50 guess. But given a weather forecast report that contains the relevant information, we would shift our bias towards one or the other.

Information changes beliefs.

While we used a weather forecast report as an analogy, I hope I’m not painting the wrong picture about how information needs to be a report, numbers, or some bar charts. It doesn’t have to be.

Information is intangible.

It is this abstract "thing" that can manifest itself in any shape or form – visuals, sounds, taste, and even smell. What matters is that when consumed, it changes our perspective.

As if the idea of information itself is not fuzzy enough, unfortunately, beliefs can also be subjective. There are 7 billion people in the world and each with their own experiences, culture, and values. When we experience the world so differently, it’s inevitable to find polarising perspectives and beliefs. What you deem informative, won’t be for someone else.

Imagine reading a kindergarten book. I doubt we would feel the same level of satisfaction as a 2-year-old child.

Information is personal.

Point is –

Information is this notion of meaningful knowledge. Its presence depends on what we already know about the world. If it brings us one step closer to the ground truths, then there is information.

Information changes beliefs.

Information is intangible.

Information is personal.

Can we take something fuzzy like this and formally define it?

Can we quantify information?

The answer is yes, and it looks like this…

👁 The formula for Shannon's Information Content.

The formula for Shannon’s Information Content.

But rather than taking it at face value and accepting that it defines information, it’s more meaningful to understand why it makes sense.

I propose that we derive the formula from the ground up.

Earlier, we’ve made a connection between information and beliefs. If information exists when beliefs are challenged, then we know that information needs to be defined as a function of beliefs.

That aside, it’s important to note that information and beliefs don’t increase together. Quite the opposite. When we observe events that we perceived as improbable, it surprises us and makes us questions the correctness of our beliefs. Hence, we have the most to gain when we are wrong the most. Vice versa.

👁 The more our beliefs already align with facts, the less information we are getting. Sun and Rain icons are created by Freepik from Flaticon, edited with permission by the author.

The more our beliefs already align with facts, the less information we are getting. Sun and Rain icons are created by Freepik from Flaticon, edited with permission by the author.

If we have to pen down this inverse relationship in a mathematical construct, it would look like this:-

👁 Information as a function of our beliefs, p(x).

Information as a function of our beliefs, p(x).

👁 The amount of information is inverse to our beliefs. Image by the author.

The amount of information is inverse to our beliefs. Image by the author.

Visually, the curve fits the description of an inverse relationship which is what we’re looking for. It’s a great start, but probabilities should only range from 0 to 1. We need to fix that.

Apart from that, there are two other minor defects violating the logic that we are trying to establish. The problem lies at the two extremes of p(x) = 0 and 1 where the line becomes ever closer to the axes, but never really touching it.

That doesn’t make sense.

When the probability of an event is 0, it means that it will never happen. If it’s never going to happen, ever, then we won’t be able to quantify the information nor have the need to do so.

On the other end of the spectrum, when the probability of an event is 1, we are simply observing a known fact so there should be exactly zero new information to be learned.

When p(x) = 0, information should be undefined.
When p(x) = 1, information should be zero.

This calls for a bit of modification to our formula. Applying a logarithm has some interesting properties that can help us satisfy all the criteria.

👁 Information as a function of the log of our beliefs, p(x).

Information as a function of the log of our beliefs, p(x).

👁 Adding a log improves the curve. Delete and Checked icons are created by hqrloveq from Flaticon, edited with permission by the author.

Adding a log improves the curve. Delete and Checked icons are created by hqrloveq from Flaticon, edited with permission by the author.

Now that the two extremes are fixed, the curve is a much more reasonable way to explain information. From here on, we can rearrange our formula with a bit of mathe-magic and it will give us the exact same formula that Claude Shannon use to describe information content.

👁 Deriving Shannon's Information Content.

Deriving Shannon’s Information Content.

While the formula looked like gibberish at first, if we peek behind the curtains, Shannon Information Content is just a geeky way of expressing that the amount of information is the magnitude of change in beliefs.

How do we use Shannon’s Information Content?

Having a formula gives us a quantifiable way to keep track of the known and unknown in a system. It gives us a systematic way to answer these 3 questions:

How much is there to know?
How much did we learn?
How much more do we not know?

To demonstrate this, I have prepared 8 identical treasure chests for us. All of them are empty, except one so we have a 1/8 chance of unlocking the correct chest.

👁 The treasure chests are shuffled so we don't know which one has the treasure. Treasure icon is created by Smashicons from Flaticon, edited with permission by the author.

The treasure chests are shuffled so we don’t know which one has the treasure. Treasure icon is created by Smashicons from Flaticon, edited with permission by the author.

The first question is – how much information do we need to unlock the correct chest with 100% certainty? A little? A lot? Rather than toying with such a vague idea of certainty, Claude Shannon’s Information Content argues that we need exactly 3 units of information, or 3 bits, as he calls it.

👁 The probability of finding treasure is 1/8, hence we need 3 bits of information.

The probability of finding treasure is 1/8, hence we need 3 bits of information.

What Shannon is claiming… is that if we have 3 bits of information, we will know all there is to the entire system. Let’s open up the chests one by one, observe how information is gained along the way, and verify the validity of this statement.

👁 As we unlock the chests, we learn more information about the system. Treasure (Closed, Filled), Delete, and Checked icons are created by Smashicons and hqrloveq from Flaticon, edited with permission by the author.

As we unlock the chests, we learn more information about the system. Treasure (Closed, Filled), Delete, and Checked icons are created by Smashicons and hqrloveq from Flaticon, edited with permission by the author.

Opening up the first chest reveals that it’s empty. There was a 7/8 chance of observing such an event, so we’ve learned 0.193 bits of information. For each empty treasure chest that we unlock, the probability of finding another empty chest gets incrementally lower, and as a result, the amount of information that we get from each chest increases steadily.

As we gain more information, the remaining uncertainty about the system decreases. At the halfway point, we have gained 1 bit of information, leaving only 2 bits of information left to uncover. In layman’s terms, it means we are more certain about where the treasure chest is than when we started. This seems appropriate as there are only 4 chests left.

On the 6th chest, there is a sudden surge in information gain because we manage to find the treasure (which had a probability of 1/3). Beyond this point, we stop getting any information because we already know for certain that the last two chests are going to be empty.

There is an interesting observation to be made here.

When we unlocked the correct chest, we learned a whoopin’ 1.585 bits of information in one go, bringing our cumulative information to 3 bits. This is the same amount of information as all the information there is in the system.

Intuitively, this makes sense because if we know where the only treasure chest is, then all other options will be nullified. But did the numbers add up because of sheer coincidence? Just to convince ourselves, let’s hide the treasure in the 5th chest instead.

Observe.

👁 Keeping the treasure in the 5th chest instead of the 6th still gives us the same total of information. Treasure (Closed, Filled), Delete, and Checked icons are created by Smashicons and hqrloveq from Flaticon, edited with permission by the author.

Keeping the treasure in the 5th chest instead of the 6th still gives us the same total of information. Treasure (Closed, Filled), Delete, and Checked icons are created by Smashicons and hqrloveq from Flaticon, edited with permission by the author.

No matter where we put it, it will always match up to the total information. Here’s the mathematical proof to support the statement.

👁 The information content will always add up to the total information, regardless of where we keep the treasure.

The information content will always add up to the total information, regardless of where we keep the treasure.

Hopefully, this convinces you that Shannon Information Content is a sensible way to objectively quantify the order of uncertainties in a probabilistic approach.

But there is one problem.

One "unit" of information is an obscure quantity that can be either be a lot, or it can be very little. It’s the equivalent of saying that I need one "unit" of milk, which is arguably less descriptive than one "carton" or one "liter" of milk.

Without an interpretation of its unit, the number stays just a number.

How useful is one unit of information?

Claude Shannon proposes that we measure 1 bit as the amount of information needed to reduce our uncertainty by half.

Perhaps it’s a bit easier to visualize that if we revisit the halfway point.

👁 Each bit of information halves the uncertainty. Treasure (Closed, Filled), and Delete icons are created by Smashicons and hqrloveq from Flaticon, edited with permission by the author.

Each bit of information halves the uncertainty. Treasure (Closed, Filled), and Delete icons are created by Smashicons and hqrloveq from Flaticon, edited with permission by the author.

Notice that the moment we unlock the 4th chest, we accumulate exactly 1 bit of information. With this one bit of information, our probability space reduces by half – from 8 to 4 chests.

With another bit, it will reduce our uncertainty by half again, leaving us with only 2 possible chests. As you can already imagine, having 3 bits would simply reduce it down to a single, definite choice. This is also proof that our system only has 3 bits of information in total.

Another way to grasp the idea is by relating one bit of information to one yes or no question that eliminates half of our choices. 3 bits of information means we need to ask 3 questions to understand everything about the system. For our example, we can ask…

Question 1: Is it in any of the chests above?
Question 2: Is it on the right?
Question 3: Is it on the left?

👁 s Each bit of information halves the uncertainty. Treasure (Closed, Filled), Delete, and Checked icons are created by Smashicons and hqrloveq from Flaticon, edited with permission by the author.

s Each bit of information halves the uncertainty. Treasure (Closed, Filled), Delete, and Checked icons are created by Smashicons and hqrloveq from Flaticon, edited with permission by the author.

The idea works both ways. If the system we are dealing with has 10 bits of information, it means we have 2¹⁰ or 1024 number of guesses. This line of thinking gives us some inklings on the odds and uncertainty to expect from any type of problem.

Not only did we give meaning to the word "information", but we also gave it a number to expound its magnitude, and that, completely changed the way we understand or work with information.

Practical Application of Shannon’s Information Content

Shannon’s Information Content extends far beyond just trivial guessing games. One of its key contributions is in the field of communication.

For any communication to take place, we always have a source, channel, and receiver.

A speech requires a voice (source) to travel through air (channel) before reaching someone else’s ears (receiver).
A phone call requires a phone (source) that transmits signals through the telephone line (channel) into another phone (receiver).

In the latter example, we also need an encoder to convert our voices into digital streams of binary 0’s and 1’s to ease the process of transmission. On the receiving end, we use a decoder to convert the signals back into voices.

👁 A perfectly efficient communication system. Telephone, encoder, and transmission tower by Freepik, Flat Icons, Mehwish from Flaticon, edited with permission by the author.

A perfectly efficient communication system. Telephone, encoder, and transmission tower by Freepik, Flat Icons, Mehwish from Flaticon, edited with permission by the author.

From a theoretical standpoint, this setup works. But in reality, channels are often noisy. Your neighbor playing their heavy metal music on full blast? Noise. Magnetic interference on the copper wire during transmission? Noise. There are always noises that will degrade our message. Instead of hearing "ten", we may hear "tin". What we send, isn’t necessarily what will be received.

To achieve reliable communication over an unreliable channel, we have to add redundancies to drown out the noise. For instance, we could program our encoder to repeat our message 3 times, "ten – ten – ten". This way, even if the message ends up corrupted on the receiving end, "ten – tin – ten", it can still recover the original message by inferring "ten" from the majority.

👁 A realistic communication system where noise flips the binary bits hence distorting the message. Telephone, encoder, and transmission tower by Freepik, Flat Icons, Mehwish from Flaticon, edited with permission by the author.

A realistic communication system where noise flips the binary bits hence distorting the message. Telephone, encoder, and transmission tower by Freepik, Flat Icons, Mehwish from Flaticon, edited with permission by the author.

But like whack-a-mole, we fix one problem, another one pops up. Adding surplus information to re-create the shorter original message slows our transmission speed. If we had to repeat ourselves 3 times, that means we are transmitting 3 times slower.

Therefore, balancing the trade-off between accuracy and the speed of transmission is necessary, but since accuracy takes precedence for most applications, the question becomes – how much speed do we need to sacrifice?

With a little bit of probability theory, we can calculate the probability of error for our system and plot them against the transmission rate.

👁 Trade-off curve. As we add redundancy to reduce the error rate, the rate of transmission trends down. Image by the author.

Trade-off curve. As we add redundancy to reduce the error rate, the rate of transmission trends down. Image by the author.

Loop this through the same encoding algorithm but varying amount of redundancy and we will get ourselves a nice trade-off curve. But sadly, we won’t be able to do this for algorithms that we haven’t invented yet.

Without the trade-off curve, it becomes very hard to decide if we should use a different channel, or maybe we should try a bit harder to come up with a better encoding algorithm that will get us the desired error rate. Without a marked target, the archer is bound to miss his shots.

Claude Shannon, with his framework for quantifying information, was able to calculate the maximum rate of transmission for any given channel where there will be an arbitrarily low error rate. In other words, given the inherent noise that comes with the channel, how "good" can the channel transmits our messages, provided that we have the most efficient encoding algorithm?

He termed this upper limit the Channel Capacity (a.k.a Shannon’s Limit) in his Noisy-Channel Coding Theorem. I hope the formula rings a bell.

👁 In a binary symmetric channel, the "0" and "1" have a p(e) rate of error, which we can use to calculate the information content, or entropy, to be precise. Image by the author.

In a binary symmetric channel, the "0" and "1" have a p(e) rate of error, which we can use to calculate the information content, or entropy, to be precise. Image by the author.

The formula reads intuitively – If we take the full capacity of the channel and subtract the capacity required to transmit the information of our original message, which may end up as correct bits or incorrect bits, then what remains must be the excess capacity of the channel. This is the best transmission rate that the channel is capable of.

While the Channel Capacity doesn’t help us devise the optimal encoding algorithm, it showed the world what was possible. Reminiscence of the 4-minute mile, once the theoretical limits are known, it paved the way for huge advancement in the way we communicate.

Concluding Remarks

This one application is but the tip of the iceberg. Shannon’s Information Theory introduced information content which some may refer to as the fundamental particles of communications. While a physical object has atoms, information has bits.

It gave the world a way to take any form of communication – from birds chirping, to drawings to morse codes – and universally compare them.

Shannon’s brilliant work has since revolutionized the way we work with information, serving as a key concept in data compression, cryptography, and even data science, amongst mountains of other great things.

As data scientists, we are always in the habit of wanting more data or more "information". But I suppose Claude Shannon just showed us that while having information is great, understanding information can be even more impactful.

Enjoyed the article? Consider becoming a Medium member to get full access to every story and support content creators like me.

Join Medium with my referral link – Casey Cheng

Written By

Casey Cheng

See all from Casey Cheng

Data Science, Editor’s Picks, Entropy, Information Theory, Thoughts And Theory

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/shannon-information-theory-discovering-particles-of-information-ab2c136c6a25/