VOOZH about

URL: https://towardsdatascience.com/what-does-entropy-measure-an-intuitive-explanation-a7f7e5d16421/

⇱ What does Entropy Measure? An Intuitive Explanation | Towards Data Science


What does Entropy Measure? An Intuitive Explanation

Entropy can be thought of as the probability of seeing certain patterns in data. Here's how it works.

11 min read

Entropy may seem abstract, but it has an intuitive side: as the probability of seeing certain patterns in data. Here’s how it works.

👁 Background Credit: Joe Maldonado @unsplash
Background Credit: Joe Maldonado @unsplash

In data science, there are many concepts linked to the notion of entropy. The most basic one is Shannon’s information entropy, defined for any distribution, P(x), through the formula:

👁 Image

Where the sum is over all the possible categories in C.

There are other related concepts that have similarly looking formulae:

Despite the ubiquity of entropy-like formulae, there are rarely discussions on the intuitions behind the formula: Why is the logarithm involved? Why are we multiplying P(x) and log P(x)? While many articles mention terms like "information", "expected surprise", the intuitions behind them are missing.

It turns out, just like probabilities, entropy can be understood through a counting exercise, and it can be linked to a sort of log-likelihood for distributions. Furthermore, this counting can be linked to the literal number of bytes in a computer. These interpretations will enable us to demystify the many facts about entropy. Curious? Let’s get started!

Counting Entropy

👁 Probability's counting definition is what makes it so intuitive (Photo by Ibrahim Rifath on Unsplash)
Probability’s counting definition is what makes it so intuitive (Photo by Ibrahim Rifath on Unsplash)

Probability can be defined operationally: When we say a coin has a 50% chance to land a head, it means that if we were to flip the coin a million times, the number of heads will get pretty close to half a million. This fraction will get closer to the 50% probability as the number of trials increase. This definition is what makes probabilities so intuitive.

Is there a similar interpretation for entropy? There is, although the counting is a little bit tricker: it will require some basic combinatorics.

How many ways are there to arrange N different balls? There are N choices for the first one, N − 1 for the second one… etc. The answer is N!, or the factorial symbol:

👁 Image

Just like in the definition of probabilities, we’ll be working with very large numbers. So it is helpful to approximate this object through Sterling’s approximation:

👁 Image

Where log indicates the natural logarithm; analogous formulae also exist if we use alternative bases like log₂ and log₁₀ (this will determine the units in which we measure entropy). The big-O notation indicates the validity of the approximation as N gets large. The term N log N will be the origin of p log p in entropy’s definition.

We are now ready to derive what entropy is counting. Imagine there is a large number of distinguishable objects, or distinguishable data points. These N data points are grouped into say c categories, like in the figure below

👁 Image

What is the total number of ways this can be done? Keeping in mind that we don’t care about the ordering of our data in any category, the answer is captured by the classic multinomial coefficients:

👁 Image

Where we have used the Ω symbol to denote the number of configurations.

Just like the case for probability, we are only interested in the large N behaviors. When dealing with such large numbers, it is helpful to take the logarithm, so we can use Sterling’s approximation to make things more manageable:

👁 Image

The formula can be simplified by utilizing the fact that the sum of all the nᵢ equals N,

👁 Image

if we substitute nᵢ/NP(i), we get exactly the entropy formula. Alternatively, we can write (for large N):

👁 Image

So we arrived at an operational definition of entropy:

Entropy counts the # of ways of categorizing large amounts of data that resemble a given probability distribution (in logarithmic units and per number of data point)

This counting exercise lies at the heart of information theory, which we’ll turn to next.

Entropy as Information

So how does our concept of entropy relate to literal bits of 0s and 1s in a computer?

👁 The concept of entropy can linked to information when we apply it to patterns of 1s and 0s (credit: Gerd Altmann)
The concept of entropy can linked to information when we apply it to patterns of 1s and 0s (credit: Gerd Altmann)

Imagine a binary sequence of some fix length N. Intuitively we know that it contains N bits of information: because it literally takes N bits to store the sequence in a hard-drive or memory.

But what if the sequence has some interesting patterns like the following?

  • 000000000000000000000000000
  • 010101010101010101010101010
  • 000000010000000000000000000

In these cases, the binary sequence representation would be very inefficient. We intuitively know that there are more efficient ways to store these sequence: we can specify the patterns as opposed to all the literal bits, and the amount of meaningful information in these sequences should be smaller.

So if we ignore the subtle patterns of digit repetitions, and just look at the basic statistical properties of the digits (proportions of 0s and 1s), how much better can we do in terms of storing those sequences?

This is where our entropy counting formula can help us: it can count the total number of sequence given some fixed proportions of 0s and 1s.

In the case where the proportions of 0s and 1s are 50/50, the total number of possibilities is (in the large N limit):

👁 Image

We see that this just approximately yields the number of all the possible binary sequences 2ᴺ. So the number of bits required to store the sequence is still N. This isn’t surprising, as we know that random sequences should be impossible to compress: it has the maximum N bits of information.

But what if the proportions are no longer 50/50? We should expect some potential savings. In this case, the total number of bits required to store one sequence would be:

👁 Image

Let’s sanity check the case when the number of 0s is much smaller than the number of 1s, say nN. In this case the _P_₁ term can be ignored, and the number of bits is given by:

👁 Image

So the amount of information is proportional to n instead of N. This is because we now only need to store the position of each 0 instead of the whole sequence.

This illustrate the power of entropy as it relates to physical bits and bytes in a computer. In summary,

The information entropy specifies the expected number of bit per length that is required to store a sequence generated by a given probability distribution.

In other words, entropy is a sort of optimal compression ratio for a fixed proportion of characters in a sequence. This is the way that entropy is linked to information.

Beyond thinking of sequence as our subject of interest, we can turn our attention to the distributions themselves. This view point allows us to interpret entropy as a sort of probability (or log-likelihood).

Entropy as Log-Likelihood

Entropy counts the number of possibilities. We want to convert this to a sort of probability. To do this we simply need to normalize the counts.

What is the total number of ways to categorize N data points into c categories? The answer is simple, because each data point has c choices:

👁 Image

We can now divide the entropy’s count by the total to obtain a probability (substituting nᵢ/NP(i)):

👁 Image

This way, entropy becomes the probability (asymptotic because of large N) of observing a particular distribution from randomly categorizing data points:

Entropy can be viewed as the log-likelihood of observing a given distribution (per data point)

There is a hidden assumption in our discussion though, as we are treating every configuration as equally likely in our calculations. What happens if some categories are more favored over others?

We can consider some reference distribution Q(x). If each data point has a chance Q(x) to be in a particular category x, the probability of observing _n_₁ in category 1, _n_₂ in category 2 and so on is given by the multinomial probability:

👁 Image

Once again, we can go through Sterling’s approximation. The calculation is very similar to the previous ones, except we have a extra Q(i) in the end

👁 Image

Substituting nᵢ/NP(i), the term inside the exponential becomes the Kullback–Leibler divergence. So our equation can be summarized as

👁 Image

Where we have used the common notation of KL-divergence inside the exponential. The KL-divergence is a generalization of Shannon’s information entropy, and our equations make our interpretation even more precise:

The Kullback-Leibler divergence of P on Q is the negative log-likelihood (per data point) of observing P when sampling data according to Q

Once again, all this is assuming N is very large.

A few facts about KL-divergence now become obvious:

  1. KL-divergence is always non-negative: this is because a probability can never be larger than 1.
  2. KL-divergence can be infinite: this happens when two distributions have no overlap, and so the counting exercise yields 0 = exp[–∞].
  3. KL-divergence is zero if and only if P = Q: when we sample data according to Q, we expect the resulting distributions to look like Q – this expectation is exact at large N.

Armed with this new understanding, we are now ready to reinterpret facts about various entropic concepts in data science!

An Entropic Sampler

Below we’ll discuss the intuitions behind some common entropy-like variables in data science. We’ll once again remind the reader that the large N limit is implicitly assumed.

Cross Entropy

This is useful for training categorical variables. It is defined as

👁 Image

Note that we’ve rewritten the definition as a sum of the KL-divergence and Shannon’s information entropy. This may look a little unfamiliar, since when we train machine learning model, we only compute an estimate of it through our samples (say S)

👁 Image

Using our counting intuition, we conclude that

Minimizing cross-entropy is equivalent to maximizing the log-likelihood of observing the same statistics as those from our sampled data, if we sample our data from a distribution Q that is being trained

This puts the cross-entropy loss on a similar conceptual ground as the L2 loss in regressions: they are both some sort of log-likelihood functions.

Mutual Information

Mutual information can be thought of as a generalized sort of correlation between two variables. Denoted by I, it is defined through the KL-divergence

👁 Image

Where in the KL-divergence computation, we are comparing a distribution of two variables, against the distribution of considering each variable separately.

Our counting intuitions give us a very nice interpretation:

Mutual information is the negative log-likelihood (per data point) of obtaining a given distribution on two variables, when we sample the two variables independently based on their marginalized distributions

This explains why mutual information is a powerful tool that can capture nonlinear correlations between variables.

The Inevitable Increase In Entropy?

Finally, we are ready to address one of the most well-known facts about entropy: the laws of thermodynamics and the inevitable increase in entropy.

It is important to keep in mind though that there are two concepts of entropy at play here:

  1. Shannon’s Information Entropy in Data Science
  2. Entropy in Thermal Physics

The increase in Entropy is a physical law that only applies in the second case. However, the entropy in physics can be viewed as a special case of Shannon’s entropy when applied to physical systems, so there is a connection there.

What this means in terms of the counting exercise then, is that the number of possibilities will inevitably increase. This makes intuitive sense, because when a physical (chaotic) system is not constrained, it should eventually sample all the possibilities. It’s a bit similar to the famous Murphy’s law that stats "anything that can go wrong will go wrong".

From a data science perspective, if we believe that our data are results of some dynamical systems, then it might makes sense to maximize entropy: because if we believe all the variables have been taken into account, there is no reason to think that our data would not explore all the possibilities. In other words, we want to consider all the possibilities/combinations – even ones that are not present in our data. This is perhaps what grants entropic concepts their super powers in data science, that

By counting all the possibilities, entropy is a conservative measure of our ignorances

This viewpoint was explored in another one of my articles on entropy.

Conclusion

By interpreting the formula for entropy as counting possibilities, we are able to understand entropy’s role in information theory and think of entropy as a sort of probability. This interpretation is ultimately what makes various entropic concepts meaningful and useful.

Please share your thoughts and feedback if you have any, happy reading! 👋

If you enjoy this article, you might be interested in some of my other related articles:

Entropy Is Not Disorder: A Physicist’s Perspective

A Physicist’s View of Machine Learning: The Thermodynamics of Machine Learning

Why We Don’t Live in a Simulation


Written By

Tim Lou, PhD

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles