All the Way from Information Theory to Log Loss in Machine Learning

Entropy, cross-entropy, log loss, and the intuition behind.

Sep 23, 2020

6 min read

In 1948, Claude Shannon introduced the information theory in his 55-page-long paper called "A Mathematical Theory of Communication". The information theory is where we start the discussion that will lead us to the log loss which is a widely-used cost function in machine learning and deep learning models.

The goal of the information theory is to efficiently deliver messages from a sender to a receiver. In the digital age, the information is represented by bits, 0 and 1. According to Shannon, one bit of information sent to the recipient means to reduce the uncertainty of the recipient by a factor of two. Thus, information is proportional to the uncertainty reduction.

Consider the case of flipping a fair coin. The probability of heads being the side facing up, P(Heads), is 0.5. After you (the recipient) are told that the heads is up, P(Heads) becomes 1. Thus, 1 bit of information is sent to you and the uncertainty is reduced by a factor of two. The amount of information we get is the reduction in uncertainty which is the inverse of the probability of events.

The number of bits of information can easily be calculated by taking log (base2) of the reduction in uncertainty.

👁 (Image by author)

(Image by author)

Let’s go over a slightly more complicated case. Two of your friends go to a store to buy a particular t-shirt and there are 4 different colors available.

Your friend Julia is a little indecisive and she tells you that she can pick any color. Your other friend John tells you that he likes the color blue and he is very much likely to buy a blue t-shirt.

You definitely have more uncertainty about the decision of Julia than that of John. Entropy is a measure that quantifies the uncertainty. To be more precise, it is the average amount of information received from samples within a probability distribution.

The following table shows the probability distributions of the events that Julia and John buying a t-shirt.

👁 The probability distributions of Julia and John buying a t-shirst (image by author)

The probability distributions of Julia and John buying a t-shirst (image by author)

Let’s start with Julia. If Julia picks blue, the uncertainty is reduced by 4 (1/0.25). It is equal to 2 bits in log base 2 (The base unit of entropy is a bit). Thus, in the case of blue, the amount of information we get is 2 bits. Since entropy is the average amount of information of samples, we repeat the same calculations for other colors. They result in the same number of bits since the probabilities are the same. For Julia, the entropy is calculated as follows:

👁 (image by author)

(image by author)

For John, the steps are the same but the result is different.

👁 (image by author)

(image by author)

The entropy is more in the case with Julia so we have more uncertainty about the decision of Julia which we expected in the beginning.

We have calculated the entropy. It is time to introduce the formula:

👁 The formula of Entropy (image by author)

The formula of Entropy (image by author)

Note: We did not include the minus sign in our calculations because it has been eliminated by taking the inverse of probability (1 / p ).

We have two events with 4 outcomes. The first event is Julia buying a t-shirt, The second event is John buying a t-shirt. The entropies are 2 bits and 1.19 bits, respectively. In other words, on average, we receive 2 bits of information about the first event and 1.19 bits of information about the second one.

We are building our way towards the concepts used in machine learning. The next topic is the cross-entropy which is the message length on average.

The color that your friend picks is transmitted to you digitally (i.e. with bits). The following table represents two different encodings used to transfer information about the choice of John.

👁 (image by author)

(image by author)

In case 1, two bits are used for every color. Thus, the average message length is 2.

👁 (image by author)

(image by author)

This encoding is acceptable for Julia but not for John. The entropy of the probability distribution of John’s choices is 1.19 bits so using 2 bits on average to send information about his choice is not an optimal way.

In case 2, cross-entropy turns out to be 1.3 bits. It is still more than 1.19 but definitely a better way than case 1.

👁 (image by author)

(image by author)

But, where does the word "cross" come from? When calculating the cross-entropy, we are actually comparing two different probability distributions. One is the actual probability distribution of the variable and the other is the predicted one with the choice of bits.

👁 (image by author)

(image by author)

The cross-entropy can be expressed as a function of the true and predicted distributions as follows:

👁 The formula of cross-entropy (image by author)

The formula of cross-entropy (image by author)

If you take a look at the calculations we have done to find the cross-entropy, you will notice that the steps overlap with this formula.

We can now start our discussion on how cross-entropy is used in the field of machine learning. Cross-entropy loss (i.e. log loss) is a widely-used cost function for machine learning and deep learning models.

Cross-entropy quantifies the comparison of two probability distributions. In supervised learning tasks, we have a target variable that we are trying to predict. The actual distribution of the target variable and our predictions are compared using the cross-entropy. The result is the cross-entropy loss, also known as log loss.

There is a slight difference between the cross-entropy and the cross-entropy loss. When calculating the loss, natural log is usually used instead of log base 2.

The cross-entropy loss:

👁 Cross-entropy loss (image by author)

Cross-entropy loss (image by author)

Let’s do an example. We have a classification problem with 4 classes. The prediction of our model for a particular observations is as below:

👁 (image by author)

(image by author)

Since we know the true probability distribution, it is 100% for the true class and zero for all others. According to our model, the class that this observation belongs to is class 1with 80% probability. The cross-entropy loss for this particular observation is calculated as below:

👁 (image by author)

(image by author)

Since the true probability is zero for all classes except for the actual class, only the predicted probability of the actual class contributes to the cross-entropy loss.

Please keep in mind that this is the loss on a particular observation. The loss on the training or test set is the average of the cross-entropies of all observations in that set.

Why Log Loss?

You may wonder why the log loss is used instead of classification accuracy as a cost function.

The following table shows the predictions of two different models on a relatively small set that consists of 5 observations.

👁 (image by author)

(image by author)

Both models correctly classified 4 observations out of 5. Thus, in terms of classification accuracy, these models have the same performance. However, the probabilities reveal that Model 1 is more certain in the predictions. Thus, it is likely to perform better in general.

Log loss (i.e. cross-entropy loss) provide a more robust and accurate evaluation of classification models.

Thank you for reading. Please let me know if you have any feedback.

References

Written By

Soner Yıldırım

See all from Soner Yıldırım

Artificial Intelligence, Classification, Data Science, Machine Learning, Predictive Analytics

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/all-the-way-from-information-theory-to-log-loss-in-machine-learning-c78488dade15/