Introduction to Probabilistic Classification: A Machine Learning Perspective

Guide to go from predicting labels to predicting probabilities

Dec 11, 2021

9 min read

Source: Naser Tamimi on Unsplash.

You are capable of training and evaluating classification models, both linear and non-linear model structures. Well done! Now, you want class probabilities instead of class labels. Read no more. This is the article you are looking for. This article walks you through the different evaluation metrics, its pros and cons and optimal model training for multiple ML models.

Classifying cats and dogs

Imagine creating a model with the sole purpose of classifying cats and dogs. The classification model will not be perfect and therefore wrongly classify certain observations. Some cats will be classified as dogs and vice versa. That’s life. In this example, the model classifies 100 cats and dogs. The confusion matrix is a commonly used visualization tool to show prediction accuracy and Figure 1 shows the confusion matrix for this example.

👁 Figure 1: Confusion matrix for classification of 100 cats and dogs. Source: Author.

Figure 1: Confusion matrix for classification of 100 cats and dogs. Source: Author.

Let’s focus on the 12 observations where the model predicts a cat while in reality it is a dog. If the model predicts 51% probability of cat and it turns out to be a dog, for sure that’s possible. However, if the model predicts 95% probability of cat and it turns out to be a dog? This seems highly unlikely.

👁 Figure 2: Predicted probability of cat and the classification threshold. Source: Author.

Figure 2: Predicted probability of cat and the classification threshold. Source: Author.

Classifiers use a predicted probability and a threshold to classify the observations. Figure 2 visualizes the classification for a threshold of 50%. It seems intuitive to use a threshold of 50% but there is no restriction on adjusting the threshold. So, in the end the only thing that matters is the ordering of the observations. Changing the objective to predict probabilities instead of labels requires a different approach. For this, we enter the field of probabilistic classification.

Evaluation metric 1: Logloss

Let us generalize from cats and dogs to class labels of 0 and 1. Class probabilities are any real number between 0 and 1. The model objective is to match predicted probabilities with class labels, i.e. to maximize the likelihood, given in Eq. 1, of observing class labels given the predicted probabilities.

👁 Equation 1: Likelihood for class labels y and predicted probabilities based on features x.

Equation 1: Likelihood for class labels y and predicted probabilities based on features x.

A major drawback of the likelihood is that if the number of observations grow, the product of the individual probabilities becomes increasingly small. So, with enough data, the likelihood will underflow the numerical precision of any computer. Next to that, a product of parameters is difficult to differentiate. That’s the reason the logarithm of the likelihood is preferred, commonly referred to as the loglikelihood. A logarithm is a monotonically increasing function of its argument. Therefore, maximization of the log of a function is equivalent to maximization of the function itself.

👁 Equation 2: Logloss for class labels y and predicted probabilities based on features x.

Equation 2: Logloss for class labels y and predicted probabilities based on features x.

Nonetheless, the loglikelihood still scales with the number of observations so an average loglikelihood is better metric to explain the observed variation. However, in practice, most people minimize the negative average loglikelihood instead maximizing the average loglikelihood because optimizers normally minimize functions. Data scientists commonly refer to this metric as Logloss, as given in Eq. 2. For a more elaborate discussion of the Logloss and its relation to the evaluation metrics normally used in classification model evaluation, I refer you to this article.

Evaluation metric 2: Brier Score

Next to the Logloss, the _Brier Score_, __ as given in Eq. 3, is commonly used as an evaluation metric for predicted probabilities. In essence, it is a quadratic loss on the predicted probabilities and the class labels. Note the similarity between the Mean Squared Error (MSE) used in regression model evaluation.

👁 Equation 3: Brier Score for class labels y and predicted probabilities based on features x.

Equation 3: Brier Score for class labels y and predicted probabilities based on features x.

However, a notable difference with the MSE is that the minimum Brier Score is not 0. The Brier Score is the squared loss on the labels and probabilities, and therefore by definition is not 0. Simply said, the minimum is not 0 if the underlying process is non-deterministic which is the reason to use probabilistic classification in the first place. In order to cope with this problem, the probabilities are commonly evaluated on a relative basis with other probabilistic classifiers using for instance the Brier Skill Score.

Example with dummy data

In this section I will show an example of the steps to go from classification to probability estimation using dummy data. The example will show multiple ML models, ranging from Logistic Regression to Random Forests. Let us first create dummy data using Sklearn. The dummy dataset contains both informative as well as redundant features and multiple clusters per class are introduced.

The dummy data is classified using the ML model structures:

The ML model’s ability to correctly classify is evaluated using the ROC-AUC score. Figure 3 shows that all ML models do a fairly good job at classifying the dummy data, i.e. ROC-AUC > 0.65, whereas RBF SVM and RF perform best.

👁 Figure 3: ROC-AUC score on out-of-sample data for different ML model structures. Source: Author.

Figure 3: ROC-AUC score on out-of-sample data for different ML model structures. Source: Author.

However, remind the model objective is predicting probabilities. It is nice that the ML models accurately classify the observations, but how well do the models predict class probabilities? There are two routes to evaluate the predicted probabilities:

Quantitatively with the Brier Score and Logloss;
Qualitatively with the calibration plot.

Quantitative evaluation of probabilities

Firstly, the ML models are quantitatively evaluated using the Brier Score and Logloss. Figure 4 shows that RBF SVM and RF perform best at probabilities estimation based on the Brier Score (left) and the Logloss (right). Note, the Logloss of the DT is relatively high and to understand the reason for this, I refer you to this article.

👁 Figure 4: Brier Score (left) and Logloss (right) on out-of-sample data for different ML model structures. Source: Author.

Figure 4: Brier Score (left) and Logloss (right) on out-of-sample data for different ML model structures. Source: Author.

Qualitative evaluation of probabilities

Secondly, the ML models are qualitatively evaluated using the calibration plot. The goal of the calibration plot is to show and evaluate whether predicted probabilities match the actual fraction of positives. The plot buckets the predicted probabilities in uniform buckets and compares the mean predicted to the fraction of positives. Figure 5 shows the calibration plot for our example. You can see that the LR and RBF SVM are well calibrated, i.e., the mean predicted probability matches the fraction of positives nicely. However, inspecting the distribution of predicted probabilities for LR shows that the predicted probabilities are more centered than for the RBF SVM. Next to that, you see that the DT is ill-calibrated and the distribution of predicted probabilities seems wrong.

👁 Figure 5: Calibration plot (upper) and distribution of probabilities (under) for different ML models. Source: Author.

Figure 5: Calibration plot (upper) and distribution of probabilities (under) for different ML models. Source: Author.

Why do predicted probabilities not match posterior probabilities?

Niculescu-Mizil & Caruano explain in their 2005 paper "Predicting Good Probabilities With Supervised Learning" why some ML models observe distorted predicted probabilities in comparison to the posterior probabilities. Let us start with explaining the root cause. When a classification model is not trained to decrease the Logloss, the predicted probabilities do not match the posterior probabilities. A solution to this, is to map predicted probabilities after model training to posterior probabilities, which is known as post-training calibration. Frequently used probability calibration techniques are:

Platt Scaling (Platt, 1999)
Isotonic Regression (Zadrozny, 2001)

👁 Figure 6: Model performance after post-training probability calibration with Platt Scaling and Isotonic regression. Source: (Niculescu et al., 2005)

Figure 6: Model performance after post-training probability calibration with Platt Scaling and Isotonic regression. Source: (Niculescu et al., 2005)

Calibrating the Decision Tree and Random Forest

The ML models are calibrated using Platt scaling and Isotonic regression, which are both easily coded in Sklearn. Note, the LR is not calibrated because this model structure is trained to decrease the Logloss and therefore has calibrated probabilities by default.

The only tunable parameter is the number of cross-validations for probability calibration. Niculescu (2005) show that a small calibration set size can deteriorate performance and that the performance improvement is most positive for Boosted and Bagged Decision Trees and Support Vector Machines. Our example contains 8,000 observations in the test dataset. For a 5-fold cross-validation, 1,600 observations are reserved for the calibration set size.

👁 Figure 7: Brier Score for Platt Scaling and Isotonic Regression for different ML models and different calibration set sizes. Source: (Niculescu et al., 2005)

Figure 7: Brier Score for Platt Scaling and Isotonic Regression for different ML models and different calibration set sizes. Source: (Niculescu et al., 2005)

Model evaluation after probability calibration

Let’s see whether probability calibration improves the Brier Score, Logloss and the calibration plot. Figure 7 shows the Brier Score and Logloss after probability calibration. Isotonic Regression has equivalent performance compared to Platt Scaling, because the dataset contains a sufficient number of observations. Given the non-parametric nature of Isotonic regression, I must warn for cases with small calibration set sizes. However, if you intend to do probabilistic classification on small data sizes, I suggest you use prior information and explore the field of Bayesian classification.

👁 Figure 7: Brier Score (left) and Logloss (right) after post-training calibration on out-of-sample data. Source: Author.

Figure 7: Brier Score (left) and Logloss (right) after post-training calibration on out-of-sample data. Source: Author.

Figure 8 shows the calibration plots after post-training calibration. You see an improvement in the predicted probabilities for the SVM, DT and RF. Next to that, the distribution of predicted probabilities cover the range of [0, 1] completely and provide accurate mean predicted probabilities.

👁 Figure 8: Calibration plot (upper) and probabilities (under) after post-training calibration of the ML models. Source: Author.

Figure 8: Calibration plot (upper) and probabilities (under) after post-training calibration of the ML models. Source: Author.

Does probability calibration impact the classification ability?

Calibration does not change the ordering of predicted probabilities. The calibration only changes the predicted probabilities to better match the observed fraction of positives. Figure 9 shows that after probability calibration, the model’s classification ability, as measured by the ROC-AUC score is either equal or better.

👁 Figure 9: ROC-AUC Score after post-training calibration for different ML model structures. Source: Author.

Figure 9: ROC-AUC Score after post-training calibration for different ML model structures. Source: Author.

Predicting probabilities? Calibrate the model!

The dummy example clearly shows that post-training calibration is essential to accurately estimate class probabilities for the discussed ML models.

Additional resources

The following tutorials / lectures were personally very helpful for my understanding of probabilistic classification. I have ranked the resources on (personal) importance and I highly recommend to check these resources.

Academia

Professor Sanjay Lall, Electrical Engineering @ Stanford.
Research Scientist Andreas Müller, Computer Science @ Columbia.

Industry

Data Scientist Becky Tucker, @ Netflix.
Data Scientist Gordon Chen, @ Oracle.

References

[1] Platt, J. (1999). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in Large Margin Classifiers (pp. 61–74).

[2] Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. ICML (pp. 609–616).

[3] Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. Proc. 22nd International Conference on Machine Learning (ICML’05).

If you’re keen on reading more, see a selection of my articles below:

Multi-armed bandits applied to order allocation among execution algorithms

Cost decomposition for a VWAP execution algorithm: Buy-side perspective.

Beyond traditional return modelling: Embracing thick tails.

Written By

Lars ter Braak

See all from Lars ter Braak

Classification, Editor’s Picks, Machine Learning, Probability, Thoughts And Theory

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/introduction-to-probabilistic-classification-a-machine-learning-perspective-b4776b469453/