Introduction to Probabilistic Classification: A Machine Learning Perspective
Guide to go from predicting labels to predicting probabilities
You are capable of training and evaluating classification models, both linear and non-linear model structures. Well done! Now, you want class probabilities instead of class labels. Read no more. This is the article you are looking for. This article walks you through the different evaluation metrics, its pros and cons and optimal model training for multiple ML models.
Classifying cats and dogs
Imagine creating a model with the sole purpose of classifying cats and dogs. The classification model will not be perfect and therefore wrongly classify certain observations. Some cats will be classified as dogs and vice versa. That’s life. In this example, the model classifies 100 cats and dogs. The confusion matrix is a commonly used visualization tool to show prediction accuracy and Figure 1 shows the confusion matrix for this example.
Let’s focus on the 12 observations where the model predicts a cat while in reality it is a dog. If the model predicts 51% probability of cat and it turns out to be a dog, for sure that’s possible. However, if the model predicts 95% probability of cat and it turns out to be a dog? This seems highly unlikely.
Classifiers use a predicted probability and a threshold to classify the observations. Figure 2 visualizes the classification for a threshold of 50%. It seems intuitive to use a threshold of 50% but there is no restriction on adjusting the threshold. So, in the end the only thing that matters is the ordering of the observations. Changing the objective to predict probabilities instead of labels requires a different approach. For this, we enter the field of probabilistic classification.
Evaluation metric 1: Logloss
Let us generalize from cats and dogs to class labels of 0 and 1. Class probabilities are any real number between 0 and 1. The model objective is to match predicted probabilities with class labels, i.e. to maximize the likelihood, given in Eq. 1, of observing class labels given the predicted probabilities.
A major drawback of the likelihood is that if the number of observations grow, the product of the individual probabilities becomes increasingly small. So, with enough data, the likelihood will underflow the numerical precision of any computer. Next to that, a product of parameters is difficult to differentiate. That’s the reason the logarithm of the likelihood is preferred, commonly referred to as the loglikelihood. A logarithm is a monotonically increasing function of its argument. Therefore, maximization of the log of a function is equivalent to maximization of the function itself.
Nonetheless, the loglikelihood still scales with the number of observations so an average loglikelihood is better metric to explain the observed variation. However, in practice, most people minimize the negative average loglikelihood instead maximizing the average loglikelihood because optimizers normally minimize functions. Data scientists commonly refer to this metric as Logloss, as given in Eq. 2. For a more elaborate discussion of the Logloss and its relation to the evaluation metrics normally used in classification model evaluation, I refer you to this article.
Evaluation metric 2: Brier Score
Next to the Logloss, the _Brier Score_, __ as given in Eq. 3, is commonly used as an evaluation metric for predicted probabilities. In essence, it is a quadratic loss on the predicted probabilities and the class labels. Note the similarity between the Mean Squared Error (MSE) used in regression model evaluation.
However, a notable difference with the MSE is that the minimum Brier Score is not 0. The Brier Score is the squared loss on the labels and probabilities, and therefore by definition is not 0. Simply said, the minimum is not 0 if the underlying process is non-deterministic which is the reason to use probabilistic classification in the first place. In order to cope with this problem, the probabilities are commonly evaluated on a relative basis with other probabilistic classifiers using for instance the Brier Skill Score.
Example with dummy data
In this section I will show an example of the steps to go from classification to probability estimation using dummy data. The example will show multiple ML models, ranging from Logistic Regression to Random Forests. Let us first create dummy data using Sklearn. The dummy dataset contains both informative as well as redundant features and multiple clusters per class are introduced.
The dummy data is classified using the ML model structures:
- Logistic Regression (LR),
- Support Vector Machine (SVM),
- Decision Tree (DT),
- Random Forest (RF).
The ML model’s ability to correctly classify is evaluated using the ROC-AUC score. Figure 3 shows that all ML models do a fairly good job at classifying the dummy data, i.e. ROC-AUC > 0.65, whereas RBF SVM and RF perform best.
However, remind the model objective is predicting probabilities. It is nice that the ML models accurately classify the observations, but how well do the models predict class probabilities? There are two routes to evaluate the predicted probabilities:
- Quantitatively with the Brier Score and Logloss;
- Qualitatively with the calibration plot.
Quantitative evaluation of probabilities
Firstly, the ML models are quantitatively evaluated using the Brier Score and Logloss. Figure 4 shows that RBF SVM and RF perform best at probabilities estimation based on the Brier Score (left) and the Logloss (right). Note, the Logloss of the DT is relatively high and to understand the reason for this, I refer you to this article.
Qualitative evaluation of probabilities
Secondly, the ML models are qualitatively evaluated using the calibration plot. The goal of the calibration plot is to show and evaluate whether predicted probabilities match the actual fraction of positives. The plot buckets the predicted probabilities in uniform buckets and compares the mean predicted to the fraction of positives. Figure 5 shows the calibration plot for our example. You can see that the LR and RBF SVM are well calibrated, i.e., the mean predicted probability matches the fraction of positives nicely. However, inspecting the distribution of predicted probabilities for LR shows that the predicted probabilities are more centered than for the RBF SVM. Next to that, you see that the DT is ill-calibrated and the distribution of predicted probabilities seems wrong.
Why do predicted probabilities not match posterior probabilities?
Niculescu-Mizil & Caruano explain in their 2005 paper "Predicting Good Probabilities With Supervised Learning" why some ML models observe distorted predicted probabilities in comparison to the posterior probabilities. Let us start with explaining the root cause. When a classification model is not trained to decrease the Logloss, the predicted probabilities do not match the posterior probabilities. A solution to this, is to map predicted probabilities after model training to posterior probabilities, which is known as post-training calibration. Frequently used probability calibration techniques are:
- Platt Scaling (Platt, 1999)
- Isotonic Regression (Zadrozny, 2001)
Calibrating the Decision Tree and Random Forest
The ML models are calibrated using Platt scaling and Isotonic regression, which are both easily coded in Sklearn. Note, the LR is not calibrated because this model structure is trained to decrease the Logloss and therefore has calibrated probabilities by default.
The only tunable parameter is the number of cross-validations for probability calibration. Niculescu (2005) show that a small calibration set size can deteriorate performance and that the performance improvement is most positive for Boosted and Bagged Decision Trees and Support Vector Machines. Our example contains 8,000 observations in the test dataset. For a 5-fold cross-validation, 1,600 observations are reserved for the calibration set size.
Model evaluation after probability calibration
Let’s see whether probability calibration improves the Brier Score, Logloss and the calibration plot. Figure 7 shows the Brier Score and Logloss after probability calibration. Isotonic Regression has equivalent performance compared to Platt Scaling, because the dataset contains a sufficient number of observations. Given the non-parametric nature of Isotonic regression, I must warn for cases with small calibration set sizes. However, if you intend to do probabilistic classification on small data sizes, I suggest you use prior information and explore the field of Bayesian classification.
Figure 8 shows the calibration plots after post-training calibration. You see an improvement in the predicted probabilities for the SVM, DT and RF. Next to that, the distribution of predicted probabilities cover the range of [0, 1] completely and provide accurate mean predicted probabilities.
Does probability calibration impact the classification ability?
Calibration does not change the ordering of predicted probabilities. The calibration only changes the predicted probabilities to better match the observed fraction of positives. Figure 9 shows that after probability calibration, the model’s classification ability, as measured by the ROC-AUC score is either equal or better.
Predicting probabilities? Calibrate the model!
The dummy example clearly shows that post-training calibration is essential to accurately estimate class probabilities for the discussed ML models.
Additional resources
The following tutorials / lectures were personally very helpful for my understanding of probabilistic classification. I have ranked the resources on (personal) importance and I highly recommend to check these resources.
Academia
- Professor Sanjay Lall, Electrical Engineering @ Stanford.
- Research Scientist Andreas Müller, Computer Science @ Columbia.
Industry
References
[1] Platt, J. (1999). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in Large Margin Classifiers (pp. 61–74).
[2] Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. ICML (pp. 609–616).
[3] Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. Proc. 22nd International Conference on Machine Learning (ICML’05).
If you’re keen on reading more, see a selection of my articles below:
Multi-armed bandits applied to order allocation among execution algorithms
Cost decomposition for a VWAP execution algorithm: Buy-side perspective.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS