Why we need to deal with imbalanced classes
An illustration of the effects of imbalanced features on model evaluation
Class imbalance naturally occurs in certain types of classification problems such as credit attribution (data set usually contains much more approved credits than rejected) or fraud detection (fraud usually represents a small percentage of the overall transactions).
Class imbalance means that one of the modalities of a categorical variable is over-represented with respect to the others, such as in the example below:
It is recommended to handle the class imbalance before training a model and the suggested methods usually fall in one of the following categories :
I have linked to some resources for each type of approach if you want to find out more about a topic. This post will be focusing instead on illustrating why class imbalance needs to be dealt with in the first place and the effects it can have on model performance otherwise. It is assumed that you are familiar with performance measures for classification models, if not here is a good starting point.
The chosen performance metric needs to be aware of class imbalance
Let us consider an example similar to the one illustrated by the pie chart, a 1:10 ratio between class 0 and class 1 in the data set. We would like to build a classification model to predict 0 or 1 based on some features.
In this case, choosing the accuracy as a performance measure (e.g. the fraction of correct predictions out of the total predictions) might lead to training a dummy model that continuously predicts the most frequent class. For such a model the confusion matrix will be given by:
Based on the accuracy formula we can see that our dummy model can provide a 0.9 accuracy :
This kind of number might sound well on paper but comes from a model that does not actually explain much and will generalize poorly on new data. To prevent this kind of behavior, models are often evaluated in practice according to several metrics. Such an extreme case would be easily thrown out by computing the ROC AUC score for instance (0.5 in this case).
But choosing a good metric for model evaluation is never obvious and this is especially true in case of class imbalance. Dealing with class imbalance at the source can help remove this extra concern.
Using the default classification threshold might result in poor performance
The classification threshold is often forgotten when performing model evaluation but it provides an extra degree of freedom that can help tune a trained model in order to reach the desired performance.
The raw output of a classification model is generally a continuous value between 0 and 1 representing the probability of the outcome. It is converted into class 0 or class 1 based on the chosen classification threshold. The threshold is set by default to 0.5 and this value usually divides the theoretical model output in half (normality is generally assumed by most ML models although some are robust enough in case this hypothesis does not hold):
Training a model based on an imbalanced data set leads to a very skewed output distribution. In this case, using the default value for the classification threshold might result in poor performance.
Even non probabilistic models such as random forests still rely on the assumption that the sampling used to perform bootstrap aggregation is representative. This assumption does not necessarily hold in case of imbalanced classes and can lead to poor overall performance.
In case of a skewed output distribution, an "optimal" classification threshold can be computed from the ROC curve or from the Precision-Recall curve as illustrated here, or by defining a custom grid search score, a topic I have covered already here
Balancing classes does not come for free
We have seen the importance of dealing with imbalanced data but applying such transformation to the data is not necessarily without consequences. Depending on the chosen method it can introduce bias, lead to overfitting or remove important information.
After balancing the classes we can check that the overall model performance is not too impacted by the used balancing technique. For the example I considered before, the ROC curve for the imbalanced data (red) is almost superposed to the ROC curve for balanced data (green), meaning few loss in performance for the added robustness.
If the results obtained for the balanced dataset are much better or too good something probably went wrong. A common mistake is oversampling before splitting the data set in train and test or before cross-validating (for an example and more detailed explanations see here). This leads to data leakage and evaluation metrics that cannot be trusted.
The purpose of this post was to illustrate the effects of class imbalance on trained models and on some evaluation metrics. Different methods for dealing with class imbalance are already used by data scientist everywhere, but none of them come free of cost and their effect on the model performance needs to be evaluated properly.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS