![]() |
VOOZH | about |
In Machine Learning and Data Science we often come across a term called
Imbalanced Data Distribution
, generally happens when observations in one of the class are much higher or lower than the other classes. As Machine Learning algorithms tend to increase accuracy by reducing the error, they do not consider the class distribution. This problem is prevalent in examples such as
Fraud Detection
,
Anomaly Detection
,
Facial recognition
etc. Standard ML techniques such as Decision Tree and Logistic Regression have a bias towards the
majority
class, and they tend to ignore the minority class. They tend only to predict the majority class, hence, having major misclassification of the minority class in comparison with the majority class. In more technical words, if we have imbalanced data distribution in our dataset then our model becomes more prone to the case when minority class has negligible or very lesser
recall
.
Imbalanced Data Handling Techniques:
There are mainly 2 mainly algorithms that are widely used for handling imbalanced class distribution.
SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem. It aims to balance class distribution by randomly increasing minority class examples by replicating them. SMOTE synthesises new minority instances between existing minority instances. It generates the
virtual training records by linear interpolation
for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class. After the oversampling process, the data is reconstructed and several classification models can be applied for the processed data.
More Deep Insights of how SMOTE Algorithm work !
NearMiss is an under-sampling technique. It aims to balance class distribution by randomly eliminating majority class examples. When instances of two different classes are very close to each other, we remove the instances of the majority class to increase the spaces between the two classes. This helps in the classification process. To prevent problem of
information loss
in most under-sampling techniques,
near-neighbor
methods are widely used.
The basic intuition about the working of near-neighbor methods is as follows:
For finding n closest instances in the majority class, there are several variations of applying NearMiss Algorithm :
This article helps in better understanding and hands-on practice on how to choose best between different imbalanced data handling techniques.
The dataset consists of transactions made by credit cards. This dataset has
492 fraud transactions out of 284, 807 transactions
. That makes it highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
Output:
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time 284807 non-null float64
V1 284807 non-null float64
V2 284807 non-null float64
V3 284807 non-null float64
V4 284807 non-null float64
V5 284807 non-null float64
V6 284807 non-null float64
V7 284807 non-null float64
V8 284807 non-null float64
V9 284807 non-null float64
V10 284807 non-null float64
V11 284807 non-null float64
V12 284807 non-null float64
V13 284807 non-null float64
V14 284807 non-null float64
V15 284807 non-null float64
V16 284807 non-null float64
V17 284807 non-null float64
V18 284807 non-null float64
V19 284807 non-null float64
V20 284807 non-null float64
V21 284807 non-null float64
V22 284807 non-null float64
V23 284807 non-null float64
V24 284807 non-null float64
V25 284807 non-null float64
V26 284807 non-null float64
V27 284807 non-null float64
V28 284807 non-null float64
Amount 284807 non-null float64
Class 284807 non-null int64
Output:
0 284315
1 492
Output:
Number transactions X_train dataset: (199364, 29)
Number transactions y_train dataset: (199364, 1)
Number transactions X_test dataset: (85443, 29)
Number transactions y_test dataset: (85443, 1)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 85296
1 0.88 0.62 0.73 147
accuracy 1.00 85443
macro avg 0.94 0.81 0.86 85443
weighted avg 1.00 1.00 1.00 85443
The accuracy comes out to be 100% but did you notice something strange ?
The recall of the minority class in very less. It proves that the model is more biased towards majority class. So, it proves that this is not the best model. Now, we will apply different
imbalanced data handling techniques
and see their accuracy and recall results.
Output:
Before OverSampling, counts of label '1': [345]
Before OverSampling, counts of label '0': [199019]
After OverSampling, the shape of train_X: (398038, 29)
After OverSampling, the shape of train_y: (398038, )
After OverSampling, counts of label '1': 199019
After OverSampling, counts of label '0': 199019
Look!
that SMOTE Algorithm has oversampled the minority instances and made it equal to majority class. Both categories have equal amount of records. More specifically, the minority class has been increased to the total number of majority class. Now see the accuracy and recall results after applying SMOTE algorithm (Oversampling).
Output:
precision recall f1-score support
0 1.00 0.98 0.99 85296
1 0.06 0.92 0.11 147
accuracy 0.98 85443
macro avg 0.53 0.95 0.55 85443
weighted avg 1.00 0.98 0.99 85443
Wow
, We have reduced the accuracy to 98% as compared to previous model but the recall value of minority class has also improved to 92 %. This is a good model compared to the previous one. Recall is great. Now, we will apply NearMiss technique to Under-sample the majority class and see its accuracy and recall results.
Output:
Before Undersampling, counts of label '1': [345]
Before Undersampling, counts of label '0': [199019]
After Undersampling, the shape of train_X: (690, 29)
After Undersampling, the shape of train_y: (690, )
After Undersampling, counts of label '1': 345
After Undersampling, counts of label '0': 345
The
NearMiss Algorithm
has undersampled the majority instances and made it equal to majority class. Here, the majority class has been reduced to the total number of minority class, so that both classes will have equal number of records.
Output:
precision recall f1-score support
0 1.00 0.56 0.72 85296
1 0.00 0.95 0.01 147
accuracy 0.56 85443
macro avg 0.50 0.75 0.36 85443
weighted avg 1.00 0.56 0.72 85443
This model is better than the first model because it classifies better and also the recall value of minority class is 95 %. But due to undersampling of majority class, its recall has decreased to 56 %. So in this case, SMOTE is giving me a great accuracy and recall, I’ll go ahead and use that model! :)