Training a decision tree against unbalanced data

Last Updated : 23 Jul, 2025

When building machine learning models like decision trees, imbalanced data—where one class significantly outnumbers another—presents a unique challenge. For example, in credit card transactions, legitimate transactions may vastly outnumber fraudulent ones. Without adjustments, decision trees tend to favor the majority class, often misclassifying the minority class.

Decision trees are effective classifiers, constructing logical rules to separate classes. However, on imbalanced datasets, trees tend to maximize node purity, leading them to favor the majority class at the expense of the minority class due to: Decision trees often emphasize the more frequent class, which can lead to high misclassification rates for the minority class.

Strategies to Address Imbalance

Cost-Sensitive Learning: Assign higher misclassification costs to the minority class (e.g., fraud), encouraging the model to consider it more seriously.
Alternative Splitting Criteria: Use metrics like Hellinger distance instead of traditional ones like information gain, as it better handles skewed distributions.
Sampling Techniques: Balance the dataset by oversampling the minority class or undersampling the majority class, or by using wrapper frameworks that combine sampling with the splitting metric.
Adjusted Evaluation Metrics: Accuracy alone is misleading in imbalanced settings. Instead, prioritize metrics like precision, recall, and F1-score to assess the model’s performance on the minority class more accurately.

Example in Practice

Consider a decision tree classifying medical records as healthy or diseased. With 99.5% healthy cases, the tree may default to "healthy" predictions, yielding high accuracy but misclassifying all diseased cases. By applying cost-sensitive learning or alternative splitting criteria, the tree can prioritize correct classification of the minority diseased cases.

Let's now see a code example in this scenario:

Training a decision tree against unbalanced data

Creating An Unbalanced Dataset

This code generates an imbalanced synthetic dataset with three classes using make_classification, with class distribution controlled by the weights parameter (70%, 20%, and 10% for each class). It then displays the class distribution and splits the data into training and testing sets, preparing it for training a classification model.

Output:

Class distribution: Counter({np.int64(0): 700, np.int64(1): 200, np.int64(2): 100})

Training the Decision Tree with Adjustments

This code initializes a decision tree classifier with balanced class weights to address imbalanced data, sets a controlled depth (max_depth=4) and minimum leaf samples for stability. After training, it evaluates the model with precision, recall, and F1 scores, and calculates the ROC AUC score using a one-vs-rest approach for multi-class. Finally, it displays the confusion matrix and visualizes the decision tree structure.

Output:

Classification Report:
 precision recall f1-score support

 0 1.00 1.00 1.00 217
 1 1.00 0.98 0.99 56
 2 0.96 1.00 0.98 27

 accuracy 1.00 300
 macro avg 0.99 0.99 0.99 300
weighted avg 1.00 1.00 1.00 300

ROC AUC Score: 1.00

The benefit of using balanced trees can be observed in the following decision boundaries:

👁 Decision_trees

Difference between decision boundaries by using default and balanced weights

Key Takeaways

Recognize Imbalance Effects: Imbalanced data skews decision trees toward the majority class.
Apply Balancing Strategies: Cost-sensitive learning, alternative splitting, and sampling improve minority class predictions.
Use Appropriate Metrics: Precision, recall, and F1-score provide a clearer picture of model performance on the minority class.

Comment

Article Tags:

Machine Learning

AI-ML-DS

Data Science Questions

AI-ML-DS With Python