![]() |
VOOZH | about |
Choosing the ideal depth for a decision tree is crucial to avoid overfitting, a common issue where the model fits the training data too well but fails to generalize to new data. The core idea is to balance the complexity of the model with its ability to generalize. Here, we will explore how to set the optimal depth for decision trees to prevent overfitting. Let's discuss few techniques for Preventing Overfitting in Decision Trees:
Divide the dataset into multiple subsets and train Decision Trees with varying depths on one subset while validating on another. This approach identifies the depth that generalizes best to unseen data. For instance, training trees with depths from 1 to 15 might reveal that depth 7 achieves the best validation accuracy without overfitting.
Set a maximum depth for the tree, typically between 3 and 10, based on the complexity of the data. Limiting depth prevents the model from capturing noise or irrelevant patterns. For example, a tree with a maximum depth of 5 may generalize better than a deeper tree that overfits by learning minor data irregularities.
Track training and validation accuracy as tree depth increases. Overfitting becomes evident when validation accuracy peaks while training accuracy continues to rise. For example, if validation accuracy plateaus at depth 8 but training accuracy keeps improving, depth 8 should be selected as the optimal value.
Use GridSearchCV or RandomizedSearchCV to efficiently identify the best tree depth by testing a range of values, such as 1 to 15. These methods automate the search process and determine the optimal depth based on cross-validation results. For example, grid search might suggest depth 6 as ideal for balancing accuracy and generalization.
Apply pruning parameters such as min_samples_split, min_samples_leaf, or ccp_alpha to simplify the tree by removing low-impact branches. Pruning reduces complexity and enhances generalization by eliminating branches that do not significantly improve predictions. For instance, setting pruning parameters might reduce a tree from a depth of 15 to an effective depth of 8, resulting in better performance on validation data.
Let's understand with the below example:
This code demonstrates five techniques to prevent overfitting in decision trees.
Output:
Method 4: Best Depth found by Grid Search is 4
Method 5: Pruning with min_samples_split=4, min_samples_leaf=2
Training Accuracy (Pruned): 0.9666666666666667
Test Accuracy (Pruned): 1.0