![]() |
VOOZH | about |
Gradient Boosting is a boosting algorithm and here each new model is trained to minimize the loss function such as mean squared error or cross-entropy of the previous model using gradient descent. In each iteration the algorithm computes the gradient of the loss function with respect to predictions and then trains a new weak model to predict this gradient. Predictions of the new model are then added to the ensemble (all models prediction) and the process is repeated until a stopping criterion is met.
A key feature of Gradient Boosting is shrinkage which scales the contribution of each new model using learning rate (denoted as ).
There's a trade off between the learning rate and the number of estimators (trees) a smaller learning rate usually means more trees are required to achieve optimal performance.
The ensemble consists of multiple trees each trained to correct the errors of the previous one. In the first iteration Tree 1 is trained on the original data and the true labels . It makes predictions which are used to compute the errors.
In the second iteration Tree 2 is trained using the feature matrix and the errors from Tree 1 as labels. This means Tree 2 is trained to predict the errors of Tree 1. This process continues for all the trees in the ensemble. Each subsequent tree is trained to predict the errors of the previous tree.
After each tree is trained its predictions are shrunk by multiplying them with the learning rate η which ranges from 0 to 1. This prevents overfitting by ensuring each tree has a smaller impact on the final model.
Once all trees are trained predictions are made by summing the contributions of all the trees. The final prediction is given by the formula:
Where are the errors predicted by each tree.
Lets see difference between AdaBoost and gradient boosting which are as follows:
Features | AdaBoost | Gradient Boosting |
|---|---|---|
Weight Update Strategy | Increase weights of misclassified sample so that the next learner focuses more on them. | Updates predictions by minimizing a loss function using the negative gradient |
Base learners | AdaBoost uses simple decision trees with one split known as the decision stumps of weak learners. | Gradient Boosting can use a wide range of base learners such as decision trees and linear models. |
Sensitivity to Noise | AdaBoost is more sensitive to noisy data and outliers due to aggressive weighting. | Gradient Boosting is less sensitive as it smooths updates using gradients. |
Optimization Technique | No explicit loss function i.e it focuses on classification error. | Explicitly minimizes a differentiable loss function. |
Boosting Mechanism | Learners are trained sequentially with sample reweighting. | Learners are trained sequentially with residual fitting (gradient descent). |
Interpretability | Easier to interpret due to simple weak learners. | Harder to interpret if complex models are used. |
Use case | Suitable for clean datasets with fewer outliers | Suitable for complex problems with varying loss function |
Here are two examples to demonstrate how Gradient Boosting works for both classification and regression. But before that let's understand gradient boosting parameters.
Now we start building our models with Gradient Boosting.
We use Gradient Boosting Classifier to predict digits from Digits dataset.
Output:
Gradient Boosting Classifier accuracy is : 0.98
We use Gradient Boosting Regressor on the Diabetes dataset to predict continuous values:
Output:
Root mean Square error: 56.39
Gradient Boosting is an effective and widely-used machine learning technique for both classification and regression problems. It builds models sequentially focusing on correcting errors made by previous models which leads to improved performance.