Battle of the Ensemble – Random Forest vs Gradient Boosting
Two of the most popular algorithms in the world of machine learning, who will win?
If you have spent some time in the world of machine learning, you would have undoubtedly heard of a concept called the bias-variance tradeoff. It is one of the most important concepts any machine learning practitioner should learn and be aware of.
Essentially, the bias-variance tradeoff is a conundrum in machine learning which states that models with low bias will usually have high variance and vice versa.
Bias is the difference between the actual value and the expected value predicted by the model. A model with a high bias is said to be oversimplified as a result, underfitting the data.
Variance, on the other hand, represents a model’s sensitivity to small fluctuations in the training data. A model with high variance is sensitive to noise and as a result, overfitting the data. In other words, the model fits well on training data but fails to generalise on unseen (testing) data.
With that in mind, in this article, I would like to share one of several techniques to balance the tradeoff between bias and variance: ensemble methods.
First of all, what are ensemble methods?
Ensemble methods involve aggregating multiple machine learning models with the aim of decreasing both bias and variance. Ideally, the result from an ensemble method will be better than any of individual machine learning model.
There are 3 main types of ensemble methods:
- Bagging
- Boosting
- Stacking
For the purpose of this article, we will only focus on the first two: bagging and boosting. Specifically, we will examine and contrast two machine learning models: random forest and gradient boosting, which utilise the technique of bagging and boosting respectively.
Furthermore, we will proceed to apply these two algorithms in the second half of this article to solve the Titanic survival prediction competition in order to see how they work in practice.
Decision tree
Before we begin, it is important that we first understand what a decision tree is as it is fundamental to the underlying algorithm for both random forest and gradient boosting.
A decision tree is a supervised learning algorithm that sets the foundation for any tree-based models such as random forest and gradient boosting. Decision trees can be used for both classification and regression problems.
Each node of a tree represents a single variable and a split point on that variable (assuming that the variable is numeric). The leaf nodes of the tree contain an output variable that is used by the tree to make a prediction.
Let’s take the Kaggle house prices prediction competition as an example.
Suppose we are building a decision tree model that will take in a variety of features of a house e.g. the number of bedrooms, lot size, the location of its neighbourhood etc in order to make a prediction on its final sale price.
For simplicity, let’s say the end result of our model looks something like this:
Given a random house, our model is now able to traverse from the very top (root node) of the decision tree down to the bottom (leaf nodes) of the tree and spit out a predicted price for that particular home.
More concretely, based on this model, a house with more than two bedrooms and a lot size larger than 11,500 square feet will have a predicted price of $233,000 and so on.
Of course, a decision tree can get more complex and sophisticated than the one shown above, with more depth and a higher number of nodes, which will, in turn, enable the tree to capture a more detailed relationship between the predictors and the target variable.
Random forest (bagging)
Now that we understood what a decision tree is and how it works, let us examine our first ensemble method, bagging.
Bagging, also known as bootstrap aggregating, refers to the process of creating and merging a collection of independent, parallel decision trees using different subsets of the training data (bootstrapped datasets).
Pay close attention to the words, independent and parallel. Decision trees built using random forest have zero knowledge and influence on the other trees in the model. This is a key defining feature of bagging.
Once all the trees are built, the model will then select the mode of all the predictions made by the individual decision trees (majority voting) and return the result as the final prediction.
I hope it is clear by now that bagging reduces the dependence on a single tree by spreading the risk of error across multiple trees, which also indirectly reduces the risk of overfitting.
Obviously, random forest is not without its flaws and shortcomings. Here are some scenarios when you should and should not use random forest:
When to use random forest
- It can be used for both classification (RandomForestClassifier)and regression (RandomForestRegressor) problems
- You are interested in the significance of predictors (feature importance)
- You need a quick benchmark model as random forest are quick to train and require minimal preprocessing e.g. feature scaling
- If you have messy data e.g. missing data, outliers etc
When not to use random forest
- If you are solving a complex, novel problem
- Transparency is important
- Prediction time is important as the model needs time to aggregate the result from multiple decision trees before arriving at the final prediction
To wrap up on random forest, here are some key hyperparameters to consider:
- n_estimators: controls how many individual decision trees will be built
- max_depth: controls how deep each individual decision tree can go
Gradient boosting (boosting)
Boosting, on the other hand, takes an iterative approach to combine a number of weak, sequential models to create one strong model by focusing on the mistakes in the prior iterations.
A weak model is one that is only slightly better than random guessing whereas a strong model is one that is strongly correlated with the true classification.
A crucial distinction that makes boosting different from bagging is that decision trees under boosting are not built independently but instead, they are built in a sequential manner where each tree effectively learns the mistake from the ones that come before it.
It is also worth noting that there are other variations of boosting e.g. AdaBoost (adaptive boosting), XGBoost (extreme gradient boosting) and LightGBM (light gradient boosting) but for the purpose of this article, we will solely focus on gradient boosting.
Similar to the section above, here are some scenarios when you should and should not use gradient boosting:
When to use gradient boosting
- It can be used for both classification (GradientBoostingClassifier) and regression (GradientBoostingRegressor) problems
- You are interested in the significance of predictors (feature importance)
- Prediction time is important because, unlike random forest, decision trees under gradient boosting cannot be built in parallel thus the process of building successive trees will take some time
When not to use gradient boosting
- Transparency is important
- Training time is important or when you have limited compute power
- Your data is really noisy as gradient boosting tends to emphasise even the smallest error and as a result, it can overfit to noise in the data
Moreover, here are some key hyperparameters to consider for gradient boosting:
- learning_rate: facilitates both how quickly and whether or not the algorithm will find the optimal solution
- max_depth: controls how many individual decision trees will be built (the trees under gradient boosting are typically shallower than those under random forest)
- n_estimators: controls how many successive trees will be built (there are usually a higher number of trees under gradient boosting than random forest)
Titanic case study
As promised, let’s now apply random forest and gradient boosting in an actual project, the Titanic survival prediction competition, in order to reinforce what we have covered so far in this article.
If you would like to follow along, check out the full notebook on my GitHub here.
Let’s first take a look at the first 5 rows of the dataset.
Now, we will perform some feature engineering and data preprocessing to get our data ready for modelling. Specifically, we will do the following:
- Fill data in the Age column with the average passenger age
- Combine SibSp and Parch features into a single feature: family_size
- Create a new feature, cabin_missing, which acts as an indicator for missing data in the Cabin column
- Encode the Sex column by assigning 0 to male passengers and 1 to female passengers
- Train test split (80% training set and 20% test set)
I will spare the details in this article but if you are interested in the rationale and the actual code behind these steps, kindly refer to my notebook.
RandomForestClassifier
To see the default hyperparameters for this model:
# Default hyperparameters for RandomForestClassifier
print(RandomForestClassifier())
Before we fit the model to the training data, we can use GridSearchCV to find the optimal set of hyperparameters.
# Set up GridSearchCV
rf = RandomForestClassifier(n_jobs = -1, random_state = 10)
params = {
'n_estimators': [5, 50, 250],
'max_depth': [2, 4, 8, 16, 32, None]
}
cv = GridSearchCV(rf, params, cv = 5, n_jobs = -1)
# Fit GridSearchCV to training set
cv.fit(X_train, Y_train)
# Best parameters
cv.best_params_
- max_depth: 4
- n_estimators: 50
In other words, the most ideal random forest model for this training set contains 50 decision trees with a maximum depth of 4.
Finally, we can proceed to fit our model using this set of hyperparameters and subsequently assess its performance on the test set.
# Instantiate RandomForestClassifier with best hyperparameters
rf = RandomForestClassifier(n_estimators = 50, max_depth = 4, n_jobs = -1, random_state = 42)
# Fit model
start = time.time()
rf_model = rf.fit(X_train, Y_train)
end = time.time()
fit_time = end - start
# Predict
start = time.time()
Y_pred = rf_model.predict(X_test)
end = time.time()
pred_time = end - start
# Time and prediction results
precision, recall, fscore, support = score(Y_test, Y_pred, average = 'binary')
print(f"Fit time: {round(fit_time, 3)} / Predict time: {round(pred_time, 3)}")
print(f"Precision: {round(precision, 3)} / Recall: {round(recall, 3)} / Accuracy: {round((Y_pred==Y_test).sum() / len(Y_pred), 3)}")
- Fit time: 0.469
- Predict time: 0.141
- Precision: 0.797
- Recall: 0.689
- Accuracy: 0.799
# Confusion matrix for RandomForestClassifier
matrix = confusion_matrix(Y_test, Y_pred)
sns.heatmap(matrix, annot = True, fmt = 'd')
GradientBoostingClassifier
Now, let’s see how gradient boosting stacks up against random forest.
Similarly, to see the default hyperparameters for this model:
# Default hyperparameters for GradientBoostingClassifier
print(GradientBoostingClassifier())
Use GridSearchCV to find the best hyperparameters.
# Set up GridSearchCV
gb = GradientBoostingClassifier(random_state = 10)
params = {
'n_estimators': [5, 50, 250, 500],
'max_depth': [1, 3, 5, 7, 9],
'learning_rate': [0.01, 0.1, 1, 10, 100]
}cv = GridSearchCV(gb, params, cv = 5, n_jobs = -1)
# Fit GridSearchCV to training set
cv.fit(X_train, Y_train)
# Best parameters
cv.best_params_
- learning_rate: 0.01
- max_depth: 3
- n_estimators: 250
As we can see, the trees that are built using gradient boosting are shallower than those built using random forest but what is even more significant is the difference in the number of estimators between the two models. Gradient boosting have significantly more trees than random forest.
This confirms what we have discussed earlier about the structure of random forest and gradient boosting and the way in which they operate.
Next, let’s fit our gradient boosting model to the training data.
# Instantiate GradientBoostingClassifier with best hyperparameters
rf = GradientBoostingClassifier(n_estimators = 250, max_depth = 3, learning_rate = 0.01, random_state = 42)
# Fit model
start = time.time()
rf_model = rf.fit(X_train, Y_train)
end = time.time()
fit_time = end - start
# Predict
start = time.time()
Y_pred = rf_model.predict(X_test)
end = time.time()
pred_time = end - start
# Time and prediction results
precision, recall, fscore, support = score(Y_test, Y_pred, average = 'binary')
print(f"Fit time: {round(fit_time, 3)} / Predict time: {round(pred_time, 3)}")
print(f"Precision: {round(precision, 3)} / Recall: {round(recall, 3)} / Accuracy: {round((Y_pred==Y_test).sum() / len(Y_pred), 3)}")
- Fit time: 1.112
- Predict time: 0.006
- Precision: 0.812
- Recall: 0.703
- Accuracy: 0.81
# Confusion matrix for GradientBoostingClassifier
matrix = confusion_matrix(Y_test, Y_pred)
sns.heatmap(matrix, annot = True, fmt = 'd')
Here, we observe that gradient boosting has a longer fit time but a much shorter predict time compared to random forest.
Again, this aligns with our initial expectation as training is done iteratively under gradient boosting, which explains the longer fit time. However, once the model is ready, gradient boosting takes a much shorter time to make a prediction compared to random forest.
To recap, random forests:
- Create independent, parallel decision trees
- Work better with a few, deep decision trees
- Have a short fit time but a long predict time
In contrast, gradient boosting:
- Builds trees in a successive manner where each tree improves upon the mistakes made by previous trees
- Works better with multiple, shallow decision trees
- Have a long fit time but a short predict time
Thank you for reading. Feel free to check out my other articles below!
70 Data Science Interview Questions You Need to Know Before Your Next Technical Interview
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS