VOOZH about

URL: https://www.geeksforgeeks.org/machine-learning/random-forest-regression-in-python/

⇱ Random Forest Regression in Python - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Random Forest Regression in Python

Last Updated : 6 Apr, 2026

Random Forest is an ensemble learning method that combines multiple decision trees to produce more accurate and stable predictions. It can be used for both classification and regression tasks, where regression predictions are obtained by averaging the outputs of several trees.

  • Multiple decision trees: Builds many trees and combines their predictions.
  • Ensemble approach: Reduces errors compared to a single decision tree.
  • Regression prediction: Produces continuous values by averaging predictions from all trees.
👁 random_forest
Random Forest

Working of Random Forest Regression

Random Forest Regression works using the bagging (Bootstrap Aggregating) technique:

  • Multiple decision trees are trained on different random subsets of the dataset with replacement to train each tree.
  • Each tree uses a random subset of features while splitting nodes.
  • Due to this each tree learns slightly different data and features, which increases model diversity.
  • The final prediction is obtained by averaging the predictions from all decision trees.
👁 random-forest-regression22
Random Forest Regression

Implementation

We will be implementing random forest regression on salaries data.

1. Importing Libraries

Here we are importing numpy, pandas, matplotlib and scikit learn.

  • RandomForestRegressor: This is the regression model that is based upon the Random Forest model.
  • LabelEncoder: This class is used to encode categorical data into numerical values.
  • train_test_split: This function is used to split a dataset into training and testing sets.
  • mean_squared_error, r2_score: Used to evaluate regression performance.

2. Importing Dataset

Now let's load the dataset in the panda's data frame. For better data handling and leveraging the handy functions to perform complex tasks in one go.

You can download dataset from here.

Output:

👁 df
Dataset

Output:

👁 dfinfo
Info of the dataset

3. Data Preparation

Here the code will extracts two subsets of data from the Dataset and stores them in separate variables.

  • Extracting Features: It extracts the features from the DataFrame and stores them in a variable named X.
  • Extracting Target Variable: It extracts the target variable from the DataFrame and stores it in a variable named y.

4. Encoding categorical columns

If the dataset contains object type columns, they are converted into numeric form using Label Encoding so that the machine learning model can process them.

5. Splitting Dataset

The dataset is divided into training and testing sets so that the model is trained on one portion and evaluated on unseen data. This prevents overly optimistic performance results.

6. Random Forest Regressor Model

The model is trained using the training dataset.

  • n_estimators=100: Number of decision trees in the forest.
  • random_state=42: Ensures reproducible results.
  • oob_score=True: Uses out-of-bag samples to estimate model performance.

7. Making predictions and Evaluating

The code evaluates the trained Random Forest Regression model:

  • oob_score_: Retrive out-of-bag (OOB) score which estimates the model's generalization performance.
  • Predictions are made on the test set.
  • Evaluates the model's performance using the Mean Squared Error (MSE) and R-squared (R2) metrics.

Output:

Out-of-Bag Score: 0.2280694384742593
Mean Squared Error: 616145000.0
R-squared: 0.9878292345679013

  • Out-of-Bag Score: Measures how well the model generalizes on unseen samples, a low value indicates weaker generalization.
  • Mean Squared Error: Average squared difference between actual and predicted values, lower values indicate better predictions.
  • R-squared: Indicates how much variance in the target variable is explained by the model, values close to 1 show a strong fit.

8. Visualizing

Now let's visualize the results obtained by using the RandomForest Regression model on our salaries dataset.

  • Creates a grid of prediction points covering the range of the feature values.
  • Plots the real data points as blue scatter points.
  • Plots the predicted values for the prediction grid as a green line.
  • Adds labels and a title to the plot for better understanding.

Output:

👁 Screenshot-2023-12-04-101235

9. Visualizing a Single Decision Tree

The code visualizes one of the decision trees from the trained Random Forest model. Plots the selected decision tree, displaying the decision making process of a single tree within the ensemble.

Output:

👁 downloa
Single Decision Tree from the Random Forest Model

Download full code from here

Applications

Random Forest Regression is widely used in many real world problems for predicting continuous values.

  • Predicting numerical values: Used for tasks like house price prediction, stock price forecasting or customer lifetime value estimation.
  • Risk analysis: Helps identify risk factors in areas such as healthcare and finance.
  • High dimensional data: Works well with datasets that contain many input features.
  • Complex relationships: Can model complex and nonlinear relationships between features and the target variable.

Advantages

Random Forest Regression offers several benefits when working with complex datasets.

  • Handles non linearity: Captures complex relationships between features and the target variable.
  • Reduces overfitting: Combines multiple decision trees to produce more stable predictions.
  • Robust to outliers: Averaging predictions from many trees reduces the effect of extreme values.
  • Works with large datasets: Performs well even with high dimensional data.
  • Handles missing data: Can maintain good accuracy even when some values are missing.
  • No feature scaling required: Does not require normalization or scaling of input data.

Limitations

Random Forest Regression also has some limitations that should be considered when using the model.

  • Computational complexity: Training many decision trees can be slow and computationally expensive.
  • Less interpretability: Harder to interpret compared to simpler models like linear regression or a single decision tree.
  • Memory intensive: Storing multiple trees requires significant memory for large datasets.
  • Overfitting on noisy data: The model may still overfit when the dataset contains a lot of noise.
  • Sensitive to imbalanced data: Performance may decrease when one class is much more frequent than others.
Comment