VOOZH about

URL: https://www.geeksforgeeks.org/machine-learning/xgboost-for-regression/

⇱ XGBoost for Regression - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

XGBoost for Regression

Last Updated : 28 Oct, 2025

XGBoost (Extreme Gradient Boosting) is an optimized and scalable implementation of the gradient boosting framework designed for supervised learning tasks such as regression and classification. In regression, XGBoost aims to predict continuous numeric values by minimizing loss functions (e.g., RMSE or MSE) while incorporating regularisation to prevent overfitting.

Use XGBoost in Regression

XGBoost is particularly effective for regression problems due to:

  • Handling Missing Values: Automatically handles missing data without requiring imputation.
  • Feature Importance: Provides insight into which features impact predictions.
  • Scalability: Efficient on large datasets with GPU acceleration.
  • Ensemble Learning: Combines multiple weak models to create a strong predictive model.

Loss Functions and Regularization

XGBoost constructs its models by minimizing an objective function that balances two aspects:

  • Prediction Accuracy — measured using a loss function
  • Model Complexity — controlled via regularization

Formally, the objective function is:

Where:

  • quantifies the error between actual value and predicted value .
  • penalizes overly complex trees to avoid overfitting.

Loss Functions

XGBoost supports multiple loss functions depending on the task:

1. Regression (continuous target):

This is also referred to as squared error loss ("reg:squarederror" in XGBoost). It penalizes larger errors more heavily, which is suitable for regression tasks where extreme deviations matter.

2. Binary Classification (0/1 target):

This is logistic loss ("reg:logistic") and is used when predictions are probabilities between 0 and 1.

Working in Regression

1. During tree building, XGBoost calculates gain for each possible split:

2. A split is accepted only if Gain > 0, ensuring that the split improves the model after considering regularization.

3. Leaf weights are calculated as:

This shows how L2 regularization (λ) shrinks leaf weights and L1 (α) further encourages zero weights.

Implementation

Step 1: Installation

Lets install the XGBoost package,

Step 2: Importing libraries and Dataset

Here we will load seaborn and pandas library. We will use the mpg dataset from Seaborn to show the working.

Step 3: Data Preprocessing

We will convert categorical features into numerical values using one-hot encoding.

Step 4: Splitting Data

Split the data into training and testing sets where 70% data will be used for training and rest for testing.

Step 5: Training XGBoost Regressor

We will train the XGBoost Regressor.

Output:

RMSE: 2.967
R²: 0.834

Step 6: Hyperparameter Tuning

We get optimized model performance with GridSearchCV.

Output:

Fitting 3 folds for each of 36 candidates, totalling 108 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'subsample': 0.8}

Step 7: Feature Plotting

We will plot the top important features.

Output:

👁 download
Visualizing top features

Limitations

  • Computationally intensive: Can be slow to train on very large datasets, especially with many trees.
  • Parameter tuning required: Requires careful tuning of hyperparameters (e.g., learning rate, max depth, number of estimators) for optimal performance.
  • Memory consumption: Can use a lot of RAM for large datasets or deep trees.
  • Less interpretable: Compared to linear regression, the final model is harder to interpret.
Comment