Train a model using LightGBM

Last Updated : 26 Mar, 2026

LightGBM is a tree-based ensemble learning algorithm that uses gradient boosting. Unlike traditional boosting methods, it grows trees leaf-wise (best-first) instead of level-wise.

Leaf-wise tree growth (better accuracy)
Histogram-based learning (faster computation)
Efficient handling of large datasets
Supports parallel and distributed training

Note: Leaf-wise growth can lead to overfitting, but this is controlled using parameters like max_depth.

Core Techniques Used in LightGBM

1. Histogram-Based Learning

LightGBM converts continuous data into discrete bins, which:

Reduces memory usage
Speeds up training
Avoids repeated sorting

2. Leaf-wise Tree Growth

Instead of splitting all nodes level by level, LightGBM:

Splits the node with maximum gain
Builds deeper and more optimized trees

3. Gradient-Based One-Side Sampling (GOSS)

Keeps data points with large gradients
Randomly samples from small-gradient data
Improves training efficiency without much loss in accuracy

4. Exclusive Feature Bundling (EFB)

Combines sparse features
Reduces dimensionality
Improves speed

Implementation to train a model using LightGBM

1. Install and Import Libraries

To train a model using LightGBM we need to install it to our runtime.

!pip install lightgbm

Importing required libraries

We import all required Python libraries like NumPy, Pandas, Seaborn, Matplotlib and SKlearn etc.

2. Load Dataset and Preprocessing

The dataset is loaded and split into training and testing sets using stratified sampling to maintain class balance.

3. Exploratory Data Analysis (EDA)

Target Distribution: This plot helps check whether the dataset is balanced or imbalanced.

Output:

👁 LightGBM

Target class distribution of SKlearn breast cancer dataset

Correlation Matrix: The heatmap shows relationships between features and helps identify highly correlated variables.

Output:

👁 correlation-matrix

Correlation Matrix

4. Creating LightGBM dataset

LightGBM uses its own optimized dataset format for faster training and better memory usage.

5. Define Hyperparameters

These parameters control model learning, complexity, regularization and performance.

6. Train Model (Latest API)

The model is trained with early stopping to prevent overfitting and logging disabled for cleaner output.

Output:

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[22] train's auc: 0.996956 train's binary_logloss: 0.238247 valid's auc: 0.993056 valid's binary_logloss: 0.257051

7. Predictions

The model outputs probabilities, which are converted into binary predictions using a threshold of 0.5.

8. Model Evaluation

These metrics evaluate model performance from different perspectives, especially AUC for classification quality.

Output:

Accuracy: 0.9473684210526315
Precision: 0.9583333333333334
Recall: 0.9583333333333334
F1 Score: 0.9583333333333334
AUC: 0.9930555555555556

9. Classification Report

Provides a detailed summary of precision, recall and F1-score for each class.

Output:

👁 Classification-report

Classification Report

10. Feature Importance

Shows which features contribute most to the model’s predictions.

Output:

👁 Top-Features

Top Features

11. Cross-Validation

Cross-validation ensures the model performs well across different data splits (more reliable than a single train-test split).

Output:

👁 Cross-Validation

Output after Applying Cross Validation

You can download the source code from here.

Comment

Article Tags:

Machine Learning

Geeks Premier League

AI-ML-DS

Geeks Premier League 2023

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/train-a-model-using-lightgbm/