![]() |
VOOZH | about |
LightGBM is a machine learning tool created by Microsoft used for tasks like classification, regression, ranking and more. It is a gradient boosting technique where decision trees are built step by step to make predictions more accurate. Compared to other gradient boosting tools it is very fast and uses less memory.
One of the reasons why LightGBM is so fast is the way it handles and stores data before building trees. This special method of storing and organizing data is called LightGBM data structure.
LightGBM does not use regular data formats like Pandas DataFrame or NumPy arrays directly during training. Instead it converts data into a binary format called the Dataset object. This object is made to be very fast and light.
Before any model can be trained, data must be stored in memory but the way this data is stored can affect:
LightGBM uses a special structure to store and process data efficiently.
In LightGBM, your input data is turned into a special structure called lightgbm.Dataset. This structure:
You can create a Dataset in LightGBM like this:
Once the data is converted into a Dataset, LightGBM will not need to look back at the original data. Everything is stored in a way that is fast to use.
Let’s explore the main ideas behind LightGBM’s data structure in simple terms:
One of the most important features in LightGBM is called histogram-based binning.
Example: You have a column in your dataset with 1000 different numbers. Instead of checking all 1000 values, LightGBM creates bins or buckets like it might group values into 255 bins. Now instead of remembering all 1000 values it only remembers which bin each value belongs to.
This reduces:
By default LightGBM uses 255 bins per feature.
LightGBM stores data in a column-major format. This means:
This is different from row-wise storage used by other tools, which is slower for finding the best splits in decision trees.
Since the data is binned into integers (bin IDs), LightGBM can store them using 8-bit or even 4-bit integers instead of 32-bit or 64-bit floats. This reduces the memory cost by a large amount.
For example:
This makes LightGBM able to train on large datasets using less memory.
You can create and use a Dataset object easily in LightGBM. Here’s a simple python example:
Once created the Dataset is ready for training.
Now that data is stored in bins and grouped by columns it starts building trees. Here’s what happens:
Because the data is already binned and compressed these steps happen very quickly.
Here are some benefits of LightGBM’s data structure:
Here are some simple tips when working with LightGBM and its Dataset structure: