![]() |
VOOZH | about |
Whenever we look at a dataset on which we are required to apply machine learning algorithms, we often see different types of values corresponding to different features present. Some of them are categorical, such as features containing "1, 2, 3" or "True or False", while others are continuous in values, such as the blood pressure of patients, which can take a range of values. As the data is often collected without much consideration for the structure or format of the data, it can present challenges for those tasked with analyzing and interpreting it. Whenever we face a situation like this it is often considered good to convert continuous values to discrete when using machine learning algorithms that perform better on categorical data. The performance of some machine learning algorithms may be affected due to the non-standard probability distribution of features containing continuous values. This is where 'KBinsDiscretizer' comes into the picture.
'KBinsDiscretizer' is a data preprocessing technique of the sklearn library that helps in converting continuous value data into bins and encoding those bins to create discrete values. This can be really helpful in creating machine learning models that work on discrete data rather than continuous data. 'KBinsDiscretizer' actually makes an algorithm work that gives in return the bin edges according to a 'strategy' parameter. We initialize the 'KBinsDiscretizer' first with different values of its parameters, and then after initializing it we fit in the data that we want to transform, after fitting in the data the algorithm gives the bin edges, and when the bin edges get determined the continuous data is transformed into bins of data. 'KBinsDiscretizer' is essential in data preprocessing as it may improve our overall machine-learning model performance. At last, the binned data is encoded according to the encoded parameter about which we are going to talk next.
'KBinsDiscretizer' takes a number of parameters that we are going to discuss now.
The strategy parameter of the 'KBinsDiscretizer' is useful in determining how the data is going to be divided into discrete bins. The strategy parameter defines the bin edges which gives us the understanding of bin width. Each value of strategy has a unique way of binning data, lets discuss different strategy parameter values of KBinsDiscretizer in a little detail:
Output:
[[0. 0. 0.]
[1. 2. 2.]
[2. 2. 2.]]
Here we will be implementing all the strategies of KBinsDiscretizer in Sciket Learn to demonstrate how each strategy discretizes data.
Output:
Original Data:
[[ 1 2 63]
[ 4 5 9]
[ 7 8 0]
[ 7 5 21]
[ 8 6 4]]
Bins using 'uniform' strategy:
[[0. 0. 4.]
[2. 2. 0.]
[4. 4. 0.]
[4. 2. 1.]
[4. 3. 0.]]
Bins using 'quantile' strategy:
[[0. 0. 4.]
[1. 2. 2.]
[3. 4. 0.]
[3. 2. 3.]
[4. 3. 1.]]
Bins using 'kmeans' strategy:
[[0. 0. 4.]
[1. 1. 2.]
[3. 4. 0.]
[3. 1. 3.]
[4. 3. 1.]]
Explanation of the Above Code:
Here we will be using the iris dataset to determine the the accuracy of species prediction with and without discretization of data. Iris dataset contains parameters which we can use to determine the flower specie which is given to us. The parameters include: Sepal Length in cm, Sepal Width in cm, Petal Length in cm, Petal Width in cm and the output which we have to classify is the column named 'Species' which will tell us about the specie of iris the given flower information belongs to. We will be using discretization on just two features Sepal Length in cm and Petal Length in cm to show the effectiveness of discretization in model accuracy. We will be modelling our dataset with the help of Decision Tree Classifier which often works phenomenally on discrete data rather than continuous data. We will write code for the same using 'sklearn' library and visualize the continuous data and the binned data. The link to the iris dataset is present here - link.
This code transforms a two-dimensional dataset loaded from the Iris dataset using the KBinsDiscretizer from scikit-learn. Different discretization strategies (uniform, quantile, and kmeans) are applied to the data, and the effects of these strategies on the distribution of the data are seen in three distinct subplots using contour plots.
Three discretization strategies are defined by this code: "uniform," "quantile," and "kmeans." After loading the Iris dataset, it chooses to visualize the first two features, which are kept in the X_data variable. The following code will discretize these features according to the prescribed strategies, enabling comparisons of the effects of various discretization techniques on the distribution of data.
Output:
This code creates a figure with subplots to showcase the impact of three discretization strategiesāuniform, quantile, and k-meansāapplied to the first two features of the Iris dataset. It generates mesh grids for contour plots by defining a set of points within the range of the selected features. The code iterates through each discretization strategy, transforming the mesh grid using KBinsDiscretizer, and displays the resulting contour plot alongside the original scattered data points. The subplots are organized in a row, with adjustments made for proper layout, tick labels, and titles, providing a clear comparison of the effects of different discretization strategies on the data visualization.
Choosing a specific strategy as a parameter in 'KBinsDiscretizer' is one of the most important task as it determines how our continuous data is going to be discrete bins of data. Choice of strategy affects how our machine learning model performs, therefore paying attention to the strategy parameter is really important. Here are some of the advantages and disadvantages of using specific strategy in 'KBinsDiscretizer':
Advantages:
Disadvantages:
In this demonstration, we looked at the many approaches that the KBinsDiscretizer in Scikit-Learn provides. On a two-dimensional dataset taken from the Iris dataset, the code demonstrated the effects of "uniform," "quantile," and "kmeans" techniques. We noticed how these methods discretize the data by using ordinal encoding and making a grid for visualization. The "quantile" approach concentrates on equi-probable quantiles, the "kmeans" approach uses clustering to identify bin boundaries, and the "uniform" strategy divides the data uniformly inside designated bin edges. This comparison demonstrated how the distribution and interpretation of the data are greatly impacted by the discretization approach selected. The KBinsDiscretizer can be more useful in a variety of applications when researchers and data analysts choose the best approach depending on the unique properties of their dataset and analytical objectives.