![]() |
VOOZH | about |
Data is the cornerstone of any analytical or machine-learning endeavor. However, real-world datasets are not perfect yet and they often contain missing values which can lead to error in the training phase of any algorithm. Handling missing values is crucial because they can lead to biased or inaccurate results in data analyses and machine learning models. Strategies for dealing with missing values include imputation (replacing missing values with estimated or calculated values), removal of incomplete records, or the use of advanced techniques like multiple imputation. Addressing missing values is an essential aspect of data cleaning and preparation to ensure robust and reliable analyses. In this article, we will discuss how to handle missing values with the CatBoost model.
CatBoost or categorical boosting is a machine learning algorithm developed by Yandex, a Russian multinational IT company. This special boosting algorithm is based on the gradient boosting framework which can handle categorical features more effectively than other traditional gradient boosting algorithms by incorporating techniques like ordered boosting, oblivious trees, and advanced handling of categorical variables to achieve high performance with minimal hyperparameter tuning. CatBoost also has an in-built hyperparameter(nan_mode) to handle missing values present in the dataset which helps us to handle the dataset very effectively without performing other data pre-processing.
Missing values refer to the absence of data for certain observations or variables in a dataset. These missing values can occur for various reasons, ranging from errors during data collection to intentional omissions. We need to handle them very carefully to achieve an accurate predictive model. Commonly missing values are represented by two ways in datasets which are discussed below-->
At first, we need to install CatBoost module to our runtime before proceed further.
!pip install catboostNow we will import all required Python libraries like NumPy, Pandas, Matplotlib, Seaborn and SKlearn etc.
Now we load a dataset from Kaggle. Then we will split it into training and testing sets(80:20) and prepare categorial features which will be feed to the CatBoost during training.
The Kaggle House Prices dataset is loaded and ready for modeling with this line of code. The data is divided into features (X) and the target variable (y) after categorical characteristics are converted to strings. An 80-20 ratio is used to further divide the dataset into training and testing sets. For CatBoost models that need to describe categorical features during training, the variable categorical_features_indices is useful since it specifies the indices of categorical features.
Exploratory Data Analysis(EDA) helps us to gain deeper insights about the dataset.
Checking missing values
This is very related to this article and also important for any dataset. Missing values effects the predictions of the model if not handled correctly. Here, we will see which columns of our dataset contains missing values with total count.
Output:
Columns with missing values:
PoolQC 1453
MiscFeature 1406
Alley 1369
Fence 1179
FireplaceQu 690
LotFrontage 259
GarageYrBlt 81
GarageCond 81
GarageType 81
GarageFinish 81
GarageQual 81
BsmtFinType2 38
BsmtExposure 38
BsmtQual 37
BsmtCond 37
BsmtFinType1 37
MasVnrArea 8
MasVnrType 8
Electrical 1
dtype: int64
This code computes the sum of the null values for each column in order to check for missing values in the 'data' DataFrame. The columns are then printed with their corresponding counts, but only for those with missing values larger than zero. This is done by sorting the columns in descending order according to the number of missing values.
Distribution of target variable
Visualizing the values distribution of target variable helps us to know if there is any potential errors are associated with the dataset. In our dataset the target variable is 'SalePrice'.
Output:
Using Seaborn, this code generates a histogram that shows the distribution of the 'SalePrice' variable in the 'data' DataFrame. The histogram gains a smooth depiction of the data distribution when the kde=True parameter is added, adding a Kernel Density Estimate plot.
To train the CatBoost model we need to create training and testing pool for CatBoost as its internal training optimization takes special type of dataset type which is different from normal NumPy or pandas data frame. After that we need to specify various hyperparameters to train the CatBoost model. Also here we are going handle missing values with the in-built catboost hyperparameters.
Now we will evaluate our model in the terms of MAE and R2-score which are most common regression model metrics.
Output:
Mean Absolute Error (MAE): 17666.19
R2 Score: 0.9000This code uses a pre-trained model (model) to make predictions on the test set. The model's performance on the test data is then assessed using the Mean Absolute Error (MAE) and R-squared (R2) scores, which offer information on the model's goodness of fit and accuracy.
We can conclude that missing values are very common in real-world datasets but we need to handle them efficiently as they can degrade the model's performance. CatBoost has its in-build mechanism to handle missing values in dataset during training. Our model achived a notable R2-Score of 90% which depicts that the missing values are handled efficiently. However, we can perform hyperparameter tuning to achieve more accurate results.