![]() |
VOOZH | about |
In this article, we will learn how to develop a machine learning model using Python which can predict the number of calories a person has burnt during a workout based on some biological measures.
Python libraries make it easy for us to handle the data and perform typical and complex tasks with a single line of code.
Refer to the links given below for the dataset used in the article:
To proceed with the model, you need to merge both the datasets. Refer to link below to see how to merge two datasets.
How to join datasets with same columns and select one using Pandas?
Now let's load the dataset into the panda's data frame and print its first five rows.
Output:
Now let's check the size of the dataset.
Output:
(15000, 9)Let's check which column of the dataset contains which type of data.
Output:
Now we will check the descriptive statistical measures of the data.
Output:
EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations.
Output:
So, we have a kind of linear relationship between these two features which is quite obvious.
Output:
As expected higher is the duration of the workout higher will be the calories burnt. But except for that, we cannot observe any such relation between calories burnt and height or weight features.
Here we can observe some real-life observations:
Output:
The distribution of the continuous features follows close to normal distribution except for some features like Body_Temp and Calories.
Output:
Output:
Here we have a serious problem of data leakage as there is a feature that is highly correlated with the target column which is calories.
Now we will separate the features and target variables and split them into training and testing data by using which we will select the model which is performing best on the validation data.
Output:
((13500, 5), (1500, 5))Now, let's normalize the data to obtain stable and fast training.
Now let's train some state-of-the-art machine learning models and compare them which fit better with our data.
Output:
LinearRegression() :
Training Error : 17.893463692619434
Validation Error : 18.007896272831253
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...) :
Training Error : 7.89463304294701
Validation Error : 10.12050432946533
Lasso() :
Training Error : 17.915089584958036
Validation Error : 17.995033362288662
RandomForestRegressor() :
Training Error : 3.9877936746031746
Validation Error : 10.451300301587302
Ridge() :
Training Error : 17.893530494767777
Validation Error : 18.00781790803129Out of all the above models, we have trained RandomForestRegressor and the XGB model's performance is the same as their MAE for the validation data is same.
Notebook: click here.
Dataset: click here.