Fish Weight Prediction (Regression Analysis for beginners) – Part 1
How to build an ML regression model using the top linear ML algorithms (Linear Regression, Lasso Regression, and Ridge Regression)
Table of contents
· Introduction · Part 1.1 – Building ML model Pipeline. · Part 1.2 – Analyze algorithms and methods. ∘ What are linear models? ∘ Comparison of the Algorithms ∘ Evaluation · Conclusion · References
Introduction
Today we will predict(estimate) the weight of the fish based on species name of fish, vertical length, diagonal length, cross length, height, and diagonal width using linear models. I will introduce the top town approach to solving the problem, which I explained in the previous article. First In part 1.1 I will build a model and then in part 1.2 I will try to explain how each algorithm and methods work. This is a regression analysis problem for beginners. Understanding the main principles and methods of building this kind of problem will help to build your own ML regression model such as (house price prediction, etc.)
Part 1.1 – Building ML model Pipeline.
In General, The building ML models are divided into 8 steps: shown in the figure below.
Usually, data scientists spend 80% of their time in the first three steps (also called Explanatory Data Analysis -EDA ). The majority of ML applications follow this pipeline. The difference between easy and advanced applications is EDA steps. So we will follow the above-mentioned pipeline to solve the weight prediction problem as an example.
Step 1: Collect the data
The data is the public dataset that can be downloaded from the Kaggle.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from itertools import combinations
import numpy as np
data = pd.read_csv("Fish.csv")
Step 2: Visualize the data (Ask yourself these questions and answer)
- How does the data look like?
data.head()
- Does the data have missing values?
data.isna().sum()
Note: As you can see, in this case, since the problem is for beginners we do not have missing values, however, this is not always the case. In the following articles, we will have data with missing values and have the possibility to understand how to cope with that.
- What is the distribution of the numerical features?
data_num = data.drop(columns=["Species"])
fig, axes = plt.subplots(len(data_num.columns)//3, 3, figsize=(15, 6))
i = 0
for triaxis in axes:
for axis in triaxis:
data_num.hist(column = data_num.columns[i], ax=axis)
i = i+1
As you can wee the distributions are okay. Target Variable(Weight) seems a little bit unbalanced and can be done some methods to balance the values. This is a little advanced topic and let’s ignore it for now.
- What is the distribution of the target variable(Weight) with respect to fish Species?
sns.displot(
data=data,
x="Weight",
hue="Species",
kind="hist",
height=6,
aspect=1.4,
bins=15
)
plt.show()
sns.pairplot(data, kind='scatter', hue='Species');
Target variable distribution with respect to species shows that there are some species such as Pike that have huge weight compared to others. This visualization gives us additional information on how the "species" feature can be used for prediction.
- What is a correlation between the target variable and features?
plt.figure(figsize=(7,6))
corr = data_num.corr()
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values, annot=True)
plt.show()
The pairwise correlation of columns shows that all the numerical features have a positive correlation to the Weights. This means that the higher the lengths or width, the higher the weight is. It seems logical as well.
Step 3: Clean the data
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
ct = make_column_transformer(
(StandardScaler(),['Length1','Length2','Length3','Height','Width']), #turn all values from 0 to 1
(OneHotEncoder(handle_unknown="ignore"), ["Species"])
)
#create X and y values
data_cleaned = data.drop("Weight",axis=1)
y = data['Weight']
x_train, x_test, y_train, y_test = train_test_split(data_cleaned,y, test_size=0.2, random_state=42)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
X_train_normal = pd.DataFrame(ct.fit_transform(x_train))
X_test_normal = pd.DataFrame(ct.transform(x_test))
The species column was converted into 7 features using OneHotEncoder. and numerical features are scaled because linear models love scaling. (I will explain the details in part 1.2 of the article)
Step 4: Train the model
def models_score(model_name, train_data, y_train, val_data,y_val):
model_list = ["Linear_Regression","Lasso_Regression","Ridge_Regression"]
#model_1
if model_name=="Linear_Regression":
reg = LinearRegression()
#model_2
elif model_name=="Lasso_Regression":
reg = Lasso(alpha=0.1,tol=0.03)
#model_3
elif model_name=="Ridge_Regression":
reg = Ridge(alpha=1.0)
else:
print("please enter correct regressor name")
if model_name in model_list:
reg.fit(train_data,y_train)
pred = reg.predict(val_data)
score_MSE = mean_squared_error(pred, y_val)
score_MAE = mean_absolute_error(pred, y_val)
score_r2score = r2_score(pred, y_val)
return round(score_MSE,2), round(score_MAE,2), round(score_r2score,2)
model_list = ["Linear_Regression","Lasso_Regression","Ridge_Regression"]
result_scores = []
for model in model_list:
score = models_score(model,X_train_normal,y_train, X_test_normal,y_test)
result_scores.append((model, score[0], score[1],score[2]))
print(model,score)
In the code above, we train different models and measure the accuracy of mean square error(MSE), mean absolute error(MAE), R² score.
Step 5: Evaluate
df_result_scores = pd.DataFrame(result_scores,columns=["model","mse","mae","r2score"])
df_result_scores
The best model has low MSE and MAE values and high R² scores. As the result shows simple linear regression worked performed best compared to Lasso regression and ridge regression in this dataset.
Step 6: Hyperparameter tuning
In this case, the basic linear regression model’s hyperparameter can be considered the learning rate. It is implemented as an SGDRegressor which updated the learning rate together with the weights. So for now it is better to do hyperparameter tuning for Ridge or Lasso. Let’s do for Ridge regression, and see if it will beat the score of simple linear regression.
from scipy.stats import loguniform
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV
space = dict()
space['solver'] = ['svd', 'cholesky', 'lsqr', 'sag']
space['alpha'] = loguniform(1e-5, 50)
model = Ridge()
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
search = RandomizedSearchCV(model, space, n_iter=100,
scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv, random_state=42)
result = search.fit(X_train_normal, y_train)
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
Best Score: -75.64939924236903
Best Hyperparameters: {‘alpha’: 0.24171039031894245, ‘solver’: ‘sag’}
Now let’s fit these hyperparameters and see the result.
reg = Ridge(alpha=0.24171039031894245, solver ="sag" )
reg.fit(X_train_normal,y_train)
pred = reg.predict(X_test_normal)
score_MSE = mean_squared_error(pred, y_test)
score_MAE = mean_absolute_error(pred, y_test)
score_r2score = r2_score(pred, y_test)
to_append = ["Ridge_hyper_tuned",round(score_MSE,2), round(score_MAE,2), round(score_r2score,2)]
df_result_scores.loc[len(df_result_scores)] = to_append
df_result_scores
Hyperparameter improved the previous default Ridge result however, it still is not better than simple linear regression.
Step7: Choose the best model and prediction
# winner model
reg = LinearRegression()
reg.fit(X_train_normal,y_train)
pred = reg.predict(X_test_normal)
plt.figure(figsize=(18,7))
plt.subplot(1, 2, 1) # row 1, col 2 index 1
plt.scatter(range(0,len(X_test_normal)), pred,color="green",label="predicted")
plt.scatter(range(0,len(X_test_normal)), y_test,color="red",label="True value")
plt.legend()
plt.subplot(1, 2, 2) # index 2
plt.plot(range(0,len(X_test_normal)), pred,color="green",label="predicted")
plt.plot(range(0,len(X_test_normal)), y_test,color="red",label="True value")
plt.legend()
plt.show()
In order to evaluate MSE, MAE, and R2score, we need to visualize the prediction and actual prediction. Why? Because have only MSE or MAE scores we can not understand how good the model is. Therefore, in the visualization above, the prediction and true value are really close to each other. This means that the model works well.
Part 1.2 – Analyze algorithms and methods.
In this part, I will explain all the theoretical parts behind the algorithms, methods, and evaluation above.
What are linear models?
The term linear models mean that the model is specified as a linear combination of features. linear regression, Lasso Regression, and Ridge regressions are linear algorithms.
- What is Linear Regression?
Linear regression is ML supervised learning algorithm which predicts target variable Y based on Independent variables X. let’s take a simple example and calculate how to figure out how it works. For example, X is working experience and Y is a salary.
From the visualization we might easily understand the correlation between the working experience and the salary-if a person has more experience, they earn more money.
We have a hypothesis function for linear regression
where a and b are trainable parameters(also referred coefficients, weights).
a – intercept in the y-axis
b – is a slope
Goal: Find a and b such that we minimize the cost function which is the same as Ordinary least squares (in Sklearn)
In the figure below, We can have intuition on how the algorithm work. In the first feature, we have a hypothesis function h(x) = 32 and OLS are calculated after that we update parameters and generated hypothesis function h(x) = 1.6x +29 and finally it reached h(x) = 3.6 +20. For each parameter update, the goal is to minimize the OLS.
How do we update values of a and b to minimize MSE?
we do update a and b using gradient descent algorithm until convergence. but first what is the gradient?
A gradient measures how much the output of a function changes if you change the inputs a little bit – Lex Fridman (MIT)
In short, gradient measures the change in all weights with regard to cost function. It is a slope of a function, the higher the slope, the faster a model learns. But if the gradient is zero, the weights are not updated anymore, so the algorithm stops learning.
a is updated by the second part of which is learning rate and gradient multiplication. The learning rate has a huge impact on finding the local minimum. When a learning rate is too large it can cause bounding back and forth between a local minimum of the function, whereas if it is too low, it will reach the local minimum, but will take too much time.
- Lasso Regression
Laso is a slight modification of linear Regression. It performs L1 regularization, which means that it adds penalty equivalent to the absolute value of the magnitude of coefficients. Why do we need it? Because it finally decreases large weights to small ones and small weights to almost zero. This kind of method, in theory, can be used when we do have a lot of features in the training data and we know that not all the features are important. Lasso regression will help to identify selecting a few most important features out of plenty of features.
- Ridge Regression
Ridge Regression s also a slight modification of linear regression with L2 regularization, which means that it penalizes the model from the sum of squared value of the weights. Thus, it results in to decrease in the absolute value of coefficients and having coefficients that are evenly distributed. Why do we need this? If we have a few features and we know that they all might affect prediction, we can use ridge regression in this case.
So linear regression is without regularization (no penalty on parameters). It sometimes can assign a high weight to some features, and lead to overfitting in the small datasets. That is why Lasso regression (Same as L1 regularization) or Ridge Regression (L2 regularization) models are used to adjust the weight of the independent variables. In general, if the number of data features is much less than the number of samples (#features << #rows) then it is likely that simple linear regression would work better. However, if a number of features are not much less than a number of samples it will tend to have high variance and low accuracy, in that case, Lasso and Ridge are more likely to work better.
Comparison of the Algorithms
Evaluation
All Regression types of models have the same evaluation methods. The most common ones are Mean Square Error, Mean Absolute Error, or R² score. Let’s take an example and calculate each score by hand.
y_true = [1, 5, 3]
y_pred = [2, 3, 4]
MSE(y_true, y_pred) = 1/3 * [(1–2)² + (5–3)² + (3–4)²] = 2 ()
MAE(y_true, y_pred) = 1/3 * [|1–2| + |5–3| + |3–4|] = 1.33
R2score(y_true, y_pred) = 1 – [(1–2)² + (5–3)² + (3–4)²] / [[(1–3)² + (5–3)² + (3–3)²]] = 0.25
Our goal is to find a model which have low MSE or MAE and High R2score(best is 1).
Conclusion
There is no ready-made formula to detect which algorithm to choose for each task and dataset. That is why it is still important to try several algorithms and evaluate each of them. However, we need to know the intuition behind each algorithm. As the theory told, since we have about 160 data points and only 12 features, linear regression was more likely to work better, which was the case as well. However, since there was a high linear correlation between training features, and that is why Lasso and Ridge also have a decent result. For the future article, I will try the same for different kinds of datasets, and let’s see how the scores will change.
My GitHub repository link of the code is here.
References
[1] www.geeksforgeeks.org, ML | Linear Regression (2018)
[2] GAURAV SHARMA, 5 Regression Algorithms you should know – Introductory Guide! (2021)
[3] Wenwei Xu, What’s the difference between Linear Regression, Lasso, Ridge, and ElasticNet in Sklearn? (2019)
[4] Niklas Donges, Gradient Descent: An Introduction to 1 of Machine Learning’s Most Popular Algorithms (2021)
[5] L.E. MelkumovaS.Ya. Shatskikh, Comparing Ridge and LASSO estimators for data analysis (2017) 3rd International Conference "Information Technology and Nanotechnology
[6] Jason Brownlee, Hyperparameter Optimization With Random Search and Grid Search (2020), Machine Learning Mastery
[7]Aarshay Jain, A Complete Tutorial on Ridge and Lasso Regression in Python (2016)
[8] scikit-learn.org, 1.1. Linear Models
[9] scikit-learn.org, 3.3.4. Regression metrics
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS