Artificial Intelligence

Fish Weight Prediction (Regression Analysis for beginners) – Part 1

How to build an ML regression model using the top linear ML algorithms (Linear Regression, Lasso Regression, and Ridge Regression)

Gurami Keretchashvili

Dec 20, 2021

10 min read

👁 Photo by Rachel Hisko on Unsplash

Photo by Rachel Hisko on Unsplash

· Introduction · Part 1.1 – Building ML model Pipeline. · Part 1.2 – Analyze algorithms and methods. ∘ What are linear models? ∘ Comparison of the Algorithms ∘ Evaluation · Conclusion · References

Introduction

Today we will predict(estimate) the weight of the fish based on species name of fish, vertical length, diagonal length, cross length, height, and diagonal width using linear models. I will introduce the top town approach to solving the problem, which I explained in the previous article. First In part 1.1 I will build a model and then in part 1.2 I will try to explain how each algorithm and methods work. This is a regression analysis problem for beginners. Understanding the main principles and methods of building this kind of problem will help to build your own ML regression model such as (house price prediction, etc.)

Part 1.1 – Building ML model Pipeline.

In General, The building ML models are divided into 8 steps: shown in the figure below.

👁 Image

Usually, data scientists spend 80% of their time in the first three steps (also called Explanatory Data Analysis -EDA ). The majority of ML applications follow this pipeline. The difference between easy and advanced applications is EDA steps. So we will follow the above-mentioned pipeline to solve the weight prediction problem as an example.

Step 1: Collect the data

The data is the public dataset that can be downloaded from the Kaggle.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from itertools import combinations
import numpy as np
data = pd.read_csv("Fish.csv")

Step 2: Visualize the data (Ask yourself these questions and answer)

How does the data look like?

data.head()

👁 image by Author

image by Author

Does the data have missing values?

data.isna().sum()

👁 image by Author

image by Author

Note: As you can see, in this case, since the problem is for beginners we do not have missing values, however, this is not always the case. In the following articles, we will have data with missing values and have the possibility to understand how to cope with that.

What is the distribution of the numerical features?

data_num = data.drop(columns=["Species"])

fig, axes = plt.subplots(len(data_num.columns)//3, 3, figsize=(15, 6))
i = 0
for triaxis in axes:
 for axis in triaxis:
 data_num.hist(column = data_num.columns[i], ax=axis)
 i = i+1

👁 Image

As you can wee the distributions are okay. Target Variable(Weight) seems a little bit unbalanced and can be done some methods to balance the values. This is a little advanced topic and let’s ignore it for now.

What is the distribution of the target variable(Weight) with respect to fish Species?

sns.displot(
 data=data,
 x="Weight",
 hue="Species",
 kind="hist",
 height=6,
 aspect=1.4,
 bins=15
)
plt.show()

sns.pairplot(data, kind='scatter', hue='Species');

👁 Image

Target variable distribution with respect to species shows that there are some species such as Pike that have huge weight compared to others. This visualization gives us additional information on how the "species" feature can be used for prediction.

What is a correlation between the target variable and features?

plt.figure(figsize=(7,6))
corr = data_num.corr()
sns.heatmap(corr, 
 xticklabels=corr.columns.values,
 yticklabels=corr.columns.values, annot=True)
plt.show()

👁 Image

The pairwise correlation of columns shows that all the numerical features have a positive correlation to the Weights. This means that the higher the lengths or width, the higher the weight is. It seems logical as well.

Step 3: Clean the data

from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler 

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

ct = make_column_transformer(
 (StandardScaler(),['Length1','Length2','Length3','Height','Width']), #turn all values from 0 to 1
 (OneHotEncoder(handle_unknown="ignore"), ["Species"])
)
#create X and y values
data_cleaned = data.drop("Weight",axis=1)
y = data['Weight']

x_train, x_test, y_train, y_test = train_test_split(data_cleaned,y, test_size=0.2, random_state=42)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

X_train_normal = pd.DataFrame(ct.fit_transform(x_train))
X_test_normal = pd.DataFrame(ct.transform(x_test))

👁 Image

The species column was converted into 7 features using OneHotEncoder. and numerical features are scaled because linear models love scaling. (I will explain the details in part 1.2 of the article)

Step 4: Train the model

def models_score(model_name, train_data, y_train, val_data,y_val):
 model_list = ["Linear_Regression","Lasso_Regression","Ridge_Regression"]
 #model_1
 if model_name=="Linear_Regression":
 reg = LinearRegression()
 #model_2
 elif model_name=="Lasso_Regression":
 reg = Lasso(alpha=0.1,tol=0.03)

 #model_3
 elif model_name=="Ridge_Regression":
 reg = Ridge(alpha=1.0)
 else:
 print("please enter correct regressor name")

 if model_name in model_list:
 reg.fit(train_data,y_train)
 pred = reg.predict(val_data)

 score_MSE = mean_squared_error(pred, y_val)
 score_MAE = mean_absolute_error(pred, y_val)
 score_r2score = r2_score(pred, y_val)
 return round(score_MSE,2), round(score_MAE,2), round(score_r2score,2)

model_list = ["Linear_Regression","Lasso_Regression","Ridge_Regression"]
result_scores = []
for model in model_list:
 score = models_score(model,X_train_normal,y_train, X_test_normal,y_test)
 result_scores.append((model, score[0], score[1],score[2]))
 print(model,score)

In the code above, we train different models and measure the accuracy of mean square error(MSE), mean absolute error(MAE), R² score.

Step 5: Evaluate

df_result_scores = pd.DataFrame(result_scores,columns=["model","mse","mae","r2score"])
df_result_scores

👁 Image

The best model has low MSE and MAE values and high R² scores. As the result shows simple linear regression worked performed best compared to Lasso regression and ridge regression in this dataset.

Step 6: Hyperparameter tuning

In this case, the basic linear regression model’s hyperparameter can be considered the learning rate. It is implemented as an SGDRegressor which updated the learning rate together with the weights. So for now it is better to do hyperparameter tuning for Ridge or Lasso. Let’s do for Ridge regression, and see if it will beat the score of simple linear regression.

from scipy.stats import loguniform
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV

space = dict()
space['solver'] = ['svd', 'cholesky', 'lsqr', 'sag']
space['alpha'] = loguniform(1e-5, 50)
model = Ridge()

cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)

search = RandomizedSearchCV(model, space, n_iter=100, 
 scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv, random_state=42)
result = search.fit(X_train_normal, y_train)

print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Best Score: -75.64939924236903

Best Hyperparameters: {‘alpha’: 0.24171039031894245, ‘solver’: ‘sag’}

Now let’s fit these hyperparameters and see the result.

reg = Ridge(alpha=0.24171039031894245, solver ="sag" )
reg.fit(X_train_normal,y_train)
pred = reg.predict(X_test_normal)
score_MSE = mean_squared_error(pred, y_test)
score_MAE = mean_absolute_error(pred, y_test)
score_r2score = r2_score(pred, y_test)
to_append = ["Ridge_hyper_tuned",round(score_MSE,2), round(score_MAE,2), round(score_r2score,2)]
df_result_scores.loc[len(df_result_scores)] = to_append
df_result_scores

👁 Image

Hyperparameter improved the previous default Ridge result however, it still is not better than simple linear regression.

Step7: Choose the best model and prediction

# winner model
reg = LinearRegression()
reg.fit(X_train_normal,y_train)
pred = reg.predict(X_test_normal)

plt.figure(figsize=(18,7))
plt.subplot(1, 2, 1) # row 1, col 2 index 1
plt.scatter(range(0,len(X_test_normal)), pred,color="green",label="predicted")
plt.scatter(range(0,len(X_test_normal)), y_test,color="red",label="True value")
plt.legend()

plt.subplot(1, 2, 2) # index 2
plt.plot(range(0,len(X_test_normal)), pred,color="green",label="predicted")
plt.plot(range(0,len(X_test_normal)), y_test,color="red",label="True value")
plt.legend()
plt.show()

👁 Image

In order to evaluate MSE, MAE, and R2score, we need to visualize the prediction and actual prediction. Why? Because have only MSE or MAE scores we can not understand how good the model is. Therefore, in the visualization above, the prediction and true value are really close to each other. This means that the model works well.

Part 1.2 – Analyze algorithms and methods.

In this part, I will explain all the theoretical parts behind the algorithms, methods, and evaluation above.

What are linear models?

The term linear models mean that the model is specified as a linear combination of features. linear regression, Lasso Regression, and Ridge regressions are linear algorithms.

What is Linear Regression?

Linear regression is ML supervised learning algorithm which predicts target variable Y based on Independent variables X. let’s take a simple example and calculate how to figure out how it works. For example, X is working experience and Y is a salary.

👁 image by author

image by author

From the visualization we might easily understand the correlation between the working experience and the salary-if a person has more experience, they earn more money.

We have a hypothesis function for linear regression

👁 Image

where a and b are trainable parameters(also referred coefficients, weights).

a – intercept in the y-axis

b – is a slope

Goal: Find a and b such that we minimize the cost function which is the same as Ordinary least squares (in Sklearn)

👁 Image

In the figure below, We can have intuition on how the algorithm work. In the first feature, we have a hypothesis function h(x) = 32 and OLS are calculated after that we update parameters and generated hypothesis function h(x) = 1.6x +29 and finally it reached h(x) = 3.6 +20. For each parameter update, the goal is to minimize the OLS.

👁 image by author

image by author

How do we update values of a and b to minimize MSE?

we do update a and b using gradient descent algorithm until convergence. but first what is the gradient?

A gradient measures how much the output of a function changes if you change the inputs a little bit – Lex Fridman (MIT)

In short, gradient measures the change in all weights with regard to cost function. It is a slope of a function, the higher the slope, the faster a model learns. But if the gradient is zero, the weights are not updated anymore, so the algorithm stops learning.

👁 Image

a is updated by the second part of which is learning rate and gradient multiplication. The learning rate has a huge impact on finding the local minimum. When a learning rate is too large it can cause bounding back and forth between a local minimum of the function, whereas if it is too low, it will reach the local minimum, but will take too much time.

Lasso Regression

Laso is a slight modification of linear Regression. It performs L1 regularization, which means that it adds penalty equivalent to the absolute value of the magnitude of coefficients. Why do we need it? Because it finally decreases large weights to small ones and small weights to almost zero. This kind of method, in theory, can be used when we do have a lot of features in the training data and we know that not all the features are important. Lasso regression will help to identify selecting a few most important features out of plenty of features.

Ridge Regression

Ridge Regression s also a slight modification of linear regression with L2 regularization, which means that it penalizes the model from the sum of squared value of the weights. Thus, it results in to decrease in the absolute value of coefficients and having coefficients that are evenly distributed. Why do we need this? If we have a few features and we know that they all might affect prediction, we can use ridge regression in this case.

👁 Image

So linear regression is without regularization (no penalty on parameters). It sometimes can assign a high weight to some features, and lead to overfitting in the small datasets. That is why Lasso regression (Same as L1 regularization) or Ridge Regression (L2 regularization) models are used to adjust the weight of the independent variables. In general, if the number of data features is much less than the number of samples (#features << #rows) then it is likely that simple linear regression would work better. However, if a number of features are not much less than a number of samples it will tend to have high variance and low accuracy, in that case, Lasso and Ridge are more likely to work better.

Comparison of the Algorithms

👁 image by author

image by author

Evaluation

All Regression types of models have the same evaluation methods. The most common ones are Mean Square Error, Mean Absolute Error, or R² score. Let’s take an example and calculate each score by hand.

👁 Image

y_true = [1, 5, 3]

y_pred = [2, 3, 4]

MSE(y_true, y_pred) = 1/3 * [(1–2)² + (5–3)² + (3–4)²] = 2 ()

MAE(y_true, y_pred) = 1/3 * [|1–2| + |5–3| + |3–4|] = 1.33

R2score(y_true, y_pred) = 1 – [(1–2)² + (5–3)² + (3–4)²] / [[(1–3)² + (5–3)² + (3–3)²]] = 0.25

Our goal is to find a model which have low MSE or MAE and High R2score(best is 1).

Conclusion

There is no ready-made formula to detect which algorithm to choose for each task and dataset. That is why it is still important to try several algorithms and evaluate each of them. However, we need to know the intuition behind each algorithm. As the theory told, since we have about 160 data points and only 12 features, linear regression was more likely to work better, which was the case as well. However, since there was a high linear correlation between training features, and that is why Lasso and Ridge also have a decent result. For the future article, I will try the same for different kinds of datasets, and let’s see how the scores will change.

My GitHub repository link of the code is here.

References

[1] www.geeksforgeeks.org, ML | Linear Regression (2018)

[2] GAURAV SHARMA, 5 Regression Algorithms you should know – Introductory Guide! (2021)

[3] Wenwei Xu, What’s the difference between Linear Regression, Lasso, Ridge, and ElasticNet in Sklearn? (2019)

[4] Niklas Donges, Gradient Descent: An Introduction to 1 of Machine Learning’s Most Popular Algorithms (2021)

[5] L.E. MelkumovaS.Ya. Shatskikh, Comparing Ridge and LASSO estimators for data analysis (2017) 3rd International Conference "Information Technology and Nanotechnology

[6] Jason Brownlee, Hyperparameter Optimization With Random Search and Grid Search (2020), Machine Learning Mastery

[7]Aarshay Jain, A Complete Tutorial on Ridge and Lasso Regression in Python (2016)

[8] scikit-learn.org, 1.1. Linear Models

[9] scikit-learn.org, 3.3.4. Regression metrics

Written By

Gurami Keretchashvili

See all from Gurami Keretchashvili

Algorithms, Artificial Intelligence, Data Science, Machine Learning, Regression

Share This Article