VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/03/multiple-linear-regression-using-python/

⇱ Multiple Linear Regression (MLR) - An Overview


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

Multiple Linear Regression using Python

Amrutha Last Updated : 21 Oct, 2024
6 min read

This article, we will be dealing with multi-linear regression, and we will take a dataset that contains information about 50 startups. Features include R&D Spend, Administration, Marketing Spend, State, and finally, Profit. Here we have to build the machine learning model to predict the profit of the startups.

Let’s get started.

Multiple Linear Regression is a machine learning algorithm where we provide multiple independent variables for a single dependent variable. However, linear regression only requires one independent variable as input.

In this article, you will explore the multiple linear regression formula, understand a multiple linear regression example, learn how to implement it using multiple linear regression in Python, and discover its significance in machine learning. By the end, you’ll have a solid grasp of this essential statistical technique and its applications in data analysis.

This article was published as a part of the Data Science Blogathon.

Working with Dataset

Let’s start by importing some libraries.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

Import train_test_split to split the dataset into training and testing datasets. And Linear Regression is the model on which we have to work. Import this model from scikit learn library. r2_score is to find the accuracy of the model. Matplotlib and seaborn are used for visualizations. Finally, import warnings and set it to ignore so that it will ignore all the warnings that we will come throughout.

Here is the link for the dataset. Download it and import it by passing the path of the dataset file into read_csv().

Let us view our data frame.

Python Code:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# import matplotlib.pyplot as plt
# import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

#import dataset
startup_df=pd.read_csv(r'50_Startups.csv')

print(startup_df.head())

View the shape of the data frame.

shape=startup_df.shape
print("Dataset contains {} rows and {} columns".format(shape[0],shape[1]))

The dataset contains 50 rows and 5 columns.

View all the columns in the data frame.

startup_df.columns

Data frame contains R&D Spend, Administration, Marketing Spend, State, and Profit.

View the statistical description of the dataset which includes the total count of each column, mean of all values, standard deviation, minimum, maximum values, and 25th, 50th, 75th per cent values of the dataset.

#Statistical Details of the dataset
startup_df.describe()

Define X and Y

This is like extracting dependent and independent variables.

We have to define x and y for the model. x and y are input and output features of the dataset. So taking x features as input values that are independent, our model will predict the outcome which is y that are dependent.

x=startup_df.iloc[:,:4]
y=startup_df.iloc[:,4]

Perform One-Hot Encoding

We use one-hot encoding when there are categorical values in our dataset. Here for us, there is a state column that is categorical, so we have to use one-hot encoding to convert them.

So, import One-HotEncoder from scikit learn library.

from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(sparse=False)
x=ohe.fit_transform(startup_df[['State']])

View x.

x

array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[1., 0., 0.],
[0., 1., 0.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.],
[0., 0., 1.],
[1., 0., 0.],
[1., 0., 0.],
[0., 1., 0.],
[1., 0., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.],
[0., 0., 1.],
[1., 0., 0.]])

It will give an array like this. Let us see what are those three categories.

ohe.categories_
[array([‘California’, ‘Florida’, ‘New York’], dtype=object)]

Here [0., 0., 1.] indicates NewYork, [0., 1., 0.] indicates Florida  and [1., 0., 0.] indicates California.

Change Columns using Column Transformer

For this import make_column_transformer from scikit learn library and pass the column that we want to transfer.

from sklearn.compose import make_column_transformer
col_trans=make_column_transformer(
 (OneHotEncoder(handle_unknown='ignore'),['State']),
 remainder='passthrough')
x=col_trans.fit_transform(x)

Now view x.

It will look like this.

Split the Dataset into Train Set and Test Set

Now, split your dataset into two parts in which 80% of the dataset will go to the training set, and 20% of the dataset will go to the testing set. Actually, you can divide it as per your wish by setting the value into test_size.

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

View the shapes of splitter data.

#shapes of splitted data
print("X_train:",x_train.shape)
print("X_test:",x_test.shape)
print("Y_train:",y_train.shape)
print("Y_test:",y_test.shape)

X_train: (40, 6)
X_test: (10, 6)
Y_train: (40,)
Y_test: (10,)

Train the Model

To train the model, we have to import the Linear Regression model, which we have already created at the beginning. Use the fit method, and pass the training sets into it to train the model.

linreg=LinearRegression()
linreg.fit(x_train,y_train)

Predict the Test Results

Use the predict method to predict the results, then pass the independent variables into it and view the results. It will give the array with all the values in it.

y_pred=linreg.predict(x_test)
y_pred

Evaluate the Model

We have different metrics to find the accuracy score of the model, and here we use r2_score to evaluate our model and find its accuracy.

Accuracy=r2_score(y_test,y_pred)*100
print(" Accuracy of the model is %.2f" %Accuracy)

The accuracy of the model is 93.47.

Plot the Results

We will plot the scatter plot between actual values and predicted values. Use xlabel to label the x-axis and use ylabel to label the y-axis.

plt.scatter(y_test,y_pred);
plt.xlabel('Actual');
plt.ylabel('Predicted');

Regression plot of our model.

A regression plot is useful to understand the linear relationship between two parameters. It creates a regression line in-between those parameters and then plots a scatter plot of those data points.

sns.regplot(x=y_test,y=y_pred,ci=None,color ='red');

Predicted Values

Let us create a new data frame that contains actual values, predicted values, and differences between them so that we will understand how near the model predicts its actual value.

pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':y_pred,'Difference':y_test-y_pred})

View the data frame.

pred_df

Here we can see the difference between Actual values and predicted values which are not very high. When values are in the range of lakhs, then the difference in thousands is not much.
We have already seen that the accuracy of this model is about 93 percent.

Conclusion

In this article, we have created a new Linear Regression model, and we learned how to perform One-Hot Encoding and where to perform it. We used a column transformer and then trained the model, predicted the results, evaluated the model using r2_score metrics, and plotted the results.

Hope you guys found it useful.

Q1. What is the formula slope of multiple linear regression?

A. The formula for the slope coefficients (β) in multiple linear regression is:
β = (X’X)^(-1) X’Y
where X is the design matrix (containing the independent variables), Y is the vector of the dependent variable, and “^(-1)” denotes the inverse of a matrix.

Q2. What is the equation for multiple regression?

A. The equation for multiple regression is:
Y = β0 + β1X1 + β2X2 + … + βkXk + ε
where Y is the dependent variable, X1, X2, …, Xk are the independent variables, β0 is the intercept, β1, β2, …, βk are the coefficients of the independent variables, and ε is the error term.

Q3.What are the 5 assumptions of multiple linear regression?

5 Assumptions of Multiple Linear Regression:
Linearity: Relationship between variables is linear.
Independence: Observations are independent.
Homoscedasticity: Constant variance of residuals.
Normality: Residuals follow normal distribution.
No Multicollinearity: Independent variables are not highly correlated.

Q4.How to detect multicollinearity?

Correlation Matrix: High correlations between independent variables.
VIF: Values greater than 5 indicate multicollinearity.
Eigenvalues/Condition Indices: Small eigenvalues or high condition indices suggest problems.
Tolerance: Values close to 0 indicate multicollinearity.
Auxiliary Regression: High R-squared values for regressions of independent variables against each other.

Read more articles on our website. Click here.

Connect with me on LinkedIn: https://www.linkedin.com/in/amrutha-k-6335231a6vl/

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

This is Amrutha, I am pursuing B.Tech in the Computer science Department. I am interested in developing ML Models with python and Data Analysis. And also I have an interest in Web Development. I hope my articles in Analytics Vidhya help you to learn better. Thank you!!

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner