Data Science

Linear Regression vs. Logistic Regression: What is the Difference?

The differences in terms of cost functions, Ordinary Least Square (OLS), Gradient Descent (GD), and Maximum Likelihood Estimation (MLE).

Aaron Zhu

Apr 10, 2022

15 min read

👁 Roadmap (Image by author)

Roadmap (Image by author)

In this article, I’ll cover a roadmap for statistical model development. Specifically, we’ll talk about how the development of linear regression differs from logistic regression. We’ll also discuss why some methods work in one model, but not in another model. The following is an overview of this roadmap.

Step 1: Design a model that explains the relationship between the response variable and the explanatory variables

Regression Model: The response variable is continuous.
Classification Model: The response variable is categorical.

Step 2: Develop a cost function (a.k.a. loss function) to determine how well this model explains a given set of data

Develop a cost function in terms of Ordinary Least Squares (OLS)
Develop a cost function in terms of Maximum Likelihood Estimation (MLE)

Step 3: Find the best parameters to solve the optimization problem with respect to the cost function

Analytical Method: Find a closed-form solution if the optimization problem can be solved analytically
Numerical Approximation Method: Find the solution iteratively using Gradient Descent

How to apply Ordinary Least Square (OLS) in a Linear Regression?

First, let’s talk about linear regression using the following data. The goal is to develop a model to explain the relationship between the response variable (i.e., Y), which is a continuous variable, and the explanatory variable (i.e., X).

👁 Image by Author

Image by Author

Step 1: Based on the linear relationship between Y and X, a linear equation, shown below, would be an appropriate model.

👁 Equation 1

Equation 1

The parameters (β0, β1) are fixed values that are typically unknown because it is difficult to collect all the data points from the population. Therefore, we use the OLS estimators (β̂0, β̂1) computed from a sample to estimate the population parameters (β0, β1).

The error term, ε, is a random variable that explains the error in terms of the distance between each point to the fitted line. The values of the error term could be both positive and negative. Overall ε ** needs to have a mean of 0, which can be achieved with an intercept (i.e., β0). ε is NOT required to have a normal distributio**n with OLS to produce an unbiased model.

However, the normality assumption allows us to perform statistical hypothesis testing. Estimates (e.g., β̂0, β̂1) can be written as a linear combination of ε (see more details here), and assuming ε is normally distributed implies the estimates are also normally distributed. Therefore, the normality assumption would generate reliable confidence intervals for the predictions and estimates.

Step 2: We would like to come up with a cost function that would quantify how well this suggested model is. For a linear regression model, we find the best estimates, β̂1 and β̂0, such that the sum of distances between Yi and Ŷi, where Ŷi = β̂1Xi + β̂0, is minimized. Typically, a quadratic distance is used in the cost function. We can summarize it in the following equation.

👁 Equation 2

Equation 2

How to apply Maximum Likelihood Estimation (MLE) in a Linear Regression?

Is the above cost function the only function we can design for linear regression? The answer is NO. Alternatively, we can come up with a different version of a cost function in terms of Maximum Likelihood Estimation (MLE).

MLE is a frequentist approach to estimate the parameters of a model, given observed data. We can find the best parameters by maximizing the likelihood of observing a given set of data, under the assumed statistical model.

Step 1: We’ll start with the same linear model as OLS (Equation 1). The only modification here is we need to impose a normality assumption on the error term for MLE to work because we need to assume the distribution of the data to calculate the likelihood.

👁 Equation 3

Equation 3

Step 2: Instead, we can come up with a different version of the cost function which would quantify how likely we can observe our outcome variable (i.e., Y) given the explanatory variables (e.g., X) under the suggested model. We would like to find the best parameters, β̂ 1 and β̂ 0, such that the likelihood of observing our data is maximized.

Let’s first rewrite equation 3 as a conditional distribution for each observation of X and Y. In other words, each Yi is drawn from a normal distribution with a mean of β1Xi + β0 for a given Xi and a standard deviation of σ.

URL: https://towardsdatascience.com/linear-regression-vs-logistic-regression-ols-maximum-likelihood-estimation-gradient-descent-bcfac2c7b8e4/

Linear Regression vs. Logistic Regression: What is the Difference?

How to apply Ordinary Least Square (OLS) in a Linear Regression?

How to apply Maximum Likelihood Estimation (MLE) in a Linear Regression?

Step 3: How to Minimize the Cost Function of Linear Regression Using a Closed-form formula?

Step 3: How to Minimize the Cost Function Using Gradient Descent?

Does Linear Regression work in Classification Problems?

What is a Logistic Regression?

Why don’t we use Sum Squared Error (SSE) from OLS as a cost function in Logistic Regression?

Logistic Regression with MLE and Cross-entropy

Minimize the Cost Function of Binary Cross-entropy in Logistic Regression

Going Beyond Vanilla Linear Regression

Going Beyond Vanilla Logistic Regression

Final Notes

Thank you for reading !!!

Related Articles