Linear Regression vs. Logistic Regression: What is the Difference?
The differences in terms of cost functions, Ordinary Least Square (OLS), Gradient Descent (GD), and Maximum Likelihood Estimation (MLE).
In this article, I’ll cover a roadmap for statistical model development. Specifically, we’ll talk about how the development of linear regression differs from logistic regression. We’ll also discuss why some methods work in one model, but not in another model. The following is an overview of this roadmap.
Step 1: Design a model that explains the relationship between the response variable and the explanatory variables
- Regression Model: The response variable is continuous.
- Classification Model: The response variable is categorical.
Step 2: Develop a cost function (a.k.a. loss function) to determine how well this model explains a given set of data
- Develop a cost function in terms of Ordinary Least Squares (OLS)
- Develop a cost function in terms of Maximum Likelihood Estimation (MLE)
Step 3: Find the best parameters to solve the optimization problem with respect to the cost function
- Analytical Method: Find a closed-form solution if the optimization problem can be solved analytically
- Numerical Approximation Method: Find the solution iteratively using Gradient Descent
How to apply Ordinary Least Square (OLS) in a Linear Regression?
First, let’s talk about linear regression using the following data. The goal is to develop a model to explain the relationship between the response variable (i.e., Y), which is a continuous variable, and the explanatory variable (i.e., X).
Step 1: Based on the linear relationship between Y and X, a linear equation, shown below, would be an appropriate model.
The parameters (β0, β1) are fixed values that are typically unknown because it is difficult to collect all the data points from the population. Therefore, we use the OLS estimators (β̂0, β̂1) computed from a sample to estimate the population parameters (β0, β1).
The error term, ε, is a random variable that explains the error in terms of the distance between each point to the fitted line. The values of the error term could be both positive and negative. Overall ε ** needs to have a mean of 0, which can be achieved with an intercept (i.e., β0). ε is NOT required to have a normal distributio**n with OLS to produce an unbiased model.
However, the normality assumption allows us to perform statistical hypothesis testing. Estimates (e.g., β̂0, β̂1) can be written as a linear combination of ε (see more details here), and assuming ε is normally distributed implies the estimates are also normally distributed. Therefore, the normality assumption would generate reliable confidence intervals for the predictions and estimates.
Step 2: We would like to come up with a cost function that would quantify how well this suggested model is. For a linear regression model, we find the best estimates, β̂1 and β̂0, such that the sum of distances between Yi and Ŷi, where Ŷi = β̂1Xi + β̂0, is minimized. Typically, a quadratic distance is used in the cost function. We can summarize it in the following equation.
How to apply Maximum Likelihood Estimation (MLE) in a Linear Regression?
Is the above cost function the only function we can design for linear regression? The answer is NO. Alternatively, we can come up with a different version of a cost function in terms of Maximum Likelihood Estimation (MLE).
MLE is a frequentist approach to estimate the parameters of a model, given observed data. We can find the best parameters by maximizing the likelihood of observing a given set of data, under the assumed statistical model.
Step 1: We’ll start with the same linear model as OLS (Equation 1). The only modification here is we need to impose a normality assumption on the error term for MLE to work because we need to assume the distribution of the data to calculate the likelihood.
Step 2: Instead, we can come up with a different version of the cost function which would quantify how likely we can observe our outcome variable (i.e., Y) given the explanatory variables (e.g., X) under the suggested model. We would like to find the best parameters, β̂ 1 and β̂ 0, such that the likelihood of observing our data is maximized.
Let’s first rewrite equation 3 as a conditional distribution for each observation of X and Y. In other words, each Yi is drawn from a normal distribution with a mean of β1Xi + β0 for a given Xi and a standard deviation of σ.
Next, we define a likelihood function of observing Yi given Xi in terms of the normal distribution.
Each point i is independent and identically distributed (iid), so we can write the likelihood function of observing all points as the product of each individual probability density.
This is the likelihood function. Typically, we would transform it into a log-likelihood function because maximizing the likelihood function is equivalent to maximizing the log-likelihood and log-likelihood is much easier to handle mathematically.
In equation 4, you’ll notice that the circled terms are all constant values and there is a negative sign in front of the summation.
Therefore, maximizing the log-likelihood function is mathematically equivalent to minimizing the cost function of OLS (see, equation 2).
How cool is that! We start with totally different ideas with respect to OLS and MLE and end up having the same cost functions for the linear regression model.
Step 3: How to Minimize the Cost Function of Linear Regression Using a Closed-form formula?
The process of finding the values to minimize or maximize a function is called Optimization. For any optimization problem of machine learning, you can handle it in either an analytical approach or a numerical approximation approach.
An analytical approach is deterministic, which means there is a closed-form solution to solve the optimization problem. You can write down the mathematical expression of the exact solution.
However, a closed-form solution is rarely obtainable in statistical modeling. Linear regression is one of the rare cases.
If the optimization problem is in terms of a linear equation, there should exist an analytical solution. We should be able to derive the formula with a little bit of calculus.
To find the maximum or minimum of a function, we can take its derivative with respect to each parameter and set them equal to 0. Solving the equations would give us the best parameters.
Let’s derive the formula.
We have our cost function as a sum of squared errors.
We take the derivative with respect to β̂0 and set it equal to 0.
Solving the equation above, we get
Now take the derivative with respect to β1 and replace β0 with equation 5 and set it equal to 0.
Solving the equation above, we get
Now we can apply some algebra tricks to rewrite the numerator and denominator, which are easy to prove.
As you can see from the formula for β1 and β0. For a given dataset, we are likely to have a closed-form solution for a linear model. We can also do all the algebra in Matrix Form. You can find its derivation here.
Keep in mind that, a closed-form solution can sometimes be problematic with multivariate linear regression, in which there are multiple explanatory variables (e.g., X1, X2, X3, etc) because the inverse matrix exists only if X has full rank, meaning if there is perfect multicollinearity, it won’t have a closed-form solution. To prevent this, we need to exclude such variables in the linear model.
Step 3: How to Minimize the Cost Function Using Gradient Descent?
We’ve talked about the analytical solution for linear regression. Next, let’s talk about a numerical approximation approach, called Gradient Descent, to solve the optimization problem.
In mathematics, ** gradient descent is a first-order iterative** optimization algorithm for finding a local minimum of a differentiable function. A typical Gradient Descent follows the steps listed below.
- Initially we can let β1 and β0 be any values (e.g., β1 = 0 and β0 = 0). Let L be the learning rate. L determines the magnitude of changes we would apply to update the parameters. The bigger the learning rate, the faster the cost function would approach the minimum point. If L is too big, your optimizer will be shooting over the "curve" and miss the minimum point. Conversely, if L is too small, it would take a very long time to reach the minima. Hence we need to set the learning rate strategically.
- Calculate the partial derivative of the cost function with respect to each parameter (which I’ve discussed above). Plugin the values of β1, β0, Xi, Yi in each partial derivative and compute their values. Derivative values would determine the direction of changes on these parameters.
- Update parameters using the following formula
- Repeat steps 2 and 3 until parameters stop changing or derivative values get close enough to zero.
Keep in mind that an analytical approach would give us the exact solution if it works in a linear regression model. Gradient Descent, however, only gives a solution that is very close to the exact solution. Fortunately, we won’t be able to notice the differences in most cases.
Gradient Descent is a popular tool to solve optimization problems in machine learning. Its algorithm is simple and easy to apply in most of the cost functions. However, it also has limitations, for example, it won’t work in a non-convex function, which we’ll discuss in the next section.
Does Linear Regression work in Classification Problems?
The short answer is NO.
We know that the outcome variable in a linear regression is a continuous variable. Can it handle categorical outcome variables, e.g., species of animals—dog vs cat, object detection, train vs airplane, spam detection, spam emails vs non-spam emails? It obviously can’t, because the predicted outcome variable from a linear equation we develop can take an infinite number of values. The predicted outcomes for a classification problem can only take a finite number of values, e.g., 2 or a small number of values.
It doesn’t mean linear regression is useless for a classification problem. In fact, we can develop a new model based on linear regression with some twists to handle classification problems using Logistical Regression.
What is a Logistic Regression?
To put it simply, we can break down Logistic Regression into two parts, Regression and Logistic.
Regression: Linear regression model is used to estimate the value of logits (a.k.a. log-odds)
- Let P be the probability of occurrence of a particular event (e.g., an email is spam). Odds is defined as the ratio of "the probability of a particular event" to "the probability of the event not occurring". Log-odds is simply the log value of odds.
- We can let log-odds be expressed in terms of a linear model
- Then we can solve for P
Logistic: We can also think of a logistic regression model as feeding a linear regression model into a logistic function (a.k.a. sigmoid function). The logistic regression function converts the values of a logit (i.e., βXi) that ranges from −∞ to +∞ to Yi that ranges between 0 and 1.
Now I think we have an appropriate model for a classification model. We can set a threshold (e.g., 0.5) for Y to determine the outcome variable. For example, if Y ≥ 0.5, it is a spam email, and if Y < 0.5, it is not a spam email.
Why don’t we use Sum Squared Error (SSE) from OLS as a cost function in Logistic Regression?
Once we have a reasonable model to address the classification problem. Next, we need to come up with a cost function. Let’s try with the sum square error function we use in the linear regression model.
Is there an analytical solution to minimize sum squared error in logistic regression? The short answer is NO.
We can quickly conclude that it’s a very difficult math problem. The complexity and non-linearity of the function make it impossible to find a closed-form solution.
Can we use gradient descent to minimize sum squared error in logistic regression? The short answer is also NO.
To use gradient descent to minimize the cost function. We need to make sure the cost function is a convex function. If we try to use the cost function of sum square error in logistic regression, it would end up being a non-convex function with many local minimums. Finding a global minimum with a non-convex function is a very difficult math problem by itself.
Logistic Regression with MLE and Cross-entropy
As sum squared error can’t be used as the cost function in logistic regression, can we use the MLE method to find the cost function instead? The short answer is Yes.
With Maximum Likelihood Estimation, we would like to maximize the likelihood of observing Y given X under a logistic regression model.
Let P be the probability of occurrence of a particular event (e.g., an email is spam, denoted by Y =1) given X. Then 1-P would imply the probability of Y=0, given X.
Therefore, for a given observation (i.e., Yi and Xi), we want to
- maximize P if Y =1
- maximize 1-P if Y =0
If we twist this a little bit, we want to
- *maximize YP if Y= 1**
- *maximize (1-Y)(1-P) if Y = 0**.
Then let’s combine the above two functions, we wan to
- maximize YP + (1-Y)(1-P)
Let’s apply the log function and change maximizing the cost function to be minimizing the opposite of the cost function, we want to
- minimize -[Ylog(P) + (1-Y)log(1-P)]
Next we can write the likelihood function of observing all points as the product of the likelihood of each point. Then we can rewrite the log of the product of each likelihood as the sum of the log of each likelihood.
We have a fancy name for the above cost function, Binary Cross-entropy.
Minimize the Cost Function of Binary Cross-entropy in Logistic Regression
Is there an analytical solution to minimize the binary cross-entropy in logistic regression? The short answer is NO.
Similar to sum square error, the complexity and non-linearity of the binary cross-entropy make it impossible to find a closed-form solution.
Can we use gradient descent to minimize the binary cross-entropy in logistic regression? The short answer is Yes.
Binary Cross-entropy is a convex function, which is guaranteed to find the global minimum by gradient descent. The steps are the same as the ones from the linear regression model.
What is cool is the partial derivative function in a gradient descent formula for logistic regression is the same as the one from linear regression.
Going Beyond Vanilla Linear Regression
In a linear regression model, we can change the cost function to build different models that lead to a different set of best parameters in a linear model.
For example, In a ridge regression model, we have a cost function of
In a lasso regression model, we have a cost function of
In an elastic-net regression model, we have a cost function of
These three variants of linear regression are associated with Regularization, which is a technique to penalize the flexibility and complexity of a model, so as to prevent the risk of overfitting. Here λ is a hyper-parameter to determine how much flexibility a model would be penalized. We can tune its value using cross-validation and grid search.
Going Beyond Vanilla Logistic Regression
A vanilla logistic regression works best when the samples of two classes are roughly the same. However, there are many real-world situations where that is not the case.
For example, there might be a very small amount of fraudulent transactions in credit card transactions data. This class imbalance issue would become tricky using a logistic regression. If there is only 0.1% of transactions were fraudulent. We can simply predict all the transactions as normal and end up with 99.9% of accuracy. How great is that! But this kind of model would be useless because we won’t be able to predict any fraudulent transactions.
To address this class imbalance issue, we can modify our cost function of binary cross-entropy to be
Here α1 and α2 would determine how severely we would penalize the misclassification of classes. Their values are usually computed as the inverse of their frequency in the data.
Going back to our example, α1 = 1 / 0.1% = 100, α2 = 1/99.9% = 1. So we would penalize 100 times more for fraudulent transactions than normal transactions. Therefore, this small modification would greatly improve the accuracy of detecting fraudulent transactions.
Final Notes
Linear regression and logistic regression are the two widely used models to handle regression and classification problems respectively. Knowing their basic forms associated with Ordinary Least Squares and Maximum Likelihood Estimation would help us understand the fundamentals and explore their variants to address real-world problems, such as model selection and imbalanced classes.
Also knowing the math behind the optimization problem of the cost function is important to implement the models more efficiently. Keep in mind that an analytical solution is available only if the data can be explained linearly. Otherwise, numeric approximation approaches such as Gradient Descent are more popular to find the optimal parameters in a machine learning model.
Here are some related posts you can explore if you’re interested in Linear Regression and Causal Inference.
- Causal Inference: Econometric Models vs. A/B Testing
- Linear Regression vs. Logistic Regression: OLS, Maximum Likelihood Estimation, Gradient Descent
- Linear Regression with OLS: Unbiased, Consistent, BLUE, Best (Efficient) Estimator
- Causal Inference with Linear Regression: Omitted variables and Irrelevant variables
- Causal Inference with Linear Regression: Endogeneity
- Linear Regression with OLS: Heteroskedasticity and Autocorrelation
Thank you for reading !!!
If you enjoy this article and would like to Buy Me a Coffee, please click here.
You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS