VOOZH about

URL: https://towardsdatascience.com/back-to-basics-part-tres-logistic-regression-e309de76bd66/

⇱ Back to Basics, Part Tres: Logistic Regression | Towards Data Science


Back to Basics, Part Tres: Logistic Regression

An illustrated guide to everything you need to know about Logistic Regression

8 min read

An illustrated guide on Logistic Regression, with code

Welcome back to the final installment of our Back to Basics series, where we’ll delve into another fundamental machine learning algorithm: Logistic Regression. In the previous two articles, we helped our friend Mark determine the ideal selling price for his 2400 feet² house using Linear Regression and Gradient Descent.

Today, Mark comes back to us again for help. He lives in a fancy neighborhood where he thinks houses below a certain size don’t sell, and he is worried that his house might not sell either. He asked us to help him determine how likely it is that his house will sell.

This is where Logistic Regression comes into play.

Logistic Regression is a type of algorithm that predicts the probability of a binary outcome, such as whether a house will sell or not. Unlike Linear Regression, Logistic Regression predicts probabilities using a range of 0% to 100%. Note the difference between predictions a linear regression model and logistic regression model make:

👁 Image

Let’s delve deeper into how logistic regression works by determining the probability of selling houses with varying sizes.

We start our process again by collecting data about house sizes in Mark’s neighborhood and seeing if they sold or not.

👁 Image

Now let’s plot these points:

👁 Image

Rather than representing the outcome of the plot as a binary output, it’ll be more informative to represent it using probabilities since that is the quantity we are trying to predict.

We represent 100% probability as 1 and 0% probability as 0

👁 Image

In our previous article, we learned about linear regression and its ability to fit a line to our data. But can it work for our problem where the desired output is a probability? Let’s find out by attempting to fit a line using linear regression.

We know that the formula for the best-fitting line is:

👁 Image

By following the steps outlined in linear regression, we can obtain optimal values for β₀ and β₁, which will result in the best-fitting line. Assuming we have done so, let’s take a look at the line that we have obtained:

👁 Image

Based on this line, we can see that a house with a size just below 2700 feet² is predicted to have a 100% probability of being sold:

👁 Image

…and a 2200 feet² house is predicted to have a 0% chance of being sold:

👁 Image

…and a 2300 feet² house is predicted to have about a 20% probability of being sold:

👁 Image

Alright, so far so good. But what if we have a house that is 2800 feet² in size?

👁 Image

Uh.. what does a probability above 100% mean? Would a house of this size be predicted to sell with a probability of 150%??

Weird. What about a house that’s 2100 feet²?

👁 Image

Okay, clearly we have run into a problem as the predicted probability for a house with a size of 2100 feet² appears to be negative. This definitely does not make sense, and it indicates an issue with using a standard linear regression line.

As we know, the range of probabilities is from 0 to 1, and we cannot exceed this range. So we need to find a way to constrain our predicted output to this range.

To solve this issue, we can pass our linear regression equation through a super cool machine called a sigmoid function. This machine transforms our predicted values to fall between 0 and 1. We input our z value (where z = β₀ + β₁size) into the machine…

👁 Image

…and out comes a fancy-looking new equation that will fit our probability constraints.

NOTE: The e in the output is a constant value and is approximately equal to 2.718.

A math-ier way of representing the sigmoid function:

👁 Image

If we plot this, we see that the sigmoid function squeezes the straight line into an s-shaped curve confined between 0 and 1.

👁 Image

Optional note for all my math-heads: You might be wondering why and how we used the sigmoid function to get our desired output. Let’s break it down.

We started with the incorrect assumption that using the linear regression formula will give us our desired probability.

👁 Image

The issue with this assumption is that (β₀ + β₁size) has range (-∞,+∞) and p has a range of [0,1]. So we need to find a value that has a range that matches that of (β₀ + β₁size).

To overcome this issue, we can equate the line to"log odds" (watch this video to understand log odds better) because we know that the log odds has a range of (-∞,+∞).

👁 Image

Now that we did that, it’s just a matter of rearranging this equation, so that we find what the p value should equal.

👁 Image

Now that we know how to modify the linear regression line so that it fits our output constraints, we can return to our original problem.

We need to determine the optimal curve for our dataset. To achieve this, we need to identify the optimal values for β₀ and β₁ (because these are the only values in the predicted probability equation that will change the shape of the curve).

Similar to linear regression, we will leverage a cost function and the gradient descent algorithm to obtain suitable values for these coefficients. The key distinction, however, is that we will not be employing the MSE cost function used in linear regression. Instead, we will be using a different cost function called Log Loss, which we will explore in greater detail shortly.

Say we used gradient descent and the Log Loss cost (using these steps) to find that our optimal values are β₀ = -120.6 and β₁ = 0.051, then our predicted probability equation will be:

👁 Image

And the corresponding optimal curve is:

👁 Image

With this new curve, we can now tackle Mark’s problem. By looking at it, we can see that a house with a size of 2400 feet²…

👁 Image

…has a predicted probability of approximately 78%. Therefore, we can tell Mark not to worry because it looks like his house is pretty likely to sell.

We can further enhance our approach by developing a Classification Algorithm. A classification algorithm is commonly used in machine learning to categorize data into categories. In our case, we have two categories: houses that will sell and houses that will not sell.

To develop a classification algorithm, we need to define a threshold probability value. This threshold probability value separates the predicted probabilities into two categories, "yes, the house will sell" and "no, the house will not sell." Typically, 50% (or 0.5) is used as the threshold value.

If the predicted probability for a house size is above 50%, it will be classified as "will sell," and if it’s below 50%, it will be classified as "won’t sell."

👁 Image

And that’s about it. That’s how we can use logistic regression to solve our problem. Now let’s understand the cost function we used to find optimal values for logistic regression.

Cost Function

In linear regression, the cost is based on how far off the line was from our data points. And, in logistic regression, the cost function depends on how far off our predictions are from our actual data, given that we are dealing with probabilities.

If we used the MSE cost function (like we did in linear regression) in logistic regression, we would end up with a non-convex (fancy term for a not-so-pretty-curve-that-can’t-be-used-effectively-in-gradient-decsent) cost function curve that can be difficult to optimize.

👁 Image

And as you may recall from our discussion on gradient descent, it is much easier to optimize a convex (aka a curve with a distinct minimum point) curve like this than a non-convex curve.

👁 Image

To achieve a convex cost function curve, we use a cost function called Log Loss.

To break down the Log Loss cost function, we need to define separate costs for when the house actually sold (y=1) and when it did not (y=0).

If y = 1 and we predicted 1 (i.e., 100% probability it sold), there is no penalty. However, if we predicted 0 (i.e., 0% probability it didn’t sell), then we get penalized heavily.

👁 Image

Similarly, if y = 0 and we predicted a high probability of the house selling, we should be penalized heavily, and if we predicted a low probability of the house selling, there should be a lower penalty. The more off we are, the more it costs us.

👁 Image

To compute the cost for all houses in our dataset, we can average the costs of all the individual predictions like this:

👁 Image

By cleverly rewriting the two equations, we can combine them into one to give us our Log Loss cost function.

👁 Image

This works because one of those two will always be zero, so only the other one will be used.

And the combined cost graph looks like this:

👁 Image

Now that we have a good understanding of the math and intuition behind logistic regression, let’s see how Mark’s house size problem can be implemented in Python.


And we’re done! You have everything you need to tackle a logistic regression problem of your own now.


And as always, please feel free to reach out to me on LinkedIn or shoot me an email at [email protected].


Written By

Shreya Rao

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles