Simple Logistic Regression in Python

Step-by-Step Guide from Data Preprocessing to Model Evaluation

Mar 30, 2021

10 min read

👁 logistic regression python cheatsheet (image by author from www.visual-design.net)

logistic regression python cheatsheet (image by author from www.visual-design.net)

What is Logistic Regression?

Don’t let the name logistic regression tricks you, it usually falls under the category of the classification algorithm instead of regression algorithm.

Then, what is a classification model? Simply put, the prediction generated by a classification model would be a categorical value, e.g. cat or dog, yes or no, true or false … On the contrary, a regression model would predict a continuous numeric value.

Logistic regression makes predictions based on the Sigmoid function which is a squiggles-like line as shown below. Despite the fact that it returns the probabilities, the final output would be a label assigned by comparing the probability with a threshold, which makes it eventually a classification algorithm.

👁 simple illustration of sigmoid function (image by author)

simple illustration of sigmoid function (image by author)

In this article, I will walk through the following steps to build a simple logistic regression model using python scikit -learn:

Data Preprocessing
Feature Engineering and EDA
Model Building
Model Evaluation

The data is taken from Kaggle public dataset "Rain in Australia". The objective is to predict the binary target variable "RainTomorrow" based on existing knowledge, e.g. temperature, humidity, wind speed etc. Feel free to grab the full code at the end of the article on my website.

1. Data Preprocessing

Firstly let’s load the libraries and the dataset.

👁 import libraries and dataset (image by author)

import libraries and dataset (image by author)

👁 df.head() (image by author)

df.head() (image by author)

Use df.describe()to have an overview of raw data.

👁 df.describe() (image by author)

df.describe() (image by author)

We cannot always expect that the data provided would be perfect for further analysis. In fact, it is rarely the case. Therefore, data preprocessing is crucial, especially, handling missing values is an imperative step to ensure the usability of the dataset. We can use isnull() function to have a view of the scope of missing data. The following code snippet calculates the missing value percentage per column.

👁 Image

👁 missing values percentage (image by author)

missing values percentage (image by author)

There are four fields with 38% to 48% of missing data. I dropped these columns since most probably these values are missing not at random. For example, we are missing a large number of evaporation figures and this may be limited by the capacity of the measuring instruments. Consequently, days with more extreme evaporation measures may not be recorded in the first place. Therefore, the remaining numbers are already biased. To that end, retaining these fields may contaminate the input data. If you would like to distinguish three common types of missing data, you may find this article "How to Address Missing Data" helpful.

👁 column-wise and row-wise deletion (image by author)

column-wise and row-wise deletion (image by author)

After performing column-wise deletions, I deleted rows that are missing labels, "RainTomorrow", through dropna().To build a machine learning model, we need labels to train or test the model, hence rows with no labels don’t help much with either process. However, this section of the dataset can be separated out as the prediction set after the model implementation. While handling missing data, it is inevitable that data shape changes hence df.shapeis a handy method allowing you to keep track of the data size. After the data manipulation above, the data shape changed from 145460 rows, 23 columns to 142193 rows, 19 columns.

For the remaining columns, I imputed the categorical variables and numerical variables separately. The code below classified columns into a categorical list and a numerical list, which would be also helpful in the later EDA process.

👁 separate numerical and categorical variables (image by author)

separate numerical and categorical variables (image by author)

Numerical Variables: impute missing values with the mean of the variable. Notice that combining df.fillna()and df.mean() __ would be enough to transform only numerical variables.

👁 address numerical missing values (image by author)

address numerical missing values (image by author)

Categorical Variables: iterate through the cat_list and replace missing values with "Unknown"

👁 address categorical missing values (image by author)

address categorical missing values (image by author)

2. Feature Engeering and EDA

Coupling these two processes together is beneficial for choosing the appropriate feature engineering techniques based on the distribution and characteristics of the dataset.

In this example, I did not go in-depth into the exploratory data analysis(EDA) process. If you are interested to know more, feel free to have a read of my article on a more comprehensive EDA guide.

Semi-Automated Exploratory Data Analysis (EDA) in Python

I automated the univariate analysis through a FOR loop. If a numerical variable is encountered, a histogram will be generated to visualize the distribution. On the other hand, a bar chart is created for the categorical variable.

👁 Image

👁 univariate analysis (image by author)

univariate analysis (image by author)

1) Address Outliers

Now that we have a holistic view of the data distribution, it is much easier to spot outliers. For instance, Rainfall has a heavily right-skewed distribution, indicating that there is at least one significantly high record.

👁 Image

To eliminate the outliers, I used quantile(0.9) to limit the dataset to those fall into the 90% quantile of the dataset. As the result, the upper bound of Rainfall values significantly dropped from 350 to 6.

👁 Image

👁 address outlier (image by author)

address outlier (image by author)

2) Feature Transformation

Date variable was transformed into Month. This is because Date has such high cardinality which makes it impossible to bring out patterns. Whereas using month may give suggestions whether it is more likely to rain in certain months of the year.

👁 date transformation (image by author)

date transformation (image by author)

3) Categorical Feature Encoding

Logistic regression only accepts numeric values as the input, therefore, it is necessary to encode the categorical data into numbers. The most common techniques are one-hot encoding and label encoding. I found this article brings an excellent comparison between these two.

One-Hot Encoding vs. Label Encoding using Scikit-Learn

Take RainToday as an example:

👁 RainToday example

RainToday example

label encoding better for ordinal data with high cardinality

👁 label encoding

label encoding

one hot encoding better for low cardinality and not ordinal data

👁 one hot encoding

one hot encoding

I chose label encoding even though these columns are not ordinal. This is due to the fact that most fields have no less than 17 unique values and one-hot encoding will make the data size grow too wide.

👁 label encoding code (image by author)

label encoding code (image by author)

Now all variables are transformed into either integer or float.

👁 df.info() (image by author)

df.info() (image by author)

4) Feature Selection

If you would like to the know details about feature selection techniques, you may find this helpful:

Feature Selection and EDA in Python

In this exercise, I use Correlation Analysis mentioned in the article above. Correlation Analysis is a common multivariate EDA method that assists in identifying highly correlated variables.

👁 correlation analysis (image by author)

correlation analysis (image by author)

For example:

MinTemp, MaxTemp, Temp9am and Temp3pm
RainFall and RainToday
Pressure9am and Pressure3am

👁 correlation matrix (image by author)

correlation matrix (image by author)

Since logistic regression requires there to be little multicollinearity among predictors, I tried to keep only one variable in each group of highly correlated variables.

👁 feature selection (image by author)

feature selection (image by author)

3. Model Building

Previously, I mentioned that the objective of this exercise is to predict RainTomorrow. Therefore, the first task is to separate the input features (independent variables – X) and the label (dependent variable – y). df.iloc[:, :-1]is a handy function to grab all rows and all columns except the last one.

👁 independent and dependent variables (image by author)

independent and dependent variables (image by author)

Secondly, both features and labels are broken down into a subset for training and another for testing. As the result, four portions are returned, X_train, X_test, y_train, and y_test. To achieve this, we introduce the _train_testsplit function and specify the parameter test_size. In the example below, test_size = 0.33, hence roughly 2/3 data used for training and 1/3 used for testing.

👁 Image

👁 split into train and test (image by author)

split into train and test (image by author)

Thanks to scikit-learn, we can avoid the tedious process of implementing all the math and algorithms from scratch. Instead, all we need to do is to import LogisticRegression from the sklearn library and fit the training data into the model. However, there is still the flexibility of changing the model by specifying several parameters, e.g. max_iter, solver, penalty. More complicated machine learning models would usually involve hyperparameter tuning process that searches through the possible hyperparameter values and finds the optimal combinations.

For this beginner-friendly model, I only alter the max_iter parameter to let the logistic regression converge, but at the same time, the number should not be too high to cause overfitting.

👁 logistic regression model (image by author)

logistic regression model (image by author)

4. Model Evaluation

ROC, AUC, Confusion Matrix and Accuracy are widely used for evaluating Logistic Regression model. All of these metrics are based on calculating the difference between the y values predicted by the model and the actual y values of the test set, hence y_pred and y_test. There are four possible scenarios while comparing the differences:

True Positive: it does rain tomorrow when predicted raining
True Negative: it doesn’t rain tomorrow when predicted not raining
False Positive: it doesn’t rain tomorrow when predicted raining
False Negative: it does rain when predicted not raining

Confusion Matrix

👁 Image

👁 confusion matrix

confusion matrix

I used plot_confusion_matrix()to provide a visual representation that clearly indicates counts of the four scenarios mentioned above. As shown, the true negative is 33122 cases, suggesting that the model is good at predicting not raining tomorrow when it is actually not going to rain. However, it still needs improvement on true positive rate, hence successfully predict raining tomorrow (only 2756 cases).

Accuracy

👁 accuracy

accuracy

Accuracy calculates the ratio of all correct predictions: (true positive + true negative) / (true positive + false positive + false negative + false positive)

ROC and AUC

👁 ROC, AUC illustration (image by author)

ROC, AUC illustration (image by author)

ROC plots the true positive rate and false positive rate upon various thresholds. For example, the point indicates the true positive rate and false positive rate when the threshold is set to 0.7, hence RainTomorrow = Yes when the predicted probability is greater than 0.7. As the probability threshold drops to 0.4, more cases will be predicted as positive (RainTomorrow = Yes), hence both true positive rate and false positive rate go up. AUC stands for area under curve, and different models will have different ROC hence different AUC scores. In this example, model 2 has a larger AUC than model 1 and it is the better model. This is because, at the same level of false positive rate, model 2 has a higher true positive rate. Therefore, model 2 has a higher AUC score which makes it the better model.

Three functions are used to plot ROC and calculate AUC:

predict_proba(): generates the probability score for each instance
roc_curve(): returns false positive rate, true positive rate and which are essential to plot the curve.
roc_auc_score(): calculates the AUC score

👁 ROC, AUC code (image by author)

ROC, AUC code (image by author)

Take-Home Message

This article covers fundamental steps in a logistic regression model building process:

Data Preprocessing: with the focus on missing value imputation
Feature Engineering and EDA: univariate analysis and multivariate analysis; handling outliers and feature transformation
Model Building: split dataset and fit the data logistic regression
Model Evaluation: confusion matrix, accuracy, ROC, and AUC

However, it is just a basic guide which is aiming to let you have a grasp of implementing logistic regression hopefully in a timely manner. There is ample space to improve the current model, by introducing hyperparameter tuning, feature importance, and standardization. As always, let’s keep learning.

URL: https://towardsdatascience.com/simple-logistic-regression-using-python-scikit-learn-86bf984f61f1/