![]() |
VOOZH | about |
Multicollinearity occurs when two or more independent variables are highly correlated which leads to unstable coefficient estimates and reduces model reliability. This makes it difficult to identify the individual effect of each predictor on the dependent variable. The Variance Inflation Factor (VIF) is used to detect multicollinearity in regression analysis. In this article, we’ll see VIF and how to use it in Python to identify multicollinearity.
VIF shows how much the variance of a regression coefficient increases due to multicollinearity. For each variable, we run a regression where that variable becomes the dependent variable and the remaining variables act as predictors. This gives an R-squared() value that tells how well one variable can be predicted using the others.
Formula for VIF is:
Since VIF increases as R² increases, a higher VIF indicates higher multicollinearity. In practice:
Understanding this formula helps us correctly spot multicollinearity and decide if we should remove or combine variables.
To detect multicollinearity in regression analysis we can implement the Variance Inflation Factor (VIF) using the statsmodels library. This function calculates the VIF value for each feature in the dataset helping us identify multicollinearity.
Syntax :
statsmodels.stats.outliers_influence.variance_inflation_factor(exog, exog_idx)
Parameters:
Consider a dataset of 500 individuals containing their gender, height, weight and Body Mass Index (BMI). Here, Index is the dependent variable and Gender, Height and Weight are independent variables. We will be using Pandas library for its implementation.
You can download the dataset from here.
Output:
Here we are using the below approch:
Output :
High VIF values for Height and Weight shows strong multicollinearity between these two variables which makes sense because a person’s height influences their weight. Detecting such relationships helps us to understand and improve the stability of our regression models.
Here are several effective strategies to address high VIF values and improve model performance:
1. Removing Highly Correlated Features: Drop one of the correlated features, the one which is less important or with a higher VIF. Removing such features reduces redundancy and improves model interpretability and stability.
2. Combining Variables or Using Dimensionality Reduction Techniques