The enigma of Adjusted R Squared
The real instigation behind the trustworthiness of Adjusted R Squared over R squared among data scientists
in regression analysis
"Truth does not consist in minute accuracy of detail; but in conveying a right impression."
Henry Alford
Introduction
Regression analysis is one of the most fundamental but commanding machine learning techniques which is still predominant and made a way for many advanced kinds of research in the industry. Although there are a handful of advanced regression techniques already serving the purpose of predicting the continuous variables like bagging, boosting, support vectors, etc, linear regression principles remain the first choice for most of the researchers if the data perpetuates a rectilinear fashion when represented on a multidimensional space. One of the most widely used evaluation metrics for the linear regression models is R squared aka Coefficient of determination. R squared is considered as a goodness of fit metric which in most of the time ranges around 0 to 1. Higher the value of R Squared examined as higher the coherence and predictive ability of the model.
But as with most of the other evaluation metrics in machine learning, R Squared has also some restraints which make it give imprecise indications sometimes by signifying an indigent model with an extremely high value. In this article, we will discuss the calculation procedure of R Squared, its limitations, and how these limitations can be overridden using an advanced evaluation metric called Adjusted R Squared.
Table of contents
- The intuition of R Squared in regression analysis
- Limitations of R Squared
- Importance of Adjusted R Squared
The intuition of R Squared in regression analysis
We will start with an example use case. Consider that we have a machine learning problem to predict the height of a person using his/her weight, father’s height, and mother’s height as independent variables.
We have the following data for research-
Here, our target variable is Height and predictor variables are –
- Father’s height (in cm)
- Mother’s height (in cm)
- Weight of the person (in kg)
It is clear from the data that there exists a linear relationship between the predictor variables and the target variable. Hence, it is a typical idea to move forward with a multiple linear regression algorithm for building the model to serve our prediction purpose.
Let’s assume that we have a train ratio of 0.7 and we considered the first 7 records as train data and the rest of the 3 records as test data.
We completed our linear regression model and now, we need to evaluate our model to know how close can be our prediction to reality.
Here, comes the R Squared, one of the most popular performance evaluation metrics to measure the strength and closeness of prediction.
R Squared = 1- (SSR/SST)
where,
SSR = Sum of Squared Residuals
SST = Sum of Squared Total
Consider that our prediction for the test data is as follows-
Let’s calculate the R Squared for our model using the sklearn library ( We will discuss the in-depth intuition of its mathematical derivation after that )
#Import necessary packages and libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
#Create input data as a dictionary
input_dict =
{
"PersonId": [1,2,3,4,5,6,7,8,9,10],
"Father's height" [136.5,149.8,174.07,168.05,185.8,170.45,180.75,148.15,154.46,158.11],
"Mother's height" : [126.5,143.8,167.07,165.05,182.8,160.45,170.75,140.25,148.46,147.11],
"Weight" : [50,60,79,85,60,65,75,55,62,67] ,
"Person's Height": [116.5,139.8,184.07,198.05,145.8,160.45,180.75,128.15,144.46,156.11]}
#Convert dictionary into a pandas dataframe
data = pd.DataFrame(input_dict)
#Split the data into train data and test data
X_train = data.head(7)
X_test = data.tail(3)
#Remove UniqueId and target variable
del X_train["PersonId"]
del X_train["Person's Height"]
#Remove UniqueId and target variable
del X_test["PersonId"]
del X_test["Person's Height"]
y_train = data.head(7)
y_test = data.tail(3)
#Remove UniqueId and predictor variables
del y_train["PersonId"]
del y_train["Father's height"]
del y_train["Mother's height"]
del y_train["Weight"]
#Remove UniqueId and predictor variables
del y_test["PersonId"]
del y_test["Father's height"]
del y_test["Mother's height"]
del y_test["Weight"]
#Perform linear regression using sklearn library
regressor = LinearRegression()
regressor.fit(X_train,y_train)
predictions = regressor.predict(X_test)
#sklearn's inbuilt method for computing the RSquared of the model
rsquared = regressor.score(X_test, y_test)
#Predictions of testdata
print(predictions)
#R Sqaured of the model
print(rsquared)
Here, R Squared = 0.963. As per the characteristics of this metric, this looks like a very good value.
But is it enough to confirm the confidence regarding the predictive ability of this model?
No.
Let’s check Adjusted R Squared also for this model-
#Adjusted RSquared of the model
n=len(data) #number of records
p=len(data.columns)-2 #number of features .i.e. columns excluding uniqueId and target variable
adjr= 1-(1-score)*(n-1)/(n-p-1)
print(adjr)
Oops !!! It is less than R squared. Moreover, there is a drop of around 2% in the confidence from R Squared (0.963) to Adjusted R Squared (0.945).
- Why there was a drop in Adjusted R Squared?
- What is the real intuitive meaning conveyed by this difference?
- How it will reflect in real-time use cases?
- Does R Squared always belong to a value between 0 and 1 or are there any exceptional cases that we often miss out?
Let’s know the answers…
Limitations of R Squared
R Squared = 1- (SSR/SST)
Here, SST stands for Sum of Squared Total which is an indication of nothing but "how much do the predicted points get varies from the mean of the target variable". Mean is nothing but a regression line here.
SST = Sum (Square (Each data point – Mean of the target variable))
Mathematically,
where,
n = Number of observations
y = Observed value of the target variable
y̅ = Mean value of the target variable
For example,
If we want to build a regression model to predict the height of a person with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all persons belonging to our sample and consider it as the prediction. The red line in the following diagram shows the mean value of the height of all the persons belonging to our sample.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we built from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression, or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will increase.
SSR = Sum (Square (Each data point – Each corresponding data point in the regression line))
Mathematically,
where,
n = Number of observations
y = Observed value of the target variable
ŷ = Predicted value of the target variable
In the above diagram, let’s consider that the blue line indicates the predictions from a sophisticated model with a high-level mathematical analysis. We can see that it has higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
- SST will be a large number because it is a very poor model (red line).
- SSR will be a small number because it is the best model we developed after much mathematical analysis (blue line).
- So, SSR/SST will be a very small number (It will become very small whenever SSR decreases).
- So, 1- (SSR/SST) will be a large number.
- So we can infer that whenever R Squared goes higher, it means the model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in the real case, we will have 100’s of independent variables for a single dependent variable. The actual problem is that, out of 100’s of independent variables-
- Some variables will have a very high correlation with the target variable.
- Some variables will have a very small correlation with the target variable.
- Also, some independent variables will not correlate at all.
If there is no correlation then what happens is that — " Our model will automatically try to establish a relationship with dependent and independent variables and proceed with mathematical calculations assuming that the researcher has already eliminated the unwanted independent variables."
For example,
For predicting the height of a person, we will have the following independent variables
- Weight ( High correlation )
- Phone number( No correlation )
- Location ( Low correlation )
- Age ( High correlation )
- Gender ( Low correlation )
Here, only weight and age are enough to build an accurate model but the model will assume that the phone number will also influence the height and represent it in a multidimensional space. When a regression plane is built through these 5 independent variables, its gradient, intercept, cost, and residual will automatically adjust to increase the accuracy. When the accuracy gets increases artificially, obviously R squared will also increase.
In such scenarios, the regression plane will touch all the edges of the original data points in the multidimensional space. It will make the SSR a very small number and that will eventually make the R Squared a very high number but when test data is introduced, such models will fail miserably.
That is the reason why a high R Squared value does not guarantee an accurate model.
Importance of Adjusted R Squared
For overcoming the challenge mentioned above, we have an additional metric called Adjusted R Squared.
*Adjusted R Squared= 1 – [ ( (1 – R Squared) (n-1) ) / (n-p-1) ]**
where,
- p = number of independent variables.
- n = number of records in the data set.
For a simple representation, we can rewrite the above formula like this-
*Adjusted R Squared= 1 – (A B)**
where,
- A = 1 – R Squared
- B = (n-1) / (n-p-1)
From the above formula, we can impulsively consider the following inferences-
- When the number of predictor variables increases, it will decrease the whole value of B.
- When the value of R Squared increases, it will decrease the whole value of A.
- Hence technically, it penalizes the value of both A and B if either R Squared is high or the number of predictor variables is high.
- If we multiply both A and B then it will be a much smaller number.
- If we subtract the product of A and B from 1 then it will be a value definitively less than 1 unless the value of p = 1.
- Not only the difference between R Squared and Adjusted R squared but also the value of Adjusted R Squared itself can be considered as a goodness of fit metric replacing the limitations of R Squared for evaluating the envisaged consistency of the model.
In summary, whenever the number of independent variables gets increases, it will penalize the formula so that the total value will come down. It is least affected by the increase of independent variables. Hence, Adjusted R Squared will more accurately indicate the performance of the model than the R Squared.
Can R Squared be negative?
Yes. It can be also a negative value in some rare scenarios.
Since, R squared = 1 – ( SSR / SST )
It is calculated on an assumption that the average line of the target which is a perpendicular line of the y-axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio that indicates, "How SSR is worst with respect to SST ? ". If your model can somewhat build a plane that is comparatively good than the worst, then in 99% cases SSR< SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR >SST? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be negative. But it happens only in 1% of cases or smaller.
Conclusion
Despite being a well-known and mass accepted performance evaluation measure, R Squared suffers from many debased inference deliver-ability in some conditions which are not in its scope. However, it is to be accepted there is no magic wand that can completely represent the inherent disposition of a regression model to 100%. The Adjusted R Squared is such a metric that can domesticate the limitations of R Squared to a great extent and that remains as a prime reason for being the pet of data scientists across the globe.
Although it is not in the scope of this article, please have a look at some other performance evaluation metrics which we usually use in regression and forecasting here like MAE, MSE, RMSE, MAPE, etc. It will give you a more congenital perspective of model evaluation dealing with continuous variables apart from what we have discussed here so far.
I hope that now you got an intuitive understanding of the principle and derivation of R Squared and Adjusted R Squared and how they need to be implemented at the right places and right timings.
You can connect with me via the following platforms-
- Quora
- Gmail – [email protected]
References
- Sougata Deb, A Novel Robust R-Squared Measure and Its Applications in Linear Regression __ (2016)
- Kazhurio Ohtani and Hisashi Tanizaki, Exact distribution of R2 and Adjusted R2 in a linear regression model with multivariate t error terms __ (2004)
- Carrodus, M.L., and Giles, D.E.A, The exact distribution of R2 when regression disturbances are autocorrelated, Economics Letters, 38, 375–380 (1992)
Thanks for reading!!!
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS