VOOZH about

URL: https://www.analyticsvidhya.com/blog/2013/12/residual-plots-regression-model/

⇱ Residual Diagnostics | Residual Plot Linear Regression


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

Diagnosing residual plots in linear regression models

Tavish Srivastava Last Updated : 21 Nov, 2024
5 min read

My first analytics project involved predicting business from each sales agent and coming up with a targeted intervention for each agent.

I built my first linear regression model after devoting a good amount of time on data cleaning and variable preparation. Now was the time to access the predictive power of the model. I got a MAPE of 5%, Gini coefficient of 82% and a high R-square. Gini and MAPE are metrics to gauge the predictive power of linear regression model. Such Gini coefficient and MAPE for an insurance industry sales prediction are considered to be way better than average. To validate the overall prediction we found the aggregate business in an out of time sample. I was shocked to see that the total expected business was not even 80% of the actual business. With such high lift and concordant ratio, I failed to understand what was going wrong. I decided to read more on statistical details of the model. With a better understanding of the model, I started analyzing the model on different dimensions. After a close examination of residual plots, I found that one of the predictor variables had a square relationship with the output variable.

πŸ‘ 3d doctor

Since then, I validate all the assumptions of the model even before reading the predictive power of the model. This article will take you through all the assumptions in a linear regression and how to validate assumptions and diagnose relationship using residual plots.

[stextbox id=”section”]Assumptions of Linear Regression Model :[/stextbox]

There are number of assumptions of a linear regression model. In modeling, we normally check for five of the assumptions. These are as follows :

1. Relationship between the outcomes and the predictors is linear.
2. Error term  has mean almost equal to zero for each value of outcome.
3. Error term  has constant variance.
4. Errors are uncorrelated.
5. Errors are normally distributed or we have an adequate sample size to rely on large sample theory.

The point to be noted here is that none of these assumptions can be validated by R-square chart, F-statistics or any other model accuracy plots. On the other hand, if any of the assumptions are violated, chances are high that accuracy plot can give misleading results.

[stextbox id=”section”]How to use residual for diagnostics :[/stextbox]

Residual analysis is usually done graphically. Following are the two category of graphs we normally look at:

1. Quantile plots : This type of is to assess whether the distribution of the residual is normal or not. The graph is between the actual distribution of residual quantiles and a perfectly normal distribution residuals. If the graph is perfectly overlaying on the diagonal, the residual is normally distributed. Following is an illustrative graph of approximate normally distributed residual.πŸ‘ Quantile_plot

Let’s try to visualize a quantile plot of a biased residual distribution.

πŸ‘ skewed

In the graph above, we see the assumption of the residual normal distribution being clearly violated.

2. Scatter plots:  This type of graph is used to assess model assumptions, such as constant variance and linearity, and to identify potential outliers. Following is a scatter plot of perfect residual distribution

πŸ‘ Scatter_plot

Let’s try to visualize a scatter plot of residual distribution which has unequal variance.

πŸ‘ Scatter_plot_2

In the graph above, we see the assumption of the residual normal distribution being clearly violated.

[stextbox id=”section”]Example :[/stextbox]

For simplicity, I have taken an example of single variable regression model to analyze residual curves. Similar kind of approach is followed for multi-variable as well.

Say, the actual relation of the predictor and the output variable is as follows:

πŸ‘ Actual eq

Ignorant of the type of relationship, we start the analysis with the following equation.

πŸ‘ Example2

Can we diagnose this misfit using residual curves?

After making a comprehensive model, we check all the diagnostic curves. Following is the Q-Q plot for the residual of the final linear equation.

πŸ‘ Example3

Q-Q plot looks slightly deviated from the baseline, but on both the sides of the baseline. This indicated residuals are distributed approximately in a normal fashion.

Following is the scatter plot of the residual :

πŸ‘ Example4

Clearly, we see the mean of residual not restricting its value at zero. We also see a parabolic trend of the residual mean. This indicates the predictor variable is also present in squared form. Now, let’s modify the initial equation to  the following equation :

πŸ‘ Example5

Following is the new scatter plot for the residual of the new equation :

πŸ‘ Example6

We now clearly see a random distribution and a approximate zero residual mean.

[stextbox id=”section”]End Notes:[/stextbox]

Every linear regression model should be validated on all the residual plots . Such regression plots directionaly guides us to the right form of equations to start with. You might also be interested in the previous article on regression ( https://www.analyticsvidhya.com/blog/2013/10/trick-enhance-power-regression-model-2/ )

Do you think this provides a solution to any problem you face? Are there any other techniques you use to detect the right form of relationship between predictor and output variables ? Do let us know your thoughts in the comments below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails or like our facebook page.

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Login to continue reading and enjoy expert-curated content.

Free Courses

AI Interview Questions & Answers Masterclass

Master AI interview questions with expert answers.

Building and Evaluating RAG System

Learn to build RAG system applications, create AI agents, and deploy.

Build Products 10x Faster with GenAI : Hands On

Master prompt engineering,build AI apps with LangChain & deploy custom GPTs.

Model Deployment using FastAPI; Prepare, Train, and Test FastAPI Application

Deploy a fastapi machine learning model with XGBoost and Docker APIs.

Build Data Pipelines with Apache Airflow

Learn ETL pipeline building and workflow orchestration with Airflow.

Responses From Readers

Tuhin Chattopadhyay

Good one Kunal, though I prefer to use designated tests to verify each of the assumptions.

123 2
Tavish Srivastava

Tuhin, I agree doing designated test to verify each assumption is a more sturdy approach. However, using the approach mentioned in the article will tell you two facts. First, whether assumption is validated or not. Second, if assumptions are not validated, what will be the best starting point for the next iteration. For instance, in the example mentioned in the article, residual plot gave us a better starting equation. Please share your thoughts of how will designated tests help on the second point. Thanks, Tavish

123 456
Kunal Jain

Tuhin, While the designated tests would provide you the output, they do not provide as much details about why things have failed. Also, even if you use designated tests, it makes sense to understand what has failed. Thanks, Kunal

123 456
Tuhin Chattopadhyay

Sorry, it's by Tavish. Liked it Tavish.

Deepak Rai

Hi, How to we check whether Error Terms are independent to each other i.e. they are not correlated with one another (Assumption of Independence of the Error Terms) and Co-linearity of the independent variables? Regards, Deepak Rai

123 1

Deepak, This will be visible from your residual scatter plot with independent variables. You will see that your error will show you autocorrelation i.e. you will that a positive error tend to follow a positive error and vice versa. Hope this helps. Tavish

123 456

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
πŸ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
πŸ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

πŸ‘ Popup Banner
πŸ‘ AI Popup Banner