![]() |
VOOZH | about |
In statistical modeling, it's important to understand how each piece of data affects the overall picture. Cook's Distance is a way to measure how much each point in a regression analysis influences the final results. Named after the statistician R. Dennis Cook, Cook's Distance helps us pinpoint which data points have a big impact on the analysis. By showing us which points matter most, Cook's Distance helps us make better decisions about our data and our models.
Cook's Distance ( ( Di ) ) for the ith observation in a regression model with p predictors is calculated using the formula-
Where,
Now we will implement the Cook's Distance Formula in R Programming Language.
lm() function fits a linear regression model, where mpg is the dependent variable and wt, hp, and disp are the independent variables.
cooks.distance() computes Cook's Distance for each observation based on the fitted model. cooksd will contain the Cook's Distance values for each observation in the mtcars dataset.
Output:
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
1.152035e-02 4.621112e-03 1.598334e-02 1.283888e-04
Hornet Sportabout Valiant Duster 360 Merc 240D
1.839055e-03 1.560119e-02 1.053270e-02 1.313511e-02
Merc 230 Merc 280 Merc 280C Merc 450SE
2.525382e-03 3.671067e-03 2.606104e-02 1.551454e-03
Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
1.049983e-04 5.648180e-03 7.218880e-05 1.298764e-02
Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
3.199707e-01 1.196019e-01 9.092102e-03 1.529771e-01
Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
2.215865e-02 4.218196e-02 4.909944e-02 7.181085e-03
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
6.980693e-02 4.163138e-04 1.732523e-06 5.959750e-02
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
7.279943e-03 1.100867e-02 3.402911e-01 8.796726e-03 Here the output represent the Cook's Distance values for each observation in the mtcars dataset.
now visualize Cook's Distance from a linear regression model using the mtcars dataset in R, you can create a plot that highlights influential points. Hereβs an example of how to do it:
Output:
ggplot is used to create a bar plot of Cook's Distance.
labs adds titles and labels to the plot.theme_minimal gives a clean look to the plot.The plot will help you identify observations with high Cook's Distance, which could indicate they have a significant influence on the fitted model. Observations above the red dashed line are typically considered influential.
Cook's Distance is a valuable tool for detecting influential observations in regression analysis. It offers insights into data reliability and model performance, it's essential to consider its limitations and interpret results in conjunction with other diagnostic measures. By exploring Cook's Distance effectively, we can enhance the quality and validity of our regression models, leading to more informed decision-making in various domains.