Causal Inference with Linear Regression: Endogeneity
Discussion of Exogenous variable, Exogenous variable, Omitted variable, Measurement Error, and Simultaneity Bias
In my previous article, we discussed some common issues when designing a linear regression – Omitting Important Variables and Including Irrelevant Variables. In this article, we’ll discuss Endogeneity in a linear regression model, especially in the context of Causal Inference.
A linear regression model is a popular tool used to draw a causal relationship between the response variable (Y) and the treatment variable (i.e., T) while controlling for other covariates (e.g., X), shown as follows. The bias (accuracy) and variance (precision) of the treatment effect (i.e., α) is a priority of such research.
What is Endogeneity?
Endogeneity refers to situations in which a predictor (e.g., treatment variable) in a linear regression model is correlated to the error term.
You call such predictor an endogenous variable. The coefficient estimate of the endogenous variable is no longer BLUE (Best Linear Unbiased Estimator) because endogeneity violates one of the classical assumptions of Linear Regression – All independent variables are uncorrelated with the error term.
On another hand, a variable is called an exogenous variable if it is not explained by other variables in the model (e.g., response variable, other explanatory variable and the error term). An exogenous variable is determined by factors outside the model.
What are the sources of Endogeneity?
There is a wide range of sources of Endogeneity. The common sources of Endogeneity can be classified as: omitted variables, simultaneity, and measurement error.
Source 1: Omitted Variables
If a variable Z is correlated to both the response variable and the predictors, we call such variable a Confounding Variable.
If a confounding variable Z is omitted in a linear regression model, then the affected predictor (e.g., treatment variable) would become endogenous because in this case, "unexplained" variable Z leaks into the error term, then the affected predictor will be correlated with the error term.
If there is endogeneity due to omitted variable, the estimates of the affected variable (e.g., treatment variable) would become biased (i.e., omitted variable bias). See proof here. It means we have an inaccurate causal effect.
If the confounding variable Z is added in the linear regression model, then the affected predictor (e.g., treatment variable) would no longer be endogenous. Therefore, the coefficient estimate for the treatment effect would no longer be biased.
Source 2: Simultaneity
Simultaneity is another common cause of endogeneity. Simultaneity arises when one or more of the predictors (e.g., treatment variable) is determined by the response variable (Y). In simple terms, X causes Y and Y causes X. For example, we can use education level to explain the household income because people with higher education tend to earn more. At the same time, we know that people with higher income are easier to afford higher education.
Typically such relationships can be explained by Simultaneous Equations (also called Structural Equations).
By solving the two equations above, we have a Reduced Form of the model
In the context of causal inference, if the treatment effect X is determined by the response variable, then it is easy to see that the treatment effect is correlated with the error term, u in Figure 2.
Therefore, both the treatment effect and the response variable are endogenous variables if we apply OLS in Figure 2. It will lead to a biased estimator of treatment effect(i.e., Simultaneity Bias). Therefore, the treatment effect can never be the true effect.
Source 3: Measurement Error
In a linear regression model, it is assumed that the observations are correctly measured without any error. In many situations, this assumption is violated. Some variables (e.g., people’s ability and willingness to exercise) may not be measurable, then we use proxy variables (e.g., people’s IQ scores and the number of hours in the gym) to measure the effect. Sometimes, it is hard to take correct observations. For example, the age variable is usually recorded in an integer, and month and day are typically ignored. In these cases, the true value of the variables are not included in the model. The difference between the observed and the true values of the variable is called a Measurement Error.
Scenario 1: When the measurement error is in the dependent variable Y, it doesn’t cause endogeneity because in this case, the unexplained Measurement Error is an exogenous variable, which is independent of the included explanatory variables. Therefore, the explanatory variables will not be correlated with the error term even if the unexplained Measurement Error leaks to the error term.
Scenario 2: In contrast, when the measurement error is in the explanatory variable, the problem of endogeneity arises.
Let’s say **X* is the observed explanatory variable(s) and X* is the true value of the variable(s). The relationship between X and X can be explained as follows,
We set up a linear regression as usual without including the measurement error term, v because it is not measurable.
Then the model we are actually estimating is
With some math, we can find out that X* is correlated with the actual error term, u, then endogeneity occurs.
In Figure 9, Cov(X, v) is 0 because the measurement error is independent of the independent variables, X. and Cov(X, ε) and Cov(v, ε) are both 0 because ε is assumed to be independent of X, and ε is very unlikely correlated to the measurement error.
Then in the linear regression with measurement error, the OLS estimator, β_hat is no longer unbiased. Moreover, The estimator will always be under-estimated (e.g., Attenuation Bias).
What is the remedy for Endogeneity?
One of the popular methods to address Endogeneity in a linear regression model is introducing one or more instrumental variables via Two-Stage Least Squares (2SLS).
Let’s define this Instrumental variable Z:
- Z is not correlated to any other covariates (including the error term) in the model
- Z is meaningfully and strongly correlated to the affected predictor (e.g., treatment variable), therefore indirectly affects Y through X
In practice, an Instrumental variable (IV) model can be implemented in two steps (2SLS):
- Step 1: We regress the Instrumental variable on the affected predictor X. Keep in mind that we need to have a strong correlation between IV and X. Otherwise, we might still have a bias on the affected predictor.
- Step 2: We regress Y on fitted X from step 1 and other covariates. The estimator we get from step 2 would be more accurate and consistent than the one from Figure 11.
Final Notes
When using a linear regression model to draw a causal inference, endogeneity is an issue we need to address, otherwise, we would get a biased treatment effect due to omitted variables, simultaneity, or measurement error.
Here are some related posts you can explore if you’re interested in Linear Regression and Causal Inference.
- Causal Inference: Econometric Models vs. A/B Testing
- Linear Regression vs. Logistic Regression: OLS, Maximum Likelihood Estimation, Gradient Descent
- Linear Regression with OLS: Unbiased, Consistent, BLUE, Best (Efficient) Estimator
- Causal Inference with Linear Regression: Omitted variables and Irrelevant variables
- Causal Inference with Linear Regression: Endogeneity
- Linear Regression with OLS: Heteroskedasticity and Autocorrelation
Thank you for reading !!!
If you enjoy this article and would like to Buy Me a Coffee, please click here.
You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS