Data Science

Causal Inference: Econometric Models vs. A/B Testing

Observational Study, Experimental Study, Regression Model, Instrumental Variable, Difference-in-difference Model, Parametric Test and…

Aaron Zhu

Feb 14, 2022

20 min read

👁 Photo by Bannon Morrissy on Unsplash

Photo by Bannon Morrissy on Unsplash

As data science practitioners, we’ve been asked a lot that "Does X drive Y?" Y is the outcome that we care about. X could be a new feature, product or drug. For example, a website owner would like to ask if a new web page design leads to a higher click-through rate or sale. A clinical researcher would ask if a new drug would promote better health.

Unfortunately, a raw correlation between X and Y alone is NOT enough to help us establish the causal relationships. The complicating factor here is a set of other features, called Confounding Variables, that affect both X and Y. For example, factors, such as visitor geolocation, gender, age and interest would affect both the use of the new feature and the outcome of sales revenue. Therefore, we need to isolate the effect of the new web page design (X) on the sales revenue (Y) while controlling for these confounding variables.

Section I: Observational Study and Experimental Study

There are two main categories of designs used to investigate the relationship between two or more variables in a study: Observational and Experimental.

In an observational study, we observe and collect the actual data (e.g., X and Y) relevant in the study without randomly imposing any kind of treatment or restriction on a group.
In an experimental study, we randomly impose treatment to a group, while the other group doesn’t receive the treatment, so that we can investigate the causal relationship between the treatment and the outcome variable. Randomization design and intervention make experimental studies different from observational studies.

Problem: Is this study Observational or Experimental?

Study 1: A study randomly assigned students into one of two groups:

One group was asked to follow a strict exercise schedule.
One group was prohibited from doing any exercise.

The researchers looked at which group tended to receive a higher GPA at the end of a semester.

Study 2: Another study took a random sample of students and examined their exercise habits. Each person was classified as either a light, moderate, or heavy exerciser. The researchers looked at which groups tended to receive higher GPA at the end of a semester.

Study 1 is an Experimental study while study 2 is an Observational study. Because Study 1 required researchers’ intervention while Study 2 didn’t.

Differences Between Observational Study and Experimental Study:

Experimental studies are typically more expensive because it requires more resources to set up the experiment for study while an observational study can be done through survey and data collection. Researchers don’t need any intervention.
Experimental studies are typically shorter than observational studies because researchers’ interventions make it more efficient to collect the relevant data for the study while it might take several years to complete data collection for observational studies.
The evidence provided by the Experimental studies is considered to be stronger than the observational study. The relationship between variables in an observational study is not a necessary causal relationship, while in an experimental study, randomization would make sure that other covariates (e.g., age, gender, geo-location, spending habit) are evenly distributed among treatment and control groups and our samples are representative of our population, so that we can more precisely investigate the causal effect of the treatment on the outcome we care about.

It sounds like Experimental studies have more advantages than Observational studies. However, in some research studies, Experimental studies are NOT a feasible option. For example,

You conduct research based on historical data which didn’t have randomized assignment of treatment in the past.
The treatment can only be observed rather than imposed. Some treatment effects (e.g., gender, level of household income, race, etc) can’t be manipulated by the researchers.
Imposing a treatment on a group is considered unethical. For example, we would like to investigate how a change in a product’s price would affect its sale revenue. If we conduct an experimental study by charging a group of customers a higher price than another group in the same location and same time frame, it would be unethical and a PR disaster. In another example, we would like to investigate if getting a college degree would affect the income earned. Again, we can’t impose some people to "having a college degree." Not only it is impossible, it would also be unethical to do so.

Section II: Observational Study with Econometric Models

Regression Model

Econometric models (a.k.a. Controlled Regression) is a popular method for an Observational Study to estimate how the change of predictor variables (e.g., treatment X and other Covariates) relate to the change of response variable (Y). Importantly, with a Controlled Regression model, we can isolate the effect of one variable (e.g., treatment X) while holding constant all of the other predictor variables.

Not only Controlled Regression is capable of controlling for Covariates (which affect the response variable), it can also control for Confounding variables, which can affect both treatment X and response variables Y.

For example, an e-commerce website would like to investigate if a virtual fitting room feature on their website would increase their sales revenue. If we just regress this website feature on sales revenue, we can easily see a positive correlation between the virtual fitting room feature and sales revenue. If we just stop there, the effect from the feature would be biased (a.k.a, Omitted Variable Bias) because an important Confounding variable, AGE, is not included in the model. Based on additional data analysis, we could find out that, younger customers tend to use the virtual fitting room more frequently, and younger customers tend to spend more on this e-commerce website. If we don’t include AGE variable in the Controlled Regression, the effect of the feature might be Biased.

👁 Image by author

Image by author

Background: Age is positively related to both Virtual_Fitting_Room and Sale_Revenue.

Controlled Model 1: β1 here is biased.
Sale_Revenue = β1 * Virtual_Fitting_Room + C

Controlled Model 2: β1 here is estimated more precisely and R-square is higher.
Sale_Revenue = β1 * Virtual_Fitting_Room + β2 * Age + C

Tips to check if there is Omitted Variable Bias in a controlled model

If adding a predictor variable into the regression model would meaningfully increase the R-squared and the treatment effect also changes by a significant amount. It is likely that the previous model suffers from Omitted Variable Bias.
If including additional variable(s) in the model doesn’t affect the treatment effect meaningfully, then we’re more confident that the estimated treatment effect is a true causal effect between treatment and the response variable.

Address Omitted Variable Bias using Instrumental Variable

Another issue in Observational Studies is confounding variables exist conceptually, but can NOT be measured or observed, so treatment effect would suffer from omitted variable bias. Including Instrumental Variable (IV) in the model is a popular method to address this issue.

Let’s define this Instrumental variable Z:

Z is not correlated to any other covariates (including the error term) in the model
Z is meaningfully and strongly correlated to treatment X, therefore indirectly affects Y through X

In practice, an Instrumental variable can be implemented in two steps:

Step 1: We regress the Instrumental variable on X. Keep in mind that we need to have a strong correlation between IV and X. Otherwise, we might still have a bias on the treatment effect
Step 2: We regress Y on predicted X from step 1 and other covariates. So we estimate the treatment effect more accurately.

Application of Instrumental Variable

Case Study: A social media site would like to investigate if having more friends on the same social media site would make a user more likely to return to the site.

First of all, an experimental study is out of the table because we can’t randomly assign people having more friends than others. Secondly, including all confounding variables is not feasible. Using an instrumental variable seems to be an easier route. A social media site should have an existing strategy to invite more friends to the site. For example, a feature of sending invitations to friends using a user’s friend contacts.

👁 Image by author

Image by author

Background: This feature of Send_Invitations_to_friends is related to number_of_friends, but not related to any other covariates for Return.

Step 1 of IV Model: β1 should suggest strong positive correlation between Send_Invitations_to_friends and Number_of_friends
Number_of_friends = β1 * Send_Invitations_to_friends + C

Step 2 of IV Model: β2 can estimate effect of number of friends on Return more precisely.
Return_Flag = β2 * Predicted_Number_of_friends + C

Difference-in-Differences Model

When an Instrumental Variable is not feasible, we would need an alternative way to control the effect from unobserved confounding variables. Difference-in-Difference (DiD) model could be a viable option. It compares outcomes from both the pre-treatment and treatment period between treatment and control groups. The key assumption required for DiD model is:

If the treatment is NOT imposed, the outcome variables of the treatment group and control would follow the parallel trends.
Any covariates (including the omitted variables) would affect the outcome variables of the treatment group and control groups in the same way.

👁 Image by Author

Image by Author

Application of Difference-in-Differences Model

Case Study: A retail store would like to investigate if a price increase on a product would generate more sales revenue.

First of all, an experimental study is out of the table because randomly charging a product with different prices in the same location is considered unethical. Secondly, constructing both confounding variables and instrumental variable is a difficult task by itself. In this case, Difference-in-Differences Model is a better option.

We can pick two retail stores from different cities and these two cities are also comparable in terms of population, income level, and demand for the product (parallel trends assumption).

The store in City 1 (treatment group) charged customers a higher price (treatment effect) in the treatment period than the control period while the store in City 2 (control group) charged customers the same price throughout both periods.

we can compare how revenue changes between the two cities after a price increase using a DiD model.

Background: Trends of sale revenue from both cities are parallel in both the pre-treatment period and treatment period (if the price change is not imposed.)

DiD Model: "Treatment_period" is an indicator of the timing that a price increase is imposed (treatment period: 1; Pre-treatment period: 0). "Treatment_Group" is an indicator of the treatment and control groups (City_1: 1; City_2: 0). The effect of the Interaction term between "Treatment_period" and "Treatment_group"(β3) estimates the pure effect of price increase on sale revenue.

Sale_Revenue = β1 * Treatment_period + β2 * Treatment_group + β3 * Treatment_period * Treatment_Group + C

β1: It estimates the effect on the response variable in the treatment period from factors other than our treatment.
β2: It estimates the average difference between treatment and control groups
β3: It estimates the pure treatment effect on the response variable.

👁 Image by Author

Image by Author

Section III: Experimental Study with A/B Test

In this section, let’s talk about another powerful tool for Causal Inference, A/B Test.

The A/B test (a.k.a, Randomized Controlled Trial) is perhaps the most accurate tool to investigate causality. By continually identifying new goals in terms of conversion rates and engagement metrics, and testing new features, a website can improve its performance and an App can attract and retain more users. Therefore A/B test is commonly used in the tech industry in recent years.

An A/B Test is usually implemented in some necessary steps: Forming Hypothesis, Sample Size Calculation, Randomization Design, Post-Test Analysis.

Hypothesis

Forming a hypothesis is the first step of every A/B test. A hypothesis a statement that describes the causal relationship you want to investigate. Let’s give an example of a hypothesis.

Null hypothesis (H0): ABC e-commerce site visitors who receive email coupons will NOT have higher purchase conversion rate compared to visitors who don't receive email coupons.

Alternative hypothesis (H1): ABC e-commerce site visitors who receive email coupons will have higher purchase conversion rate compared to visitors who don't receive email coupons.

Every hypothesis is consist of key components: Population, Treatment, Evaluation Metric, Null & Alternative hypothesis

Population: We need to define what subjects are eligible for the experiment (e.g., all users, or users from a certain location) and also how to determine an individual subject (a.k.a. Unit of Diversion). In the above example, the popular is all visitors from ABC e-commerce site and the Unit of Diversion is User ID.

Treatment (Intervention): **** A treatment can be a new feature or new design. In the above example, treatment would be receiving email coupons. Keep in mind that, treatment usually can only be ONE intervention. We can’t impose multiple changes in one group. For example, if we send out both email coupons and mail coupons for the experiment, we can’t distinguish the effect from both interventions.

Treatment & Control groups: Any subjects that are imposed by the treatment would belong to the treatment group. Any subjects that are NOT imposed by the treatment would be in the control group.

Evaluation Metric (Outcome Variable): The evaluation metric is the outcome that we care about and would be investigated. In the above example, the evaluation metric is purchase conversion rate, which is defined as a ratio between the number of visitors who make purchases and the number of total visitors in the experiment.

There are different types of Evaluation Metrics. For example,

Counts: Engagement metrics, such as Daily Active Users (DAU), Weekly Active Users (WAU), Monthly Active Users (MAU) and User Stickiness (DAU/MAU) are common evaluation metrics
Distribution (e.g., mean, percentiles): Evaluation metric can be a distribution. For example, the average session time on a site or the average number of clicks before conversions.
Probability and Ratio: Evaluation metric can also be a ratio. For example, conversion rate, which is defined by the number of subjects that take the desired action (e.g., click a bottom, make a purchase) over the number of total subjects in an experiment. Retention rate, which measures the percentage of users returning to your website or app within a period of time. Tracking conversion rates and retention rates allow you to monitor the performance of a website and identify areas for improvement.

Null & Alternative hypothesis: **** Null hypothesis would state that there is no difference in the outcome variable between the treatment group and the control group. In other words, the treatment wouldn’t affect the outcome. The alternative hypothesis states that there is a statistical difference in the outcome between the two groups.

Sample Size Calculation

The next step is to calculate the sample size for the experiment. We need to determine a few things before the calculation.

Effect Size: This is the difference of outcome variable (e.g., change of conversion rate) between treatment and control groups. Keep in mind that, with enough sample size, even tiny changes from the experiment will be found statistically significant. So you need to think about the business impact of the changes and practical importance. You need to ask "what is the minimum effect on the outcome for the intervention to be worthwhile to launch?" while considering the development and opportunity costs. Also the smaller the effect size, the more data would be needed and the longer the test would last.
Statistical Significance Level & Power: Typically the significance level is set to be 0.05 and the power is set as 0.8. Significance level (a.k.a Type I error) is the acceptable likelihood of falsely detecting the effect when the effect is nonexistent. So the smaller the significance level, the better the test is and the more data would be needed. Power is is the likelihood that a test will detect an effect when the effect is present. So the higher the power, the better the test is and the more data would be needed.
Standard Deviation: This is the variance of the outcome variable. When it is difficult to obtain, we can rely on historical data or knowledge from domain experts to estimate this. **** _The higher the Standard deviation, the more data we will nee_d
Sample Size Calculation: Once you have the information mentioned above, the following formula can calculate the sample size. The Z-values are the standard score with respect to values of Significance Level & Power. σ value is the standard deviation. µc-µt is the effect size.

👁 Image by author

Image by author

Randomization

Once we have a hypothesis and sample size, we then can randomly assign subjects to the treatment and control groups. Randomization is the key to the success of an unbiased A/B test. It needs to meet the following requirements:

need to make sure samples in the test are representative of the population. So that the conclusion drawn from the sample can be applied to the population.
need to make sure covariates are evenly distributed between the treatment group and control group. Any factors (e.g., gender, level of income, location, type of device) that might affect the outcome variable need to be evenly distributed. So that we can isolate the effect on treatment while keeping other covariates comparable.

Post-Test Analysis

Before we analyze how the treatment affects our outcome variable in the experiment. we need to conduct a sanity check in the experiment. The metrics we use for the sanity check are called invariant metrics (e.g., number of cookies), which are not supposed to be affected by the experiment. So there shouldn’t be a change in invariant metrics between control and treatment groups. Otherwise, there are flaws in the experiment setup.

Once it passes sanity check, we can process with analyzing actual data we care about. There are many methods we can use to investigate if the outcome variable is different between control and treatment groups.

Parametric Tests: Parametric tests work well when the outcome variable is relatively normally distributed. The followings are some of the popular parametric tests:

Student’s t-test: We assume the variances of the outcome variable are the same between control and treatment groups.
Welch’s t-test: When sample sizes or variances are NOT comparable, Welch’s t-test would outperform Student’s t-test
ANOVA Test: Sometimes there are multiple treatment groups. Before we run multiple t-tests, we can first run the ANOVA test, which uses F-test, to determine if the means of three or more groups are different. If p-value from the F-test is small, we know that at least one group is different from the rest. Then we can spend time running pairwise t-tests to find out which group is different. Keep in mind that when running multiple tests, we need to correct P-value using Bonferroni correction or False Discovery Rate (FDR).

Non-parametric Test: Non-parametric tests don’t make assumptions about the distribution of the underlying data. They are viable options when the continuous outcome variable is NOT normally distributed or there are two or more categorical outcomes. The followings are some of the popular non-parametric tests:

Chi-Squared Test: This is an Independence test that allows you to test if there is a statistically significant association between treatment and the outcome variables. It can deal with categorical data with two or more outcome values while a t-test can only handle categorical data with two outcome values.
Fisher’s exact Test: The chi-squared test is only reliable if the sample size is relatively large( i.e., n>1000). If this threshold is not met, one can use Fisher’s exact test instead.
Mann Whitney U Test (Wilcoxon Rank Sum Test): This test would use ranks instead of actual values. When comparing continuous variable that is not normally distributed or the sample size is small, Wilcoxon Rank Sum Test would be a good option.

Issues in A/B Test and Solutions

Stop A/B Tests Too Early: After we calculate the sample size, we would know the number of days to run the test by dividing the average daily traffic. If the number is less than a week, we should continue running the test for at least 2 weeks. If possible, 1–2 business cycles would be better. People behave differently on a day-by-day basis (e.g., weekday vs weekend) and are affected by external events (e.g., holidays, tax season, summer vs winter). We can get a more robust result with extended test.
Network Effect: In social media platforms, users’ behaviors are very likely influenced by that of people in their social circles, therefore, the independence assumption about the users doesn’t hold. When randomly assigning each user to control and treatment groups, the treatment effect from the test is usually under-estimated because the treatment effect may spill over to the control group through the treatment group’s social circle. To address this issue, we can use cluster randomization, which would put users in the same social circle in the same group.
Novelty Effect & Primacy Effect: People would react to new changes/features of the product differently. Some people might feel excited about any new change and would like to experiment it for the sake of trying something new. This kind of behavior is called Novelty Effect. On the other hand, some people might resist any changes on a product. This is called Primacy Effect or Change Aversion. If you observe a smaller or larger initial effect, it’s likely due to novelty or primacy effect. To solve this issue, we can extend the duration of the test since these effects would eventually fade away. Alternatively, we can conduct the A/B test on NEW users only because new users would have a fresh perspective and shouldn’t be affected by these effects.
Contradicting Results: Sometimes, we would see contradicting results from multiple evaluation metrics (e.g., the conversion rate goes up, but the retention rate goes down). To address this issue, we can come up with one OEC (Overal Evaluation Criterion) which accounts for both short-term and long-term goals, and trade-offs between different metrics. However, you should be able to quantify both positive and negative impacts and make sure the negative impact is acceptable.

Final Notes

Here are some related posts you can explore if you’re interested in Linear Regression and Causal Inference.

Thank you for reading !!!

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.

Written By

Aaron Zhu

See all from Aaron Zhu

Ab Test, Causal Inference, Data Science, Econometrics, Machine Learning

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/causal-inference-econometric-models-vs-a-b-testing-190781fe82c5/