Bayesian AB Testing

Using and choosing priors in randomized experiments.

Jan 10, 2023

13 min read

CAUSAL DATA SCIENCE

Cover, image by Author

Randomized experiments, a.k.a. AB tests, are the established standard in the industry to estimate causal effects. Randomly assigning the treatment (new product, feature, UI, …) to a subset of the population (users, patients, customers, …) we ensure that, on average, the difference in outcomes (revenue, visits, clicks, …) can be attributed to the treatment. Established companies like Booking.com report constantly running thousands of AB tests at the same time. And newer growing companies like Duolingo attribute a large chunk of their success to their culture of experimentation at scale.

With so many experiments, one question comes natural: in one specific experiment, can you leverage information from previous tests? How? In this post, I will try to answer these questions by introducing the Bayesian approach to AB testing. The Bayesian framework is well suited for this type of task because it naturally allows for the updating of existing knowledge (the prior) using new data. However, the method is particularly sensitive to functional form assumptions, and apparently innocuous model choices, like the skewness of the prior distribution, can translate into very different estimates.

Search and Infinite Scrolling

For the rest of the article, we are going to use a toy example, loosely inspired by Azavedo et al. (2019): a search engine that wants to increase its ad revenue, without sacrificing search quality. We are a company with an established experimentation culture and we continuously test new ideas on how to improve our landing page. Suppose that we came up with a new brilliant idea: infinite scrolling! Instead of having a discrete sequence of pages, we allow users to keep scrolling down if they want to see more results.

👁 Image, generated by Author using NightCafé

Image, generated by Author using NightCafé

To understand whether infinite scrolling works, we ran an AB test: we randomize users into a treatment and a control group and we implement infinite scrolling only for users in the treatment group. I import the data-generating process dgp_infinite_scroll() from [src.dgp](https://github.com/matteocourthoud/Blog-Posts/blob/main/notebooks/src/dgp.py). With respect to previous articles, I generated a new DGP parent class that handles randomization and data generation, while its children classes contain specific use cases. I also import some plotting functions and libraries from [src.utils](https://github.com/matteocourthoud/Blog-Posts/blob/main/notebooks/src/utils.py). To include not only code but also data and tables, I use Deepnote, a Jupyter-like web-based collaborative notebook environment.

We have information on 10.000 website visitors for which we observe the monthly ad_revenue they generated, whether they were assigned to the treatment group and were using the infinite_scroll, and also the average monthly past_revenue.

The random treatment assignment makes the difference-in-means estimator unbiased: we expect the treatment and control group to be comparable on average, so we can causal attribute the average observed difference in outcomes to the treatment effect. We estimate the treatment effect by linear regression. We can interpret the coefficient of infinite_scroll as the estimated treatment effect.

It seems that the infinite_scroll was indeed a good idea and it increased the average monthly revenue by 0.1524$. Moreover, the effect is significantly different from zero at the 1% confidence level.

We could further improve the precision of the estimator by controlling for past_revenue in the regression. We do not expect a sensible change in the estimated coefficient, but the precision should improve (if you want to know more on control variables, check my other articles on CUPED and DAGs).

Indeed, past_revenue is highly predictive of current ad_revenue and the precision of the estimated coefficient for infinite_scroll decreases by one-third.

So far, everything has been very standard. However, as we said at the beginning, suppose this is not the only experiment we ran trying to improve our browser (and ultimately ad revenue). The infinite scroll is just one idea among thousands of others that we have tested in the past. Is there a way to efficiently use this additional information?

Bayesian Statistics

One of the main advantages of Bayesian statistics over the frequentist approach is that it easily allows to incorporate additional information into a model. The idea directly follows from the main theorem behind all Bayesian statistics: Bayes Theorem. Bayes theorem, allows you to do inference on a model by inverting the inference problem: from the probability of the model given the data, to the probability of the data given the model, a much easier object to deal with.

👁 Bayes Theorem, image by Author

Bayes Theorem, image by Author

We can split the right-hand side of Bayes Theorem into two components: the prior and the likelihood. The likelihood is the information about the model that comes from the data, the prior instead is any additional information about the model.

First of all, let’s map Bayes theorem into our context. What is the data, what is the model, and what is our object of interest?

the data which consists of our outcome variable ad_revenue, y, the treatment infinite_scroll, D and the other variables, past_revenue and a constant, which we jointly denote as X
the model is the distribution of ad_revenue, given past_revenue and the infinite_scroll feature, y|D,X
our object of interest is the posterior Pr(model | data), in particular the relationship between ad_revenue and infinite_scroll

How do we use prior information in the context of AB testing, potentially including additional covariates?

Bayesian Regression

Let’s use a linear model to make it directly comparable with the frequentist approach:

👁 Conditional distribution of y|x, image by Author

Conditional distribution of y|x, image by Author

This is a parametric model with two sets of parameters: the linear coefficients β and τ, and the variance of the residuals σ. An equivalent, but more Bayesian, way to write the model is:

👁 Conditional distribution of y|x, image by Author

Conditional distribution of y|x, image by Author

where the semi-column separates the data from the model parameters. Differently from the frequentist approach, in Bayesian regressions, we do not rely on the central limit theorem to approximate the conditional distribution of y, but we directly assume it is normal.

We are interested in doing inference on the model parameters, β, τ, and σ. Another core difference between the frequentist and the Bayesian approach is that the first assumes that the model parameters are fixed and unknown, while the latter allows them to be random variables.

This assumption has a very practical implication: you can easily incorporate previous information about the model parameters in the form of prior distributions. As the name says, priors contain information that was available before looking at the data. This leads to one of the most relevant questions in Bayesian statistics: how do you choose a prior?

Priors

When choosing a prior, one analytically appealing restriction is to have a prior distribution such that the posterior belongs to the same family. These priors are called conjugate priors. For example, before seeing the data, I assume my treatment effect is normally distributed and I would like it to be normally distributed also after incorporating the information contained in the data.

In the case of Bayesian linear regression, the conjugate priors for β, τ, and σ are normally and inverse-gamma distributed. Let’s start by blindly using a standard normal and inverse gamma distribution as prior.

👁 Prior distributions, image by Author

Prior distributions, image by Author

We use the probabilistic programming package PyMC to do inference. First, we need to specify the model: the prior distributions of the different parameters and the likelihood of the data.

PyMC has an extremely nice function that allows us to visualize the model as a graph, model_to_graphviz.

👁 Diagram of the model, image by Author

Diagram of the model, image by Author

From the graphical representation, we can see the various model components, their distributions, and how they interact with each other.

We are now ready to compute the model posterior. How does it work? In short, we sample realizations of model parameters, we compute the likelihood of the data given those values and derive the corresponding posterior.

The fact that Bayesian inference requires sampling, has been historically one of the main bottlenecks of Bayesian statistics since it makes it sensibly slower than the frequentist approach. However, this is less and less of a problem with the increased computational power of model computers.

We are now ready to inspect the results. First, with the summary() method, we can print a model summary very similar to those produced by the [statsmodels](https://www.statsmodels.org/dev/index.html) package we used for linear regression.

The estimated parameters are extremely close to the ones we got with the frequentist approach, with an estimated effect of the infinite_scroll equal to 0.157.

If sampling had the disadvantage of being slow, it has the advantage of being very transparent. We can directly plot the distribution of the posterior. Let’s do it for the treatment effect τ. The PyMC function plot_posterior plots the distribution of the posterior, with a black bar for the Bayesian equivalent of a 95% confidence interval.

👁 Posterior distribution of τ̂, image by Author

Posterior distribution of τ̂, image by Author

As expected, since we chose conjugate priors, the posterior distribution looks gaussian.

So far we have chosen the prior without much guidance. However, suppose we had access to past experiments. How do we incorporate this specific information?

Past Experiments

Suppose that the idea of the infinite scroll was just one among a ton of other ideas that we tried and tested in the past. For each idea, we have the data on the corresponding experiment, with the corresponding estimated coefficient.

We have generated 1000 estimates from past experiments. How do we use this additional information?

Normal Prior

The first idea could be to calibrate our prior to reflect the data distribution in the past. Keeping the normality assumption, we use the estimated average and standard deviations of the estimates from past experiments.

On average, had practically no effect on ad_revenue, with an average effect of 0.0009.

However, there was sensible variation across experiments, with a standard deviation of 0.029.

Let’s rewrite the model, using the mean and standard deviation of past estimates for the prior distribution of τ.

Let’s sample from the model

and plot the sample posterior distribution of the treatment effect parameter τ.

👁 Posterior distribution of τ̂, image by Author

Posterior distribution of τ̂, image by Author

The estimated coefficient is sensibly smaller: 0.11 instead of the previous estimate of 0.16. Why is it the case?

The fact is that the previous coefficient of 0.16 is extremely unlikely, given our prior. We can compute the probability of getting the same or a more extreme value, given the prior.

The probability of this value is virtually zero. Therefore, the estimated coefficient has moved towards the prior mean of 0.0009.

Student-t Prior

So far, we have assumed a normal distribution for all linear coefficients. Is it appropriate? Let’s check it visually (check here for other methods on how to compare distributions), starting from the intercept coefficient β₀.

👁 Image

The distribution seems pretty normal. What about the treatment effect parameter τ?

👁 Image

The distribution is very heavy-tailed! While at the center it looks like a normal distribution, the tails are much "fatter" and we have a couple of very extreme values. Excluding measurement error, this is a setting that happens often in the industry, where most ideas have extremely small or null effects, and very few ideas are breakthroughs.

One way to model this distribution is a student-t distribution. In particular, we use a t-student with mean 0.0009, variance 0.003, and 1.3 degrees of freedom to match the moments of the empirical distributions of past estimates.

Let’s sample from the model.

And plot the sample posterior distribution of the treatment effect parameter τ.

👁 Posterior distribution of τ̂, image by Author

Posterior distribution of τ̂, image by Author

The estimated coefficient is now again similar to the one we got with the standard normal prior, 0.11. However, the estimate is more precise since the confidence interval has shrunk from [0.077, 0.016] to [0.065, 0.015].

What has happened?

Shrinking

The answer lies in the shape of the different prior distributions that we have used:

standard normal, N(0,1)
normal with matched moments, N(0, 0.03)
t-student with matched moments, t₁.₃(0, 0.003)

Let’s plot all of them together.

👁 Different prior distributions, image by Author

Different prior distributions, image by Author

As we can see, all distributions are centered on zero, but they have very different shapes. The standard normal distribution is essentially flat over the [-0.15, 0.15] interval. Every value has basically the same probability. The last two instead, even though they have the same mean and variance, have very different shapes.

How does it translate into our estimation? We can plot the implied posterior for different estimates, for each prior distribution.

👁 Effect of priors on experiment estimates, image by Author

Effect of priors on experiment estimates, image by Author

As we can see, the different priors transform the experimental estimates in very different ways. The standard normal prior essentially has no effect on estimates in the [-0.15, 0.15] interval. The normal prior with matched moments instead shrinks each estimate by approximately 2/3. The effect of the t-student prior is instead non-linear: it shrinks small estimates towards zero, while it keeps large estimates as they are. The dotted grey line marks the effects of the different priors, for our experimental estimate τ̂.

👁 Image generated by Author using NightCafé

Image generated by Author using NightCafé

Conclusion

In this article, we have seen how to extend the analysis of AB tests to incorporate information from past experiments. In particular, we have introduced the Bayesian approach to AB testing and we have seen the importance of choosing a prior distribution. Given the same mean and variance, assuming a prior distribution with "fat tails" (very skewed) implies a stronger shrinkage of small effects and a lower shrinkage of large effects.

The intuition is the following: a prior distribution with "fat tails" is equivalent to assuming that breakthrough ideas are rare but not impossible. This has practical implications after the experiment, as we have seen in this post, but also before it. In fact, as reported by Azevedo et al. (2020), if you think the distribution of the effects of your ideas is more "normal", it is optimal to run few but large experiments to be able to discover smaller effects. If instead, you think that your ideas are "breakthrough or nothing", i.e. their effects are fat-tailed, it makes more sense to run small but many experiments since you don’t need a large size to detect large effects.

References

E. Azevedo, A. Deng, J. Olea, G. Weyl, Empirical Bayes Estimation of Treatment Effects with Many A/B Tests: An Overview (2019). AEA Papers and Proceedings.
E. Azevedo, A. Deng, J. Olea, J. Rao, G. Weyl, AB Testing with Fat Tails (2020). Journal of Political Economy.
A. Deng, Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments (2016). WWW ’15 Companion.

Code

You can find the original Jupyter Notebook here:

Blog-Posts/bayes_ab.ipynb at main · matteocourthoud/Blog-Posts

Thank you for reading!

I really appreciate it! 🤗 If you liked the post and would like to see more, consider following me. I post once a week on topics related to causal inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations.

Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!

Written By

Matteo Courthoud

See all from Matteo Courthoud

Causal Data Science, Causal Inference, Data Science, Deep Dives, Statistics

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/bayesian-ab-testing-ed45cc8c964d/