Data Science Fundamentals – A/B Testing

We Use a Simple Example to Explore the Ins and Outs of A/B Testing (a.k.a. Hypothesis Testing)

Jun 24, 2019

14 min read

Now that data science boot camp (Metis) is over, it’s time to study up for interviews. Since I started blogging, I’ve discovered that writing about a concept and attempting to teach it to readers forces me to learn that concept much more deeply.

So in the next few weeks, I will be going one by one over all the core tools that every data scientist and aspiring data scientist (like me) should have in their tool belt so that we can all ace our interviews (fingers crossed)! Now on to today’s topic!

_(If you are interested in the code I used for this analysis, you can find it here on my GitHub)_

Hypothesis Testing in Disguise

👁 Photo by it's me neosiam from Pexels

Photo by it’s me neosiam from Pexels

If you have a statistics background, at some point you probably wondered, "Is A/B testing the same thing as hypothesis testing"? Yes it is! So let’s figure out A/B testing by exploring how hypothesis testing works via a simple example.

Imagine that our client, the owner of a highly successful personal finance app, came to us with the following problem:

"Tony, our new app redesign is supposed to help people increase the amount of money that they save. But does it actually work? Please help us figure that out so we can decide whether or not to deploy it."

So our job is to figure out whether people save more due to the new app design. First, we need to figure out whether we have the data we need. We ask, "What data have you already collected that might be helpful?"

It turns out our client has already run an experiment and gathered some data:

Six months ago, our client randomly selected 1,000 newly signed up users and assigned 500 of them to the control group and 500 to the experimental group.
The control group went on to use the current app.
Meanwhile, the experimental group was exposed to the redesigned app.
All users started with a 0% savings rate.
The 1,000 users represent just a small portion of the app’s total users.

After six months, our client records the savings rate of all 1,000 users in the experiment. Savings rate is the percentage of each user’s monthly paycheck that he or she saves. She finds the following:

The control group has an average savings rate of 12%, up from 0%, with a standard deviation of 5%.
The experimental group has an average savings rate of 13%, up from 0%, with a standard deviation of 5%.

The results of our experiment look like this when plotted on a histogram:

👁 Histogram of Control and Experimental Group Savings Rates

Histogram of Control and Experimental Group Savings Rates

The members of the experimental group do appear like they ended up with higher savings rates versus those of the control group after six months. So is it enough to just plot this histogram, show it to our client, and call it a day?

No, because we still cannot be sure this increase in savings that we observe is real. By dumb luck, we could have sampled users for our experiment in such a way that the people with the desire to save more all ended up in the experimental group. To address this concern, we need to ask the following question:

How likely are we to get the results that we observed from random chance?

Answering this question is the crux of the hypothesis test (and an A/B test).

The Null Hypothesis

Imagine for a second that in reality the new app design does NOT help users to save more. However, even if the new design is a dud, it is still possible to observe an increase in the savings rate when we conduct our experiment.

👁 Photo by San Fermin Pamplona from Pexels

Photo by San Fermin Pamplona from Pexels

How could that happen? It can happen because we are sampling. For example, if I picked 100 people at random from a large crowed of thousands and calculated their average height, I might get something like 5 feet 8 inches. Then if I did it a few more times, I might get 5 feet 10 inches the next time and 5 feet 7 inches the time after that.

Because we are calculating our statistics using samples and not the entire population, every sample mean that we calculate will be different.

Knowing that sampling causes variation, we can reframe our question above into the following:

If the new app design truly has zero effect on people’s savings, what is the probability of observing as large an increase in savings as we did from random chance?

Stated formally, our null hypothesis would be – the increase in savings rates for the control group is equal to the the increase in savings rates for the experimental group.

Our job is now to test the null hypothesis. We can do so with a probability thought experiment.

Simulating the Experiment Over and Over Again

Imagine that we can easily and instantly run our experiment again and again. Also, we are still in the parallel world where the new app design is a dud and has zero effect on users’ savings. What would we observe?

For the curious, here is how we simulate this:

Take 500 draws (there are 500 users in our control group and another 500 in our experimental group) each of two normally distributed random variables with the same statistical characteristics as our control group (mean = 12%, standard deviation = 5%). These will be our control and experimental groups (same mean because we are in the world where our new design has zero effect). It would be technically more correct to use Poisson distributed random variables here, but we use normally distributed ones for simplicity.
Record the difference in mean savings between the groups (i.e. we subtract the mean savings rate of the control group from the mean savings rate of the experimental group).
Do this 10,000 times.
Plot a histogram of the differences in mean savings between the groups.

When we do this, we get the histogram below. The histogram shows how much the mean savings rate difference between groups varies due to random chance (driven by sampling).

The red vertical line shows the mean savings rate difference we actually observed (1%) when our client ran her experiment. The percentage of observations to the right of the red line in the histogram below is the value we are after – the probability of observing as large an increase in savings as 1% from random chance (we do a one tailed test here because it is easier to understand and visualize).

👁 Histogram Showing the Difference Between Group Means for 10,000 Simulations (Assuming New Design Has Zero Effect on Savings Rates)

Histogram Showing the Difference Between Group Means for 10,000 Simulations (Assuming New Design Has Zero Effect on Savings Rates)

In this case that value is very low – in only nine out of the 10,000 experiments we ran (assuming the new design has zero effect on savings), did we observe a difference in group means of 1% or greater.

This means that there is only a 0.09% change of observing a value as high as we did due to random chance!

This 0.09% chance is our p-value. "Huh? Stop throwing random terms at me!", you say. There is definitely a lot of statistical terminology around hypothesis testing (and A/B testing) and we will leave most of those for Wikipedia to explain.

Our aim, as always, is to build an intuitive understanding of how and why these tools work – so in general we will avoid terminology in favor of simple explanations where we can. However, the p-value is a critical concept that you will run into a lot in the data science world so we must confront it. The p-value (the 0.09% value we calculated above in our simulation) represents:

The probability of observing what we observed if the null hypothesis were true.

Thus, the p-value is the number that we can use to test whether the null hypothesis is true or not. Based on its definition, it looks like we want as low a p-value as possible – the lower the p-value, the less likely it is that we just got lucky with our experiment. In practice, we will set a p-value cutoff (called alpha) below which we will reject the null hypothesis and conclude that the observed effect/impact is most likely real (statistically significant).

Now let’s explore a statistical property that lets us quickly calculate p-values.

The Central Limit Theorem

Now is as good a time as any to talk about one of the foundational concepts of statistics – the Central Limit Theorem states that if you add up independent random variables, their normalized sum tends towards a normal distribution as you sum more and more of them. The Central Limit Theorem holds even if the random variables themselves do not come from a normal distribution.

Translation – if we calculate a bunch of sample averages (assuming our observations are independent of each other, like how flips of a coin are independent), the distribution of all those sample averages will be the normal distribution.

👁 Q-Q Plot - The Red Line Denotes a Perfectly Normal Distribution

Q-Q Plot – The Red Line Denotes a Perfectly Normal Distribution

Take a look at the histogram of the mean differences that we calculated earlier. It looks like a normal distribution right? We can verify normality using a Q-Q plot, which compares the quantile of our distribution against that of a reference distribution (in this case, the normal distribution). If our distribution is normal, it would adhere closely to the red 45 degree line. And it does, cool!

So when we ran our savings experiment over and over again, that was an example of the Central Limit Theorem in action!

So why does this matter?

Remember how we tested the null hypothesis earlier by running 10,000 experiments. Doesn’t that sound super tiring? In reality, it’s both tiring and costly to repeatedly run experiments. But thanks to the Central Limit Theorem we don’t have to!

We know what the distribution of our repeated experiments will look like – the normal distribution and we can use this knowledge to statistically infer the distribution of our 10,000 experiments without actually running them!

Let’s Review What We Know So Far:

We observe a difference in mean savings rate of 1% between the control and experimental group. And we want to know whether this is a real difference or just statistical noise.
We know that we need to take the experiment’s results with a grain of salt because we conducted it on only a small sample of the client’s total user base. If we did it again on a new sample, the results would change.
Since we are worried that in reality the new app design has no impact on savings, our null hypothesis is that the difference in means between the control and experimental group is zero.
We know from the Central Limit Theorem that if we were to repeatedly sample and conduct new experiments, the results of those experiments (the observed mean difference between the control and experimental groups) would have a normal distribution.
And from statistics, we know that when we take the difference of two independent random variables, the variance of the result is equal to the sum of the individual variances:

👁 Image

Completing the Job

Nice! We now have everything that we need to run our hypothesis test. So let’s go ahead and complete the job we received from our client:

👁 Same Histogram as Above (Pasted Again for Reference)

Same Histogram as Above (Pasted Again for Reference)

First, before we get biased by looking at the data, we need to choose a cutoff, called alpha (if our calculated p-value is less than alpha, we reject the null hypothesis and conclude that the new design increases savings rates). The alpha value corresponds to our probability of incurring a false positive – rejecting the null hypothesis when it is actually true. 0.05 is pretty standard among statisticians so we will go with that.
Next, we need to calculate the test statistic. The test statistic is the numerical equivalent of the histogram above and tells us how many standard deviations away from the null hypothesis value (in our case zero), the observed value (1%) is. We can calculate it like so:

👁 Image

The Standard Error is the standard deviation of the difference between the experimental group’s average savings rate and the control group’s average savings rate. In the plot above, the standard error is represented by the width of the blue histogram. Recall that the variance of the difference of two random variables is equal to the sum of the individual variances (and standard deviation is the square root of variance). We can easily calculate the standard error using information that we already have:

👁 Image

Remember that both the control and experimental group’s savings rates had a standard deviation of 5%. So our sample variance is 0.0025 and N is the number of observations in each group so N is equal to 500. Plugging these numbers into the formula, we get a standard error of 0.316%.
In the test statistic formula, Observed Value is 1% and Hypothesized Value is 0% (since our null hypothesis is that there is no effect). Plugging in those values along with the standard error we just calculated into the test statistic formula, we get a test statistic of 0.01/0.00316 = 3.16.
Our observed value of 1% is 3.16 standard deviations away from the hypothesized value of 0%. That’s a lot. We can use the Python code below to calculate the p-value (for a two-tailed test). The p-value we get is 0.0016. Note that we use p-values for a two-tailed test because we can’t automatically assume that the new design is either the same or better than the current one – the new design could also be worse and a two tailed test accounts for that possibility (more on this here).

from scipy.stats import norm

# Two Tailed Test
print('The p-value is: ' + str(round((1 - norm.cdf(3.16))*2,4)))

The p-value of 0.0016 is below our alpha of 0.05 so we reject the null hypothesis and tell our client that yes, it appears that the new app design does indeed help her users save more money. Hurray, victory!

👁 Photo by Rakicevic Nenad from Pexels

Photo by Rakicevic Nenad from Pexels

Finally, note that the p-value we calculated analytically of 0.0016 is different from the 0.0009 that we simulated earlier. That’s because the simulation we ran was one-tailed (one-tailed tests are easier to understand and visualize). We can reconcile the values by multiplying the simulated p-value by two (to account for the second tail) to get 0.0018 – pretty close to 0.0016.

Conclusion

In the real world, A/B testing won’t be as clear cut as our fictitious example. Most likely our client (or boss) won’t have ready-to-use data for us and we will have to do our own data gathering and cleaning. Here are some additional practical issues to keep in mind when preparing to A/B test:

How much data do you need? Data is time consuming and expensive to gather. A badly run experiment might even end up alienating users. But if you don’t gather enough observations, your tests will not be very reliable. So you will need to carefully balance the benefits of more observations with the incremental costs of gathering them.
What are the costs of falsely rejecting a true null hypothesis (Type 1 Error) versus the costs of failing to reject a false null hypothesis (Type 2 Error)? Going back to our example, a Type 1 Error is equivalent to green-lighting the new app design when it actually has no effect on savings. And a Type 2 Error is the same as sticking with the current design when the new one actually encourages people to save more. We tradeoff between the risk of Type 1 and 2 errors by picking a reasonable cutoff value, alpha. A higher alpha increases the risk of Type 1 Error and a lower alpha increases the risk of Type 2 Error.

Hopefully this was informative, cheers!

If you got all the way here, please check out some other pieces by me:

This is my favorite out all my pieces so far, it’s about neural nets

Why random forests are great

I miss my Metis bootcamp experience and friends already!

A project I worked on while at Metis, investing in Lending Club loans

My first data science post, logistic regression

Written By

Tony Yiu

See all from Tony Yiu

Analytics, Data Science, Digital Marketing, Marketing, Statistics

Share This Article