Simulated Data, Real Learnings : Power Analysis

Part 2 – Experimental Power Analysis

Mar 26, 2024

12 min read

image by Robert So from Pexels.com

INTRODUCTION

Simulation is a powerful tool in the data science tool box. After reading this article, you’ll have a good understanding of how simulation can be used to estimate the power of a designed experiment. This is the second part of a multi-part series that discusses how simulation can be useful in data science and machine learning.

Here are the contents that we’ll cover:

Overview of power analysis
How to calculate power using simulation – example based approach

In this article, I will just give a quick definition of data simulation:

Data simulation is the creation of fictitious data that mimics the properties of the real-world.

In part 1 of this series, I discuss the definition of data simulation much more extensively – you can check it out at the link below:

Simulated Data, Real Learnings : Part 1

OVERVIEW OF POWER ANALYSIS

Experimentation is the gold standard for learning about relationships in the world around us. There are many considerations to take into account when planning an experiment. Even though experimentation is the gold standard, a poorly planned experiment can give useless or misleading results. Power analysis is a crucial component of planning good experiments.

Before we get into the details, let’s answer this value oriented question: Question : Why is estimating power before running an experiment important? Answer: Because without understanding an experiment’s power, we could waste our time and resources running an experiment that cannot detect meaningful results.

In statistical lingo, the definition of power is the probability that the null hypothesis will be correctly rejected. While less precise (and specific to experimental design), I like to define the power of experiments as – the probability that we will pick up on a relationship, given one exists.

Power in experimentation is the probability that we will pick up on a relationship, given one exists

Power is calculated by estimating two competing forces – (1) signal and (2) noise. The relative size of these two variables determines how well we will be able to pick up trends.

👁 Image

When calculating power, we create two distributions – one distribution has a mean of zero (being interpreted as our experimental variable has no relationship with the response variable) and the other distribution has a non-zero mean (interpreted as our variable has a positive relationship with the response). Note that the first distribution corresponds to the null hypothesis and the second distribution corresponds to the alternative hypothesis. Loosely speaking, the power is inversely related to the level of overlap between these two distributions – i.e. more overlap = less power, less overlap = more power.

I pulled the image below from further down in the article. I’ll give a quick talk through of the image and then move onto what this article is actually about – the simulation! If you still have a lot of questions about power analysis after this section, don’t worry, I’m not the only one on the internet that has written about it!

👁 image by author

image by author

The green distribution is our null distribution; representing the probability distribution of estimated relationship values we could see if there is no relationship between our experimental and response variables. The blue is an alternative distribution, representing possible relationship estimates if the relationships between experimental and response variables is a positive number. The red line is the cutoff for the top 5% of the null (green) distribution. I’m going to skip a lot of details here and just say that the area in the blue distribution, to the right of the red line is the power. If this doesn’t make sense, remember that google is your friend 😊 We got to get on to talking about simulation in this article about simulation!

HOW TO CALCULATE POWER USING SIMULATIONS

Now that we’ve quickly talked about what power is, let’s finally get to the meat of the article – how simulation can help us calculate power and therefore help us design better experiments!

I’m a big fan of learning through examples, so my primary teaching medium in this section will be walking through a simulation example.

Here’s the scenario: you work for an advertising firm that has tasked you to estimate the power of a specific designed experiment. This experiment is trying to understand the impact of an advertising strategy on sales. The design splits the country into multiple sub-groups and randomly selects one group to get the advertising campaign and another to serve as the control.

Since this isn’t an article on experimental design, we’ll make the simplifying assumption that we’ve decided to employ the simple difference-in-difference (DID) approach for our analysis (I won’t go too much into this approach here – again, google is your friend!). The difference-in-difference method cuts our data into four groups (1) pre/control, (2) post/control, (3) pre/test and (4) post/test. Only the post/test set of data will be impacted by the campaign, the other three data sets will not receive any advertising from the campaign and will serve to control for confounding factors.

Note: while we selected the DID approach for this example, we can modify our code/approach to calculate power for any analysis approach.

👁 Separation of data for difference-in-difference method - image by author

Separation of data for difference-in-difference method – image by author

To estimate the power of our test, we are going to run two simulations. Simulation 1: the program has no impact and Simulation 2: the program has an impact level that we set. The first simulation corresponds to the null hypothesis (campaign has no impact on sales) and the second simulation represents that alternative hypothesis (campaign has a positive impact on sales). Each simulation will provide us one data point (in our specific case, it will be the DID calculation). E.g. if we run simulation 1 once, we calculate one DID value from that simulation.

The next step is to run each of the two simulation multiple times to get two distributions of our DID metrics. These two distributions are all we need to estimate the power. We then make our power calculation based on the overlap between the distributions. With little overlap we have higher power and vice versa.

Below is an illustration of the four datasets that correspond to the two simulations. Note that when we create Simulation 2 data, we set the simulated impact for the test/post data set (labeled as ‘simulate impact >0’) to a specific level e.g. 5% increase in sales.

👁 break down of 2 types of simulations needed for power analysis

break down of 2 types of simulations needed for power analysis

Now that we understand how we are going to run our simulations, let’s get some Python code going to actually create the simulated data and calculate the power!

Here’s how the code ties to the set up we just discussed:

👁 Image

import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

def calculate_diff_in_diff(pre_test, pre_control, 
 post_test, post_control):

 '''
 Calculates the diff in diff given the 4 averages from the
 4 data sets used to calculate diff in diff.

 inputs:
 pre_test (float) : average for the pre/test data
 pre_control (float) : average for the pre/control data
 pre_test (float) : average for the pre/test data
 post_control (float) : average for the post/control data

 output
 diff_in_diff_ratio (float) : diff in diff calculation

 '''

 test_diff = post_test - pre_test
 control_diff = post_control - pre_control

 diff_in_diff_ratio = test_diff - control_diff

 return diff_in_diff_ratio

def simulate_group_data(n_customers, 
 rv_dist_func, 
 rv_inputs,
 program_impact = 0):

 '''
 Simulates a single dataset for diff in diff simulation

 inputs:
 n_customers (int) : number of customers to be simulated
 in the dataset
 rv_dist_func (func) : function used to sample random
 variables
 rv_inputs (dict) : keyword inputs to the rv_dist_func
 program_impact (float) : additive impact that program
 has on probability of customer
 purchasing. Should only be 
 non-zero for post/test data
 default = 0

 output:
 purch_rate (float) : average purchase rate 
 (total purchs / total custs)
 for the simulated data set

 '''

 purch_prob = rv_dist_func(**rv_inputs, size = n_customers)

 # add impact of treatment if simulating post/test data
 purch_prob += program_impact

 # convert to binary
 conv_to_binary_prob = np.random.uniform(0, 1, n_customers)

 purch_binary = np.where(purch_prob >= conv_to_binary_prob, 1, 0)

 purch_rate = np.sum(purch_binary) / len(purch_binary)

 return purch_rate

def simulate_multiple_times(n_sims,
 pre_test_sim_inputs,
 post_test_sim_inputs,
 pre_control_sim_inputs,
 post_control_sim_inputs,
 cut_off_decimal = 0.05):

 '''
 Uses functions that run a single simulation
 to run multiple simulation - aggregates the
 results of the multiple simulations

 inputs:
 n_sims (int) : number of simulations to run
 pre_test_sim_inputs (dict) : keyword args for running 
 a single pre/test simulation
 post_test_sim_inputs (dict) : keyword args for running 
 a single post/test simulation
 pre_control_sim_inputs (dict) : keyword args for running 
 a single pre/control simulation
 post_control_sim_inputs (dict) : keyword args for running 
 a single post/control simulation
 cut_off_decimal (float) : decimal form of significance level
 for power analysis, e.g. 5% = 0.05
 default = 0.05

 outputs:
 Nothing, but prints power at a significance level set with
 the cut_off_decimal input and creates visualization of 
 the null and alternative histograms

 '''

 diff_in_diff_impact = []
 # run with given impact for intervention
 for i in range(n_sims):

 pre_test_sim = simulate_group_data(**pre_test_sim_inputs)
 post_test_sim = simulate_group_data(**post_test_sim_inputs)
 pre_control_sim = simulate_group_data(**pre_control_sim_inputs)
 post_control_sim = simulate_group_data(**post_control_sim_inputs)

 temp_diff_in_diff = calculate_diff_in_diff(pre_test_sim,
 pre_control_sim,
 post_test_sim,
 post_control_sim)

 diff_in_diff_impact.append(temp_diff_in_diff)

 # run w/o any impact for intervention
 del post_test_sim_inputs['program_impact']
 diff_in_diff_no_impact = []
 for i in range(n_sims):

 pre_test_sim = simulate_group_data(**pre_test_sim_inputs)
 post_test_sim = simulate_group_data(**post_test_sim_inputs)
 pre_control_sim = simulate_group_data(**pre_control_sim_inputs)
 post_control_sim = simulate_group_data(**post_control_sim_inputs)

 temp_diff_in_diff = calculate_diff_in_diff(pre_test_sim,
 pre_control_sim,
 post_test_sim,
 post_control_sim)

 diff_in_diff_no_impact.append(temp_diff_in_diff)

 # print power calculation result
 cutoff = 1 - cut_off_decimal
 cutoff_index = int(len(diff_in_diff_no_impact) * cutoff)
 sorted_data = sorted(diff_in_diff_no_impact)
 top_5_percent_cutoff = sorted_data[cutoff_index]

 # calculate power
 reject_region_ind = np.where(diff_in_diff_impact <= top_5_percent_cutoff, 
 1, 0)
 power = 1 - np.sum(reject_region_ind) / len(reject_region_ind)
 print(f'power for scenario specifics is {power}')

 # Kernel density estimation (KDE) to estimate the PDFs
 sns.kdeplot(diff_in_diff_impact, color='blue', 
 label='simulated impact', fill=True)
 sns.kdeplot(diff_in_diff_no_impact, color='green',
 label='no impact', fill=True)
 plt.axvline(x=top_5_percent_cutoff, color='red', linestyle='--')

 # Plot settings
 plt.xlabel('Difference in Differences')
 plt.ylabel('Probability Density')
 plt.title('No Impact vs. Simulated Impact')
 plt.legend()
 plt.show()

 return

# running simulations

# set up sample sizes
test_n_customers = 1500
control_n_customers = 1500

# set inputs for the four needed data sets
pre_test_sim_inputs = {'n_customers': test_n_customers,
 'rv_dist_func' : np.random.uniform,
 'rv_inputs' : {'low' : 0, 'high' : 0.60}}

post_test_sim_inputs = {'n_customers' : test_n_customers,
 'rv_dist_func' : np.random.uniform,
 'rv_inputs' : {'low' : 0, 'high' : 0.60},
 'program_impact' : 0.10}

pre_control_sim_inputs = {'n_customers' : control_n_customers,
 'rv_dist_func' : np.random.uniform,
 'rv_inputs' : {'low' : 0, 'high' : 0.80}}

post_control_sim_inputs = {'n_customers' : control_n_customers,
 'rv_dist_func' : np.random.uniform,
 'rv_inputs' : {'low' : 0, 'high' : 0.80}}

# run multiple simulations
simulate_multiple_times(1000,
 pre_test_sim_inputs,
 post_test_sim_inputs,
 pre_control_sim_inputs,
 post_control_sim_inputs
 )

In the code above, we are running the two simulations 1000 times. If we simulate an impact of 5% (meaning our program caused customers to increase their probability of purchase by 5%) and a sample size of 1500 customers we get the results shown below.

👁 image by author

image by author

The red line represents the cut off for the top 5% of the green, ‘no impact’ distribution (the distribution that corresponds to simulation 1). At a 5% significance level, with a sample size of 1500 and an impact size of 5%, our power is about 63% – meaning that there is a 63% chance that we would conclude that there is a difference in the program given the simulation conditions. The 63% is calculated as the percent of the blue distribution that is to the right of the red line.

From here, the magic of simulation for power analysis comes in! A power of 63% is pretty low. Thanks to our simulation, we know that we are likely to have low power in this test. Good thing we checked before implementing the experiment! Using our framework, we can now explore how to fix the low power problem.

There are two non-trivial ways to increase power – (1) increase sample size and (2) increase impact of the test. Let’s look at both!

We’ll increase the sample size from 1500 to 3000. Easy enough to do in the code:

# set up sample sizes for test and control
test_n_customers = 3000
control_n_customers = 3000

👁 image by author

image by author

We now have a power of about 91%! We can implement an increase in sample size in our designed experiment by running the experiment for longer or increasing the geographical regions that are going to be included in the test. Based on our power analysis, we should definitely do one of those given our lower power at the current sample size.

The second approach – increasing the impact size – is less straightforward. We really can only impact this directionally. If we knew how to exactly change the impact size, we wouldn’t need to run a test (because we would understand the relationship between our experimental variable and our response variable)! If we want to increase the impact size, we have to ramp up the treatments level – in this example, that would mean more aggressive advertising campaigns. We know that if a relationship exists between sales and the campaign, increasing the budget for the campaign would likely increase the impact of the program. But again, we don’t know how much larger the impact will be given a more intense campaign – we just know it will be bigger.

We can get an idea of how our power is changed by an increase in impact by simulating different impact levels. The table below shows the corresponding power levels to different levels of simulated impact. This can be used to help us understand our confidence at various levels. If the campaign has an impact smaller than 5%, we are not very likely to pick up on it. If this is not okay, we can either increase the sample size to reduce noise, or ramp up the campaign to make it more likely that we will observe higher impacts.

We can take a table like this to our stakeholders and make sure everyone understands what levels of impact the test is likely to pick up on. If everyone is okay with a low probability of being able to detect <5% impact, then we can move forward. If not, we can modify the experimental design.

👁 power by simulated impacts - image by author

power by simulated impacts – image by author

Following this simulation process, our hypothetical selves are now prepared to make an informed experimental design decision! We have the tools to understand how many customers we need to include in the experiment to capture certain levels of power. We also understand, given a specific amount of customers how likely we are to pick up on trends of various sizes. And it is all thanks to simulation!

CONCLUSION

Simulation can be an extremely useful tool for calculating the power of an experiment. It is approach agnostic, it will work for difference-in-difference just as well as for regression or traditional hypothesis testing. We can get very useful key insights from performing power analysis with simulation. We can understand how large of a sample size we need (imagine the waste if we created a test and did a sample size that was too small or much to big!) and we can understand what level of impact is required for us to pick up on it. With the power simulation calculations, we can greatly improve the quality of or experiments and the learnings we get from them!

Written By

Jarom Hulet

See all from Jarom Hulet

Data Science, Machine Learning, Programming, Python, Statistics

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/simulated-data-real-learnings-power-analysis-652045eeae22/

⇱ Simulated Data, Real Learnings : Power Analysis | Towards Data Science

Simulated Data, Real Learnings : Power Analysis

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained