Why Bootstrapping Actually Works
A simple layman explanation of why this popular technique in data science makes sense
We do not always have abundant data for our projects. Often, we only have one sample dataset to work with due to the lack of resources to perform repeated experiments (e.g. for A/B testing).
Fortunately, we have resampling methods to make the most of whatever data we have. Bootstrapping is a resampling technique that provides information otherwise unavailable if we fit our model only once on the original sample.
While we may be familiar with the ‘what‘ and ‘how‘ behind bootstrapping, this article aims to present the ‘why‘ of bootstrapping in a layman manner.
Quick Recap of Bootstrapping
The goal of bootstrap is to create an estimate (e.g., sample mean x̄) for a population parameter (e.g., population mean θ) based on multiple data samples obtained from the original sample.
Bootstrapping is done by repeatedly sampling (with replacement) the sample dataset to create many simulated samples. Each simulated bootstrap sample is used to calculate an estimate of the parameter, and these estimates are then combined to form a sampling distribution.
The bootstrap sampling distribution then allows us to draw statistical inferences such as estimating the standard error of the parameter.
Why Bootstrapping Works?
You must be wondering, how can the act of repeatedly sampling the same sample dataset allow us to make inferences about the population statistics?
Ideally, we would want to draw multiple independent real-world samples from the true population to understand the population statistics. However, we have earlier established that this might not always be possible.
Therefore, we must work with our sample dataset, which becomes the best (and only) information we have about the population.
The reasonable thing to assume is that most samples (if drawn randomly) will look pretty much like the population from which they originate. With this in mind, it means that our sample data can be treated as a population that we now pretend represents the true population.
With this pretend-population in place, we can draw multiple (bootstrap) random samples from it. It is as if we are now obtaining multiple samples from the true population.
Note: In reality, the original sample is only one sample we have from the true population.
Because sampling with replacement is allowed, the bootstrap samples can also be regarded as random samples generated under different methods and assumptions.
The aggregated sampled information from these bootstrap samples will ultimately help us get (relatively) accurate estimates of the population parameter, e.g., population mean.
So how effective is bootstrap sampling? The image above compares the parameter (α) **** estimates from 1,000 simulated samples from the true population against 1,000 bootstrap samples.
We can see that the boxplots have similar spreads, indicating that the bootstrap approach can effectively estimate the variability associated with the parameter estimate.
Summary
In this article, we explored a simple explanation of the intuition behind bootstrapping. Hopefully, this write-up has given you a better appreciation of bootstrapping and why it works theoretically and in practice.
The key concept is that the original sample is assumed to be representative of the population. By resampling this sample many times, we get a relatively accurate sampling distribution of the sample estimate of the population parameter.
There are, of course, several caveats involved in this. For example, in the normal circumstance of sampling from the true population, we would never take a sample the same size as the entire population. However, it is common to use a sample size the same as the original dataset in bootstrapping.
For more details on the numerous caveats, you can check out this StatsExchange forum thread here. I also look forward to your feedback on this topic.
Before you go
I welcome you to join me on a data science learning journey! Follow this Medium page and check out my GitHub to stay in the loop of more exciting data science content. Meanwhile, have fun bootstrapping!
The Dying ReLU Problem, Clearly Explained
Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER)
Most Starred & Forked GitHub Repos for Data Science and Python
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS