VOOZH about

URL: https://towardsdatascience.com/sampling-distribution-sample-mean-fcf69484535e/

⇱ Sampling Distribution - sample mean | Towards Data Science


Sampling Distribution - sample mean

with Python simulation and examples

8 min read

One of the most important concepts discussed in the context of inferential data analysis is the idea of sampling distributions. Understanding sampling distributions helps us better comprehend and interpret results from our descriptive as well as predictive data analysis investigations. Sampling distributions are also frequently used in decision making under uncertainty and hypothesis testing.

👁 Photo by Marcin Jozwiak on Unsplash
Photo by Marcin Jozwiak on Unsplash

What are sampling distributions?

You may already be familiar with the idea of probability distributions. A probability distribution gives us an understanding of the probability and likelihood associated with values (or range of values) that a random variable may assume. A random variable is a quantity whose value (outcome) is determined randomly. Some examples of a random variable include, the monthly revenue of a retail store, the number of customers arriving at a car wash location on any given day, the number of accidents on a certain highway on any given day, weekly sales volume at a retail store, etc. Although the outcome of a random variable is random, the probability distribution allows us to gain and understanding about the likelihood and probabilities of different values occurring in the outcome. Sampling distributions are probability distributions that we attach to sample statistics of a sample.

Sample mean as a sample statistic

A sample statistic (also known simply as a statistic) is a value learned from a sample. Here is an example, suppose you collect the results of a survey filled out by 250 randomly selected individuals who live in a certain neighborhood. Based on the survey results you realize that the average annual income of the individuals in this sample is $82,512. This is a sample statistic and is denoted by x̅ = $82,512. The sample mean is also a random variable (denoted by X̅) with a probability distribution. The probability distribution for X̅ is called the sampling distribution for the sample mean. Sampling distribution could be defined for other types of sample statistics including sample proportion, sample regression coefficients, sample correlation coefficient, etc.

You might be wondering why X̅ is a random variable while the sample mean is just a single number! The key to understanding this lies in the idea of sample to sample variability. This idea refers to the fact that samples drawn from the same population are not identical. Here’s an example, suppose in the example above, instead of conducting only one survey of 250 individuals living in a particular neighborhood, we conducted 35 samples of the same size in that neighborhood. If we calculated the sample mean for each of the 35 samples, you would be getting 35 different values. Now suppose, hypothetically, we conducted many many surveys of the same size in that neighborhood. We would be getting many many (different) values for sample means. The distribution resulting from those sample means is what we call the sampling distribution for sample mean. Thinking about the sample mean from this perspective, we can imagine how X̅ (note the big letter) is the random variable representing sample means and (note the small letter) __ is just one realization of that random variable.

Sampling distribution of the sample mean

Assuming that X represents the data (population), if X has a distribution with average μ and standard deviation σ, and if X is approximately normally distributed or if the sample size n is large,

👁 Image

The above distribution is only valid if,

  • X is approximately normal or sample size n is large, and,
  • the data (population) standard deviation σ is known.

If X is normal, then X̅ is also normally distributed regardless of the sample size n. Central Limit Theorem tells us that even if X is not normal, if the sample size is large enough (usually greater than 30), then X̅’s distribution is approximately normal (Sharpe, De Veaux, Velleman and Wright, 2020, pp. 318–320). If X̅ is normal, we can easily standardize and convert it to the standard normal distribution Z.

If the population standard deviation σ is not known, we cannot assume that the sample mean X̅ is normally distributed. If certain conditions are satisfied (explained below), then we can transform X̅ to another random variable t such that,

👁 Image

The random variable t is said to follow the t-distribution with n-1 degrees of freedom, where n is the sample size. The t-distribution is bell-shaped and symmetric (just like the normal distribution) but has fatter tails compared to the normal distribution. This means values further away from the mean have a higher likelihood of occurring compared to that in the normal distribution.

The conditions to use the t-distribution for the random variable t are as follows (Sharpe et al., 2020, pp. 415–420):

  • If X is normally distributed, even for small sample sizes (_n<_15), the t-distribution can be used.
  • If the sample size is between 15 and 40, the t-distribution can be used as long as X is unimodal and reasonably symmetric.
  • For sample sizes greater than 40, the t-distribution can be used unless X’s distribution is heavily skewed.

Simulation with Python

Let’s draw a sample of size n=250 from the normal distribution. Here we are assuming that our data is normally distributed and has parameters μ = 20 and σ = 3. Collecting one sample from this population

Running this code once gives me one instance (or realization) of the random variable X̅. Below are 10 values for after I ran this code 10 times.

👁 Image

But if I ran this code 10,000 times and recorded the values of and plotted the frequency (or density) of the values, I would get the following result.

👁 The distribution of the sample mean (image by author).
The distribution of the sample mean (image by author).

As you can see, the distribution is approximately symmetric and bell-shaped (just like the normal distribution) with an average of approximately 20 and a standard error that is approximately equal to 3/sqrt(250) = 0.19.

Sampling from the same population with different sample sizes will result in different measures of spread in the outcome distribution. As we expect, increasing the sample size will reduce the standard error and therefore, the distribution will be narrower around its average. Note that the distribution of X̅ is normal even for extremely small sample sizes. This is because X is normally distributed.

👁 The effect of sample size on the standard error of the distribution for the sample mean (image by author).
The effect of sample size on the standard error of the distribution for the sample mean (image by author).

What if the population (data) is not normal?

No worries! Even if your data is not normally distributed, if the sample size is large enough, the distribution of X̅ can still be approximated using the normal distribution (according to Central Limit Theorem). The following figure shows the distribution of X̅ when X is heavily skewed to the left. As you can see, X̅’s distribution tends to mimic the distribution of X for small sample sizes. However, as sample size grows the distribution of X̅ becomes more symmetric and bell-shaped. As mentioned above, if sample size is large (usually larger than 30), X̅’s distribution is approximately normal regardless of what the distribution of X is.

👁 X̅'s distribution is normal for large sample sizes, even when X has a skewed distribution (image by author).
X̅’s distribution is normal for large sample sizes, even when X has a skewed distribution (image by author).

Example and applications

Knowing the distribution of X̅ can help us solve problems, where we need to use inferential data analysis to make decisions under uncertainty. Many business problems require decision making tools that are able to address the stochastic and probabilistic nature of random event. Hypothesis testing is one of those tools frequently used in many different business domains including retail operations, marketing, quality assurance, etc.

For example, suppose a retail store has run a major marketing campaign and is interested to investigate the effects of the campaign on average sales of the store. Suppose that the management would like to investigate if average daily sales is now greater than $8,000. The following hypotheses demonstrate this research question:

👁 Image

Note that we are conducting a test on the population average sales, hence the μ. To address the test, suppose we record sales volumes over 40 days (sample with n=40) and calculate the required statistics. Suppose the average and standard deviation of daily sales volumes are calculated as x̅=$8,100 and s=$580, respectively. Since the value of σ is not known, and given that the above hypothesis test is being addressed, we can convert X̅ to the random variable t with n-1=39 degrees of freedom where,

👁 Image

To address the test, we need to find the p-value associated with the test. This property is calculated as,

👁 Image

The probability density function for the random variable t along with the p-value of the test are depicted below.

👁 The p-value for the test is highlighted in the picture (image by author).
The p-value for the test is highlighted in the picture (image by author).

The following will find the p-value for the test.

The calculations give a p-value equal to approximately 0.14. By most standards (significance levels), this is a large p-value indicating that we fail to reject the null hypothesis. In other words, based on the distribution of X̅ and the sample collected, we cannot conclude that the average daily sales volume at the retail store, μ, is greater than $8000. This calculation was possible only because we knew what the distribution of X̅ was.

Sampling distributions could be defined for other sample statistics (e.g., sample proportions, regression predictor coefficients, etc.) and are also used in other contexts like confidence and prediction intervals or inferential analysis on regression results.


[1]: Sharpe N. R., De Veaux R. D., Velleman P. F., Wright D. (2020) Business Statistics, Fourth Canadian Edition. Pearson Canada Inc.


Written By

Behrouz Bakhtiari

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles