An Interactive Guide to Hypothesis Testing in Python

T-Test, ANOVA, Chi-Squared Test with Examples

Apr 15, 2022

10 min read

👁 Statistical Tests in Python Cheatsheet (image from author's website)

Statistical Tests in Python Cheatsheet (image from author’s website)

What is Hypothesis Testing?

Hypothesis testing is an essential part in inferential statistics where we use observed data in a sample to draw conclusions about unobserved data – often the population.

Implication of hypothesis testing:

clinical research: widely used in psychology, biology and healthcare research to examine the effectiveness of clinical trials
A/B testing: can be applied in business context to improve conversions through testing different versions of campaign incentives, website designs etc
feature selection in machine learning: filter-based feature selection methods use different statistical tests to determine the feature importance
college or university: well, if you major in statistics or data science, it is likely to appear in your exams …

4 Steps in Hypothesis testing

(If you prefer a video walk-through, please also check out my YouTube video.)

Step 1. Define null and alternative hypothesis

Null hypothesis (H0) can be stated differently depends on the statistical tests, but generalize as the claim that no difference, no relationship or no dependency exists between two or more variables.

Alternative hypothesis (H1) is contradictory to the null hypothesis and it claims that relationships exist. It is the hypothesis that we would like to prove right. However, a more conservational approach is favored in statistics where we always assume null hypothesis is true and try to find evidence to reject the null hypothesis.

Step 2. Choose the appropriate statistical test

Common types of statistical testing including t-tests, z-tests, anova test and chi-square test

👁 choose the appropriate statistical test (image by author)

choose the appropriate statistical test (image by author)

T-test: compare two groups/categories of numeric variables with small sample size

Z-test: compare two groups/categories of numeric variables with large sample size

ANOVA test: compare the difference between two or more groups/categories of numeric variables

Chi-Squared test: examine the relationship between two categorical variables

Correlation test: examine the relationship between two numeric variables

Step 3. Calculate the p-value

How p value is calculated varies based on statistical tests. Firstly, based on the mean and standard deviation of the observed sample data, we are able to derive the test statistics value (e.g. t-statistics). Then calculate the probability of getting this test statistics given the distribution of the null hypothesis (e.g. student t-distribution), we will find out the p-value. We will use some examples to demonstrate this in more detail.

Step 4. Determine the statistical significance

p value is then compared against the significance level (also noted as alpha value) to determine whether there is sufficient evidence to reject the null hypothesis. The significance level is a predetermined probability threshold – commonly 0.05. If p value is larger than the threshold, it means that the value is likely to occur in the distribution when the null hypothesis is true. On the other hand, if lower than significance level, it means it is very unlikely to occur in the null hypothesis distribution – so reject the null hypothesis.

In addition to significance level, Power Analysis is also commonly used in hypothesis testing, check out my article on statistical power.

Statistical Power in Hypothesis Testing – Visually Explained

Hypothesis Testing with Examples

Kaggle dataset "Customer Personality Analysis" is used in this case study to demonstrate different types of statistical test: T-test, ANOVA and Chi-Square test. They are sensitive to large sample size, and almost certainly will generate very small p-value when sample size is large . Therefore, I took a random sample (size of 100) from the original data:

sampled_df = df.sample(n=100, random_state=100)

T-Test

👁 T Test (image from author's website)

T Test (image from author’s website)

T-test is used when we want to test the relationship between a numeric variable and a categorical variable.There are three main types of t-test.

one sample t-test: test the mean of one group against a constant value
two sample t-test: test the difference of means between two groups
paired sample t-test: test the difference of means between two measurements of the same subject

For example, if I would like to test whether "Recency" (the number of days since customer’s last purchase ) contributes to the prediction of "Response" (whether the customer accepted the offer in the last campaign ), I can use a two sample t-test.

The first sample would be the "Recency" of customers who accepted the offer:

recency_P = sampled_df[sampled_df['Response']==1]['Recency']

The second sample would be the "Recency" of customers who rejected the offer:

recency_N = sampled_df[sampled_df['Response']==0]['Recency']

To compare the "Recency" of these two groups intuitively, we can use histogram (or distplot) to show the distributions.

👁 distplot for t-test (image by author)

distplot for t-test (image by author)

It appears that positive responses have lower Recency compared to negative responses. To quantify the difference, let’s follow the steps in hypothesis testing and carry out a t-test.

Step1. define null and alternative hypothesis

null: there is no difference in Recency between the customers who accept the offer in the last campaign and who don’t accept the offer
alternative: customers who accept the offer has lower Recency compared to customers who don’t accept the offer

Step 2. choose the appropriate test

To test the difference between two independent samples, two-sample t-test is the most appropriate statistical test. The test statistics follows student t-distribution when null hypothesis is true . The shape of t distribution is determined by the degree of freedom, calculated as the sum of two sample size minus 2.

Import the Python library scipy.stats and create the t-distribution as below.

from scipy.stats import t
rv = t(df=100-2)

Step 3. calculate the p-value

There are some handy functions in Python calculate the probability in a distribution. For any x covered in the range of the distribution, pdf(x) is the probability density function of x – which can be represented as the orange line below, and cdf(x) is the cumulative density function of x – which can be seen as the cumulative area. In this example, we are testing the alternative hypothesis that – Recency of positive response minus the Recency of negative response is less than 0. Therefore we should use a one-tail test and compare the t-statistics we get against the lowest value in this distribution – therefore p-value can be calculated as _cdf(t_statistics)_ in this case.

👁 t-statistics and t-distribution (image by author)

t-statistics and t-distribution (image by author)

_ttest_ind()_ is a handy function for independent t-test in python that has done all of these for us automatically. Pass two samples rececency_P and recency_N as the parameters, choose the alternative = "less", and we get the t-statistics and p-value.

👁 t-test in Python (image by author)

t-test in Python (image by author)

Here I use Plotly to visualize the p-value in t-distribution. Hover over the line and see how point probability and p-value changes as the x shifts. The area with filled color highlights the p-value we get for this specific example.

Check out the Code Snippet on my website, if you want to build this yourself.

Step 4. determine the statistical significance

The commonly used significance level threshold is 0.05. Since p-value here (0.012) is smaller than 0.05, we can say that it is statistically significant based on the collected sample. A lower Recency of customer who accepted the offer is likely not occur by chance. This further indicates the feature "Response" may be a strong predictor of the target variable "Recency". And if we would perform feature selection for a machine learning model predicting the "Recency" value, "Response" is likely to have high importance.

ANOVA Test

👁 ANOVA Test (image by author)

ANOVA Test (image by author)

Now that we know t-test is used to compare the mean of one or two sample groups. What if we want to test more than two samples? Use ANOVA test.

ANOVA examines the difference among groups by calculating the ratio of variance across groups vs variance within a group. Larger ratio indicates that the difference across groups is a result of the group difference rather than just random chance.

As an example, I use the feature "Kidhome" for the prediction of "NumWebPurchases". There are three values of "Kidhome" – 0, 1, 2 which naturally forms three groups.

kidhome_0 = sampled_df[sampled_df['Kidhome']==0]['NumWebPurchases']
kidhome_1 = sampled_df[sampled_df['Kidhome']==1]['NumWebPurchases']
kidhome_2 = sampled_df[sampled_df['Kidhome']==2]['NumWebPurchases']

Firstly, visualize the data. I found box plot to be the most aligned visual representation of ANOVA test.

👁 box plot for ANOVA test (image by author)

box plot for ANOVA test (image by author)

It appears there are distinct differences among three groups. So let’s carry out ANOVA test to prove if that’s the case.

1. define hypothesis:

null: there is no difference among three groups
alternative: there is difference between at least two groups

2. choose the appropriate test: ANOVA test is preferred for examining the relationships of numeric values against a categorical value with more than two groups. The test statistics of null hypothesis in ANOVA test also follows a distribution defined by degrees of freedom – which is f-distribution. The degrees of freedom is calculated by number of total samples (n) and the number of groups (k).

dfn = n – 1
dfd = n – k

from scipy.stats import f
dfn = 3-1
dfd = 100-3
rv = f(dfn, dfd)

3. calculate the p-value: To calculate the p-value of the f-statistics, we use the right tail cumulative area of the f-distribution, which is 1-rv.cdf(f_statistics).

👁 f-statistics and f-distribution (image by author)

f-statistics and f-distribution (image by author)

x = np.linspace(rv.ppf(0.0001), rv.ppf(0.9999), 100000)
y = rv.pdf(x)
pvalue = 1 - rv.cdf(x)

To easily get the f-statistics and p-value using Python, we can use the function stats.f_oneway() which returns p-value: 0.00040.

f_stat, pvalue = stats.f_oneway(kidhome_0, kidhome_1, kidhome_2)

4. determine the statistical significance: Compare the p-value against the significance level 0.05, we can infer that there is strong evidence against the null hypothesis and very likely that there is difference in "NumWebPurchases" between at least two groups.

Chi-Squared Test

👁 Chi-Squared Test (image from author's website)

Chi-Squared Test (image from author’s website)

Chi-Squared test is for testing the relationship between two categorical variables. The underlying principle is that if two categorical variables are independent, then one categorical variable should have similar composition when the other categorical variable change. Let’s look at the example of whether "Education" and "Response" are independent.

First, use stacked bar chart and contingency table to summary the count of each category.

ed_contingency = pd.crosstab(sampled_df['Education'], sampled_df['Response'])

👁 Image

👁 stacked bar chart for Chi-Squared test (image by author)

stacked bar chart for Chi-Squared test (image by author)

If these two variables are completely independent to each other (null hypothesis is true), then the proportion of positive Response and negative Response should be the same across all Education groups. It seems like composition are slightly different, but is it significant enough to say there is dependency – let’s run a Chi-Squared test.

1. define hypothesis:

null: "Education" and "Response" are independent to each other.
alternative: "Education" and "Response" are dependent to each other.

2. choose the appropriate test: Chi-Squared test is chosen for categorical vsl categorical statistical tests. Chi-distribution is determined by the degree of freedom which is calculated as (row – 1) x (column – 1).

from scipy.stats import chi2
r = 5
c = 2
dof = (5-1) * (2-1)
rv = chi2(df= dof)

3. calculate the p-value: p value is calculated as the right tail cumulative area: 1-rv.cdf(chi2_statistics)

👁 chi2-statistics and chi-distribution (image by author)

chi2-statistics and chi-distribution (image by author)

x = np.linspace(rv.ppf(0.0001), rv.ppf(0.9999), 100000)
y = rv.pdf(x)
pvalue = 1 - rv.cdf(x)

Python also provides a useful function to get the chi statistics and p-value given the contingency table.

chi2_stat, pvalue, dof, exp = chi2_contingency(ed_contingency)

4. determine the statistical significance: the p-value is 0.41, suggesting that it is not statistical significant. Therefore, we cannot reject the null hypothesis that these two categorical variables are independent. This also indicates that "Education" may not be a strong predictor of "Response".

Thanks for reaching so far, we have covered a lot of contents in this article but still have two important hypothesis tests that are worth discussing separately in upcoming posts.

z-test: test the difference between two categories of numeric variables – when sample size is LARGE
correlation: test the relationship between two numeric variables

If you would like to read more articles like this, I would really appreciate your support by signing up Medium membership 🙂

Take-Home Message

In this article, we interactively explore and visualize the difference between three common statistical tests: T-test, ANOVA test and Chi-Squared test. We also use examples to walkthrough essential steps in hypothesis testing:

define the null and alternative hypothesis
choose the appropriate test
calculate the p-value
determine the statistical significance

URL: https://towardsdatascience.com/an-interactive-guide-to-hypothesis-testing-in-python-979f4d62d85/