Statistics in Python β Using ANOVA for Feature Selection
Understand how to use ANOVA for comparing between a categorical and numerical variable
In my previous article, I talked about using the chi-square statistics to select features from a dataset for machine learning. The chi-square test is used when both your independent and dependent variables are all categorical variables. However, what if your independent variable is categorical and your dependent variable is numerical? In this case, you have to use another statistic test known as ANOVA β Analysis of Variance.
And so in this article, our discussion will revolve around ANOVA and how you use it in machine learning for feature selection. Like all my previous articles, I will use a concrete example to explain the concept.
Before we get started, it is useful to summarize the different methods for feature selection that we have discussed so far :
If you need a refresher on Pearson correlation, Spearmanβs rank correlation, and Chi-Square, I suggest you go and check them out now (see the links below) and come back to this article once you are done. Some of the concepts discussed in this article is similar to that of the chi-square test, and so I recommend you check that out.
Statistics in Python β Using Chi-Square for Feature Selection
Statistics in Python β Collinearity and Multicollinearity
Statistics in Python β Understanding Variance, Covariance, and Correlation
What is ANOVA?
ANOVA is used for testing two variables, where:
- one is a categorical variable
- another is a numerical variable
ANOVA is used when the categorical variable has at least 3 groups (i.e three different unique values).
If you want to compare just two groups, use the t-test. I will cover t-test in another article.
ANOVA lets you know if your numerical variable changes according to the level of the categorical variable.
ANOVA uses the f-tests to statistically test the equality of means. F-tests are named after its test statistic, F, which was named in honor of Sir Ronald Fisher.
Here are some examples that makes it easier to understand when you can use ANOVA.
- You have a dataset containing information of a group of people pertaining to their social media usage and the number of hours they sleep:
You want to find out if the amount of social media usage (categorical variable) has a direct impact on the number of hours of sleep (numerical variable).
- You have a dataset containing three different brands of medication and the number of days for the medication to take effect:
You want to find out if there is a direct relationship between a specific brand and its effectiveness.
ANOVA checks whether there is equal variance between groups of categorical feature with respect to the numerical response.
If there is equal variance between groups, it means this feature has no impact on the response and hence it (the categorical variable) cannot be considered for model training.
Performing AVONA by hand
The best way to understand ANOVA is to use an example. In the following example, I use a fictitious dataset where I recorded the reaction time of a group of people when they are given a specific type of drink.
Sample Dataset
I have a sample dataset named drinks.csv containing the following content:
team,drink_type,reaction_time
1,water,14
2,water,25
3,water,23
4,water,27
5,water,28
6,water,21
7,water,26
8,water,30
9,water,31
10,water,34
1,coke,25
2,coke,26
3,coke,27
4,coke,29
5,coke,25
6,coke,23
7,coke,22
8,coke,27
9,coke,29
10,coke,21
1,coffee,8
2,coffee,20
3,coffee,26
4,coffee,36
5,coffee,39
6,coffee,23
7,coffee,25
8,coffee,28
9,coffee,27
10,coffee,25
There are 10 teams in all β each team comprises of 3 persons. Each person in the team is given three different types of drinks β water, coke, and coffee. After consuming the drink, they were asked to perform some activities and their reaction time recorded. The aim of this experiment is to determine if the drinks have any effect on a personβs reaction time.
Letβs first load the dataset into a Pandas DataFrame:
import pandas as pd
df = pd.read_csv('drinks.csv')
Record the observation size, which we will make use of later:
observation_size = df.shape[0] # number of observations
Visualizing the dataset
It is useful to visualize the distribution of the data using a Boxplot:
_ = df.boxplot('reaction_time', by='drink_type')
You can see that the three types of drinks have about the same median reaction time.
Pivoting the dataframe
To facilitate the calculation for ANOVA, we need to pivot the dataframe:
df = df.pivot(columns='drink_type', index='team')
display(df)
The columns represent the three different types of drinks and the rows represents the 10 teams. We will also use this chance to record the number of items in each group, as well as the number of groups, which we will make use of later:
n = df.shape[0] # 10; number of items in each group
k = df.shape[1] # 3; number of groups
Defining the Hypotheses
You now define your null hypothesis and alternate hypothesis, just like the chi-square test. They are:
- Hβ (Null hypothesis) β that there is no difference among group means.
- Hβ (Alternate hypothesis) β that at least one group differs significantly from the overall mean of the dependent variable.
Step 1 β Calculating the means for all groups
We are now ready to begin our calculations for ANOVA. First, letβs find the mean for each group:
df.loc['Group Means'] = df.mean()
df
From here, you can now calculate the overall mean:
overall_mean = df.iloc[-1].mean()
overall_mean # 25.666666666666668
Step 2 β Calculate the Sum of Squares
Now that we have calculated the overall mean, we can proceed to calculate the following:
- Sum of squares of all observation β SS_total
- Sum of squares within β SS_within
- Sum of squares between β SS_between
Sum of squares of all observation β SS_total
The sum of squares of all observation is calculated by deducting each observation from the overall mean, and then summing all the squares of the differences:
Programmatically, SS_total is computed as:
SS_total = (((df.iloc[:-1] - overall_mean)**2).sum()).sum()
SS_total # 1002.6666666666667
Sum of squares within β SS_within
The sum of squares within is the sum of squared deviations of scores around their groupβs mean:
Programmatically, SS_within is computed as:
SS_within = (((df.iloc[:-1] - df.iloc[-1])**2).sum()).sum()
SS_within # 1001.4
Sum of Squares between β SS_between
Next we calculate the sum of squares of the group means from the overall mean:
Programmatically, SS_between is computed as:
SS_between = (n * (df.iloc[-1] - overall_mean)**2).sum()
SS_between # 1.266666666666667
You can verify that:
SS_total = SS_between + SS_within
Creating the ANOVA Table
With all the values computed, you can now complete the ANOVA table. Recall you have the following variables:
You can compute the various degrees of freedoms as follows:
df_total = observation_size - 1 # 29
df_within = observation_size - k # 27
df_between = k - 1 # 2
From the above, compute the various mean squared values:
mean_sq_between = SS_between / (k - 1) # 0.6333333333333335
mean_sq_within =
SS_within / (observation_size - k) # 37.08888888888889
Finally, you can calculate the F-value, which is the ratio of two variances:
F = mean_sq_between / mean_sq_within # 0.017076093469143204
Recall earlier that I mentioned ANOVA uses the f-tests to statistically test the equality of means.
Once the F-value is obtained, you now have to refer to the f-distribution table (see http://www.socr.ucla.edu/Applets.dir/F_Table.html for one example) to obtain the f-critical value. The f-distribution table is organized based on the Ξ± value (usually 0.05). So you need to first locate the table based on Ξ±=0.05:
Next, observe that the columns of the f-distribution table is based on df1 while the rows are based on df2. You can get your df1 and df2 from the previous variables that we have created:
df1 = df_between # 2
df2 = df_within # 27
Using the values of df1 and df2, you can now locate the f-critical value by locating the df1 column and df2 row:
From the above figure, you can see that the f-critical value is 3.3541. Using this value, you can now decide if you will accept or reject the null hypothesis using the F-distribution curve:
Since the f-value (0.0171, which is what we can calculated) is less than the f-critical value in the f-distribution table, we accept the null hypothesis β this means there is no variance in different groups β all the means are the same.
For machine learning, this feature β _drinktype, should not be included for training as it seems the different types of drinks have no effect on the reaction time.
You should only include a feature for training only if you reject the null hypothesis as this means that the values in the drink types have an effect on the reaction time.
Using the Stats module to calculate f-score
In the previous section, we manually calculated the f-value for our dataset. Actually, there is an easier way β use the stats moduleβs f_oneway() function to calculate the f-value and p-value:
import scipy.stats as stats
fvalue, pvalue = stats.f_oneway(
df.iloc[:-1,0],
df.iloc[:-1,1],
df.iloc[:-1,2])
print(fvalue, pvalue) # 0.0170760934691432 0.9830794846682348
The f_oneway() function takes the groups as input and returns the ANOVA F and p-value:
In the above, the f-value is 0.0170760934691432 (identical to the one we calculated manually) and the p-value is 0.9830794846682348.
Observe that the f_oneway() function takes in a variable number of arguments:
If you have many groups, it would be quite tedious to pass in the values of all the groups one by one. So, there is an easier way:
fvalue, pvalue = stats.f_oneway(
*df.iloc[:-1,0:3].T.values
)
I will leave the above as an exercise for you to understand how it works.
Using the statsmodels module to calculate f-score
Another way to calculate the f-value is to use the statsmodel module. You first build the model using the ols() function, and then call the fit() function on the instance of the model. Finally, you call the anova_lm() function on the fitted model and specify the type of ANOVA test to perform on it:
There are 3 types of ANOVA tests to perform, but their discussion is beyond the scope of this article.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('drinks.csv')
model = ols('reaction_time ~ drink_type', data=df).fit()
sm.stats.anova_lm(model, typ=2)
The above code snippet produces the following result, which is the same as the f-value that we calculated earlier (0.017076):
The anova_lm() function also returns the p-value (0.983079). You can make use of the following rules to determine if the categorical variable has any influence on the numerical variable:
- if p < 0.05, this means that the categorical variable has significant influence on the numerical variable
- if p > 0.05, this means that the categorical variable has no significant influence on the numerical variable
Since the p-value is now 0.983079 (>0.05), this means that the drink_type has no significant influence on the reaction_time.
Summary
In this article, I have explained how ANOVA helps to determine if a categorical variable has influence on a numerical variable. So far the ANOVA test that we have discussed is known as the one-way ANOVA test. There are a few variations of ANOVA:
- One-way ANOVAβ used to check how a numerical variable responds to the levels of one independent categorical variables
- Two-way ANOVA -used to check how a numerical variable responds to the levels of two independent categorical variables
- Multi-way ANOVA β used to check how a numerical variable responds to the levels of multiple independent categorical variables
Using a two-way ANOVA or multi-way ANOVA, you can investigate the combined impact of two (or more) independent categorical variables on one dependent numerical variable.
I hope you find this article useful. Stay tuned for the next article!
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS