VOOZH about

URL: https://towardsdatascience.com/phik-k-get-familiar-with-the-latest-correlation-coefficient-9ba0032b37e7/

⇱ Phik (𝜙k) - get familiar with the latest correlation coefficient | Towards Data Science


Phik (𝜙k) – get familiar with the latest correlation coefficient

That is also consistent between categorical, ordinal, and interval variables!

8 min read
👁 Photo by Ganapathy Kumar on Unsplash
Photo by Ganapathy Kumar on Unsplash

Hands-on Tutorials

Recently I was doing EDA using pandas-profiling and something piqued my interest. In the correlations tab, I saw many known metrics I have known since university – Pearson’s r, Spearman’s ρ, and so on. However, among those I have seen something new – Phik (𝜙k). I have not heard about this metric before so I decided to dive a bit deeper into it.

👁 Image by author
Image by author

Fortunately, the report generated by pandas-profiling also has an option to display some more details about the metrics. The following information was provided about Phik:

Phik (𝜙k) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution.

I must say, this sounds really useful! Especially when working with datasets containing a mixture of numeric and categorical features. In this article, you can find out a quick summary of what the new correlation metric does and how to use it in practice with the phik library.

A short introduction to 𝜙k

In many fields (not only data science), Pearson’s correlation coefficient is a standard approach of measuring correlation between two variables. However, it has some drawbacks:

  • it works only with continuous variables,
  • it only accounts for a linear relationship between variables,
  • it is sensitive to outliers.

That is where 𝜙k comes into play and offers several improvements over the current go-to measure. The key differentiators of Phik are:

  • it is based on several refinements to Pearson’s χ2 ** (chi-squared)** contingency test – a hypothesis test for independence between two (or more) variables,
  • it works consistently between categorical, ordinal, and interval (continuous) variables,
  • it captures non-linear dependencies,
  • it reverts to Pearson’s correlation coefficient in the case of a bivariate normal distribution of the input,
  • the algorithm contains a built-in noise reduction technique against statistical fluctuations.

The most similar metric to 𝜙k is Cramer’s 𝜙, which is a correlation coefficient meant for two categorical variables and is also based on Pearson’s χ2 test statistic. What is important to note is that even though it is a measure used for categorical variables, it can also be used for ordinal and binned interval variables. However, the value of the coefficient is highly dependent on the binning chosen per variable and can therefore be difficult to interpret and compare. That is not the case for 𝜙k. Additionally, Cramer’s 𝜙 is sensitive to outliers, especially for smaller sample sizes.

In the figure below, you can see a few comparisons presenting the selected three correlation metrics. We can see that 𝜙k does a good job at detecting non-linear patterns missed by other coefficients. As you can see, the values of 𝜙k are between 0 and 1, so there is no indication of the direction of the relationship.

👁 Source
Source

Naturally, there are also some drawbacks of the new method:

  • the calculation of 𝜙k is computationally expensive (due to calculation of some integrals under the hood),
  • no closed-form formula,
  • no indication of direction,
  • when working with numeric-only variables, other correlation coefficients will be more precise, especially for small samples.

In this article, I do not want to go much into the details of how to actually calculate the 𝜙k. The main reason for that is that the process is more complex than just explaining one formula as in the case of Pearson’s r. That is why I prefer to focus on the practical part ahead, and refer all interested to the source paper (which is written in a nice and easy-to-understand way).

Hands-on example in Python

For a while, I was wondering what would be a good dataset for testing out the new correlation coefficient. And inspiration came unexpectedly while browsing some video game news – a dataset containing all the Pokémon will be perfect for the analysis, as it combines categorical and numerical features. There are no ordinal features in the dataset, but that will not be a problem for presenting how to work with phik. You can find the data here.

👁 Photo by Don H on Unsplash
Photo by Don H on Unsplash

As always, the first step is to load the libraries.

Then, we load and prepare the data. We only keep the relevant columns (battle statistics, generation, type and boolean flags indicating whether a Pokémon is legendary or not), as many of the other ones are related to evolutions and other forms.

👁 Image by author
Image by author

Now we are ready for exploring the data using the 𝜙k coefficient.

𝜙k correlation matrix

Getting the correlation matrix containing the pair-wise 𝜙k coefficients is as easy as using the phik_matrix method. By default, the library drops the NaNs from the data for calculating the correlation coefficient. Additionally, we round the results to two decimals, for improved readability.

phik_overview = df.phik_matrix()
phik_overview.round(2)
👁 Image by author
Image by author

When we do not provide a list containing the interval columns as an argument, the columns will be selected based on educated guessing. In this case, the is_legendary and related are not interval columns, so we will create an appropriate list and pass it as an argument.

Also, we can manually specify the bins we would like to use for the interval variables. We do not do so, so the bins will be determined automatically.

To make the analysis of the table easier, we can use the plot_correlation_matrix function to plot the results as a heatmap.

👁 Image by author
Image by author

We can see that there is some correlation between variables such as defense and hp, or hp and attack. What is more, we can see that there is no correlation between special attack and generation.

Significance of the correlations

When assessing correlations we should not only look at the coefficients but also at their statistical significance. Because in the end, a large correlation may be statistically insignificant, and vice versa.

👁 Image by author
Image by author

The heatmap above presents the significance matrix. The color scale indicates the level of significance and it saturates at +/- 5 standard deviations. The relatively high values of the correlation coefficient for the battle stats we mentioned above are statistically significant, while the correlation of special attack versus generation is not.

For more details on how to calculate the statistical significance and what corrections to the "standard" p-value calculation are taken into account, please refer to the original paper.

Global correlation

The global correlation coefficient is a useful measure expressing the total correlation of one variable to all other variables in the dataset. This gives us an indication of how well one variable can be modeled using the other variables.

👁 Image by author
Image by author

All variables have pretty high values of the global correlation metric, with the highest one going to defense. We have seen before that there was a strong correlation between defense and some other battle stats, hence the highest score here.

Outlier significance

While Pearson’s correlation between two continuous variables is easy to interpret, that is not the case for 𝜙k between two variables of mixed types, especially when it concerns categorical variables. That is why the authors provided additional functionality to look at the outliers – excesses, and deficits over the expected frequencies coming from the contingency table of two variables.

We first take a look at a continuous vs. categorical feature. For that example, we selected secondary_type and defense.

Running the code generates the following heatmap:

👁 Image by author
Image by author

Some of the conclusions we can draw from the plot above – rock and steel Pokémon (as secondary type) have significantly higher defense, while the inverse is true for the poison/fairy/flying ones.

Then, we do a similar analysis for two categorical variables – primary and secondary types.

👁 Image by author
Image by author

Analyzing this table indicates that are significantly more normal-flying and grass-poison Pokémon than expected, and significantly fewer normal-poison and dragon-bug. From my knowledge of Pokémon, that is indeed the case, and the conclusions from this table represent the actual types of Pokémon in the games.

Correlation report

Above, we have seen four different things we can investigate with the phik library. There is also a convenience function that allows us to generate all of the above with a single line of code.

The function generates a report by pairwise evaluation of all correlations, their significances and outlier significances.

Note: The number of plots can easily explode for a larger dataset. That is why we can tune the correlation and significance thresholds to only plot the relevant variables.

Takeaways

  • 𝜙k is a new correlation coefficient, especially suitable for working with mixed-type variables.
  • using the coefficient, we can find variable pairs that have (un)expected correlations, and evaluate their statistical significance. We can also interpret the dependencies between each pair of variables.

You can find the code used for this article on my GitHub. Also, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.

If you liked this article, you might also be interested in one of the following:

One simple tip to make your READMEs stand out

9 Useful Pandas Methods You Might Have Not Heard About

Chefboost – an alternative Python library for tree-based models

References


Written By

Eryk Lewinson

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles