VOOZH about

URL: https://towardsdatascience.com/correlation-between-happiness-internet-usage-and-mathematics-3f23e539b5fb/

⇱ Correlation between Happiness, Internet Usage, and Mathematics | Towards Data Science


Correlation between Happiness, Internet Usage, and Mathematics

A surprising correlation between the Happiness, Internet Usage and Mathematics!

9 min read

A step by step guide to differentiate between a spurious and a real correlation

👁 Photo by Denise Jones on Unsplash
Photo by Denise Jones on Unsplash

A correlation between a few bizarre things often irks our curiosity to question if there is any causality behind it! To name a few correlations:

  1. Shortage of pirates and increase in global warming.
  2. Increase in lemon imports and decrease in highway fatality in US.
  3. Worldwide non-commercial space launches and sociology doctorates awarded.
  4. A good monsoon and increase in equity prices of tractor companies.
  5. Increase in the price of the precious metals and decrease in the interest rates.
  6. Increase in the oil price and decrease in the tourism.

At the first sight, one could see the distinction between the first three correlations, and the last three. The first three correlations are what one would call as farcical correlations. These correlations could be a result of just serendipity, reminding us of the old adage, "Correlation does not imply Causation". However, the last three do seem to have a causal relationship between them. For example, a good monsoon implies a good agricultural yield, which in turn implies increase in purchase of agri-related products such as tractors, pesticides, manure etc.

Now, how can one determine if the two variables just have a spurious relationships, in contrast, to actual cause and effect relationship. This is a question that often requires a human-kind of intuition along with the subject expertise. In this article, we attempt at formalizing this intuitive approach into few more concrete steps using happiness data, internet usage per population, and mathematics proficiency among the 15 year-old students.

Data

  1. The World Happiness Report 2020 : This reports consists of 156 countries and encompasses factors contributing to happiness such as social support, healthy life expectancy, freedom to make life choices etc.
  2. Internet Usage 2018: The internet users data for 216 countries ranked by number of absolute users and users per population. Here, we use ranking for users per population.
  3. Pisa Worldwide Ranking 2018: It consist data of scholastic performance of 15 year-old students in mathematics for 79 countries.

Data Visualization

The data above consists of continuous variables for a large number of countries – the least being 79. The most effective way to understand this data is with a choropleth plot, where the data is converted into categories. This is achieved via mapclassify package in python. The mapclassify enables to implement a family of classification schemes for choropleth maps.

fig, ax = plt.subplots(1,figsize=(15, 10))
ax.axis('off')
ax.set_title('Pisa Mathematics Ranking', fontdict={'fontsize': '35', 'fontweight' : '3'})
merged.plot(column='Avg_score', cmap='RdBu', scheme="User_Defined", 
 legend=True, edgecolor='0.8',classification_kwds=dict(bins=[300,350,400,450,500,550,600]),
 ax=ax,label='Ranking')
ax.set_xlim(minx+10, maxx-10)
ax.set_ylim(miny+10, maxy-10)
leg = ax.get_legend()
plt.rc('legend', fontsize=20)
leg.set_bbox_to_anchor((0.85, 0.6, 0.2, 0.2))
👁 Image by author.
Image by author.

In this choropleth map, the countries that were not part of the pisa’s ranking are given an arbitrary score of 300, which is just below the least score of 325. It is evident that most of African, and Asian countries are not part of this ranking. A possible reason for this could be the price tag of this test, which is at EUR 205000%20The%20base%20international%20overhead,and%20reporting%20of%20the%20data.) per participating country. Only two administrative regions, Macau and Hong Kong, were removed from this map (due to plotting difficulties), and are shown as part of China itself. China and Singapore are the only two countries that has scored above 550, while China has scored a whopping 591, and Singapore has scored a 569. In general, developed Asian, European, and American countries have performed better in this ranking. However, it is not indicative of any relationship between a country’s performance, and its economic status as the test is yet to percolate into developing nations in Africa and Asia.

Likewise, the choropleth map of world happiness index is shown below.

👁 Image by author.
Image by author.

The colour schema here is reversed in comparison to the above plot for better visualisation. The countries with no happiness index is given a value of zero. Clearly, this is a more extensive report consisting of 156 countries in total. Finland, Denmark and Switzerland are the top three countries, respectively, whereas the top 10 is mostly occupied by the Nordic countries. In addition, there is apparent positive correlation between economy and happiness in this map.

Similarly, a map representing percentage of internet users per population:

👁 Image by author.
Image by author.

The above map implies a correlation between the developed countries, and percentage of internet users per population. This map consist data for 216 countries, and countries with no data is represented with a zero percentage. In the following sections, we calculate the Pearson’s correlation between these three rankings.

Correlation

The Pearson’s correlation coefficient shows only linear relationship between two variables. It gives zero correlation if the two variables are related only non-linearly , e.g. equation of a circle, hyperbola, parabola etc. The Pearson’s correlation measures the strength and direction of the relationship between the two variables. It does not measure the slope of the relationship, and attributes zero correlation to purely non-linear relationships as shown by the rows in the below figure, respectively.

👁 By DenisBoigelot, original uploader was Imagecreator - CC0,
By DenisBoigelot, original uploader was Imagecreator – CC0,

Advantages of Correlation:

  1. It is easy to compute between two variables.
  2. It helps in reducing the number of features required for modelling if the two variables are highly correlated.
  3. Knowing the correlation helps mitigates risk, e.g. a increase in correlation of different equities, and tranches meant a lose in one causes the loses in others. A scenario that lead to 2007 financial crisis!

Disadvantages of Correlation:

  1. It can give spurious correlations that do not have any cause-effect relationship.
  2. The correlation changes with transformation of data. Hence, needs to be computed on a continuous basis during data processing.
  3. One can never get exact relationship between the two variables using correlations.

Correlation between Happiness, Internet Usage and Mathematics

👁 Fig. 1 (Image by author.)
Fig. 1 (Image by author.)

There is surprisingly high positive correlation between percentage of internet users per population to happiness index. In addition, there is also a high positive correlation between the scholastic performance of mathematics and percentage of internet users per population. There is a small positive correlation between mathematics and happiness.

In the above plot, the correlation is calculated after merging all the three DataFrames – happiness index, pisa ranking and internet usage – using the country names. Hence, it limits the number of countries in the merged DataFrame at the most by the DataFrame with least number of countries, which is the pisa DataFrame with 79 countries. Therefore, the merged DataFrame here has 77 countries with the exception of two administrative regions, Macau and Hong Kong.

To take advantage of the other two DataFrames with more data, we compute the correlations of them separately.

👁 Fig. 2 (Image by author.)
Fig. 2 (Image by author.)

Evidently, the happiness and internet seems to have more correlation after adding more data as there are 150 countries in this merged DataFrame.

Similarly, the merged DataFrame of only mathematics and internet with 78 countries yields a correlation that is slightly lesser than DataFrame with all the three DataFrames.

👁 Fig. 3 (Image by author.)
Fig. 3 (Image by author.)

One can see that correlation is quite sensitive to number of data. This is due to the fact that the numerator of the Pearson’s correlation coefficient computes the sum of products between the differences of x and y from their respective mean (Gulp!). That is:

👁 by GeeksforGeeks
by GeeksforGeeks

Therefore, even a single data point (x,y) with a large noise can contribute significantly even after the normalization by the denominator.

The correlation between only mathematics and happiness shows minimal change compared to Fig. 1, and has a total of 77 countries.

👁 Fig. 4 (Image by author.)
Fig. 4 (Image by author.)

Is it a Spurious Correlation?

Although, on the first sight there seems to be no apparent relationship between happiness & internet, and mathematics & internet. The two combinations that have a high positive correlations of 0.77 and 0.51, respectively. However, on a deeper inspection one can see that:

  1. Number of internet users per population is higher among the developed countries.
  2. The GDP of the country can certainly contribute towards the happiness of the people.
  3. In addition, implying that the mathematics proficiency is more correlated to the GDP of the country. Meaning, developed country can spare more resources towards a quality education of students.

All this might seem a wild conjecture at this point. However, there is an easy way to prove that it is not spurious correlation, and that the above reasoning is correct. That is to plot the correlation of GDP of the country along with happiness, internet and mathematics. If our reasoning is correct it should show that happiness, internet and mathematics are far more correlated to GDP than to each other!

Using the logged GDP per capita from the happiness data we get

👁 Fig. 5 (Image by author.)
Fig. 5 (Image by author.)

The Fig. 5 shows that the correlation of logged GDP per capita with happiness, mathematics and internet is more compared to others. This proves our conjecture that the economic development of the country is closely related to happiness, mathematics proficiency and internet usage. Hence, establishing a sensible causal relationship between these apparently different features. Therefore, this is not spurious correlation!

Furthermore, as mentioned above the transformation of data can change the correlation between them. In particular, the natural log of GDP per capita will change power relationship into linear relationship, thus increasing the correlation coefficient! Therefore, to get an exact relationship we should take correlation with only the GDP per capita.

👁 Fig. 6 (Image by author.)
Fig. 6 (Image by author.)

Evidently, the above plot with GDP per capita shows lesser correlation compared to Fig. 5 with logged GDP per capita.

Now, plotting only internet and GDP per capita for taking advantage of more data between these two DataFrame, we get:

👁 Fig. 7 (Image by author.)
Fig. 7 (Image by author.)

Take home messages

  1. Correlations are important in feature engineering. One can use it reduce the number features used if the features are highly correlated to each other.
  2. Pearson’s correlation coefficient can not express purely non-linear relationship, e.g. two variables related by a equation of circle have zero correlation.
  3. Correlation is quite sensitive to number of data points used. Even if a few data points are more scattered it will be reflected in the Pearson’s correlation coefficient.
  4. Happiness & Internet Usage and Mathematics & Internet Usage have a significant positive correlation of 0.77 and 0.51, respectively (see Fig. 2 & Fig. 3)! The disparity of this correlation, and Fig. 6 is due to more data in FIg. 2 and Fig. 3.
  5. The reason for this is a higher positive correlation between internet usage and GDP of the country, which is at 0.81 (see Fig.7). Hence, GDP establishes the cause and effect relationship between observed correlation of Happiness & Internet Usage and Mathematics & Internet Usage.

Refer here for the entire notebook along with the datasets.


Written By

Vasanth Bs

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles