The Difference Between Correlation and Regression
A clarifying article on these topics
When approaching Machine Learning, there are lots of topics to understand before moving the fingers on the keyboard and starting programming. The topics are not only related to the ‘available’ algorithms but are very related to math concepts (but, as I stated, if you do not know math you can learn when you need it).
When approaching Data Science and Machine Learning there are topics in statistics to understand; a couple of these topics are correlation and regression. In this article, I will explain the difference between these two topics with examples, and we’ll even cover the evergreen: "correlation is not causation!".
1. Correlation
Correlation is a statistical measure that expresses the linear relation between two variables.
It is simply like that. But, you know, definitions have to be taken into practice (also, to better understand the topics we are studying).
Deepening the concepts, we can say that two variables are correlated if for every value of the first variable correspond to a value for the second variable, following a certain regularity (or if you want, a certain path); so that, if the two variables are highly correlated, the path would be linear (a line), since the correlation describes the linear relation between the variables.
This means that correlation expresses a relation between variables, not a cause-effect relationship! If the independent variable increases in value and the dependent value also increases, it does not mean that the first variable causes the increment in value for the second value!
Let’s make an example:
It’s summer, and it’s hot; you don’t like the high temperatures of your city, so you decide to go to the mountain. Luckily for you, you get to the mountain top, measure the temperature and you find it’s less than in your city. You get a little suspicious (also, because you are not satisfied with the decrease in temperature) and you decide to go to a higher mountain finding that the temperature is even less than the one on the previous mountain.
You try mountains with different heights, measure the temperature and plot a graph; you find that with the height of the mountain increasing, the temperature decreases and you can see a linear trend. What does it mean? It means that the temperature is related to the height; it doesn’t mean the height of the mountain caused the decrease in temperature (what temperature would you measure if you get to the same height, at the same latitude, with an hot air balloon? 🙂 ).
So, since in the physical world we need definitions to measure things, a good question would be: how do we measure correlation?
The typical method to measure correlation is to use the correlation coefficient (also known as the Pearson index or the linear correlation index). I don’t want to do into the math, since the purpose of this article is to be informative and educative, but not with formulas: I just want you to grab and understand the concepts.
The correlation coefficient exploits the statistical concept of covariance, which is a numerical way to define how two variables vary together. Leaving the math and just talking about the concepts, the correlation coefficient is a numerical value that varies between -1 and +1. If the correlation coefficient is -1, the two variables will have a perfect negative linear correlation; if the correlation coefficient is +1, the two variables will have a perfect positive linear correlation; if the correlation coefficient is 0, it means that there is no linear correlation between the two variables.
I said we would leave the math on its own, but I didn’t say we would leave the code, since we are in Data Science. So, how do we calculate the correlation coefficient in Python? Well, we generally calculate the correlation matrix. Supposing we have two variables stored in a data frame called "df", "Variable 1" and "Variable 2", we can plot the correlation matrix for example in seaborn:
import seaborn as sns
#heat map for correlation coefficient
sns.heatmap(df.corr(), annot=True, fmt="0.2")
And we get:
The above image shows that the two variables we have taken into account are highly correlated since their correlation coefficient is 0.96. We, then, expect to describe graphically their relation with a line with a positive slope. And here we came to the next concept: regression.
But before going on into the concept of regression, I want to say the last thing. I’ve stressed, during this article, the fact that the correlation has to do with a linear relationship between the variables. Let’s take two variables we know for sure are not linear related; a parable, for example:
In this case, if we calculate the correlation factor we get 0:
The fact that the variables are non-correlated tells us just that there is no line that can describe the relationship between the variables: it doesn’t mean that the variables are not related at all! It just means that the relationship is not linear (and can be anything!)
2. Regression
Regression analysis is a mathematical technique used to analyze some data, consisting of a dependent variable and one (or more) independent variables with the aim to find an eventual functional relationship between the dependent variable and the independent ones.
The aim of the regression analysis is to find an estimate (a good one!) between the dependent and the independent variable(s). Mathematically speaking, the aim of the regression is to find the curve that best fits the data.
Of course, the curve that best fits the data can be a line; but it can be whatever curve, depending on the relationship!
So, what we usually do is calculate the correlation coefficient and if it has values near 1 we can expect a line when studying the regression; otherwise…we have to try with polynomial regression (or with something else, like exponential or whatever it is)!
In fact, if we calculate the regression line between the data seen before (the "Variable_1 and Variable_2" with a 0.96 correlation coefficient) we get:
import seaborn as sns
import matplotlib.pyplot as plt
#plotting the time series analysis with a red regression line
sns.regplot(data=df, x="Variable_1", y="Variable_2", line_kws={"color": "red"})
plt.xlabel('Variable 1', size=14)
plt.ylabel('Variable 2', size=14)
plt.title('(LINEAR) REGRESSION BETWEEN TWO VARIABLES')
As expected, since the correlation matrix is 0.96 we get a line with a positive slope as the curve that best fits the data.
Finally, I want to say that there are numerous techniques to find the curve that best fits the data; one of the most used is the "Ordinary Least Squares" method, but, as I said I’m not going into the math: just trust me because the aim of this article is to spread concepts knowledge.
Thanks for reading. I hope you have understood the concepts if you didn’t know (if not, let me know with a comment!); and if you did know the concepts…I hope I didn’t make any mistakes!
Let’s connect together!
LINKEDIN (send me a connection request)
If you want, you can subscribe to my mailing list so you can stay always updated!
Consider becoming a member: you could support me and other writers like me with no additional fee. Click here to become a member.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS