Understanding PCA
We Discover How Principal Components Analysis Helps Us Uncover the Underlying Trends in Our Data
(Principal Components Analysis)
In data science and finance (and pretty much any quantitative discipline), we are always sifting through a lot of noise in search of signal. Now if only, there were an algorithm that could do that for us…
There is! Today we will explore how PCA (Principal Components Analysis) helps us uncover the underlying drivers hidden in our data – a super useful feature as it allows us to summarize huge feature sets using just a few principal components.
If you are interested in the code that I used to generate the charts below, you can find it on my GitHub here.
Variance is Both Our Enemy and Our Friend
If you have spent some time reading statistics or data science textbooks, you will notice that the main reason we go through all the trouble of building models is to explain variance.
But what does that really mean? Let’s unpack this step by step. First, what do we actually mean by variance? Imagine that our data looks like this:
You are thinking, "Tony why are you showing me a flat line? This is so boring." And that is exactly the point. If our data was just a flat line, it would be very easy to predict (just predict five all the time) but also completely uninteresting. A flat line is an example of data with zero variance – there is absolutely no vertical variation in the data.
What is an example of zero variance data in real life? It sounds ridiculous but let’s pretend your boss told you to predict the number of floors in a five story building. So every day for 100 days you measure the number of floors of the building in question and at the end you get the chart above, a straight line. When your boss comes back and asks for your prediction, you say with confidence "I predict that tomorrow the building will still be five floors tall!" Rocket science right?
Barring disaster, there is a 100% chance that you will be right. But there was no point to what you did – data with no variance has no uncertainty, so there is nothing for us to predict or explain, it just is. And if we were trying to use variables with zero variance as X variables in a model (to predict a target variable with nonzero variance), we would become super frustrated by the absolute lack of signal in our feature set.
So what does data that has variance look like? Here is 100 days of daily price returns (as a percentage) for Apple stock:
Now we have something to work with. As expected, there’s a lot of variance in our Apple stock returns data – variance is basically how much the data bounces up and down by. The more bouncy it is, the harder it is to predict or explain. Apple’s daily returns bounce around significantly, going back and forth between positive and negative values including some large spikes.
But amidst all that noise, bouncy data also contains signal (a.k.a. information).
Compared to data with no variance, bouncy data is both infinitely more interesting as the target variable of your model and infinitely more useful as a feature variable inside your model.
That’s why I say variance is both our enemy and our friend:
- It is our enemy because more variance in our target variable creates uncertainty and makes the target harder to predict or explain.
- It is also our friend though because, like we saw above, features with no variance are completely uninteresting and contain no signal at all. So to have a chance of building a good model, we need features with signal, or in other words features with variance (actually, more specifically, we want features that have nonzero covariances with our target).
Capturing Signal with Principal Components
We live in a world of too much information, not too little. The same holds true in data science – there is almost always a huge set of potential features we can use to make our prediction.
But as much as would like to, we can’t use them all. We just don’t have enough observations – using 10,000 features to fit a model when we only have 5,000 observations of the target variable would be a terrible idea. We would end up with a massively overfit model that would break once we tried to run the model in the real world (on truly out of sample data).
But without domain expertise and modeling experience (and even with it sometimes), it can be hard to decide which features are truly worth keeping in our model. Now if only there were an algorithm that could transform our 10,000 features into an ideal set of features.
The Ideal Set of Features
If we could create from scratch an ideal set of features, what properties would this set have? I propose that it would have the following three properties:
- High Variance: features with a lot of variance contain a lot of potential signal – signal (a.k.a. useful information) is a basic requirement for building a good model.
- Uncorrelated: features that are highly correlated with each other are less useful and in certain cases downright harmful (when the correlation is so high as to cause multicollinearity). To see why this is so, pretend that you employ a roomful of talented stock traders. Would you prefer them to all invest in a similar manner or differently? If it were me, I would prefer them to be as different from each other as possible – this creates a diversifying effect where the traders protect each other from their errors and you almost never have a situation where everyone is simultaneously wrong.
- Not That Many: We want to have a low number of features relative to our number of target variable observations. Too many features relative to observations would result in an overfit model that performs poorly out of sample.
PCA (Principal Components Analysis) gives us our ideal set of features. It creates a set of principal components that are rank ordered by variance (the first component has higher variance than the second, the second has higher variance than the third, and so on), uncorrelated, and low in number (we can throw away the lower ranked components as they contain little signal).
How Does PCA Do That?
So how does it work such magic? Well there is either the linear algebra explanation or the intuitive one. We will opt for the intuitive one here, but if you would like to check out the math, this blog post is great.
PCA works its magic by repeatedly asking and answering the following questions:
- At the very start of the process, PCA asks what is the strongest underlying trend in the feature set (we will call this component 1)? We will visualize this multiple ways later, so don’t worry if this is unclear now.
- Next PCA asks what is the second strongest underlying trend in the feature set that also happens to be uncorrelated with component 1 (we will call it component 2)?
- Then PCA asks what is the third strongest underlying trend in the feature set that also happens to be uncorrelated with both components 1 and 2 (we will call it component 3)?
- And so on…
How does it find these underlying trends? If you have ever Googled PCA, you have probably seen something similar to the following picture:
In the picture, our data is the black dots. So what is the strongest underlying trend? We can approach this as if it were a linear regression problem – the strongest trend is the line of best fit (the blue line). So the blue line is component 1. You might ask why is the blue line component 1 and not the red line? Remember that component 1 is the principal component with the highest variance (since highest variance equates to highest potential signal).
The linear regression connection is useful because it helps us realize that each principal component is a linear combination of the individual features. So much like how a linear regression model is the weighted sum of our features that adheres most closely to our target variable, the principal components are also weighted sums of our features. Except in this case, they are the weighted sums that best express the underlying trends in our feature set.
Going back to our example, we can visually see that the blue line captures more variance than the red line because the distance between the blue ticked lines is longer than the distance between the red ticked lines. The distance between the ticked lines is an approximation of the variance captured by our principal component – the more the black dots, our data, vary along the principal component’s axis, the more variance it captures.
Now for component 2, we want to find the second strongest underlying trend with the added condition that it is uncorrelated to component 1. In statistics, trends and data that are orthogonal (a.k.a. perpendicular) to each other are uncorrelated.
Check out the plot to the left. I have plotted two features, one in blue and the second in red. As you can see, they are orthogonal to each other. All of the variation in the blue feature is horizontal and all the variation in the red one is vertical. Thus, as the blue feature changes (horizontally), the red feature stays completely constant as it can only change vertically.
Cool, so in order to find component 2, we just need to look for a component with as much variance as possible that is also orthogonal to component 1. Since our earlier PCA example was a very simple one with just two dimensions, we have only one option for component 2, the red line. In reality, we probably have tons of features so we would need to consider many dimensions when we search for our components but even then, the process is the same.
An Example to Tie it All Together
Let’s go back to our earlier example with stocks. Instead of just Apple stock, I’ve downloaded data for 30 different stocks representing multiple industries. If we plot all their daily returns (for 100 days, same as above), we get the following mess of a chart:
Every stock is sort of doing its own thing and there is not much to glean from this chart besides that daily stock returns are noisy and volatile. Let’s use sci-kit learn to calculate principal component 1 and then plot it (PCA is sensitive to the relative scale of your features – since all my features are daily stock returns I did not scale the data but in practice, you should consider using StandardScaler or MinMaxScaler). The black line in the figure below is component 1:
So the black line represents the strongest underlying trend in our stock returns. "What is it though?", you ask? Good question and unfortunately without some domain expertise, we don’t know. This loss of interpretation is the key drawback of using something like PCA to reduce our much larger feature set into a smaller set of key underlying drivers. Unless we are lucky or just plain experts of the data, we would not know what each of the PCA components means.
In this case, I would guess that component 1 is the S&P 500 – the strongest underlying trend in all our stock returns data is probably the overall market, whose ebbs and flows impact the prices of each individual stock. Let’s check this by plotting the S&P 500’s daily returns against component 1 (below). Almost a perfect fit considering how noisy the data is! The correlation between the S&P 500’s daily returns and principal component 1 is 0.92.
So like we guessed, the most important underlying trend in our stock data is the stock market. The scikit-learn implementation of PCA also tells us how much variance each component explains – component 1 explains 38% of the total variance in our feature set.
Let’s take a look at another principal component. Below, I have plotted components 1 (in black) and 3 (in green). As expected, they have a low correlation with each other (0.08). Unlike component 1, component 3 only explains 9% of the variance in our feature set, much lower than component 1’s 38%. And unfortunately, I have no idea what component 3 represents – this is where PCA’s lack of interpretation comes to bite us.
Conclusion
In this post we saw how PCA can help us uncover the underlying trends in our data – a super useful ability in today’s big data world. PCA is great because:
- It isolates the potential signal in our feature set so that we can use it in our model.
- It reduces a large number of features into a smaller set of key underlying trends.
However, the drawback is that when we run our features through PCA, we lose a lot of interpretability. Without domain expertise and a lot of guessing, we probably wouldn’t know what any of the components beyond the top one or two represents.
But generally this is not a deal breaker. If you are convinced that there is ample signal in your large set of features, then PCA remains a useful algorithm that allows you to extract most of that signal to use in your model without having to stress about overfitting.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS