Independent Component Analysis (ICA)
Finding hidden factors in data
This is the final post in a two-part series on Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Although the techniques are similar, they are, in fact, different approaches and perform different tasks. In this post, I will provide a high-level introduction to ICA, compare it to PCA, and give an example of using ICA to remove blink artifacts from EEG data.
ICA
The standard problem used to describe ICA is the "Cocktail Party Problem". In its simplest form, imagine two people having a conversation at a cocktail party (like the red and blue speakers above). For whatever reason, you have two microphones placed near both party-goers (like the purple and pink microphones above). Both voices are heard by both microphones at different volumes based on the distance between the person and the microphone. In other words, we record two files that include audio from the two party-goers mixed together. The problem then is, how can we separate the two voices in each file to obtain isolated recordings of each speaker?
This problem is solved easily with Independent Component Analysis (ICA), which transforms a set of vectors into a maximally independent set. Returning to our "Cocktail Party Problem," ICA will convert the two mixed audio recordings (represented by purple and pink waveforms below) into two unmixed recordings of each individual speaker (represented by blue and red waveforms below). Notice that the number of inputs and outputs are the same, and since the outputs are mutually independent, there is no obvious way to drop components like in Principal Component Analysis (PCA).
How it works
There are two key assumptions made in ICA. The hidden independent components we are trying to uncover must be one, statistically independent and two, non-Gaussian. By independent, I mean information about x does not give you information about y and vice versa. Mathematically, this translates to,
Where p(x) represents the probability distribution of x. p(x,y) represents the joint distribution of x and y. The non-Gaussian assumption simply means the independent components have distributions that are not Gaussian, meaning it doesn’t look like a bell curve.
The first assumption is the starting point of ICA. We want to disentangle information to derive a set of independent factors. If there are not multiple independent generators of information to uncover, there really isn’t a need for ICA. For example, imagine using ICA for the "Cocktail Party Problem", but with only one partygoer, what one could call the COVID birthday party problem. It wouldn’t make much sense.
The need for the second assumption lies in the mathematics. ICA uses the idea of non-Gaussianity to uncover independent components. Non-Gaussianity quantifies how far the distribution of a random variable is from being Gaussian. Example measures of non-Gaussianity are kurtosis and negentropy. Why such a measure is helpful follows from the Central Limit Theorem. Specifically, a result that states the sum of two independent random variables has a distribution that is closer to Gaussian than either of the original variables. ICA combines this idea, non-Gaussianity measures, and the non-Gaussian assumption to uncover independent components hidden in data.
To illustrate this, consider a dataset with two variables x_1 and x_2. These variables serve as a basis that defines a space i.e. we can use them to plot points in 2 dimensions. Suppose we know the two independent components underlying the data, s_1, and s_2. These two components serve as an alternative basis to describe the same space. Therefore, any point y in this space could be written as both a linear combination of variables x_1 and x_2 or components s_1 and s_2.
Going back to the Central Limit Theorem, the distribution of the sum of two random variables will be more Gaussian than either individual variable. Thus, when a_1 and a_2 are both non-zero, the distribution of y will be more Gaussian than either s_1 or s_2. The reverse is that if either a_1 or a_2 is zero, then the distribution of y will be less Gaussian than in the former case. And, if the non-Gaussian assumption of s_1 and s_2 holds, it will not be Gaussian at all since y will be exactly equal to one of the independent components!
In other words, the non-Gaussianity of y is maximized when it is directly proportional to one of the independent components. This allows us to frame ICA as an optimization problem. For example,
Where we want to find the values of w_1 and w_2 that maximize the kurtosis of a linear combination of our known input variables. These optimal values of w_1 and w_2 will define an independent component.
More generally, we can solve for the matrix of weights, W, which maximizes the non-Gaussianity of the matrix multiplication of W and a data matrix, X.
Key Points
I may have (once again) gone too far into the mathematical weeds. As a takeaway, I will highlight three key points of ICA:
- The number of inputs equals the number of outputs
- Assumes independent components are statistically independent
- Assumes independent components are non-Gaussian
PCA vs ICA
Before moving on to an example, I will briefly compare PCA and ICA. Although the two approaches seem related, they perform different tasks. Specifically, PCA is often used to compress information i.e. dimensionality reduction. In contrast, ICA aims to separate information by transforming the input space into a maximally independent basis. A commonality is both approaches require input data to be autoscaled i.e. subtract each column by its mean and divide by its standard deviation. This is one reason why PCA is usually a good thing to do before performing ICA.
Example: Blink Removal from EEG
As always, I will close with a concrete, practical example. I will use ICA to remove blink artifacts from EEG data in this example. Code is available in the GitHub repository.
Electroencephalography (EEG) is a technique that measures electrical activity resulting from the brain. A major disadvantage of EEG is its sensitivity to motion and other non-brain artifacts. One such artifact occurs whenever participants blink. In the below figure, blink artifacts can plainly be seen via spikes in the voltage vs time plot of the Fp1 electrode (near the front of the head).
A good first step when using ICA is first performing PCA on the dataset and doing this in Matlab is easily done with the function pca(). I will note here it is critical to autoscale the data. This is done automatically in the pca() function. Also, here, we start with 64 columns corresponding to 64 EEG electrode voltages measured over time. After PCA, we are left with 21 columns corresponding to 21 score vectors i.e. principal components.
Next, we can train an ICA model and apply it to the PCA score matrix.
We can plot the independent components to inspect which ones correspond to blinking artifacts.
I use a lazy heuristic to pick out independent components representing blink information. Namely, picking components whose square has 4 prominent peaks. The remaining components can be used to reconstruct the original dataset without information from these blink components.
Finally, we plot the original and resulting voltage over time plot for the Fp1 electrode.
Conclusion
Independent Component Analysis (ICA) extracts hidden factors within data by transforming a set of variables into a new set that is maximally independent. ICA relies on a measure of non-Gaussianity to accomplish this task. Principal Component Analysis (PCA) and ICA aim at different goals. Namely, the former compresses information, and the latter separates information. Despite their differences, using PCA as a preprocessing step for ICA is often helpful. This combination of techniques has applications in financial analysis and neuroscience.
👉 More in this series: Principal Component Analysis | GitHub repo
Resources
Connect: My website | Book a call
Socials: YouTube 🎥 | LinkedIn | Twitter
Support: Buy me a coffee ☕️
[1] Hyvärinen A, Oja E. Independent component analysis: algorithms and applications. Neural Netw. 2000;13(4–5):411–430. doi:10.1016/s0893–6080(00)00026–5
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS