Reading list

Principal Component Analysis (PCA) is one of the prominent dimensionality reduction techniques. It is valuable when we need to reduce the dimension of the dataset while retaining maximum information.

In this article, we will learn the need for PCA, PCA working, preprocessing steps required before applying PCA, and the interpretation of principal components.

Why do we need PCA?

PCA is not required unless you have a dataset with a large number of attributes. Generally, when we deal with real-world data we encounter a huge messy dataset with a large number of attributes.

If we apply any Machine Learning model on a huge dataset without reducing its dimensions then it would be computationally expensive.

Therefore, to reduce the dimension and to retain maximum information we need PCA as our objective is to deliver accurate ML models with less time and space complexity.

PCA is needed when dataset have large number of attributes. We can avoid PCA for smaller datasets.

Is there any preprocessing step required before applying PCA?

We need to keep the below points in our mind before applying PCA

PCA can not be applied to the dataset with null values. Hence, you need to treat null values before proceeding with PCA. There are different ways of treating null values such as dropping the variables and imputing the missing data using mean or median.

We shouldn’t apply PCA on the dataset having attributes on different scales. We need to standardize variables before applying PCA.

Let us take an example of Facebook Metric data set

This dataset has 19 columns(or dimensions) and we will try to reduce its dimensions using PCA. Below you will find the python code and its output. We have dropped one categorical column for simplicity of analysis.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\Himanshu\Downloads\Facebook_metrics\dataset_Facebook.csv",sep = ';')
data.drop(columns = 'Type',inplace = True) ##For simplicity we keep all data as numerical
data.head()

The output of the Facebook metrics dataset

👁 Input data

We will check the statistical summary of our dataset to find the scale of different attributes. Below we can see that every attribute is on a different scale. Therefore, we can not jump to PCA directly without changing the scales of attributes.

Statistical summary of data

👁 Statistical summary of data

We see that column “Post Weekday” has less variance and column “Lifetime Post Total Reach” has comparatively more variance.

Therefore, if we apply PCA without standardization of data then more weightage will be given to the “Lifetime Post Total Reach” column during the calculation of “eigenvectors” and “eigenvalues” and we will get biased principal components.

Now we will standardize the dataset using RobustScaler of sklearn library. Other ways of standardizing data are provided in sklearn like StandardScaler and MinMaxScaler and can be chosen as per the requirement.

from sklearn.preprocessing import RobustScaler
rs = RobustScaler()
scaled = pd.DataFrame(rs.fit_transform(data),columns = data.columns)
scaled.head()

Who decides the number of principal components?

Unless specified, the number of principal components will be equal to the number of attributes.

Our dataset has 18 attributes initially hence we get 18 principal components. These components are new variables which are in fact a linear combination of input variables.

Once we get the amount of variance explained by each principal component we can decide how many components we need for our model based on the amount of information we want to retain.

Principal components are uncorrelated with each other. These principal components are known as eigenvectors and the variances explained by each eigenvector is known as eigenvalues.

Below we have applied PCA on the scaled datasets. If we want a predefined number of components then we can do that it using PCA(n_components)

from sklearn.decomposition import PCA
scaled_data = scaled.dropna()
pca = PCA() ## If we need predefined number of components we can set n_components to any integer value
pca.fit_transform(scaled_data)
print(pca.explained_variance_ratio_)

Here the output is the variance explained by each principal component. We have 18 attributes in our dataset and hence we get 18 principal components.

Always remember that the first principal component will always hold maximum variance

You can observe the same in the output that the first principal component holds maximum variance followed by subsequent components.

Interpretation of Principal Component

Now we have 18 principal components and we will try to find out how these components are influenced by each attribute.

We can check the influence of the top 3 attributes (both positive and negative) for the first principal component.

Below is the python code to fetch the influence of attributes on principal components by changing the number of features and number of components.

def feature_weight(pca, n_comp, n_feat):
 #df = pd.DataFrame(np.round(pca.components_,2),columns = scaled_data.columns)
 comp = pd.DataFrame(np.round(pca.components_, 2), columns=scaled_data.keys()).iloc[n_comp - 1]
 comp.sort_values(ascending=False, inplace=True)
 comp = pd.concat([comp.head(n_feat), comp.tail(n_feat)])
 comp.plot(kind='bar', title='Top {} weighted attributes for PCA component {}'.format(n_feat, n_comp))
 plt.show()
 return comp
feature_weight(pca,0,3)

👁 top weighted features| principal component analysis

We can interpret here that our first principal component is mostly influenced by engagement to the post (like, comment, impression, and reach).

Likewise, we can interpret other principal components as per the understanding of data using the above plot.

Plot to visualize variance by each principal component: Scree Plot

Below you can see a scree plot that depicts the variance explained by each principal component.

Here we can see that the top 8 components account for more than 95% variance. We can use these 8 principal components for our modelling purpose.

def screeplot(pca):
 var_len = len(pca.explained_variance_ratio_)
 indx = np.arange(var_len)
 var_pca = pca.explained_variance_ratio_
 plt.figure(figsize=(14, 8))
 ax = plt.subplot()
 cum_var = np.cumsum(var_pca)
 ax.bar(indx, var_pca)
 ax.plot(indx, cum_var)
 ax.set_xlabel("Principal Components")
 ax.set_ylabel("Percentage Variance Explained")
 plt.title('Cumulative Variance Explained by Principal Components')
screeplot(pca)

👁 Scree Plot | principal component analysis

Finally, we reduce the number of attributes to 8 from the initial 18 attributes. We were also able to retain 95% information of our dataset. Voila !! 🙂

Below is the scree plot for unscaled data just to check how different our principal components will be in the scaled version. We can see that there is a huge difference in principal components and the amount of variance explained. Here, the first component is explaining around 85% variance.

👁 Image

Similarly, you can check for each principal component how they have been influenced by attributes of unscaled data.

I hope this article would help to understand the basics of PCA.

If you like this article then I will share another article with basic mathematics about PCA.

The media shown in this article on Data Visualizations in Julia are not owned by Analytics Vidhya and is used at the Author’s discretion

Himanshu

Beginner Machine Learning Python Python Structured Data Supervised

Login to continue reading and enjoy expert-curated content.

Free Courses

👁 Generative AI
4.6

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

👁 Generative AI
4.5

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

👁 Generative AI
4.5

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

👁 Generative AI
4.7

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

👁 Generative AI
4.9

Introduction to AI & ML

AI & ML are transforming industries. Learn their impacts in this course.

URL: https://www.analyticsvidhya.com/blog/2021/05/demystifying-the-working-of-principal-component-analysis/

⇱ Principal Component Analysis Demystified | - Analytics Vidhya

Reading list

Demystifying the working of Principal Component Analysis!

Introduction

Why do we need PCA?

Is there any preprocessing step required before applying PCA?

Who decides the number of principal components?

Interpretation of Principal Component

Plot to visualize variance by each principal component: Scree Plot

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Data Science Course

No Code Predictive Analytics with Orange

Adaptive Email Agents with DSPy

Introduction to AI & ML

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Continue your learning for FREE

Enter email address to continue

Enter OTP sent to

Enter the OTP