Curse of Dimensionality in Machine Learning

Last Updated : 23 Jul, 2025

Curse of Dimensionality in Machine Learning arises when working with high-dimensional data, leading to increased computational complexity, overfitting, and spurious correlations.

Techniques like dimensionality reduction, feature selection, and careful model design are essential for mitigating its effects and improving algorithm performance. Navigating this challenge is crucial for unlocking the potential of high-dimensional datasets and ensuring robust machine-learning solutions.

What is Curse of Dimensionality?

Curse of Dimensionality
In high-dimensional spaces, data points become sparse, making it challenging to discern meaningful patterns or relationships due to the vast amount of data required to adequately sample the space.
Curse of Dimensionality significantly impacts machine learning algorithms in various ways. It leads to increased computational complexity, longer training times, and higher resource requirements. Moreover, it escalates the risk of overfitting and spurious correlations, hindering the algorithms' ability to generalize well to unseen data.

How to Overcome the Curse of Dimensionality?

To overcome the curse of dimensionality, you can consider the following strategies:

1. Dimensionality Reduction Techniques:

Feature Selection: Identify and select the most relevant features from the original dataset while discarding irrelevant or redundant ones. This reduces the dimensionality of the data, simplifying the model and improving its efficiency.
Feature Extraction: Transform the original high-dimensional data into a lower-dimensional space by creating new features that capture the essential information. Techniques such as Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are commonly used for feature extraction.

2. Data Preprocessing:

Normalization: Scale the features to a similar range to prevent certain features from dominating others, especially in distance-based algorithms.
Handling Missing Values: Address missing data appropriately through imputation or deletion to ensure robustness in the model training process.

Implementation: Mitigating Curse Of Dimensionality

Here we are using the dataset uci-secom.

Import Necessary Libraries

Import required libraries including scikit-learn modules for dataset loading, model training, data preprocessing, dimensionality reduction, and evaluation.

Loading the dataset

The Dataset is stored in a CSV file named 'your_dataset.csv', and have a timestamp column named 'Time' and a target variable column named 'Pass/Fail'.

Remove Constant Features

We are using VarianceThreshold to remove constant features and SimpleImputer to impute missing values with the mean.

Splitting the data and standardizing

Feature Selection and Dimensionality Reduction

Feature Selection:SelectKBest is used to select the top k features based on a specified scoring function (f_classif in this case). It selects the features that are most likely to be related to the target variable.
Dimensionality Reduction:PCA (Principal Component Analysis) is then used to further reduce the dimensionality of the selected features. It transforms the data into a lower-dimensional space while retaining as much variance as possible.

Training the classifiers

Training Before Dimensionality Reduction: Train a Random Forest classifier (clf_before) on the original scaled features (X_train_scaled) without dimensionality reduction.
Evaluation Before Dimensionality Reduction: Make predictions (y_pred_before) on the test set (X_test_scaled) using the classifier trained before dimensionality reduction, and calculate the accuracy (accuracy_before) of the model.
Training After Dimensionality Reduction: Train a new Random Forest classifier (clf_after) on the reduced feature set (X_train_pca) after dimensionality reduction.
Evaluation After Dimensionality Reduction: Make predictions (y_pred_after) on the test set (X_test_pca) using the classifier trained after dimensionality reduction, and calculate the accuracy (accuracy_after) of the model.

Complete Code

Output:

Accuracy before dimensionality reduction: 0.8745
Accuracy after dimensionality reduction: 0.9235668789808917

The accuracy before dimensionality reduction is 0.8745, while the accuracy after dimensionality reduction is 0.9236. This improvement indicates that the dimensionality reduction technique (PCA in this case) helped the model generalize better to unseen data.

Comment

Article Tags:

Machine Learning

AI-ML-DS

AI-ML-DS With Python

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/curse-of-dimensionality-in-machine-learning/