How to Perform Feature Selection for Regression Data

Last Updated : 27 Jul, 2025

Feature selection is a crucial step in the data preprocessing pipeline for regression tasks. It involves identifying and selecting the most relevant features (or variables) that contribute to the prediction of the target variable. This process helps in reducing the complexity of the model, improving its performance, and making it more interpretable.

In this article, we will explore various techniques to perform feature selection for regression data, ensuring that you can build efficient and accurate models.

Table of Content

Why Feature Selection is Important in Regression?

Feature selection is vital because not all features in a dataset are equally important. Some features may be irrelevant or redundant, leading to overfitting and poor model performance. By selecting only the most relevant features, you can:

Reduce model complexity: Fewer features mean a simpler model, which is easier to interpret and faster to train.
Improve model performance: Removing irrelevant features can enhance the model's predictive accuracy.
Prevent overfitting: With fewer features, the model is less likely to learn noise from the training data.

Techniques for Feature Selection in Regression

1. Correlation Analysis

Correlation analysis helps identify linear relationships between features and the target variable. Features with high correlation to the target variable are typically considered more important for the regression model. Similarly, pairs of features with high correlation to each other might indicate redundancy, where only one feature may be necessary.

If you're predicting house prices, features like the number of bedrooms or square footage might have a high positive correlation with the price, making them important features to include in your model.

How to Use?

Calculate the Pearson correlation coefficient between each feature and the target variable. A coefficient close to +1 or -1 indicates a strong linear relationship. You can visualize this with a correlation matrix heatmap.

In Python, you can use pandas to calculate correlation:

2. Univariate Selection

Univariate feature selection involves selecting features based on their individual relationship with the target variable. This method uses statistical tests to determine the significance of each feature.

For predicting exam scores, individual features like study hours or past grades can be evaluated to determine their impact on the prediction.

How to Use?

You can apply tests like ANOVA F-value or Chi-Square to rank features based on their relevance.

In Python, the SelectKBest method from the sklearn library is commonly used for this purpose.

3. Recursive Feature Elimination (RFE)

RFE is a recursive method that eliminates less important features in a step-by-step manner. It works by fitting a model and removing the weakest feature(s) until the desired number of features is reached.

When building a model to predict car prices, RFE might eliminate less significant features like color or brand while retaining more influential features like engine size and mileage.

How to Use?

RFE can be implemented using sklearn's RFE class, where you specify the estimator and the number of features to select. The method ranks the features and recursively eliminates the least important ones.

4. Lasso Regression (L1 Regularization)

Lasso regression adds a penalty equal to the absolute value of the coefficients to the loss function, effectively shrinking some coefficients to zero. This means some features are effectively removed from the model.

For a dataset predicting diabetes progression, Lasso might reduce the number of features by setting the coefficients of less important features to zero, focusing only on those with substantial predictive power.

How to Use?

Implement Lasso regression using the Lasso class in sklearn. By adjusting the regularization parameter (alpha), you can control the number of features selected.

5. Feature Importance from Tree-based Models

Tree-based models like Random Forests or Gradient Boosting can compute feature importance scores based on how useful each feature is at reducing the impurity (e.g., Gini index) in decision trees.

In predicting customer churn, a Random Forest model might show that features like customer tenure or usage patterns are the most important for predicting churn.

How to Use?

Train a tree-based model and extract the feature importance scores. Features with higher importance scores are more relevant for predicting the target variable.

6. Dimensionality Reduction Techniques (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms features into a set of uncorrelated components, ranked by the amount of variance they explain in the data. Although PCA doesn't explicitly select features, it helps reduce the feature space, making it a useful tool in preprocessing.

When working with high-dimensional datasets like gene expression data, PCA can reduce the number of features while retaining the essential information needed for accurate predictions.

How to Use?

Implement PCA using sklearn's PCA class. Decide on the number of principal components to retain based on the explained variance ratio.

Implementing Feature Selection for Regression Data Using RFE and Linear Regression

Step 1: Importing Libraries and Loading the Dataset

In this step, we import the necessary libraries and load the California housing dataset, which will be used for the regression task.

Step 2: Splitting the Dataset into Training and Testing Sets

The dataset is split into training and testing sets using an 80-20 split. This helps in evaluating the model's performance on unseen data.

Step 3: Initializing the Linear Regression Model

We initialize a linear regression model that will be used in the Recursive Feature Elimination (RFE) process to rank the features.

Step 4: Performing Recursive Feature Elimination (RFE)

RFE is applied to the training data to select the top 5 most important features. The model is trained iteratively, and the least important features are removed in each step.

Step 5: Identifying and Displaying the Selected Features

The selected features from the RFE process are identified and printed. These features are considered the most relevant for predicting the target variable.

Output:

Selected Features: Index(['MedInc', 'AveRooms', 'AveBedrms', 'Latitude', 'Longitude'], dtype='object')

Step 6: Training the Model with Selected Features

The model is retrained using only the selected features, both for the training and testing datasets. This step ensures that only the most relevant features are used in the final model.

Step 7: Making Predictions and Evaluating the Model

Predictions are made on the test set using the trained model. The model's performance is evaluated by calculating the Mean Squared Error (MSE), which gives an indication of the model's accuracy.

Output:

Mean Squared Error: 0.5667695170781499

Conclusion

Feature selection is an essential process in building efficient regression models. By carefully selecting the most relevant features, you can improve model performance, reduce complexity, and enhance interpretability. Techniques such as correlation analysis, univariate selection, RFE, Lasso regression, and feature importance from tree-based models provide powerful tools to identify the features that matter most. Implementing these methods will help you create more robust and accurate regression models, ultimately leading to better insights and predictions.

Comment

Article Tags:

Machine Learning

AI-ML-DS

AI-ML-DS With Python

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/how-to-perform-feature-selection-for-regression-data/