Fitting Different Inputs into an Sklearn Pipeline

Last Updated : 23 Jul, 2025

The Scikit-learn A tool called a pipeline class links together many processes, including feature engineering, model training, and data preprocessing, to simplify and optimize the machine learning workflow. The sequential application of each pipeline step guarantees consistent data transformation throughout training and testing. Because they enforce best practices—like making sure transformations are only learnt from training data to prevent data leakage—pipelines are especially helpful.

Understanding the Basics of sklearn Pipelines

Before diving into the specifics of handling multiple inputs, it's essential to understand the fundamental structure of an sklearn pipeline. A pipeline in sklearn is a sequence of data processing steps, each of which is an instance of a Transformer or an Estimator. These steps are executed in a linear sequence, with the output of one step serving as the input to the next.

Key benefits of pipeline use:

Code organization: Divide the data processing and model training phases to keep your code modular and manageable.
Consistency: Make sure the training and testing sets of data undergo the same set of modifications.
Data leak prevention: By making sure that transformations (like scaling) are only fitted on training data and not test data, pipelines help to avoid typical mistakes.
Grid search integration: To fine-tune hyperparameters throughout the workflow, pipelines make it simple to integrate with programs like GridSearchCV and RandomizedSearchCV.

Understanding Different Types of Inputs

In the field of machine learning, datasets frequently comprise diverse input feature types, necessitating distinct preprocessing procedures prior to model training. Among these input kinds are:

Numerical inputs: These can be discrete or continuous values (temperature, income, age, etc.). They frequently call for scaling (such as StandardScaler and MinMaxScaler) and occasionally feature engineering (such as binning or polynomial features).
Categorical Inputs: These are features that represent categories or labels (e.g., "Male/Female", "Yes/No"). They are usually encoded into numerical representations as part of preprocessing. OneHotEncoder, which converts categories into binary vectors, andLabelEncoder, which converts categories into integers, are examples of common encoders.
Text Inputs: Unprocessed textual data, such as comments and product reviews. Vectorization techniques like CountVectorizer or TfidfVectorizer are necessary for this kind of data in order to transform text into numerical feature vectors that may be used as input for machine learning models.

Each of these input kinds requires a distinct approach, and to address these varied input kinds, sklearn offers a large range of preprocessing tools that can be integrated into a pipeline.

Creating Custom Transformers for Different Inputs

Sometimes, the built-in transformers may not fully meet the requirements of your data. In such cases, you can create a custom transformer by subclassing TransformerMixin and BaseEstimator. For example, we might want a custom transformer that applies specific preprocessing for different input types.

Output:

[[ 6 7]
 [ 8 9]
 [10 11]]

Managing Multiple Input Types with ColumnTransformer

The ColumnTransformer is a powerful tool in sklearn that allows you to apply different preprocessing steps to different columns in your dataset. This is particularly helpful when dealing with datasets that contain both numerical and categorical data.

For instance, you can scale numerical features and one-hot encode categorical features in the same pipeline using ColumnTransformer.

Output:

[[-1.34164079 1. 0. ]
 [-0.4472136 0. 1. ]
 [ 0.4472136 1. 0. ]
 [ 1.34164079 0. 1. ]]

This ensures that each type of feature receives the appropriate preprocessing.

Fitting Categorical and Numerical Features in the Same Pipeline

Combining both categorical and numerical features into a single pipeline is straightforward with ColumnTransformer. Let’s walk through an example where we preprocess numerical and categorical features and train a RandomForestClassifier.

Output:

Pipeline(steps=[('preprocessor',
 ColumnTransformer(transformers=[('num', StandardScaler(),
 ['numerical']),
 ('cat', OneHotEncoder(),
 ['categorical'])])),
 ('classifier', RandomForestClassifier())])

The pipeline will preprocess the input and train the RandomForestClassifier model on it; however, there won't be any printed output.

Combining Feature Engineering and Model Training in a Pipeline

Scikit-learn pipelines also support combining feature engineering steps with model training. For example, you can add polynomial features or perform feature selection within a pipeline. In this case, the pipeline will:

Preprocess the data
Add polynomial features
Select the top 3 features
Train the classifier

Output:

Pipeline(steps=[('preprocessor',
 ColumnTransformer(transformers=[('num', StandardScaler(),
 ['numerical']),
 ('cat', OneHotEncoder(),
 ['categorical'])])),
 ('poly_features', PolynomialFeatures()),
 ('select', SelectKBest(k=3)),
 ('classifier', RandomForestClassifier())])

In this example, the model is fitted after polynomial features are added and the top three features are chosen.

Handling Missing Values in a Pipeline with Different Inputs

Managing missing data in both numerical and categorical columns is essential in any preprocessing workflow. You can handle missing values using sklearn's SimpleImputer in combination with ColumnTransformer.

Output:

Pipeline(steps=[('preprocessor',
 ColumnTransformer(transformers=[('num', SimpleImputer(),
 ['numerical']),
 ('cat',
 Pipeline(steps=[('imputer',
 SimpleImputer(strategy='most_frequent')),
 ('encoder',
 OneHotEncoder(handle_unknown='ignore'))]),
 ['categorical'])])),
 ('classifier', RandomForestClassifier())])

Explanation:

Imputation: The mean is used to fill in missing numerical values, and the most common category is used to fill in missing categorical values.
OneHotEncoding: OneHotEncoder is used to transform the categorical characteristics into a binary vector format.
RandomForestClassifier: The RandomForestClassifier receives the preprocessed category and numerical features in order to train the model.

These examples explain how sklearn.Pipeline may optimize your processes by bringing together feature engineering, model training, and preprocessing in a unified, effective way.

Advanced Pipeline Techniques in Sklearn

While the basic use of pipelines streamlines machine learning workflows, there are more advanced techniques to enhance flexibility and performance, especially when dealing with complex datasets and feature engineering processes. Below, we explore advanced methods like FeatureUnion, andextracting information from pipelines.

1. FeatureUnion

When dealing with the same dataset, you might want to apply multiple transformations and combine the results. This is where FeatureUnion comes into play. It allows you to concatenate the output of multiple transformers into a single dataset, which is useful when you're extracting different types of features.

Example usage:

In this example, PCA is used to reduce dimensionality, while SelectKBest selects the best features based on a scoring function. These are combined using FeatureUnion and followed by model training.

2. Extracting Information from a Pipeline

Once a pipeline is built, you may want to access or inspect specific components. This can be useful for retrieving feature importances or modifying a certain step of the pipeline.

Each step of the pipeline can be accessed using the named_steps attribute. For example, to retrieve the scaler and classifier:

Output:

Scaler used in pipeline: StandardScaler()
Feature union used in pipeline: FeatureUnion(transformer_list=[('pca', PCA(n_components=2)),
 ('select_best', SelectKBest(k=2))])
Classifier used in pipeline: RandomForestClassifier()
Feature importances: [0.32040734 0.02322668 0.34747261 0.30889336]

Best Practices for Working with Pipelines

Here are some best practices for efficiently using pipelines with multiple input types:

Use ColumnTransformer for Multiple Inputs: Apply appropriate transformations for each input type using ColumnTransformer to ensure that preprocessing is done correctly.
Handle Missing Values: Use SimpleImputer to handle missing data for both numerical and categorical columns.
One-Hot Encode Categorical Variables: Ensure that categorical variables are one-hot encoded to be compatible with machine learning models.
Leverage GridSearchCV for Hyperparameter Tuning: Combine GridSearchCV with pipelines to optimize preprocessing and model parameters together.
Avoid Data Leakage: Always apply transformations to test data during model evaluation, ensuring that the test data remains unseen during the training phase.
Modularity: Keep pipelines modular to make them easier to maintain and update.

Conclusion

Scikit-learn pipelines are an effective tool for handling multi-step, complex machine learning workflows like feature engineering, data preprocessing, and model training. Pipelines reduce the risk of data leakage and simplify the code by chaining these stages together to ensure that data transformations are executed consistently across training and testing sets.

They are especially useful because they make it possible to integrate diverse preprocessing approaches seamlessly when working with datasets that have different input kinds (text, numerical, and categorical, for example).
In the end, pipelines improve the reproducibility, efficiency, and ease of maintenance of your machine learning process.

Comment

Article Tags:

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/fitting-different-inputs-into-an-sklearn-pipeline/