![]() |
VOOZH | about |
The Scikit-learn A tool called a pipeline class links together many processes, including feature engineering, model training, and data preprocessing, to simplify and optimize the machine learning workflow. The sequential application of each pipeline step guarantees consistent data transformation throughout training and testing. Because they enforce best practicesālike making sure transformations are only learnt from training data to prevent data leakageāpipelines are especially helpful.
Before diving into the specifics of handling multiple inputs, it's essential to understand the fundamental structure of an sklearn pipeline. A pipeline in sklearn is a sequence of data processing steps, each of which is an instance of a Transformer or an Estimator. These steps are executed in a linear sequence, with the output of one step serving as the input to the next.
Key benefits of pipeline use:
In the field of machine learning, datasets frequently comprise diverse input feature types, necessitating distinct preprocessing procedures prior to model training. Among these input kinds are:
Each of these input kinds requires a distinct approach, and to address these varied input kinds, sklearn offers a large range of preprocessing tools that can be integrated into a pipeline.
Sometimes, the built-in transformers may not fully meet the requirements of your data. In such cases, you can create a custom transformer by subclassing TransformerMixin and BaseEstimator. For example, we might want a custom transformer that applies specific preprocessing for different input types.
Output:
[[ 6 7]
[ 8 9]
[10 11]]
The ColumnTransformer is a powerful tool in sklearn that allows you to apply different preprocessing steps to different columns in your dataset. This is particularly helpful when dealing with datasets that contain both numerical and categorical data.
For instance, you can scale numerical features and one-hot encode categorical features in the same pipeline using ColumnTransformer.
Output:
[[-1.34164079 1. 0. ]
[-0.4472136 0. 1. ]
[ 0.4472136 1. 0. ]
[ 1.34164079 0. 1. ]]
This ensures that each type of feature receives the appropriate preprocessing.
Combining both categorical and numerical features into a single pipeline is straightforward with ColumnTransformer. Letās walk through an example where we preprocess numerical and categorical features and train a RandomForestClassifier.
Output:
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num', StandardScaler(),
['numerical']),
('cat', OneHotEncoder(),
['categorical'])])),
('classifier', RandomForestClassifier())])
The pipeline will preprocess the input and train the RandomForestClassifier model on it; however, there won't be any printed output.
Scikit-learn pipelines also support combining feature engineering steps with model training. For example, you can add polynomial features or perform feature selection within a pipeline. In this case, the pipeline will:
Output:
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num', StandardScaler(),
['numerical']),
('cat', OneHotEncoder(),
['categorical'])])),
('poly_features', PolynomialFeatures()),
('select', SelectKBest(k=3)),
('classifier', RandomForestClassifier())])
In this example, the model is fitted after polynomial features are added and the top three features are chosen.
Managing missing data in both numerical and categorical columns is essential in any preprocessing workflow. You can handle missing values using sklearn's SimpleImputer in combination with ColumnTransformer.
Output:
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num', SimpleImputer(),
['numerical']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='most_frequent')),
('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['categorical'])])),
('classifier', RandomForestClassifier())])
Explanation:
These examples explain how sklearn.Pipeline may optimize your processes by bringing together feature engineering, model training, and preprocessing in a unified, effective way.
While the basic use of pipelines streamlines machine learning workflows, there are more advanced techniques to enhance flexibility and performance, especially when dealing with complex datasets and feature engineering processes. Below, we explore advanced methods like FeatureUnion, andextracting information from pipelines.
When dealing with the same dataset, you might want to apply multiple transformations and combine the results. This is where FeatureUnion comes into play. It allows you to concatenate the output of multiple transformers into a single dataset, which is useful when you're extracting different types of features.
Example usage:
In this example, PCA is used to reduce dimensionality, while SelectKBest selects the best features based on a scoring function. These are combined using FeatureUnion and followed by model training.
Once a pipeline is built, you may want to access or inspect specific components. This can be useful for retrieving feature importances or modifying a certain step of the pipeline.
Each step of the pipeline can be accessed using the named_steps attribute. For example, to retrieve the scaler and classifier:
Output:
Scaler used in pipeline: StandardScaler()
Feature union used in pipeline: FeatureUnion(transformer_list=[('pca', PCA(n_components=2)),
('select_best', SelectKBest(k=2))])
Classifier used in pipeline: RandomForestClassifier()
Feature importances: [0.32040734 0.02322668 0.34747261 0.30889336]
Here are some best practices for efficiently using pipelines with multiple input types:
ColumnTransformer to ensure that preprocessing is done correctly.SimpleImputer to handle missing data for both numerical and categorical columns.GridSearchCV with pipelines to optimize preprocessing and model parameters together.Scikit-learn pipelines are an effective tool for handling multi-step, complex machine learning workflows like feature engineering, data preprocessing, and model training. Pipelines reduce the risk of data leakage and simplify the code by chaining these stages together to ensure that data transformations are executed consistently across training and testing sets.