This tutorial will teach you how and when to use all the advanced tools from the Sklearn Pipelines ecosystem to build custom, scalable, and modular machine learning models that can easily be deployed in production.
In isolation, there is much content about different components from the Sklearn Pipelines toolbox. I am writing this tutorial because it is precious to see how all those components work together into a single, more complex system.
I will use a concrete example and show you how and when to use the following components:
Knowing how to use them individually is easy, that is why this tutorial will emphasize on when to use them and how to use them interchangeably in a complex system.
Goal
We will build a forecasting model to predict the following year’s global mean wheat yield.
The main focus will be on the advanced concepts of the Sklearn Pipelinecomponents. Therefore, we won’t spend much time on other data science principles.
Table of Contents
Dataset
Summary of Pipeline Fundamentals
Configuration
Data Preparation
Building the Pipeline
Global Pipeline. Let’s Put Things Together.
How to Use the Global Pipeline
NOTE: __ If you are interested only in the advanced topics of Sklearn Pipelines, skip directly to Building the Pipeline.
Dataset
We are using a publicly available dataset [1] provided by Pangaea, which tracks global historical yearly yields for various plants from 1981 to 2016.
We found the dataset using this GitHub Repository. Check it out for more awesome publicly available datasets.
The datasets provide multiple types of crops, but for this example, we will use only wheat.
Here is a short reminder of the main principles used by the Sklearn Pipelines ecosystem.
Everything is revolved around the Pipeline object.
A Pipeline contains multiple Estimators.
An Estimator can have the following properties:
learns from the data → using the fit() method
transforms the data → using the transform() method. Also known as a Transformer (no, not the robots, it is a subclass of an Estimator).
predicts from new data → using the predict() method. Also known as a Predictor.
NOTE 1: We can have Transformers that do not have a fit() method. Therefore, those classes are not parameterized and follow the principles of a pure function. Usually, those types of transformers are helpful when doing feature engineering (e.g., we can multiply two different columns without learning anything before using the fit() method).
NOTE 2: The Pipeline object inherits the methods from the last Estimator within the Pipeline.
NOTE 3: If you want to add a model to your Pipeline, it must be the last element. I will show you a trick on how to perform postprocessing operations on the model’s predictions using TransformedTargetRegressor.
Configuration
Render Pipelines as Diagrams
By setting the display of the Sklearn configuration to "diagram," we can quickly visualize the Pipeline as diagrams.
Below we will define a few constants that we will use across the code.
Data Preparation
Pick Ground Truth
Because time series forecasting is a form of unsupervised learning, we have to predict a data point at Tₙ using information from the past. Therefore, in the beginning, we will take the features and labels as the same time series. But during the preprocessing steps, we will use the features as past data points and the label as the data point we want to predict.
This is not a time series forecasting post. Therefore, don’t overthink this step. Also, you don’t have to read the code line by line.
Just focus on the big picture and on the Pipelines steps. ** It is enough to understand the end goal of how and when to use specific Sklearn components**.
X, y = yields.copy(), yields.copy()
Split Data: Train & Test
Now we will split the data between train and testing. You will see how easy it is to use your model on new data using Sklearn Pipelines correctly.
Let’s see how the train-test splits look:
(Starting year of the split, Ending year of the split, number of years within the split)
Enough talking. Let’s start implementing the actual Pipeline.
The global Pipeline is divided into the following subcomponents:
stationarity pipeline (used both on the features and targets)
feature engineering pipeline
regressor pipeline
target pipeline
1. Stationarity Pipeline
The Pipeline is used on a time series to make it stationary. More concretely, it will remove periodicity and standardize the mean and variance across time. Here you can read more about this.
In this step, we will show you how to use the following:
1️⃣ By inheriting from BaseEstimator + TransformerMixin. Using this approach, the pipeline unit can learn from the data, transform it, and reverse the transformation. Here is a short description of the supported interface:
fit(X, y) – used to learn from the data
transform(X) – used to transform the data
fit_transform (X) – learn and transform the data. This function is inherited from TransformerMixin
inverse_transform(X) – used to reverse the transformation
Note 1: This statement is always true: x == inverse_transform(transform(x)) – with a small tolerance accepted.
Note 2: The targets (e.g., y) are passed only in the fit() method. As for the transform() and inverse_transform(), only the features (e.g., X) are given as input. Also, we can’t force the Pipeline to access any other attributes.
2️⃣ Writing a pure function that is ultimately wrapped by FunctionTransformer.
This approach is practical when the transformation does not require to have a state (it doesn’t have a fit() method), and it doesn’t need to inverse the transformation (it doesn’t have an inverse_transform() method). Therefore, it is useful when you need to implement just the transform() method.
This must be a pure function (for input A, you always get output B → it doesn’t depend on the external context). Otherwise, you will encounter strange behavior when your Pipeline gets bigger.
Note: As we implement only a transformation, we have access only to the features (e.g., X). We can’t access any other attributes.
As an observation, we chose to write a class for the transformation even though it didn’t need to implement the fit() method (.e.g., LogTransformer). But it is good software practice to pack the transformation and its inverse into the same structure.
We leveraged the partial function from the Python functools module to configure the transformations.
As stated earlier, the function given to FunctionTransformer should take only one input and output only one value. ** Also, it isn’t a class to configure it within the constructor. Therefore, using partial, we can set only a subset of the parameters of a function. It will wrap up our initial function and return another function that will need as input only the parameters not specified in partial** at the next call.
Finally, let’s build the Pipeline. We have used the make_pipeline() utility function that automatically names every pipeline step.
Note that you don’t have to understand the implementation of every function. Focus on the structure.
For most of the transformations, we used pure functions + FunctionTransformer. We used this approach because we are not interested in implementing fit() or _inversetransform(). Therefore, using this method, our code is slimmer and cleaner.
Only DropRowsTransformer is implemented with a class because it needs both fit() and _inversetransform().
From my experience, when implementing data & feature engineering pipelines, I usually find FunctionTransfomers more useful and cleaner. I don’t think it is good practice to inherit classes and leave most methods empty.
Now let’s get to the sweet part, where we will use:
Using make_column_transformer, we can run different operations/pipelines on subsets of the columns. In this concrete example, we ran different transformations on the "_meanyield" and the "locations" columns. Another sweet thing about this operator is that it runs the operations for every set of columns in parallel. Therefore, the features for "_meanyield" and "locations" are computed in parallel.
Using make_union, we can compute multiple features in parallel. In this example, for "_meanyield," we are calculating in parallel four different features:
past observations
moving average
moving standard deviation
moving median
NOTE 1: The same principle applies to the "locations" feature.
NOTE 2: I recommend using the make_[x] utility function. It will make your life easier.
In the snippet below, you can see those components in action. Following the Sklearn Pipelines paradigm, look how nicely we reused most of the functionality across the Pipeline.
Another essential element is the memory= "cache" attribute. Using this, all the steps are cached on the local disk. Therefore, if your output is cached on new runs, it will automatically read the results from the cache. It also knows when to invalidate the cache if something is changing.
Now, with minimum effort, by running all the transformations in parallel and caching the outputs of your pipeline, your machine learning pipeline will be blazing fast.
Below you can see how to build the final regressor. Besides the _feature_engineerpipeline presented above, we stacked a scaler and a Radom Forest model.
You can observe that the model is the last unit added to the pipeline.
4. Target Pipeline
The target_pipeline is used to preprocess the labels and postprocess the model’s predictions.
The TransformedTargetRegressor class **** takes as arguments the following:
regressor: Takes as input the _regressorpipeline defined above. What we are used to using in most of the pipelines.
transformer: Takes as input the _targetpipeline which will preprocess the labels the model will use as ground truth when using the fit() and transform()** methods defined within the Pipeline. _ALS_O, when making predictions, it will _postproces_s the output of the model calling _inversetransform().** How awesome is that? Finally, a method to pack all the steps into a single logical unit.
The excellent part is that to make predictions; you also have to call a one-liner. Call "pipeline.predict(X)" and that’s it. You have your predictions.
Using TransformedTargetRegressor, the predictions are already transformed back to their initial scale when calling predict. Therefore, the model/pipeline is highly compact and easy to deploy in various scenarios: batch, API, streaming, embedded, etc.
Another helpful feature is that you can quickly use GridSearch (or other techniques) on your features and model. Now you can consider different configurations from your data pipeline as hyper-parameters. Therefore, you can quickly experiment with various features with only a few lines of code.
We can observe that it is doing decent work using a simple model and without any fine-tuning at all. An RMSE of ~0.13 on a scale of ~4.0 is pretty good.
But as stated a few times, this tutorial was about leveraging Sklearn Pipelines, not building an accurate model.
Conclusion
If you’ve gotten so far, you are fantastic. Now you know how to write professional Scikit Pipelines. Thank you for reading my article!
Using a concrete example, we showed how powerful it is to leverage Sklearn Pipelines and their entire stack of components: TransformerMixin, BaseEstimator, FunctionTransformer, ColumnTransformer, FeatureUnion, TransformedTargetRegressor.
Using this approach, we built a flexible machine learning pipeline where we can:
Easily reuse the transformations and compose them in various ways (modular code).
Write clean and scalable classes.
Write blazing-fast code that computes all the features in parallel and caches intermediate checkpoints across the Pipeline.
Directly deploy the model as a simple class without further preprocessing/postprocessing steps.
Quickly perform hyper-parameter tunning on the feature/data pipeline and the model itself.
What other hidden gems do you know about Sklearn Pipelines?
You can find the full implementation of the example used within the tutorial on my GitHub. In that repository, I will constantly push all the examples I will use to write various articles.
💡 My goal is to help machine learning engineers level up in designing and productionizing ML systems. Follow me on LinkedIn or subscribe to my weekly newsletter for more insights!
🔥 If you enjoy reading articles like this and wish to support my writing, consider becoming a Medium member. By using my referral link, you can support me without any extra cost while enjoying limitless access to Medium’s rich collection of stories.