VOOZH about

URL: https://towardsdatascience.com/how-to-quickly-design-advanced-sklearn-pipelines-3cc97b59ce16/

⇱ How to Quickly Design Advanced Sklearn Pipelines | Towards Data Science


Skip to content

How to Quickly Design Advanced Sklearn Pipelines

Compose all the components from Scikit-Learn Pipelines to build custom production-ready models.

12 min read

Tutorial

👁 Photo by Clint Patterson on Unsplash
Photo by Clint Patterson on Unsplash

This tutorial will teach you how and when to use all the advanced tools from the Sklearn Pipelines ecosystem to build custom, scalable, and modular machine learning models that can easily be deployed in production.

In isolation, there is much content about different components from the Sklearn Pipelines toolbox. I am writing this tutorial because it is precious to see how all those components work together into a single, more complex system.

I will use a concrete example and show you how and when to use the following components:

Knowing how to use them individually is easy, that is why this tutorial will emphasize on when to use them and how to use them interchangeably in a complex system.

Goal

We will build a forecasting model to predict the following year’s global mean wheat yield.

The main focus will be on the advanced concepts of the Sklearn Pipeline components. Therefore, we won’t spend much time on other data science principles.


Table of Contents

  • Dataset
  • Summary of Pipeline Fundamentals
  • Configuration
  • Data Preparation
  • Building the Pipeline
  • Global Pipeline. Let’s Put Things Together.
  • How to Use the Global Pipeline

NOTE: __ If you are interested only in the advanced topics of Sklearn Pipelines, skip directly to Building the Pipeline.


Dataset

We are using a publicly available dataset [1] provided by Pangaea, which tracks global historical yearly yields for various plants from 1981 to 2016.

We found the dataset using this GitHub Repository. Check it out for more awesome publicly available datasets.

The datasets provide multiple types of crops, but for this example, we will use only wheat.

👁 Global yearly wheat yield and the number of locations provided within the Pangaea dataset [Image by the Author].
Global yearly wheat yield and the number of locations provided within the Pangaea dataset [Image by the Author].

Summary of Pipeline Fundamentals

Here is a short reminder of the main principles used by the Sklearn Pipelines ecosystem.

Everything is revolved around the Pipeline object.

A Pipeline contains multiple Estimators.

An Estimator can have the following properties:

  • learns from the data → using the fit() method
  • transforms the data → using the transform() method. Also known as a Transformer (no, not the robots, it is a subclass of an Estimator).
  • predicts from new data → using the predict() method. Also known as a Predictor.

NOTE 1: We can have Transformers that do not have a fit() method. Therefore, those classes are not parameterized and follow the principles of a pure function. Usually, those types of transformers are helpful when doing feature engineering (e.g., we can multiply two different columns without learning anything before using the fit() method).

NOTE 2: The Pipeline object inherits the methods from the last Estimator within the Pipeline.

NOTE 3: If you want to add a model to your Pipeline, it must be the last element. I will show you a trick on how to perform postprocessing operations on the model’s predictions using TransformedTargetRegressor.


Configuration

Render Pipelines as Diagrams

By setting the display of the Sklearn configuration to "diagram," we can quickly visualize the Pipeline as diagrams.

👁 Example of a visualization of a Sklearn Pipeline diagram [Image by the Author].
Example of a visualization of a Sklearn Pipeline diagram [Image by the Author].

Constants

Below we will define a few constants that we will use across the code.


Data Preparation

Pick Ground Truth

Because time series forecasting is a form of unsupervised learning, we have to predict a data point at Tₙ using information from the past. Therefore, in the beginning, we will take the features and labels as the same time series. But during the preprocessing steps, we will use the features as past data points and the label as the data point we want to predict.

This is not a time series forecasting post. Therefore, don’t overthink this step. Also, you don’t have to read the code line by line.

Just focus on the big picture and on the Pipelines steps. ** It is enough to understand the end goal of how and when to use specific Sklearn components**.

X, y = yields.copy(), yields.copy()

Split Data: Train & Test

Now we will split the data between train and testing. You will see how easy it is to use your model on new data using Sklearn Pipelines correctly.

Let’s see how the train-test splits look:

(Starting year of the split, Ending year of the split, number of years within the split)
X_train.index.min(), X_train.index.max(), X_train.index.max() - X_train.index.min() + 1
(1981, 2012, 32)
y_train.index.min(), y_train.index.max(), y_train.index.max() - y_train.index.min() + 1
(1986, 2012, 27)
X_test.index.min(), X_test.index.max(), X_test.index.max() - X_test.index.min() + 1
(2007, 2016, 10)
y_test.index.min(), y_test.index.max(), y_test.index.max() - y_test.index.min() + 1
(2012, 2016, 5)

Building the Pipeline

Enough talking. Let’s start implementing the actual Pipeline.

The global Pipeline is divided into the following subcomponents:

  1. stationarity pipeline (used both on the features and targets)
  2. feature engineering pipeline
  3. regressor pipeline
  4. target pipeline

1. Stationarity Pipeline

The Pipeline is used on a time series to make it stationary. More concretely, it will remove periodicity and standardize the mean and variance across time. Here you can read more about this.

In this step, we will show you how to use the following:

We can build a pipeline estimator in two ways:

1️⃣ By inheriting from BaseEstimator + TransformerMixin. Using this approach, the pipeline unit can learn from the data, transform it, and reverse the transformation. Here is a short description of the supported interface:

  • fit(X, y) – used to learn from the data
  • transform(X) – used to transform the data
  • fit_transform (X) – learn and transform the data. This function is inherited from TransformerMixin
  • inverse_transform(X) – used to reverse the transformation

Note 1: This statement is always true: x == inverse_transform(transform(x)) – with a small tolerance accepted.

Note 2: The targets (e.g., y) are passed only in the fit() method. As for the transform() and inverse_transform(), only the features (e.g., X) are given as input. Also, we can’t force the Pipeline to access any other attributes.

2️⃣ Writing a pure function that is ultimately wrapped by FunctionTransformer.

This approach is practical when the transformation does not require to have a state (it doesn’t have a fit() method), and it doesn’t need to inverse the transformation (it doesn’t have an inverse_transform() method). Therefore, it is useful when you need to implement just the transform() method.

This must be a pure function (for input A, you always get output B → it doesn’t depend on the external context). Otherwise, you will encounter strange behavior when your Pipeline gets bigger.

Note: As we implement only a transformation, we have access only to the features (e.g., X). We can’t access any other attributes.

As an observation, we chose to write a class for the transformation even though it didn’t need to implement the fit() method (.e.g., LogTransformer). But it is good software practice to pack the transformation and its inverse into the same structure.

We leveraged the partial function from the Python functools module to configure the transformations.

As stated earlier, the function given to FunctionTransformer should take only one input and output only one value. ** Also, it isn’t a class to configure it within the constructor. Therefore, using partial, we can set only a subset of the parameters of a function. It will wrap up our initial function and return another function that will need as input only the parameters not specified in partial** at the next call.

Finally, let’s build the Pipeline. We have used the make_pipeline() utility function that automatically names every pipeline step.

👁 Diagram of the stationary_pipeline [Image by the Author].
Diagram of the stationary_pipeline [Image by the Author].

Here is how you can quickly check that your transformation and inverse_transformation are performing correctly:

Using np.allclose(), we check the equality by accepting a small error.

👁 Example of how the time series are looking after being processed by the stationary Pipeline [Image by the Author].
Example of how the time series are looking after being processed by the stationary Pipeline [Image by the Author].

2. Feature Engineering Pipeline

Now let’s do some feature engineering.

Note that you don’t have to understand the implementation of every function. Focus on the structure.

For most of the transformations, we used pure functions + FunctionTransformer. We used this approach because we are not interested in implementing fit() or _inversetransform(). Therefore, using this method, our code is slimmer and cleaner.

Only DropRowsTransformer is implemented with a class because it needs both fit() and _inversetransform().

From my experience, when implementing data & feature engineering pipelines, I usually find FunctionTransfomers more useful and cleaner. I don’t think it is good practice to inherit classes and leave most methods empty.

Now let’s get to the sweet part, where we will use:

Using make_column_transformer, we can run different operations/pipelines on subsets of the columns. In this concrete example, we ran different transformations on the "_meanyield" and the "locations" columns. Another sweet thing about this operator is that it runs the operations for every set of columns in parallel. Therefore, the features for "_meanyield" and "locations" are computed in parallel.

Using make_union, we can compute multiple features in parallel. In this example, for "_meanyield," we are calculating in parallel four different features:

  • past observations
  • moving average
  • moving standard deviation
  • moving median

NOTE 1: The same principle applies to the "locations" feature.

NOTE 2: I recommend using the make_[x] utility function. It will make your life easier.

In the snippet below, you can see those components in action. Following the Sklearn Pipelines paradigm, look how nicely we reused most of the functionality across the Pipeline.

Another essential element is the memory= "cache" attribute. Using this, all the steps are cached on the local disk. Therefore, if your output is cached on new runs, it will automatically read the results from the cache. It also knows when to invalidate the cache if something is changing.

Now, with minimum effort, by running all the transformations in parallel and caching the outputs of your pipeline, your machine learning pipeline will be blazing fast.

👁 Diagram of the feature_engineering_pipeline. Note that in the notebook, the diagram is interactive. You have a dropdown showing you more details about every pipeline unit [Image by the Author].
Diagram of the feature_engineering_pipeline. Note that in the notebook, the diagram is interactive. You have a dropdown showing you more details about every pipeline unit [Image by the Author].

Let’s run the _feature_engineeringpipeline on our training features:

feature_engineered_yields = feature_engineering_pipeline.fit_transform(X_train.copy())
👁 Train features computed by the feature_engineering_pipeline [Image by Author].
Train features computed by the feature_engineering_pipeline [Image by Author].

Now let’s run it on our testing features:

feature_engineering_pipeline.transform(X_test.copy())

3. Regressor Pipeline

Below you can see how to build the final regressor. Besides the _feature_engineerpipeline presented above, we stacked a scaler and a Radom Forest model.

You can observe that the model is the last unit added to the pipeline.

4. Target Pipeline

The target_pipeline is used to preprocess the labels and postprocess the model’s predictions.

Everything will make sense in just a second.


Global Pipeline. Let’s Put Things Together.

Here we will show you how to use the TransformedTargetRegressor component.

The TransformedTargetRegressor class **** takes as arguments the following:

  • regressor: Takes as input the _regressorpipeline defined above. What we are used to using in most of the pipelines.
  • transformer: Takes as input the _targetpipeline which will preprocess the labels the model will use as ground truth when using the fit() and transform() ** methods defined within the Pipeline. _ALS_O, when making predictions, it will _postproces_s the output of the model calling _inversetransform().** How awesome is that? Finally, a method to pack all the steps into a single logical unit.

    👁 Diagram of the entire Pipeline. As the Pipeline gets more complex, a good visualization will always be your friend [Image by the Author].
    Diagram of the entire Pipeline. As the Pipeline gets more complex, a good visualization will always be your friend [Image by the Author].

How to Use the Global Pipeline

Train

Now the training step is just a one-liner.

pipeline.fit(X=X_train.copy(), y=y_train.copy())

Make Predictions

The excellent part is that to make predictions; you also have to call a one-liner. Call "pipeline.predict(X)" and that’s it. You have your predictions.

Using TransformedTargetRegressor, the predictions are already transformed back to their initial scale when calling predict. Therefore, the model/pipeline is highly compact and easy to deploy in various scenarios: batch, API, streaming, embedded, etc.

Another helpful feature is that you can quickly use GridSearch (or other techniques) on your features and model. Now you can consider different configurations from your data pipeline as hyper-parameters. Therefore, you can quickly experiment with various features with only a few lines of code.

y_pred = pipeline.predict(X_test.copy())
y_pred
26 4.026584
0 4.122576
1 4.080378
2 4.174781
3 4.380293
Name: 0, dtype: float64

Test

For fun, let’s evaluate the model using the good old RMSE and MAPE metrics.

evaluate(y_test, y_pred)
INFO:sklearn-pipelines:RMSE: 0.147044
INFO:sklearn-pipelines:MAPE: 0.030220

We can observe that it is doing decent work using a simple model and without any fine-tuning at all. An RMSE of ~0.13 on a scale of ~4.0 is pretty good.

But as stated a few times, this tutorial was about leveraging Sklearn Pipelines, not building an accurate model.


Conclusion

If you’ve gotten so far, you are fantastic. Now you know how to write professional Scikit Pipelines. Thank you for reading my article!

Using a concrete example, we showed how powerful it is to leverage Sklearn Pipelines and their entire stack of components: TransformerMixin, BaseEstimator, FunctionTransformer, ColumnTransformer, FeatureUnion, TransformedTargetRegressor.

Using this approach, we built a flexible machine learning pipeline where we can:

  • Easily reuse the transformations and compose them in various ways (modular code).
  • Write clean and scalable classes.
  • Write blazing-fast code that computes all the features in parallel and caches intermediate checkpoints across the Pipeline.
  • Directly deploy the model as a simple class without further preprocessing/postprocessing steps.
  • Quickly perform hyper-parameter tunning on the feature/data pipeline and the model itself.

What other hidden gems do you know about Sklearn Pipelines?

You can find the full implementation of the example used within the tutorial on my GitHub. In that repository, I will constantly push all the examples I will use to write various articles.

References

[1] Iizumi, Toshichika, Global dataset of historical yields v1.2 and v1.3 aligned version. PANGAEA (2019), Supplement to: Iizumi, Toshichika; Sakai, T, The global dataset of historical yields for major crops 1981–2016 (2020), Scientific Data


💡 My goal is to help machine learning engineers level up in designing and productionizing ML systems. Follow me on LinkedIn or subscribe to my weekly newsletter for more insights!

🔥 If you enjoy reading articles like this and wish to support my writing, consider becoming a Medium member. By using my referral link, you can support me without any extra cost while enjoying limitless access to Medium’s rich collection of stories.

Join Medium with my referral link – Paul Iusztin


Written By

Paul Iusztin

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles