VOOZH about

URL: https://towardsdatascience.com/4-scikit-learn-tools-every-data-scientist-should-use-4ee942958d9e/

⇱ 4 Scikit-Learn Tools Every Data Scientist Should Use | Towards Data Science


4 Scikit-Learn Tools Every Data Scientist Should Use

Composite Estimators and Transformers

6 min read

Written By: Amal Hasni & Dhia Hmila

👁 Photo by Sandy Millar on Unsplash
Photo by Sandy Millar on Unsplash

Data Science projects tend to include multiple back and forth passages between preprocessing, feature engineering, feature selection, training, testing … Juggling all of these steps, while trying multiple options or even in production environments, can get messy very fast. Fortunately, Scikit-Learn provides options that allow us to chain multiple estimators into one. In other words, a particular action like fit or predict needs only to be applied once on the whole sequence of estimators. In this article, We share with you four of these tools with examples of use cases through a concrete project.

Table Of Contents

1 – Pipelines 2 – Function Transformer 3 – Column Transformer 4 – Feature Union Alternative Syntax: make_[composite_estimator] Bonus: Visualizing Your Pipeline


Warming-up

Before we start exploring scikit-learn’s tools, let’s start by getting a dataset we can play with.

We should mention that this is just for the sake of example. So you don’t necessarily need to download it (unless you want to try the code yourself).

We actually stumbled on a nice python package called datasets that allows you to easily download more than 500 datasets:

We’re going to use Amazon Us Reviews. It contains numerical and textual features (e.g. reviews and the number of helpful votes it got) and the target feature is the number of stars attributed.


1 – Pipelines

What is a Pipeline?

Pipelines are tools made to encapsulate sequences of estimators into a single one for convenience purposes.

What makes it useful?

Pipelines are convenient for multiple reasons:

  • Compactness: You can avoid writing multiple lines of code and worrying about the order of your estimators.
  • Clarity: Easier to read and visualize.
  • Ease of handling: Only one single action is needed to apply fit or predict methods over the whole sequence of estimators.
  • Joint parameter selection: Using pipelines allows you to optimize the parameters of all estimators at once via grid search.

How can I use it in practice?

In practice, a pipeline is a bunch of transformers followed by an estimator.

If you don’t know what a transformer is, it’s basically any object that implements a fit and transform methods.

In our example, we’re going to transform the reviews from textual to numeric data using TfidfVectorizer and then attempt a prediction using RandomForestClassifier :

👁 Image By Author
Image By Author

2 – Function Transformer

What is a Function Transformer?

As we mentioned in the previous section, a Transformer needs to include fit and transform methods.

A FunctionTransformer is a stateless transformer constructed from a callable (aka function) you’ve created.

What makes it useful?

In some cases, you need to perform a transformation that doesn’t make any parameter fitting. In this scenario, it would be useless to create a fit method. It’s in those cases that FunctionTransformer would be the most useful.

How can I use it in practice?

Our Dataset includes a feature under a date format. An example of a transformation we can do over dates is to extract the year.

This is how to do this with a FunctionTransformer:

Note: If a lambda function is used with FunctionTransformer, then the resulting transformer will not be pickleable. A nice thing to know is that you can work around this by using cloudpickle package.

3 – Column Transformer

What is a Column Transformer?

Depending on your Dataset, you might need to apply distinct transformations on different columns of an array or pandas DataFrame.

Column Transformer allows applying separate transformations before concatenating the resulting features.

What makes it useful?

This estimator is particularly useful for heterogeneous data. In this case, you need to customize feature extraction mechanisms or transformations to the data type of the column or the subset of columns.

How can I use it in practice?

In our example, we have both textual and numerical data. Here’s how we use Column Transformer to apply separate transformations depending on the data type:

As expected, the output Data of the ColumnTransformer has 102 columns: 100 from the TF-IDF transformer and 2 from the Normalizer.

👁 Image By Author
Image By Author

4 – Feature Union

What is a Feature Union?

Similar to ColumnTransformer , a FeatureUnion concatenates the results of multiple transformers. Except it is slightly different since each transformer gets the whole dataset as their input instead of a subset of columns as in the ColumnTransformer.

The two are quite equivalent in terms of what you can do with them, but depending on the situation, one may be more appropriate to use and require fewer lines of codes.

What makes it useful?

FeatureUnion allows you to combine different feature extraction transformers into one Transformer.

How can I use it in practice?

Equivalently to the previous case, we can construct the FeatureUnion by using a pipeline composed of ColumnSelector and a given Transformer.

We’re going to construct a FeatureUnion that performs the same transformations as the previously implemented ColumnTransformer .

To do that, we’re going to encapsulate two pipelines into a FeatureUnion. Each pipeline chains a ColumnSelector and a given Transformer.

There is no pre-implemented ColumnSelector in scikit-learn, so we’re going to build our own, using FunctionTransformer :

👁 Image By Author
Image By Author

Alternative Syntax: make_[composite_estimator]

There are alternatives to the previously mentioned methods(except Function Transformer) that have a slightly different syntax.

These methods are:

These are shorthands for the previous constructors. The main difference is that they do not require, and do not permit, naming the estimators. Instead, the component names will automatically be set to the lowercase of their types.

In most cases, it is simpler, cleaner, and more easily readable to use the shorthand versions. However, you might need to customize your estimators’ names if you need to perform a grid search for example. In this case, assigning short distinguishable names can be useful for clarity and compactness.

If you’re not familiar with parameter optimization using grid search, we will be writing an article about it soon.

You can see the difference between the two versions applied to the Pipeline case below:

  • Pipeline syntax:
  • Compared to make_pipeline :

    Bonus: Visualizing Your Pipeline

Scikit-Learn has a neat way of visualizing the composite estimators you create using the following lines of code:

This will effectively create an interactive HTML file representing your pipeline in a clear way.

An alternative to this, if you’re using notebooks is to use this code instead:


Final Thoughts

Pipelines and composite estimators are powerful tools for Data Science projects especially those meant to be put in a production environment. Their added value is not only about clarity and convenience but also about safety and data leakage prevention.

If you want to see how we used the different tools mentioned in this article in a Hands-on project, don’t hesitate to check out our previous article:

How I Built a Classification Model for Source Code Languages

Finally, if you want to see how to make use of these composite estimators to optimize hyperparameters over the whole pipeline or to compare different algorithm performances, stay tuned for our future article about grid searches.

Thank you for sticking with us this far. We hope you liked the content. Stay safe and we will see you in our future article 😊


Written By

Amal Hasni

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles