A Simple Example of Pipeline in Machine Learning with Scikit-learn
Today's post will be short and crisp and I will walk you through an example of using Pipeline in machine learning with python. I will use…
Today’s post will be short and crisp and I will walk you through an example of using Pipeline in machine learning with python. I will use some other important tools like GridSearchCV etc., to demonstrate the implementation of pipeline and finally explain why pipeline is indeed necessary in some cases. Let’s begin
Definition of pipeline class according to scikit-learn is
Sequentially apply a list of transforms and a final estimator. Intermediate steps of pipeline must implement fit and transform methods and the final estimator only needs to implement fit.
The above statements will be more meaningful once we start to implement pipeline on a simple data-set. Here I’m using the red-wine data-set, where the ‘label’ is quality of the wine, ranging from 0 to 10. In terms of data pre-processing, it’s a rather simple data-set as, it has no missing values.
import pandas as pd
winedf = pd.read_csv('winequality-red.csv',sep=';')
# print winedf.isnull().sum() # check for missing data
print winedf.head(3)
>>> fixed ac. volat. ac. citric ac. res. sugar chlorides
0 7.4 0.70 0.00 1.9 0.076
1 7.8 0.88 0.00 2.6 0.098
2 7.8 0.76 0.04 2.3 0.092
free sulfur diox. tot. sulfur diox. dens. pH sulphates
0 11.0 34.0 0.9978 3.51 0.56
1 25.0 67.0 0.9968 3.20 0.68
2 15.0 54.0 0.9970 3.26 0.65
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
We can always check the correlation plots with seaborn or else we can plot some of the features using a scatter plot and below are two such plots..
As expected acidity and pH has a high negative correlation compared to residual sugar and acidity. Once we are familiar and have played around enough with the data-set, let’s discuss and implement pipeline.
As the name suggests, pipeline class allows sticking multiple processes into a single scikit-learn estimator. pipeline class has fit, predict and score method just like any other estimator (ex. LinearRegression).
To implement pipeline, as usual we separate features and labels from the data-set at first.
X=winedf.drop(['quality'],axis=1)
Y=winedf['quality']
If you have looked into the output of pd.head(3) then, you can see the features of the data-set vary over a wide range. As I have explained before, just like principal-component-analysis, some fitting algorithm needs scaling and here I will use one such, known as SVM (Support Vector Machine). For more on the theory of SVM, you can check my other post.
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
Here we are using StandardScaler, which subtracts the mean from each features and then scale to unit variance.
Now we are ready to create a pipeline object by providing with the list of steps. Our steps are – standard scalar and support vector machine. These steps are list of tuples consisting of name and an instance of the transformer or estimator. Let’s see the piece of code below for clarification –
steps = [('scaler', StandardScaler()), ('SVM', SVC())]
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps) # define the pipeline object.
The strings (‘scaler’, ‘SVM’) can be anything, as these are just names to identify clearly the transformer or estimator. We can use make_pipeline instead of Pipeline to avoid _ naming the estimator or transformer. The final step has to be an estimator in this list of tuples_.
We divide the data-set into training and test-set with a random_state=30 .
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=30, stratify=Y)
It’s necessary to use stratify as I’ve mentioned before that the labels are imbalanced as most of the wine quality falls in the range 5,6. You can check using pandas value_counts() which returns objects containing counts of unique values.
print winedf['quality'].value_counts()
>>> 5 681
6 638
7 199
4 53
8 18
3 10
SVM is usually optimized using two parameters gamma,C . I have discussed effect of these parameters in another post but now, let’s define a parameter grid that we will use in GridSearchCV .
parameteres = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]}
Now we instantiate the GridSearchCV object with pipeline and the parameter space with 5 folds cross validation.
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
We can use this to fit on the training data-set and test the algorithm on the test-data set. Also we can find the best fit parameters for the SVM as below
grid.fit(X_train, y_train)
print "score = %3.2f" %(grid.score(X_test,y_test))
print grid.best_params_
>>> score = 0.60
{'SVM__C': 100, 'SVM__gamma': 0.1}
With this we have seen an example of effectively using pipeline with grid search to test support vector machine algorithm.
On a separate post, I have discussed in great detail of applying pipeline and GridSearchCV and how to draw the decision function for SVM. You can use any other algorithm like logistic regression instead of SVM to test which learning algorithm works best for red-wine data-set. For applying Decision Tree algorithm in a pipeline including GridSearchCV on a more realistic data-set, you can check this post.
Why Pipeline :
I will finish this post with a simple intuitive explanation of why Pipeline can be necessary at times. It helps to enforce desired order of application steps, creating a convenient work-flow, which makes sure of the reproducibility of the work. But, there is something more to pipeline, as we have used grid search cross validation, we can understand it better.
The pipeline object in the example above was created with StandardScalerand SVM . Instead of using pipeline if they were applied separately then for StandardScaler one can proceed as below
scale = StandardScaler().fit(X_train)
X_train_scaled = scale.transform(X_train)
grid = GridSearchCV(SVC(), param_grid=parameteres, cv=5)
grid.fit(X_train_scaled, y_train)
Here we see the intrinsic problem of applying a transformer and an estimator separately where the parameters for estimator (SVM) are determined using GridSearchCV . The scaled features used for cross-validation is separated into test and train fold but the test fold within grid-search already contains the info about training set, as the whole training set (X_train) was used for standardization. In a simpler note when SVC.fit() is done using cross-validation, the features already include info from the test-fold as StandardScaler.fit() was done on the whole training set.
One can bypass this oversimplification by using pipeline. Using pipeline we glue together the StandardScaler() and SVC() and this ensure that during cross validation the StandardScaler is fitted to only the training fold, exactly similar fold used for SVC.fit(). A fantastic pictorial representation of the above description is given in Andreas Muller book¹.
[1] Andreas Muller, Sarah Guido; Introduction to Machine Learning with Python ; pp-305–320; First Edition; Riley O’ publication; amazonlink
You can find the complete code in GitHub.
Cheers ! Stay strong !!
If you’re interested in further fundamental machine learning concepts and more, you can consider joining Medium using My Link. You won’t pay anything extra but I’ll get a tiny commission. Appreciate you all!!
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS