Pipelines and columntransformer in Sklearn

Posted on October 18, 2020November 14, 2020Machine learning, Python

What is pipeline?

Pipeline is a way to organize repetitive steps in data science projects such as data cleaning, data transformation and data modeling. It makes your code clean, readable and facilitates implementation in the production environment.
It is recently that I came across to combination of columntransofmer & pipeline, and I found them extremely efficient for the ordering & automation of coding. So let's jump in the code to see how we can apply them together.

For this example, I am using Airbnb dataset from kaggle.

First loading essential libraries:

#load libraries
import pandas as pd
import numpy as np

from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,mean_squared_error

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

Loading the data:

#read data
airbnb = pd.read_csv('data/AB_NYC_2019.csv')

This data is Airbnb New York dataset in 2019 and includes all needed information to find out more about hosts, geographical availability, neighborhood, etc. Each listing is having a price for night, which is our target variable.

airbnb.head()

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0

Just to check if there is any duplication in values & drop unnecessary columns:

airbnb.drop_duplicates(inplace=True)
airbnb.drop(['name','id','host_name','last_review','host_id','neighbourhood'], axis=1, inplace=True)

Specify numeric and categorical variables, this is going to help us for transformation steps:

numeric_features = [cls for cls in airbnb.columns if airbnb[cls].dtype=='float64' or airbnb[cls].dtype=='int64']
print(numeric_features)
cat_features = [cls for cls in airbnb.columns if airbnb[cls].dtype=='object' and airbnb[cls].nunique()<10]
print(cat_features)

['latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']
['neighbourhood_group', 'room_type']

Let's remove the target variable from columns (since I want to apply transformation steps on the columns and don't want to change my target variable).

airbnb.numeric_features.pop(2)

'price'

Let's check situation of missing values in my dataset:

airbnb.isnull().sum()

neighbourhood_group                   0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

It seems reviews_per_month includes missing values, let's address them by replacing mean value (of the column) in the null records, so here I am defining an imputer with strategy of replacing mean value for null values:

imputer = SimpleImputer(strategy ='mean', missing_values = np.nan)

For my categorical variable, I define a pipeline which includes Onehot encoding step for conversion of categorical columns:

cat_imputer =  Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Here I am defining a columntansformer which includes these parameters:

remainder: By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through.
transformers: List of tuples specifying the transformer objects to be applied to subsets of the data. In the above lines, I have defined two transformers, one for numerical variables, which is replacing missing values with mean and one for converting categorical variables using one hot encoding. So this is the place to bind both transoformers here.

pre_process = ColumnTransformer( remainder='passthrough',
    transformers=[
        ('num',imputer,numeric_features),
        ('cat',cat_imputer,cat_features)
    
])

Here, I am defining two models for applying on my data, one is a XGBregressor and the other one is a Decision tree regressor.

XGB_model =  XGBRegressor(n_estimators=1000, learning_rate=0.05,objective = 'reg:squarederror')
DT_model =  DecisionTreeRegressor(max_depth=10)

Let's divide target,training & testing data:

y = airbnb.price              
X = airbnb.drop(['price'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

X_train.columns

Index(['neighbourhood_group', 'latitude', 'longitude', 'room_type',
       'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')

Here I am defining two pipelines, the first one includes:

Preprocessing step
Modeling step (XGB regressor)

and the second one:

Preprocessing step (Same as above)
Modeling step (Decision tree regressor)

pipe_xgb =  Pipeline(steps=[
    ('preprocessor',pre_process),
    ('model',XGB_model)
])

pipe_dt =  Pipeline(steps=[
    ('preprocessor',pre_process),
    ('model',DT_model)
])

pipe_line = [ pipe_xgb, pipe_dt]
mdl_name = ['XGB','Decision Tree']

For each pipeline object, train model using training data & predict test data:

i = 0
for pipe in pipe_line:
    pipe.fit(X_train, y_train)
    predictions = pipe.predict(X_test)
    print("Model is " + mdl_name[i]+", MAE: " + str(np.round(mean_absolute_error(predictions, y_test)))+
      ", RMSE: "+str(np.round(np.sqrt(mean_squared_error(predictions, y_test)),3)))
    i = i+1

Model is XGB, MAE: 67.0, RMSE: 232.755
Model is Decision Tree, MAE: 70.0, RMSE: 294.749

By comparing the MAE & RMSE of outcome, we can see XGB regressor is performing better than the Decision tree. Anyway, the objective of this example was not training the best model for our data, but to show how easy we can automate & order our data modeling steps using pipeline & column transformers.

Good references to check more materials on the topic:

Sklearn documentation about Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
Sklearn documentation about columnTransformer: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
A good notebook about Pipeline in Kaggle: https://www.kaggle.com/alexisbcook/pipelines
Automate Machine Learning Workflows with Pipelines in Python and scikit-learn: https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/

Author: Pari

All posts

What is pipeline?

Leave a Reply Cancel reply