Pipelines and columntransformer in Sklearn

What is pipeline?

Pipeline is a way to organize repetitive steps in data science projects such as data cleaning, data transformation and data modeling. It makes your code clean, readable and facilitates implementation in the production environment.
It is recently that I came across to combination of columntransofmer & pipeline, and I found them extremely efficient for the ordering & automation of coding. So let's jump in the code to see how we can apply them together.

For this example, I am using Airbnb dataset from kaggle.

First loading essential libraries:

#load libraries
import pandas as pd
import numpy as np

from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,mean_squared_error

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

Loading the data:

#read data
airbnb = pd.read_csv('data/AB_NYC_2019.csv')

This data is Airbnb New York dataset in 2019 and includes all needed information to find out more about hosts, geographical availability, neighborhood, etc. Each listing is having a price for night, which is our target variable.

airbnb.head()
idnamehost_idhost_nameneighbourhood_groupneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365
02539Clean & quiet apt home by the park2787JohnBrooklynKensington40.64749-73.97237Private room149192018-10-190.216365
12595Skylit Midtown Castle2845JenniferManhattanMidtown40.75362-73.98377Entire home/apt2251452019-05-210.382355
23647THE VILLAGE OF HARLEM....NEW YORK !4632ElisabethManhattanHarlem40.80902-73.94190Private room15030NaNNaN1365
33831Cozy Entire Floor of Brownstone4869LisaRoxanneBrooklynClinton Hill40.68514-73.95976Entire home/apt8912702019-07-054.641194
45022Entire Apt: Spacious Studio/Loft by central park7192LauraManhattanEast Harlem40.79851-73.94399Entire home/apt801092018-11-190.1010

Just to check if there is any duplication in values & drop unnecessary columns:

airbnb.drop_duplicates(inplace=True)
airbnb.drop(['name','id','host_name','last_review','host_id','neighbourhood'], axis=1, inplace=True)

Specify numeric and categorical variables, this is going to help us for transformation steps:

numeric_features = [cls for cls in airbnb.columns if airbnb[cls].dtype=='float64' or airbnb[cls].dtype=='int64']
print(numeric_features)
cat_features = [cls for cls in airbnb.columns if airbnb[cls].dtype=='object' and airbnb[cls].nunique()<10]
print(cat_features)
['latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']
['neighbourhood_group', 'room_type']

Let's remove the target variable from columns (since I want to apply transformation steps on the columns and don't want to change my target variable).

airbnb.numeric_features.pop(2)
'price'

Let's check situation of missing values in my dataset:

airbnb.isnull().sum()
neighbourhood_group                   0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

It seems reviews_per_month includes missing values, let's address them by replacing mean value (of the column) in the null records, so here I am defining an imputer with strategy of replacing mean value for null values:

imputer = SimpleImputer(strategy ='mean', missing_values = np.nan)

For my categorical variable, I define a pipeline which includes Onehot encoding step for conversion of categorical columns:

cat_imputer =  Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Here I am defining a columntansformer which includes these parameters:

  • remainder: By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through.
  • transformers: List of tuples specifying the transformer objects to be applied to subsets of the data. In the above lines, I have defined two transformers, one for numerical variables, which is replacing missing values with mean and one for converting categorical variables using one hot encoding. So this is the place to bind both transoformers here.

pre_process = ColumnTransformer( remainder='passthrough',
    transformers=[
        ('num',imputer,numeric_features),
        ('cat',cat_imputer,cat_features)
    
])

Here, I am defining two models for applying on my data, one is a XGBregressor and the other one is a Decision tree regressor.

XGB_model =  XGBRegressor(n_estimators=1000, learning_rate=0.05,objective = 'reg:squarederror')
DT_model =  DecisionTreeRegressor(max_depth=10)

Let's divide target,training & testing data:

y = airbnb.price              
X = airbnb.drop(['price'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

X_train.columns
Index(['neighbourhood_group', 'latitude', 'longitude', 'room_type',
       'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')

Here I am defining two pipelines, the first one includes:

  • Preprocessing step
  • Modeling step (XGB regressor)

and the second one:

  • Preprocessing step (Same as above)
  • Modeling step (Decision tree regressor)

pipe_xgb =  Pipeline(steps=[
    ('preprocessor',pre_process),
    ('model',XGB_model)
])

pipe_dt =  Pipeline(steps=[
    ('preprocessor',pre_process),
    ('model',DT_model)
])

pipe_line = [ pipe_xgb, pipe_dt]
mdl_name = ['XGB','Decision Tree']

For each pipeline object, train model using training data & predict test data:

i = 0
for pipe in pipe_line:
    pipe.fit(X_train, y_train)
    predictions = pipe.predict(X_test)
    print("Model is " + mdl_name[i]+", MAE: " + str(np.round(mean_absolute_error(predictions, y_test)))+
      ", RMSE: "+str(np.round(np.sqrt(mean_squared_error(predictions, y_test)),3)))
    i = i+1
Model is XGB, MAE: 67.0, RMSE: 232.755
Model is Decision Tree, MAE: 70.0, RMSE: 294.749

By comparing the MAE & RMSE of outcome, we can see XGB regressor is performing better than the Decision tree. Anyway, the objective of this example was not training the best model for our data, but to show how easy we can automate & order our data modeling steps using pipeline & column transformers.

Good references to check more materials on the topic:

Author: Pari

Leave a Reply

Your email address will not be published. Required fields are marked *