Here are some examples of common preprocessing steps on different types of features:

Imputation: Replacing missing values with the mean or median,
Log transform: Taking the log of right-skewed numeric features, or its more general cousin, the Box-Cox transform.
Scaling: Standardizing numeric features so values are on similar scales,
Encoding: Transforming categorical variables to numeric features.

Some of these methods, such as taking a log transform, can be done on the full dataset as they don't require estimation of any parameters from the dataset. Methods such as standard scaling require an estimate of the mean and variance from the data, and imputation requires an estimate on the mean and median. Technically, these steps cannot be done on the full dataset. It is easy to fit on the training data, and then transform on the test data. Slightly more subtle is the issue of cross-validation: these steps should really be fit separately on the training folds, and tranformed on the validation fold. Otherwise we are getting some information from the validation set for our preprocessing.

As a practical matter, at least with standard scaler and imputation, the validation set shouldn't have a huge effect on the mean, median, or variance. More importantly, if you want to scale some variables, and log transform others (for example) it is tedious to select the features you want to transform, do the transform, and then paste the transformed features back together.

A recent addition to Scikit-learn is the ColumnTransformer, which allows us to specify different transformations per column, making preprocessing a lot less tedious (and a lot easier to read!). As a bonus, it is easy to put a ColumnTransformer into a pipeline, and do the scaling and imputation correctly. There is a nice example of doing this in the documentation; this article is a more detailed walkthrough of that example.

In this article, we will first walk through a fairly simple example to see how to use a ColumnTransformer, and then a more involved example to really see how they are valuable.

Toy example: predict gender from height, weight, and colorblindness¶

In this example, we are going to make a simple predictor for predicting gender based on someone's weight (in pounds), height (in inches), and whether or not they are colorblind. It is difficult to think of an application of this model where we know someone's weight, height, and information about their vision but don't know his or her gender, but it is a simple example with only a few features.

This dataset has 10k rows; let's look at the first few:

Out[162]:

	Height	Weight	Colorblind	is_female
0	NaN	145.576248	0.0	True
1	68.486412	172.546072	0.0	False
2	70.505927	206.440942	0.0	False
3	NaN	122.034650	0.0	True
4	65.488671	135.013164	NaN	True

We see we have some missing values that we will need to fill. We will use logistic regression, which is a regularized model. So we have two things we need to do:

Impute missing values
Standard scale the continuous features, height and weight

Let's look at the number of missing values that we have:

Out[163]:

	type	num_missing	total
Height	float64	597	10000
Weight	float64	0	10000
Colorblind	float64	527	10000
is_female	bool	0	10000

The good news is that we don't have any missing weights! But in an ideal world, our imputation strategy would be different for Height and Colorblind:

For Height it is natural to choose the mean or median
For Colorblind, it is natural to choose the mode (choosing the mean would give a number that was not 0 or 1!)

We could do it in "one shot" by picking the median, because for the special case of a binary classifier the median is also the mode. Let's replace the Height with the mean instead, to capture some of the difficulties that would be present in more complicated examples.

Attempt 1: Apply same tranformations to all columns¶

One way of approach this problem is to just to a mean imputation and standard scaling on all the columns, like so:

In [164]:

from sklearn.impute import SimpleImputer 

pipeline_steps = [
    ('mean_imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
]

preprocess_pipeline = Pipeline(pipeline_steps)

X_train_scaled = preprocess_pipeline.fit_transform(X_train)

# Look at the first 4 rows
X_train_scaled[:4]

Out[164]:

array([[-1.46886948e+00, -1.66737018e+00, -2.24024562e-01],
       [-9.90723725e-01, -5.60672626e-01, -2.24024562e-01],
       [-6.36856723e-01, -6.68249704e-01, -2.24024562e-01],
       [ 5.72957617e-01,  9.31586346e-01,  3.42226819e-17]])

One of the advantages of this approach is that the pipeline makes it easy to apply these steps to the test data. We can also put our pipeline into GridSearch, and it will automatically do imputation and scaling using only the training folds (i.e. no data leakage from the validation set to the training set).

Some of the downsides to this approach:

We were lucky that we could get away with standard scaling all the columns in this case.
Once we scaled our examples, we lost the binary nature of the colorblind feature. This is going to make it harder to interpret the last column.
Because we did mean imputation, the missing colorblind values were replaced with the mean (i.e. not 0 or 1) before scaling. This means there are three distinct values in this column: actually colorblind, imputed colorblind, and not colorblind. Looking at the values

Out[165]:

	Scaled Colorblind Value	Observations in Train set
0	-2.240246e-01	6788
1	3.422268e-17	389
2	4.707984e+00	323

We can still constuct a Logistic Regression model from this, even if we would have preferred not to do mean imputation on the colorblind values. We can compose the preprocessing pipeline and do a grid search for the appropriate amount of regularization

In [166]:

from sklearn.linear_model import LogisticRegression

# Do preprocessing and modeling in one step. Pipelines can be attached to
# one another!
model_pipeline = Pipeline([
    ('preprocessing', preprocess_pipeline),
    ('classifier', LogisticRegression(solver='lbfgs'))
])

param = {
    'classifier__C': [1e-4, 1e-2, 1e-1, 1, 3, 10, 30, 100]
}

grid = GridSearchCV(model_pipeline, param_grid=param, cv=5).fit(X_train, y_train)

# How well do we do on the validation set vs the test set?
# Note we don't need to separately scale / impute the test set
test_score = grid.score(X_test, y_test)

The validation set accuracy = 92.21%; the test set accuracy = 91.84%

Attempt 2: Using the column transformer¶

Let's see how we can use the ColumnTransformer to address the same problem. We will look at two types of processing steps:

For the heights and weights, we will do mean imputation, followed by standard scaling.
For the colorblindness, we will only do mode imputation

Technically we don't need any imputation on the weights for this dataset, but it is nice to include it incase we fit to future data where the weights are missing.

The ColumnTransformer works in a similar way to a pipeline, where you feed it a list of tuples. Each tuple contains the name of the step, the transformation you want to do, and a list of columns you want to apply that transformation to. It is this last step that makes it different from an ordinary pipeline. Let's see it in action:

In [168]:

from sklearn.compose import ColumnTransformer

cnts_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', StandardScaler())
])
    
colorblind_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent'))
])
    
preprocess_pipeline = ColumnTransformer([
    ('continuous', cnts_pipeline, ['Height', 'Weight']),
    ('colorblind', colorblind_pipeline, ['Colorblind'])
])
    
X_train_processed = preprocess_pipeline.fit_transform(X_train)
X_train_processed[:4]

Out[168]:

array([[-1.46886948, -1.66737018,  0.        ],
       [-0.99072372, -0.56067263,  0.        ],
       [-0.63685672, -0.6682497 ,  0.        ],
       [ 0.57295762,  0.93158635,  0.        ]])

The ColumnTransformer has applied the cnts_pipeline to the columns Height and Weight, and applied the colorblind_pipeline to the Colorblind column. Note that the output at the end has the same scaled heights and weights, while preserving the binary nature of the colorblind column.

This preprocessing is a little longer than the previous version, but we have been able to specify exactly what we want done to each column rather than compromising for convinience. We can now use the exact same code as above to make our model:

In [169]:

from sklearn.linear_model import LogisticRegression

# Do preprocessing and modeling in one step. Pipelines can be attached to
# one another!
model_pipeline = Pipeline([
    ('preprocessing', preprocess_pipeline),
    ('classifier', LogisticRegression(solver='lbfgs'))
])

param = {
    'classifier__C': [1e-4, 1e-2, 1e-1, 1, 3, 10, 30, 100]
}

grid = GridSearchCV(model_pipeline, param_grid=param, cv=5).fit(X_train, y_train)

# How well do we do on the validation set vs the test set?
# Note we don't need to separately scale / impute the test set
test_score = grid.score(X_test, y_test)

The validation set accuracy = 92.21%; the test set accuracy = 91.80%

Let's see how we can use the ColumnTransformer on a less trivial example.

Loan data¶

We will try and find the rate people pay loans back using a Logistic Regression model. Unlike tree based models, this will require us to scale numerical features. This was originally LendingClub data, but to keep the problem simple, some of the features have been binarized, and the categorical features have been dropped. The dataset we will be using is available here. Let's start by looking at the data:

Out[179]:

	loan_amnt	annual_inc	open_acc	dti	delinq_2yrs	inq_last_6mths	revol_util	emp_length	term_5yrs	has_bankruptcy	defaulted
0	10000.0	49200.0	10.0	20.00	0.0	1.0	21.00	10.0	False	False	False
1	3000.0	80000.0	15.0	17.94	0.0	0.0	53.90	1.0	True	False	False
2	6000.0	84000.0	4.0	18.44	2.0	0.0	37.73	1.0	False	False	False
3	5000.0	50004.0	14.0	13.97	3.0	0.0	59.50	2.0	True	False	True
4	5000.0	24044.0	8.0	11.93	0.0	0.0	29.30	2.0	False	False	False

This data doesn't have any missing values, but it has dollar amounts, which can be heavily right-skewed (i.e. the have a bump on the left side). It isn't obvious what some of these columns are, so it makes sense to get the definition of each of the features, and a measure of how skewed they are:

Out[180]:

	Type	Description	skew
loan_amnt	float64	Amount of loan ($)	1.13328
annual_inc	float64	Annual income ($)	8.14486
open_acc	float64	How long account has been open (months)	1.03854
dti	float64	Debt-to-income ratio as percentage (e.g. 11% -> 11.0)	-0.0436909
delinq_2yrs	float64	The number of times the borrower had been 30+ days past due on a payment in the past 2 years.	2.99947
inq_last_6mths	float64	The borrower's number of inquiries by creditors in the last 6 months	2.24966
revol_util	float64	The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).	-0.0659327
emp_length	float64	Number of months with current employer	0.193894
term_5yrs	bool	Whether this was a 5 year loan (dataset only had 5 yr and 3 yr loans)	1.11532
has_bankruptcy	bool	Whether the borrower has ever declared bankruptcy	3.96065

For our processing, we want to:

Log the highly skewed values
Standard scale the numerical values we don't log (logs usually compress the range of a variable enough that we don't need to scale the logged values)
Keep the boolean values untouched

To do this, we will make two ColumnTransformers: one of the highly skewed features, and another for the remaining numerical (but non-boolean) features. First let's store the skews in a Series:

In [181]:

feature_skew = X_train.select_dtypes(include=[np.number]).skew()
feature_skew

Out[181]:

loan_amnt         1.133279
annual_inc        8.144861
open_acc          1.038542
dti              -0.043691
delinq_2yrs       2.999472
inq_last_6mths    2.249655
revol_util       -0.065933
emp_length        0.193894
dtype: float64

Now we want to break our features into groups.

In [182]:

log_features = feature_skew[abs(feature_skew) > 0.9].index
scale_features = [name for name in feature_skew.index if name not in log_features]
boolean_features = X_train.select_dtypes(exclude=[np.number]).columns

We could also manually specify which columns we want to be in each group. This would be equivalent:

log_features     = ['loan_amnt', 'annual_inc', 'open_acc', 'delinq_2yrs', 'inq_last_6mths']
scale_features   = ['dti', 'revol_util', 'emp_length']
boolean_features = ['term_5yrs', 'has_bankruptcy']

We have one more obstacle before using our ColumnTransformer, which is how to take the log of the log_features. We are familiar with StandardScaler, but there isn't a build in transformer for taking the log of a column. There is a function (namely np.log), but this isn't a tranformer -- it doesn't have a fit or transform method.

The problem of wanting to apply a function to a column which isn't technically a transformer is common, so Scikit-learn introduced a FunctionTransformer. A FunctionTransformer takes a function (such as np.log) and makes a transformer that does nothing when fit is called, but calls the function when transform is called. It is easy to use.

Let's see how it gets used in our example. We don't have to worry about taking the log of 0, we will take np.log1p of the log_features rather than np.log:

In [184]:

preprocessor = ColumnTransformer(
    transformers=[
        ('log', FunctionTransformer(np.log1p, validate=False), log_features),
        ('scale', StandardScaler(), scale_features)],
    remainder='passthrough')

That's all there is to it! It took far longer to describe in words than to code. By default, a ColumnTransformer will move all features that were not transformed, but this can be overriden (as it was here) by using the remainder="passthrough" argument.

Now let's train our model:

In [219]:

log_reg = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('logisitic', LogisticRegression(C=0.02, solver='lbfgs', max_iter=10000, class_weight='balanced'))
]).fit(X_train, y_train);

That's all there is to it! The model's performance isn't that great -- looking at accuracy and recall show pretty poor results (note that the default rate is about 16%, but we did use class weight's to increase accuracy)

In [234]:

# ignore
pd.DataFrame([
    [log_reg.score(X_train, y_train), recall_score(y_train, log_reg.predict(X_train))],
    [log_reg.score(X_test, y_test), recall_score(y_test, log_reg.predict(X_test))],
], columns = ['Accuracy', 'Recall'], index=['Train', 'Test'])

Out[234]:

	Accuracy	Recall
Train	0.634098	0.604733
Test	0.635626	0.604972

The poor results are due to the number of informative features we threw away in order to keep this example self-contained (as well as our use of Logistic Regression, instead of a tree-based model which would require less preprocessing). That is okay; our goal in this article was to show how to use the ColumnTransformer in non-trivial problems, rather than to build a good classifier.

We will revisit this dataset where we will use CategoryEncoders along with the ColumnTransformer to make a much better model on this dataset.

Summary¶

ColumnTransformer is written a lot like a pipeline, except when specifying the steps we also specify which columns should be transformed
This allows us to scale some features, while not scaling the OneHotEncoded or binary features (this makes interpreting the final coefficient of these features simpler, as the values are still 0 or 1, so the coefficient is the contribution of this feature being present vs absent)
It also allows you to choose different imputation methods on a column-by-column basis
By placing the ColumnTransformer in a pipeline with your model, you can easily do your preprocessing inside GridSearchCV and not worry about data leakage. The alternatives are either to transform the entire training set (which has data leak into your validation set, making CV scores too optimistic) or manually doing your cross-validation. For most types of preprocessing, this isn't a huge issue (except if upsampling, as discussed here), but it is nice that the ColumnTransformer makes it easy to do the right thing.

Of course, categorical variables have their own set of challenges and transformations. To tackle categorical variables, we will to introduce the CategoryEncoders package.

On my blog¶

An introduction to pipeslines: What they are, and why you should be using them (to come)
How to do cross-validation when upsampling An example of data leakage in cross-validation that is important.
Why pd.get_dummies is Evil, use CategoryEncoders instead: The problems of using get_dummies and how the new CategoryEncoders package addresses these issues (to come)
Imputation by Group: How to imputation on groups (to come)

Elsewhere¶

Scikit-learn tutorial on ColumnTransformers