Here are some examples of common preprocessing steps on different types of features:
- Imputation: Replacing missing values with the mean or median,
- Log transform: Taking the log of right-skewed numeric features, or its more general cousin, the Box-Cox transform.
- Scaling: Standardizing numeric features so values are on similar scales,
- Encoding: Transforming categorical variables to numeric features.
Some of these methods, such as taking a log transform, can be done on the full dataset as they don't require estimation of any parameters from the dataset. Methods such as standard scaling require an estimate of the mean and variance from the data, and imputation requires an estimate on the mean and median. Technically, these steps cannot be done on the full dataset. It is easy to fit
on the training data, and then transform
on the test data. Slightly more subtle is the issue of cross-validation: these steps should really be fit
separately on the training folds, and tranform
ed on the validation fold. Otherwise we are getting some information from the validation set for our preprocessing.
As a practical matter, at least with standard scaler and imputation, the validation set shouldn't have a huge effect on the mean, median, or variance. More importantly, if you want to scale some variables, and log transform others (for example) it is tedious to select the features you want to transform, do the transform, and then paste the transformed features back together.
A recent addition to Scikit-learn is the ColumnTransformer
, which allows us to specify different transformations per column, making preprocessing a lot less tedious (and a lot easier to read!). As a bonus, it is easy to put a ColumnTransformer
into a pipeline, and do the scaling and imputation correctly. There is a nice example of doing this in the documentation; this article is a more detailed walkthrough of that example.
In this article, we will first walk through a fairly simple example to see how to use a ColumnTransformer
, and then a more involved example to really see how they are valuable.
Toy example: predict gender from height, weight, and colorblindness¶
In this example, we are going to make a simple predictor for predicting gender based on someone's weight (in pounds), height (in inches), and whether or not they are colorblind. It is difficult to think of an application of this model where we know someone's weight, height, and information about their vision but don't know his or her gender, but it is a simple example with only a few features.
This dataset has 10k rows; let's look at the first few:
We see we have some missing values that we will need to fill. We will use logistic regression, which is a regularized model. So we have two things we need to do:
- Impute missing values
- Standard scale the continuous features, height and weight
Let's look at the number of missing values that we have:
The good news is that we don't have any missing weights! But in an ideal world, our imputation strategy would be different for Height
and Colorblind
:
- For
Height
it is natural to choose the mean or median - For
Colorblind
, it is natural to choose the mode (choosing the mean would give a number that was not 0 or 1!)
We could do it in "one shot" by picking the median, because for the special case of a binary classifier the median is also the mode. Let's replace the Height
with the mean instead, to capture some of the difficulties that would be present in more complicated examples.
Attempt 1: Apply same tranformations to all columns¶
One way of approach this problem is to just to a mean imputation and standard scaling on all the columns, like so:
from sklearn.impute import SimpleImputer
pipeline_steps = [
('mean_imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
]
preprocess_pipeline = Pipeline(pipeline_steps)
X_train_scaled = preprocess_pipeline.fit_transform(X_train)
# Look at the first 4 rows
X_train_scaled[:4]
One of the advantages of this approach is that the pipeline makes it easy to apply these steps to the test data. We can also put our pipeline into GridSearch, and it will automatically do imputation and scaling using only the training folds (i.e. no data leakage from the validation set to the training set).
Some of the downsides to this approach:
- We were lucky that we could get away with standard scaling all the columns in this case.
- Once we scaled our examples, we lost the binary nature of the colorblind feature. This is going to make it harder to interpret the last column.
- Because we did mean imputation, the missing colorblind values were replaced with the mean (i.e. not 0 or 1) before scaling. This means there are three distinct values in this column: actually colorblind, imputed colorblind, and not colorblind. Looking at the values
We can still constuct a Logistic Regression model from this, even if we would have preferred not to do mean imputation on the colorblind values. We can compose the preprocessing pipeline and do a grid search for the appropriate amount of regularization
from sklearn.linear_model import LogisticRegression
# Do preprocessing and modeling in one step. Pipelines can be attached to
# one another!
model_pipeline = Pipeline([
('preprocessing', preprocess_pipeline),
('classifier', LogisticRegression(solver='lbfgs'))
])
param = {
'classifier__C': [1e-4, 1e-2, 1e-1, 1, 3, 10, 30, 100]
}
grid = GridSearchCV(model_pipeline, param_grid=param, cv=5).fit(X_train, y_train)
# How well do we do on the validation set vs the test set?
# Note we don't need to separately scale / impute the test set
test_score = grid.score(X_test, y_test)
Attempt 2: Using the column transformer¶
Let's see how we can use the ColumnTransformer
to address the same problem. We will look at two types of processing steps:
- For the heights and weights, we will do mean imputation, followed by standard scaling.
- For the colorblindness, we will only do mode imputation
Technically we don't need any imputation on the weights for this dataset, but it is nice to include it incase we fit to future data where the weights are missing.
The ColumnTransformer
works in a similar way to a pipeline, where you feed it a list of tuples. Each tuple contains the name of the step, the transformation you want to do, and a list of columns you want to apply that transformation to. It is this last step that makes it different from an ordinary pipeline. Let's see it in action:
from sklearn.compose import ColumnTransformer
cnts_pipeline = Pipeline([
('impute', SimpleImputer(strategy='mean')),
('scale', StandardScaler())
])
colorblind_pipeline = Pipeline([
('impute', SimpleImputer(strategy='most_frequent'))
])
preprocess_pipeline = ColumnTransformer([
('continuous', cnts_pipeline, ['Height', 'Weight']),
('colorblind', colorblind_pipeline, ['Colorblind'])
])
X_train_processed = preprocess_pipeline.fit_transform(X_train)
X_train_processed[:4]
The ColumnTransformer
has applied the cnts_pipeline
to the columns Height
and Weight
, and applied the colorblind_pipeline
to the Colorblind
column. Note that the output at the end has the same scaled heights and weights, while preserving the binary nature of the colorblind column.
This preprocessing is a little longer than the previous version, but we have been able to specify exactly what we want done to each column rather than compromising for convinience. We can now use the exact same code as above to make our model:
from sklearn.linear_model import LogisticRegression
# Do preprocessing and modeling in one step. Pipelines can be attached to
# one another!
model_pipeline = Pipeline([
('preprocessing', preprocess_pipeline),
('classifier', LogisticRegression(solver='lbfgs'))
])
param = {
'classifier__C': [1e-4, 1e-2, 1e-1, 1, 3, 10, 30, 100]
}
grid = GridSearchCV(model_pipeline, param_grid=param, cv=5).fit(X_train, y_train)
# How well do we do on the validation set vs the test set?
# Note we don't need to separately scale / impute the test set
test_score = grid.score(X_test, y_test)
Let's see how we can use the ColumnTransformer
on a less trivial example.
Loan data¶
We will try and find the rate people pay loans back using a Logistic Regression model. Unlike tree based models, this will require us to scale numerical features. This was originally LendingClub data, but to keep the problem simple, some of the features have been binarized, and the categorical features have been dropped. The dataset we will be using is available here. Let's start by looking at the data:
This data doesn't have any missing values, but it has dollar amounts, which can be heavily right-skewed (i.e. the have a bump on the left side). It isn't obvious what some of these columns are, so it makes sense to get the definition of each of the features, and a measure of how skewed they are:
For our processing, we want to:
- Log the highly skewed values
- Standard scale the numerical values we don't log (logs usually compress the range of a variable enough that we don't need to scale the logged values)
- Keep the boolean values untouched
To do this, we will make two ColumnTransformers
: one of the highly skewed features, and another for the remaining numerical (but non-boolean) features. First let's store the skews in a Series
:
feature_skew = X_train.select_dtypes(include=[np.number]).skew()
feature_skew
Now we want to break our features into groups.
log_features = feature_skew[abs(feature_skew) > 0.9].index
scale_features = [name for name in feature_skew.index if name not in log_features]
boolean_features = X_train.select_dtypes(exclude=[np.number]).columns
We could also manually specify which columns we want to be in each group. This would be equivalent:
We have one more obstacle before using our ColumnTransformer
, which is how to take the log of the log_features
. We are familiar with StandardScaler
, but there isn't a build in transformer for taking the log
of a column. There is a function (namely np.log
), but this isn't a tranformer -- it doesn't have a fit
or transform
method.
The problem of wanting to apply a function to a column which isn't technically a transformer is common, so Scikit-learn introduced a FunctionTransformer
. A FunctionTransformer
takes a function (such as np.log
) and makes a transformer that does nothing when fit
is called, but calls the function when transform
is called. It is easy to use.
Let's see how it gets used in our example. We don't have to worry about taking the log of 0, we will take np.log1p
of the log_features
rather than np.log
:
preprocessor = ColumnTransformer(
transformers=[
('log', FunctionTransformer(np.log1p, validate=False), log_features),
('scale', StandardScaler(), scale_features)],
remainder='passthrough')
That's all there is to it! It took far longer to describe in words than to code. By default, a ColumnTransformer
will move all features that were not transformed, but this can be overriden (as it was here) by using the remainder="passthrough"
argument.
Now let's train our model:
log_reg = Pipeline(steps=[
('preprocess', preprocessor),
('logisitic', LogisticRegression(C=0.02, solver='lbfgs', max_iter=10000, class_weight='balanced'))
]).fit(X_train, y_train);
That's all there is to it! The model's performance isn't that great -- looking at accuracy and recall show pretty poor results (note that the default rate is about 16%, but we did use class weight's to increase accuracy)
# ignore
pd.DataFrame([
[log_reg.score(X_train, y_train), recall_score(y_train, log_reg.predict(X_train))],
[log_reg.score(X_test, y_test), recall_score(y_test, log_reg.predict(X_test))],
], columns = ['Accuracy', 'Recall'], index=['Train', 'Test'])
The poor results are due to the number of informative features we threw away in order to keep this example self-contained (as well as our use of Logistic Regression, instead of a tree-based model which would require less preprocessing). That is okay; our goal in this article was to show how to use the ColumnTransformer
in non-trivial problems, rather than to build a good classifier.
We will revisit this dataset where we will use CategoryEncoders
along with the ColumnTransformer
to make a much better model on this dataset.
Summary¶
ColumnTransformer
is written a lot like a pipeline, except when specifying the steps we also specify which columns should be transformed- This allows us to scale some features, while not scaling the OneHotEncoded or binary features (this makes interpreting the final coefficient of these features simpler, as the values are still 0 or 1, so the coefficient is the contribution of this feature being present vs absent)
- It also allows you to choose different imputation methods on a column-by-column basis
- By placing the
ColumnTransformer
in a pipeline with your model, you can easily do your preprocessing insideGridSearchCV
and not worry about data leakage. The alternatives are either to transform the entire training set (which has data leak into your validation set, making CV scores too optimistic) or manually doing your cross-validation. For most types of preprocessing, this isn't a huge issue (except if upsampling, as discussed here), but it is nice that theColumnTransformer
makes it easy to do the right thing.
Of course, categorical variables have their own set of challenges and transformations. To tackle categorical variables, we will to introduce the CategoryEncoders
package.
Related articles¶
On my blog¶
- An introduction to pipeslines: What they are, and why you should be using them (to come)
- How to do cross-validation when upsampling An example of data leakage in cross-validation that is important.
- Why
pd.get_dummies
is Evil, use CategoryEncoders instead: The problems of usingget_dummies
and how the new CategoryEncoders package addresses these issues (to come) - Imputation by Group: How to imputation on groups (to come)