The James-Stein Encoder

How we convert categorical features — non-numeric features without an order — into numbers can effect the performance of our machine learning models.

The article "Encoding Categorical Features" talks about different encoding techniques, and what would influence your choice of encoding. This article looks into one effective encoding scheme called target, impact, or James-Stein encoding.

The basic idea

The rough idea for this family of encoders is:

For regression: replace category value $X_i$ with the average value of the target over that group.
For binary classification: replace category value $X_i$ with the proportion of instances with that value that belong to the positive class.
For multi-class classification: replace category value $X_i$ with one value per class; that value being the proportion of instances with that value that below to each of the classes.

Let's see this with an example. Suppose we had a dataset with 10k people, and the data took the form

Gender	Race	Age	Height (cm)	Colorblind
Male	White	26	167	False
Female	White	22	158	False
Female	Asian	29	162	False
Male	Black	23	172	True
Male	White	24	169	True

We find the following percentage of the people are colorblind, when broken down by gender:

Gender	% of training set that is colorblind
Male	8.0%
Female	0.5%

and by race:

Race	% of training set that is colorblind
Asian	2.5%
Black	2.0%
Hispanic	4%
White	4%

Target encoding would replace each categorical variable with the percentage of colorblind people in that group. After transformation, we would get

Gender	Race	Age	Height (cm)	Colorblind
0.080	0.040	26	167	False
0.005	0.040	22	158	False
0.005	0.025	29	162	False
0.080	0.020	23	172	True
0.080	0.040	24	169	True

Note that we are not looking at the combined categories (i.e. we were not looking at the percentage of asian females that were colorblind, or of hispanic men). Each feature is encoded separately. This is similar to Naive Bayes.

Being careful with the encoder

Unlike other encoding methods (such as one-hot encoding), this method uses knowledge of the target. It is important if you use this method to encode after splitting into training and testing sets. We should also ensure that we do the encoding within cross-validation. We will demonstrate how to perform target encoding in a cross-validation safe way below.

Difference between James-Stein and target encoding

There is a big difference between knowing that 4% of a population of 4000 is color-blind versus a 4% of a population of 50 are color-blind. In the former case, we are reasonably confident the proportion is close to 4%. In the later case, we only have two people out of 50 that are colorblind, and are very susceptible to random noise.

The James-Stein encoder shrinks the average toward the overall average. If $p_{\text{all}}$ is the overall proportion of people that are colorblind in our sample set, we have

$$\text{Encoded value for group $i$} = (1-B) p_i + B p_{\text{all}}$$

where $B$ is a weight of the population mean, and $1-B$ is the weight of the group mean (with the total weight being 1).

There are different methods for calculating $B$, as discussed in the documentation, but the default one in category encoders is called the "independent model". For each category we have

$$B = \frac{\text{(group variance)}}{\text{(group variance)} + \text{(population variance)}}$$

When we are uncertain about a group's value (i.e. the group variance is high compared to the population variance) then $B\approx 1$, and we are heavily biased toward the population value. When the group variance is much lower that the population variance, $B \approx 0$ and we use the value for the group instead.

For a proportion problem, the explicit formula for $B$ is obtained using the variance in the population proportion:

$$B = \frac{p_i(1-p_i)/N_i}{\frac{p_i(1-p_i)}{N_i} + \frac{p_{\text{all}}(1-p_{\text{all}})}{N_{\text{all}}}}$$

where $N_i$ is the number in the group, and $N_{\text{all}}$ is the number in the population.

Let's see how this plays out with our colorblind example. Encoding the race variable, let's say we had

Race	Number in sample	Number colorblind	Proportion
Asian	4000	100	2.5%
Black	1950	39	2.0%
Hispanic	50	2	4.0%
White	4000	160	4.0%
Total	10000	301	301/10k = 3.0%

These are the same percentages we saw earlier, but now we are including the sample sizes as well. The squared standard error for the entire population is

$$(\text{std err in pop})^2 = \frac{p_{\text{all}}(1-p_{\text{all}})}{N_{\text{all}}} = \frac{(0.03)(0.97)}{10,000} = 2.9\times 10^{-6}$$

Each group has its own $B$ (the weight of the overall mean) which we calculate below. We put everything in terms of powers of $10^{-6}$ to allow easy comparison

Race	std error squared	B
Asian	(0.025)(1 - 0.025)/4000 = 6.1x10^-6	0.6778
Black	10.1x10^-6	0.7769
Hispanic	768.0x10^-6	0.9962
White	9.6x10^-6	0.7680

We see in this data set, the overall mean would have the greatest effect on the Hispanic encoding, and the least effect on the Asian encoding.

Doing it in code

To implement this model , we first need to install the category encoders package:

conda install -c conda-forge category_encoders

No train-test split

Let's make a dataframe and encode it:

import category_encoders as ce
import pandas as pd

# Some fake data loaded from Github
colorblind = pd.DataFrame('https://........')

# Build the encoder
encoder = ce.JamesSteinEncoder(cols=['gender', 'race'])

# Encode the frame and view it
colorblind_tranformed = encoder.fit_transform(colorblind, colorblind['Colorblind'])

# Look at the first few rows
colorblind_tranformed.head()

With train-test split

Let's train a simple RandomForest model, just to show how to use the encoder with cross-validation. We will put our encoder in a pipeline with our random forest:

import category_encoders as ce
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import recall_score
from sklearn.pipeline import Pipeline

# Some fake data loaded from Github
colorblind = pd.DataFrame('https://........')

# Do the train test split
X_train, X_test, y_train, y_test = train_test_split(colorblind.drop('Colorblind', axis=1), colorblind['Colorblind'])

# Build the encoder
encoder = ce.JamesSteinEncoder(cols=['gender', 'race'])

# Build the model, including the encoder
model = Pipeline([
  ('encode_categorical', encoder),
  ('classifier', RandomForestClassifier())
])

# Here are the parameters we want to search over
# Review pipelines to see how to access the different
# stages
params = {
  'classifier__n_estimators': [50, 100, 200],
  'classifier__max_depth': [4, 6, 8]
}

# build a grid search
grid = GridSearchCV(model, param_grid=params, cv=5).fit(X_train, y_train)

# How well did we do on the test set?
# Note that we don't need to explicitly transform the test
# set!
predict_test = grid.predict(X_test)
print(f"Recall on the test set is {recall_score(y_train, predict_test)}")

By putting our encoder in a pipeline, cross validation was handled correctly (i.e. the encoder was trained on the 4 training folds, and evaluated on the one hold out fold). See the article on pipelines for more detail.