How to do cross-validation when upsampling data

One of my colleagues, Sophie Searcy, recently wrote an blog post that dealt with imbalanced classes. She looked at ways to address an imbalanced learning problem, as well as the pros and cons of the different approaches. One big takeaway of that article (which you should read!) was to carefully consider whether or not you should address the problem of imbalance by oversampling, or if you should look at some of the alternatives: adjusting the weights of the classes, or checking if your model deals with imbalanced data naturally.

This article is about how to do cross-validation once you have decided that oversampling is the right approach for your problem. A notebook on Github with all steps included is useful if you want to play with the process yourself; here we highlight the main steps.

For this article, we will be going through the following steps:

  1. Getting a baseline
  2. Oversampling the wrong way
    Do a train-test split, then oversample, then cross-validate. Sounds fine, but results are overly optimistic.
  3. Oversampling the right way
    1. Manual oversampling
    2. Using `imblearn`'s pipelines (for those in a hurry, this is the best solution)

If cross-validation is done on already upsampled data, the scores don't generalize to new data. In a real problem, you should only use the test set ONCE; we are reusing it to show that if we do cross-validation on already upsampled data, the results are overly optimistic and do not generalize to new data (or the test set).

The dataset

We will be using a thyroid dataset, where the number of bad thyroids make up about 6% of the data (i.e. about 1 in 16 patients have thyroid issues). The dataset is available as part of the imbalanced learn's dataset module. Our goal will be to find a classifier with a good recall (i.e. we want our classifier to find as many positive cases as it can). We have to be aware there is a danger in using this metric, as simply predicting everyone has a bad thyroid will make the recall 100%.

We are going to ensure that we have the same splits of the data every time. We can ensure this by creating a KFold object, kf, and passing cv=kf instead of the more common cv=5.

In [3]:
kf = KFold(n_splits=5, random_state=42, shuffle=False)

1. Baseline (no oversampling)

Let's get a baseline result by picking a random forest.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45)
rf = RandomForestClassifier(n_estimators=100, random_state=13)
cross_val_score(rf, X_train, y_train, cv=kf, scoring='recall')
Out[4]:
array([0.81081081, 0.73684211, 0.875     , 0.7037037 , 0.7804878 ])

These are decent results, and we haven't even optimized the model! Let's do some hyperparameter tuning:

In [5]:
params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [4, 6, 10, 12],
    'random_state': [13]
}

grid_no_up = GridSearchCV(rf, param_grid=params, cv=kf, 
                          scoring='recall').fit(X_train, y_train)

grid_no_up.best_score_
Out[5]:
0.7803820054409211

We have about 78% recall on one of our models before we have tried oversampling. This is the number to beat.

Normally we would wait until we had finished our modeling to look at the test set, but an important part of this is to see how oversampling, done incorrectly, can make us too confident in our ability to generalize based off cross-validation. We haven't oversampled yet, so let's just check that the test scores are in line with what we expect from the CV scores about (i.e. about 78%)

In [6]:
recall_score(y_test, grid_no_up.predict(X_test))
Out[6]:
0.8035714285714286

This looks like it is (roughly) consistent with the CV results.

2. Oversampling the wrong way

Let's just oversample the training data (we are smart enough not to oversample the test data), and check that this gives us an even split of the two classes:

In [7]:
X_train_upsample, y_train_upsample = SMOTE(random_state=42).fit_sample(X_train, y_train)
y_train_upsample.mean()
Out[7]:
0.5

Now let's cross-validate using grid search. Remember the training set has been upsampled; that is not being done as part of the GridSearch

In [8]:
params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [4, 6, 10, 12],
    'random_state': [13]
}

grid_naive_up = GridSearchCV(rf, param_grid=params, cv=kf, 
                             scoring='recall').fit(X_train_upsample, 
                                                   y_train_upsample)
grid_naive_up.best_score_
Out[8]:
0.9843160927198451

This is an amazing recall! If we look at the validation scores, they all look pretty good:

In [9]:
grid_naive_up.cv_results_['mean_test_score']
Out[9]:
array([0.93360792, 0.9345499 , 0.93337591, 0.94714925, 0.94736138,
       0.94273667, 0.97585677, 0.98218414, 0.97864618, 0.98237253,
       0.98187974, 0.98431609])

Here is the model that made these results:

In [10]:
grid_naive_up.best_params_
Out[10]:
{'max_depth': 12, 'n_estimators': 200, 'random_state': 13}

Ok, let's look at how it does on the training set as a whole (once we eliminate the upsampling)

In [11]:
recall_score(y_train, grid_naive_up.predict(X_train))
Out[11]:
1.0

Ok, what about the test set?

In [12]:
# But wait ... uh-oh, spagetti-os!
recall_score(y_test, grid_naive_up.predict(X_test))
Out[12]:
0.9107142857142857

Ok, time for some good news/bad news:

  • good: the recall on the test set is 91%, better than the 80% we got without upsampling
  • bad: our confidence in the cross-valdation results went down. With no upsampling, the validation recall was 78%, which was a good estimate of the test validation of 80%. With upsampling, the validation recall was 100% which isn't a good measure of the test recall (91%)

3. Let's make SMOTE-ing part of our cross validation!

The issue is that we

  • oversample
  • then split into cross-validation folds

To see why this is an issue, consider the simplest method of over-sampling (namely, copying the data point). Let's say every data point from the minority class is copied 6 times before making the splits. If we did a 3-fold validation, each fold has (on average) 2 copies of each point! If our classifier overfits by memorizing its training set, it should be able to get a perfect score on the validation set! Our cross-validation will choose the model that overfits the most. We see that CV chose the deepest trees it could!

Instead, we should split into training and validation folds. Then, on each fold, we should

  1. Oversample the minority class
  2. Train the classifier on the training folds
  3. Validate the classifier on the remaining fold

Let's see this in detail by doing it manually:

3A. Manual upsampling within folds

In [13]:
example_params = {
        'n_estimators': 100,
        'max_depth': 5,
        'random_state': 13
    }

def score_model(model, params, cv=None):
    """
    Creates folds manually, and upsamples within each fold.
    Returns an array of validation (recall) scores
    """
    if cv is None:
        cv = KFold(n_splits=5, random_state=42)

    smoter = SMOTE(random_state=42)
    
    scores = []

    for train_fold_index, val_fold_index in cv.split(X_train, y_train):
        # Get the training data
        X_train_fold, y_train_fold = X_train.iloc[train_fold_index], y_train[train_fold_index]
        # Get the validation data
        X_val_fold, y_val_fold = X_train.iloc[val_fold_index], y_train[val_fold_index]

        # Upsample only the data in the training section
        X_train_fold_upsample, y_train_fold_upsample = smoter.fit_resample(X_train_fold,
                                                                           y_train_fold)
        # Fit the model on the upsampled training data
        model_obj = model(**params).fit(X_train_fold_upsample, y_train_fold_upsample)
        # Score the model on the (non-upsampled) validation data
        score = recall_score(y_val_fold, model_obj.predict(X_val_fold))
        scores.append(score)
    return np.array(scores)

# Example of the model in action
score_model(RandomForestClassifier, example_params, cv=kf)
Out[13]:
array([0.78378378, 0.76315789, 0.96875   , 0.81481481, 0.90243902])

We can even do grid search this way by looping over the parameters. As a reminder, the parameter combinations we tried earlier were

In [14]:
params
Out[14]:
{'n_estimators': [50, 100, 200],
 'max_depth': [4, 6, 10, 12],
 'random_state': [13]}

This loop tries all combinations, and stores the average recall score on the validation sets:

In [15]:
score_tracker = []
for n_estimators in params['n_estimators']:
    for max_depth in params['max_depth']:
        example_params = {
            'n_estimators': n_estimators,
            'max_depth': max_depth,
            'random_state': 13
        }
        example_params['recall'] = score_model(RandomForestClassifier, 
                                               example_params, cv=kf).mean()
        score_tracker.append(example_params)
        
# What's the best model?
sorted(score_tracker, key=lambda x: x['recall'], reverse=True)[0]
Out[15]:
{'n_estimators': 50,
 'max_depth': 4,
 'random_state': 13,
 'recall': 0.8486884268736002}

The best estimator has a recall score of 85% on the validation set. Let's see how this compares with the test score:

In [16]:
rf = RandomForestClassifier(n_estimators=50, max_depth=4, random_state=13)
rf.fit(X_train_upsample, y_train_upsample)
recall_score(y_test, rf.predict(X_test))
Out[16]:
0.8392857142857143

Note that is is roughly consistent (84% vs 85%)

3B. Using the imblearn pipeline

The imbalanced-learn dataset extends the sklearn's built-in pipeline methods. Specifically, you can import

from sklearn.pipeline import Pipeline, make_pipeline

which will allow you to do multiple steps at once. It is also nice that if you fit the model, all the steps (such as scaling, and the model) are fit at once. If you predict with the model, scaling steps are only transformed, so you can pass multiple steps into a pipeline.

There is a restriction. The restriction comes partially from the naming of functions (e.g. transform vs resample) but one way of thing of it is that sklearn's pipeline only allows for one row in to be transformed to another row (perhaps with different or added features). To upsample, we need to increase the number of rows. Imbalanced-learn generalizes the pipeline, but tries to keep the syntax and function names the same:

from imblearn.pipeline import Pipeline, make_pipeline

Let's see it in action:

In [17]:
imba_pipeline = make_pipeline(SMOTE(random_state=42), 
                              RandomForestClassifier(n_estimators=100, random_state=13))
cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf)
Out[17]:
array([0.75675676, 0.78947368, 0.90625   , 0.77777778, 0.7804878 ])

This is much nicer than using our manual score function! Notice that the recall scores are similar to when we did this manually.

Even nicer, the pipelines play well with GridSearchCV, so we don't have to loop over parameters manually:

In [18]:
new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
                        return_train_score=True)
grid_imba.fit(X_train, y_train);

We can see that the best estimator selected by grid search with the pipeline matches the one we found manually:

In [19]:
grid_imba.best_params_
Out[19]:
{'randomforestclassifier__max_depth': 4,
 'randomforestclassifier__n_estimators': 50,
 'randomforestclassifier__random_state': 13}

How well do we do on our validation set?

In [20]:
grid_imba.best_score_
Out[20]:
0.8486780485230826

Let's compare this to the test set:

In [21]:
y_test_predict = grid_imba.predict(X_test)
recall_score(y_test, y_test_predict)
Out[21]:
0.8392857142857143

We have some confidence this is doing what we want: when we did cross-validation manually, we also saw cross-validation give recall scores of 85% (vs 84% recall on the test set).

When predicting, the SMOTE step doesn't do anything (it just passes the values through). We can check this explicitly by just making a prediction from the randomforestclassifier and seeing we get the same result:

In [22]:
y_test_predict = grid_imba.best_estimator_.named_steps['randomforestclassifier'].predict(X_test)
recall_score(y_test, y_test_predict)
Out[22]:
0.8392857142857143

Summary

Here is a summary of the different approaches we took:

Method Recall (validation) Recall (test)
No upsampling (baseline) 78.0% 80.3%
Upsample training set before CV 100% 91.1%
Upsample as part of CV (manual) 84.9% 83.9%
Upsample as part of CV (pipeline) 84.9% 83.9%

The last two lines should be (and are) the same. The difference is simple the pipeline is easier to manage and leads to cleaner code, but it is good to see the explicit process once. The high level takeaways:

  • For each case, except when we upsampled the training set before the CV, the validation set recall was a good estimate of the test set recall.
  • When we upsampled the training set before cross validation, there was a difference of 9 percentage points between the CV recall and recall on the test set.
  • When upsampling before cross validation, you will be picking the most oversampled model, because the oversampling is allowing data to leak from the validation folds into the training folds.
  • In this example doing the upsampling incorrectly lead to the best recall overall (91%). This won't generally happen! Our metric (recall) could have been much worse. The important point is that the main way we have of telling if we are doing well is using the CV scores.
  • The test set should only be used ONCE. In this article, we used it multiple times to show when how the different upsampling method affected our ability to trust the cross-validated scores.

In your problems, you should do your baseline model and the (correctly) upsampled models, and use the CV scores for your modeling decisions. The test set's role is to tell how well your model generalizes after making all of your modeling decisions.

References

There were some nice articles on this around the web; here are a collection I found particularly useful: