Are you sure that's a probability?

A hard classification is when we take an observation and predict which class it belongs to. For example, a bank could use someone's financial history (the observation) to predict whether or not they would pay back a loan. A doctor could take measurements of a tumor (the observation) to predict whether the tumor was malignant and needed to be removed.

In contrast, a soft classification indicates the confidence the model has in its prediction. In giving a loan, a model might make the hard classification that both clients would pay back a loan, but the soft classification that client 1 has an 80% chance of paying the loan back, while client 2 has a 95% chance of paying the loan back.

If we have a soft classifier, we can make a hard classifier by picking a threshold: simply classify all observations with soft predictions above the threshold as the positive class, and those with soft predictions below the threshold as the negative class. For scikit-learn classifiers that calculate probabilities, this is how they work with a default threshold of 50%. Using probability as a threshold helps make your model more explainable as well -- you might decide that for a loan classifier, you only want to accept people that you are really confident about (e.g. those that have an 85% chance of higher of paying back a loan). For the tumor case, you might decide to have a doctor look at any tumor that has more than a 5% chance of being malignant.

While a probability is a convenient way to make a soft classification, all we really need is a score. This is easiest to explain for a binary classifier. Calling one of the cases the positive class, the only thing we need from the score is

A higher score means the model is more confident that the observation belongs to the positive class

A well-known example of a score that isn't a probability is a credit score. Usually between 300 and 850, the higher the credit score a person has, the more confidence a bank has that they will pay back a loan. We don't directly get an answer to the question "the probability that someone with a credit score of 620 pays back a loan is ....." (which presumably depeneds on things like size of the loan), but it does allow a bank to rank people it would like to give loans to.

If we have access to a probability of belonging to the positive class, it is a great candidate for a score. Not only does a higher score mean the model is more confident in the observation, but we are also able to use the probabilities to determine things like our expected values. If we use our probabilities this way we have to be careful they are actually probabilities. Scikit-learn's predict_proba method promises probabilities, but often these are really just scores between 0 and 1, that sum to one!

Beware of bad names

For many models in Scikit-learn, we have a predict_proba method. For each observation, predict_proba returns an array with the same number of columns as there are classes; and the "probability" that an observation is in class $i$ is found by looking at the $i$-th column. However, for many models, these are not really probabilities. Instead, they are scores that satisfy the following conditions:

  1. Each "probability" is between 0 and 1,
  2. When you sum the probabilities of an observation being in any particular class, they sum to 1.

What makes something a probability?

Probability can be difficult to define. For example, if we make a rain predictor, and it tells us there is a 20% chance of rain today, what does that mean? How can we tell if it is correct or not? When checking the prediction, it will either rain (or it won't). The standard frequentist argument is if we take days that look similar to today, then on 20% of them it should rain, but then we get into arguments about how similar different days have to be in order to be grouped together. In even more extreme, such as predicting the outcome of elections, it is difficult to see how we could "repeat" the experiment to give meaning to the result.

In The Signal and the Noise, Nate Silver talks about this problem for weather forecasts in chapter 4. Instead of worrying about what makes days similar to one another, let's take a different approach. Look historically at all the days where our model made the prediction that it would rain 20% of the time. Out of those days, what percentage of the time did it actually rain? If the answer is wildly different from 20%, then the model's probabilities are not doing a great job at actually estimating the probability.

In practice, we bin the probabilities, as there might only be one day where there is a probability of exactly 0.2000 of rain. If we take all the predictions between 17.5% and 22.5% and group them together in one bin, we would expect it to actually rain between 17.5% and 22.5% of the time.

This process of binning the prediction probabilities and comparing to the actual proportion of times it happened is called calibration. A calibration curve plots the predicted probabilities against the actual rate of occurance.

The weather example

Let's look at this Kaggle dataset of whether or not it rained in Seattle. We will load the data into a dataframe:

In [3]:
rain = pd.read_csv('seattleWeather_1948-2017.csv').dropna()
0 1948-01-01 0.47 51 42 True
1 1948-01-02 0.59 45 36 True
2 1948-01-03 0.42 45 35 True
3 1948-01-04 0.31 45 34 True
4 1948-01-05 0.17 45 32 True

The precipitation column (PRCP) tells us how many inches of rain fell, and the RAIN column tells us if any rain at all fell. We shouldn't use "how much rain fell" to determine if any rain fell, so we are just going to use the temperatures to predict whether or not there was rain that day.

In [4]:
features = rain[['TMIN', 'TMAX']]
target = rain.RAIN.astype(bool)
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42)


We can also see that the data set is reasonably balanced (it rained on 42.8% of the days in the training set).

Let's use a Naive Bayes Classifier and make hard predictions:

In [5]:
nb_rain = GaussianNB().fit(X_train, y_train)
accuracy = nb_rain.score(X_train, y_train)
print(f'The hard predictions were right {100*accuracy:5.2f}% of the time')
The hard predictions were right 67.17% of the time

What about the soft predictions? We can use the predict_proba method to get them. We are only interested in the probability we predict the positive case, so we are going to grab column 1:

In [6]:
predictions = nb_rain.predict_proba(X_train)[:, 1]

Plotting the distribution of probabilities:

Here we see something interesting: even though our Naive Bayes model will make predictions greater than 50% (which means nb_rain.predict(...) would predict rain on those days), the probability never exceeds 70%. We can call calibration_curve to "bin" similar predicted probabilities together, and calculate what percentage of the time it actually rains in that bin:

In [8]:
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

predictions = nb_rain.predict_proba(X_train)[:, 1]
binned_true_p, binned_predict_p = calibration_curve(y_train, predictions, n_bins=10)

We can now plot the predicted probabilities against the actual probabilities:

The highlighted point in the calibration curve tells us that when our classifier claimed the probability of raining was approximately 25%, it actually rained 33% of the time! Someone told the "probability" of rain was only 25% thinks they will get one time in four, but will actually get wet one time in three! The region shown in blue is the region where a "hard" classification (i.e. nb_rain.predict) would predict rain.

If our classifier was giving us good probabilities, the points would line up close to the red line. As an example, a random forest on the same dataset gives the following result:

In [10]:
rfc_rain = RandomForestClassifier(n_estimators=100, random_state=13).fit(X_train, y_train)
rfc_predict = rfc_rain.predict_proba(X_train)[:, 1]

In this case, we see that our Random Forest's estimation of the probabilities are very reasonable!

For this example, Naive Bayes gave scores that looked like probabilities, and were generated by predict_proba so they should be probabilities, but when we looked at our calibration curve we found that they were really just scores. The random forest predict_proba in this case gave reasonable probabilities.

Recalibration and Platt scaling

When a model's predict_proba method gives scores rather than probabilities, we can recalibrate the scores to make them closer to a probability. This process is called Platt scaling, and is implemented in Scikit-learn as CalibratedClassifierCV. In the weather example, we could apply Platt scaling to our Naive Bayes Classifier:

In [12]:
calibrated_nb_rain = CalibratedClassifierCV(nb_rain, cv=5, method='isotonic')
# Note we need to refit!, y_train);

We can now use calibrated_nb_rain.predict_proba to access the probabilities after recalibration. Note that we are doing this on the test set, so this is data that hasn't been seen on training or calibration:

In [13]:
calibrated_probs = calibrated_nb_rain.predict_proba(X_test)[:, 1]
binned_true_p, binned_predict_p = calibration_curve(y_test, calibrated_probs, n_bins=10)

These corrected probabilities are looking pretty good! Let's compare the no calibration and post-calibration models side-by-side on the test set:

One thing to be aware of is that (by default) CalibratedClassifierCV actually refits your model while calibrating. You can override this by setting cv="prefit", but if you do this you need to make sure that you fit CalibratedClassifierCV on a different set of data than the set you trained your model on, i.e. you should split your data into training, calibration, and test sets.

Limitations of recalibration

The examples above show a really good probabilities out of the box (with Random Forest) and a huge improvement on probabilities for Naive Bayes. Random Forest doesn't always perform this well, and scaling cannot always make the probabilities "good". Let's look at a slightly more complicated problem where we would want accurate probabilities to calculate expected values: a loan payback dataset.

In [16]:
loans = pd.read_csv('lending_club_clean.csv')
loan_amnt annual_inc open_acc dti delinq_2yrs inq_last_6mths mths_since_last_delinq pub_rec revol_bal revol_util total_acc pub_rec_bankruptcies term grade emp_length home_ownership purpose defaulted
0 10000.0 49200.0 10.0 20.00 0.0 1.0 35.0 0.0 5598.0 21.00 37.0 0.0 36 C 10 RENT other False
1 3000.0 80000.0 15.0 17.94 0.0 0.0 38.0 0.0 27783.0 53.90 38.0 0.0 60 B 1 RENT other False
2 6000.0 84000.0 4.0 18.44 2.0 0.0 8.0 0.0 0.0 37.73 14.0 0.0 36 B 1 MORTGAGE medical False
3 5000.0 50004.0 14.0 13.97 3.0 0.0 20.0 0.0 4345.0 59.50 22.0 0.0 60 D 2 RENT other True
4 5000.0 24044.0 8.0 11.93 0.0 0.0 45.0 0.0 2224.0 29.30 16.0 0.0 36 A 2 RENT debt_consolidation False

For this problem, we are interested in predicting the default rate (i.e. the defaulted column). Because the goal is just to look at the scaling, we simplify our features by making the following transformations:

  • Drop all categorical features (home_ownership, purpose, grade)
  • Binarize the terms (only 3 year and 5 year terms)
  • Binarize the banckruptcies (only 2 people have more than 1 bankruptcy)

This leads to a model where we can just apply a random forest. We search over a few models until we find one that does a good job separating the defaulter's from those that pay back their loans:

In [24]:
params = {
    'max_depth': [4, 6, 8, 10],
    'min_samples_split': [2, 5, 10, 12],
    'class_weight': [None, {0: 1, 1: 5}],
    'n_estimators': [100],

grid = GridSearchCV(RandomForestClassifier(), param_grid=params, cv=5, scoring='roc_auc'), y_train)
{'class_weight': None,
 'max_depth': 8,
 'min_samples_split': 10,
 'n_estimators': 100}

We know that we can change the threshold to adjust the precision and recall (as described in this article on ROC curves). How well did the Random Forest "probabilities" do at separating the two cases?

In [45]:
# Compare to the AUC-ROC:
predict_prob = grid.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, predict_prob)

Not too bad! Let's also create the calibrated version and check the ROC score there as well. We see that it is almost exactly the same! The calibration doesn't change the order of the scores, it just scales them to make a better match, and the ROC score only cares about the ordering of the scores.

In [26]:
calibrated_rfc = CalibratedClassifierCV(grid.best_estimator_, cv=5, method='isotonic').fit(X_train, y_train)

# Compare to the AUC-ROC:
predict_prob = calibrated_rfc.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, predict_prob)

You may wonder why there is any difference in the two scores if the recalibration doesn't change the ordering. The difference comes about because we refit the model when doing the CalibratedClassifierCV.

Let's compare the output for the probabilities before and after the recalibration:

We see that the recalibration helps, but the results are not perfect. In this case, whenever the model makes a prediction of 40% or higher (even after calibration), we are seeing that 100% of the cases are defaults!

The Platt scaling basically fits a logistic regression on the original model's predict_proba. The closer the calibration curve is to a sigmoid, the more effective the scaling will be in correcting the model. You can read more about how it works, and the difference between "sigmoid" and "isotonic" methods in the official user guide.

Different model performances

So which models have "good" probabilities, and which ones are just "scores". Logistic regression optimizes for the log-loss directly, so it's probabilities are generally well calibrated. For the other models:

Model Bias
Naive Bayes Tends to push probabilties toward the extremes. In this case, the extremes are not 0 or 1, but there is a lot of "pile up" at 0 and 0.7
Ensembles (e.g. random forest) When voting, it is hard to get a probability close to zero or close to 1. A probability of zero would mean each member of the ensemble voted "no", which is unlikely in a large number of ensembles.
LinearSVC Support Vector Classifiers generally don't support predict_proba in Scikit-learn. The exception is the LinearSVC (i.e. specializing a SVC to a linear kernel). According to the user guide (and references therein) maximum margin methods tend to have probabilties cluster around the center.
kNN Varies a lot depending on $k$. When $k$ is small, the probabilities tend to be close to 0 or 1. As $k$ increases, the probabilites tend to cluster around the middle.


  • Soft classification gives scores, where a higher score for that class means the model is more confident in its prediction.
  • For binary classification, you only need one score (generally the score of the positive class).
  • A probability can be used as a score (even in the multiclass case).
  • Many Scikit-learn models, such as Tree-based methods, ensemble methods, kNN, and Naive Bayes have a predict_proba method; but these should really be thought of as giving scores rather than "true" probabilities.
  • A calibration curve is a good way to check if your probabilities are accurate.
  • Platt Scaling is a way of trying to recalibrate your model's probabilities to be probabilities (rather than scores), but you should check calibration after fitting.

The predict_proba is a bad name, but as Phil Karlton (a designer for Netscape) once told us:

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton


I'd like to thank Chad Scherrer for helpful feedback on the example I used to illustrate calibration, and how it doesn't always do as good a job as we want.