Stylish Pandas

As the Zen of Python states, "readability counts". With a few simple tips and tricks, we can make our Pandas dataframes a lot more readable.

Goodhard's Law

Goodhart's law claims "When a measure becomes a target, it ceases to be a good measure". This article explores how bad metrics can create perverse incentives, and how cross-validation fails to catch our errors.

Are You Getting Burned By One-Hot Encoding?

A common technique for transforming categorical variables into a form suitable for machine learning is called "one-hot encoding" or "dummy encoding". This article discusses some of the limitations and folklore around this method (such as the...

Encoding categorical variables

Non-numeric features generally have to be encoded into one or more numeric features before applying machine learning models. This article covers some of the different encoding techniques, the category_encoders package, and some of the pros and...

Custom Loss vs Custom Scoring

Scikit learn grid search functions include a scoring parameter. Scorers allow us to compare different trained models. Models try to minimize a loss function. While custom scoring is straight-forward, custom losses are not.

The Bad Names In Classification Problems

There are a proliferation of different metrics in classification problems: accuracy, precision, recall, and more! Many of these metrics are defined in terms of True Positives, True Negatives, False Positives, and False Negatives. Here we give...

Introducing the column transformer

The ColumnTransformer allows us to easily apply different transformations to different features. For example, now we can scale some numerical features, while leaving binary flags alone! This article walks through two examples using...

Are you sure that's a probability?

Many of the classifiers in sklearn support a predict_proba method for calculating probabilities. Often, these "probabilities" are really just a score from 0 to 1, where a higher score means the model is more confident in the prediction, but it...

Pet Peeve - Art vs Science

The expression "Data science is more art than science" makes my skin crawl. Data science, like all sciences, requires both strict methodology and a lot of creativity.

The James-Stein Encoder

One technique, sometimes called "target" or "impact" encoding, uses the average value of the target variable per value to encode. The James-Stein encoder is a twist the "shrinks" the target value back to the global average to stop statistical...

Custom scoring in cross-validation

The scoring functions used in our models are often baked in (such as using cross-entropy in Logistic Regression). We do get some choices when cross-validating, however. For example, we can pick the regularization parameter by using the ROC area...