Are You Getting Burned By One-Hot Encoding?

A common technique for transforming categorical variables into a form suitable for machine learning is called "one-hot encoding" or "dummy encoding". This article discusses some of the limitations and folklore around this method (such as the...

Encoding categorical variables

Non-numeric features generally have to be encoded into one or more numeric features before applying machine learning models. This article covers some of the different encoding techniques, the category_encoders package, and some of the pros and...

Custom Loss vs Custom Scoring

Scikit learn grid search functions include a scoring parameter. Scorers allow us to compare different trained models. Models try to minimize a loss function. While custom scoring is straight-forward, custom losses are not.

Globals Are Bad

Jupyter's use for quick experimentation encourages the use of global variables, as we may only have one connection to a database, or one dataframe used by all functions. The globals can lead to subtle, hard to debug problems. This article shows...

Keeping Notebooks Clean

Jupyter notebooks allow for quick experimentation and exploration, but can encourage some bad habits. One subtle error is the usage of global variables in a Jupyter notebook. This is a quick post to show the error, and some steps you can take to avoid it

The Bad Names In Classification Problems

There are a proliferation of different metrics in classification problems: accuracy, precision, recall, and more! Many of these metrics are defined in terms of True Positives, True Negatives, False Positives, and False Negatives. Here we give...

Introducing the column transformer

The ColumnTransformer allows us to easily apply different transformations to different features. For example, now we can scale some numerical features, while leaving binary flags alone! This article walks through two examples using...

Are you sure that's a probability?

Many of the classifiers in sklearn support a predict_proba method for calculating probabilities. Often, these "probabilities" are really just a score from 0 to 1, where a higher score means the model is more confident in the prediction, but it...

Fixing a broken Postgres on Ubuntu (and AWS EC2)

If your Ubuntu server is shutdown (for example, by your AWS instance rebooting), you may leave Postgres in an inconsistent state. This post walks through the steps of locating the lockfiles and getting Postgres up and running again.

Prevent big commits

Instead of learning how to undo accidentally commiting a large file, what if we could prevent the commit in the first place? This article shows how to use git hooks to check commits automatically for validity before actually doing the commit.

Making a Python Package VIII - summary

This is the eighth in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we summarize the steps needed to make and deploy a Python package.

Making a Python Package VII - deploying

This is the seventh in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we show how to deploy to TestPyPI.

Making a Python Package VI - including data files

This is the sixth in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we show how to include a CSV file into your package. This should be...

Making a Python Package V - Testing with Tox

This is the fifth in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we use the tox package to automate some of the deployment steps

Making a Python Package IV - writing unit tests

This is the fourth in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we use pytest to write unit tests for the roman numeral package.

Making a Python Package II - writing docstrings

This is the second in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we add docstrings for our users to be able to understand what our package does.

Making a Python Package

This is the first in a series of blog posts where we go through the process of taking a collection of functions and turning them into a deployable Python package. In this post, we create a Roman Numerals function, and make it into a Python module.

Pet Peeve - Art vs Science

The expression "Data science is more art than science" makes my skin crawl. Data science, like all sciences, requires both strict methodology and a lot of creativity.

Data Lakes, Data Warehouses and Databases - Oh My!

What is the difference between a production database and a data warehouse? How does that differ from a data lake? Why would I use one over the other? With the volume of data around, there are more and more use cases for data storage. This article...

Using API calls via the Network Panel

Second article in the advanced web-scraping series. Clarifies the difference between static and dynamic pages. Shows how to use Chrome's Network Panel to intercept Javascript and AJAX calls.

Name to Age

How much does your name say about your age? We use the database of names from the social security administration, as well as age distribution data from the US Census, to find out! See what your own name's age distribution looks like here.

Long vs Wide Data

What does it mean for data to be in long form vs wide form, and when would you use each? In Pandas, how do you convert from one form to another?

The James-Stein Encoder

One technique, sometimes called "target" or "impact" encoding, uses the average value of the target variable per value to encode. The James-Stein encoder is a twist the "shrinks" the target value back to the global average to stop statistical...

Custom scoring in cross-validation

The scoring functions used in our models are often baked in (such as using cross-entropy in Logistic Regression). We do get some choices when cross-validating, however. For example, we can pick the regularization parameter by using the ROC area...

A/B Test simulator

Determine the sample size needed to discover differences between two treatments, given your tolerance for false acceptances of inferior treatments, and false rejection of good treatments. Also includes a simulation of a trial, so that you can see...