Stacked Turtles

How can reducing page memory size increase download times?

Damien martin | Sat 06 June 2020 | Category Data Science

In 2009, YouTube made improvements which reduced the video player from 1.2MB to 98kB. However, the time taken to download the page seemed to increase.

Pathological demand with flights - using instrumental variables

Damien martin | Mon 01 June 2020 | Category Data Science

Man predictive models learn correlations between your features and your target, and apply those to make predictions. If you change your strategies, you risk changing these correlations. We look at an example where increasing prices leads to...

Pathological demand in ridesharing - confounding with demand

Damien martin | Sun 31 May 2020 | Category Data Science

Many predictive models learn correlations between your features and your target, and apply those to make predictions. If you change your strategies, you risk changing these correlations. We look at an example where increasing prices leads to...

Setting up Jupyter on the Cloud

Damien martin | Thu 23 April 2020 | Category Tools

This article shows how you can run Jupyter on a remote server, connect to it, and have Jupyter continue to run - even if you get disconnected.

Normalization (z-score for features, Cohen's D for results)

Damien martin | Sat 18 January 2020 | Category Data Science

p-values are commonly used to determine if an effect is statistically significant. Cohen's D gives a measure of how important an effect is. It is possible to see a statistically significant difference (p value small) even if the effect isn't...

How to save Jupyter's environment (and kernels)

Damien martin | Wed 13 November 2019 | Category Tools

An earlier article, "Save the environment with conda", showed how to make a new environment and use it with Jupyter. This article walks through how to fix Jupyter if it isn't using the correct environment.

Stylish Pandas

Damien martin | Tue 05 November 2019 | Category Data Science

As the Zen of Python states, "readability counts". With a few simple tips and tricks, we can make our Pandas dataframes a lot more readable.

Goodhard's Law

Damien martin | Wed 25 September 2019 | Category Data Science

Goodhart's law claims "When a measure becomes a target, it ceases to be a good measure". This article explores how bad metrics can create perverse incentives, and how cross-validation fails to catch our errors.

Are You Getting Burned By One-Hot Encoding?

Damien martin | Sat 31 August 2019 | Category Data Science

A common technique for transforming categorical variables into a form suitable for machine learning is called "one-hot encoding" or "dummy encoding". This article discusses some of the limitations and folklore around this method (such as the...

Encoding categorical variables

Damien martin | Sun 25 August 2019 | Category Data Science

Non-numeric features generally have to be encoded into one or more numeric features before applying machine learning models. This article covers some of the different encoding techniques, the category_encoders package, and some of the pros and...

Custom Loss vs Custom Scoring

Damien martin | Sun 28 July 2019 | Category Data Science

Scikit learn grid search functions include a scoring parameter. Scorers allow us to compare different trained models. Models try to minimize a loss function. While custom scoring is straight-forward, custom losses are not.

Pros and Cons of Changing Definitions

Damien martin | Sun 21 July 2019 | Category Data Science

A definition cannot be wrong, but it can fail to be useful. Can you repurpose a definition, or should you start from scratch?

Pet Peeve - Single Source of Truth

Damien martin | Sun 21 July 2019 | Category Data Science

In software engineering, it is important to have a single source of truth. In data science, it is a little more complicated.

Globals Are Bad

Damien martin | Thu 13 June 2019 | Category Tools

Jupyter's use for quick experimentation encourages the use of global variables, as we may only have one connection to a database, or one dataframe used by all functions. The globals can lead to subtle, hard to debug problems. This article shows...

Keeping Notebooks Clean

Damien martin | Thu 13 June 2019 | Category Tools

Jupyter notebooks allow for quick experimentation and exploration, but can encourage some bad habits. One subtle error is the usage of global variables in a Jupyter notebook. This is a quick post to show the error, and some steps you can take to avoid it

Interview Practice with Precision and Recall

Damien martin | Sat 01 June 2019 | Category Data Science

How to prepare for those annoying questions about precision and recall in interviews.

The Bad Names In Classification Problems

Damien martin | Sat 01 June 2019 | Category Data Science

There are a proliferation of different metrics in classification problems: accuracy, precision, recall, and more! Many of these metrics are defined in terms of True Positives, True Negatives, False Positives, and False Negatives. Here we give...

Introducing the column transformer

Damien martin | Sun 26 May 2019 | Category Data Science

The ColumnTransformer allows us to easily apply different transformations to different features. For example, now we can scale some numerical features, while leaving binary flags alone! This article walks through two examples using...

Are you sure that's a probability?

Damien martin | Thu 23 May 2019 | Category Data Science

Many of the classifiers in sklearn support a predict_proba method for calculating probabilities. Often, these "probabilities" are really just a score from 0 to 1, where a higher score means the model is more confident in the prediction, but it...

How to do cross-validation when upsampling data

Damien martin | Mon 20 May 2019 | Category Data Science

We know to split our data into a training and a testing set before we do our preprocessing, let alone our modeling. Often we are not as careful when doing cross-validation; we should really do things like scale our data within cross-validation...

Fixing a broken Postgres on Ubuntu (and AWS EC2)

Damien martin | Wed 13 March 2019 | Category Tools

If your Ubuntu server is shutdown (for example, by your AWS instance rebooting), you may leave Postgres in an inconsistent state. This post walks through the steps of locating the lockfiles and getting Postgres up and running again.

What is a ROC Curve? A visualization with credit scores.

Damien martin | Sun 03 March 2019 | Category Tools

ROC (Receiver Operator Characteristic) curves are a great way for measuring the performance of binary classifiers. They show how well a classifier's score (where a higher score means more likely to be in the "positive" class) does at separating...

Prevent big commits

Damien martin | Fri 01 February 2019 | Category Github

Instead of learning how to undo accidentally commiting a large file, what if we could prevent the commit in the first place? This article shows how to use git hooks to check commits automatically for validity before actually doing the commit.

Save the environment with conda (and how to let others run your programs)

Damien martin | Tue 22 January 2019 | Category Tools

Environments allow you to distribute software to other users, where you don't know what packages they have installed. This is a better solution than using requirements.txt, as the packages you install won't interfere with the users system.

Making a Python Package VIII - summary

Damien martin | Sat 05 January 2019 | Category Tools

This is the eighth in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we summarize the steps needed to make and deploy a Python package.

Making a Python Package VII - deploying

Damien martin | Fri 04 January 2019 | Category Tools

This is the seventh in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we show how to deploy to TestPyPI.

Making a Python Package VI - including data files

Damien martin | Thu 03 January 2019 | Category Tools

This is the sixth in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we show how to include a CSV file into your package. This should be...

Making a Python Package V - Testing with Tox

Damien martin | Thu 03 January 2019 | Category Tools

This is the fifth in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we use the tox package to automate some of the deployment steps

Making a Python Package IV - writing unit tests

Damien martin | Wed 02 January 2019 | Category Tools

This is the fourth in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we use pytest to write unit tests for the roman numeral package.

Making a Python Package III - making an installable package

Damien martin | Wed 02 January 2019 | Category Tools

This is the third in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we use setuptools to allow people to install our package on their system.

Making a Python Package II - writing docstrings

Damien martin | Tue 01 January 2019 | Category Tools

This is the second in a series of blog posts where we go through the process of taking a collection of functions and turn them into a deployable Python package. In this post, we add docstrings for our users to be able to understand what our package does.

Making a Python Package

Damien martin | Tue 01 January 2019 | Category Tools

This is the first in a series of blog posts where we go through the process of taking a collection of functions and turning them into a deployable Python package. In this post, we create a Roman Numerals function, and make it into a Python module.

Derivations and Conjugate Priors (proportions)

Damien martin | Fri 28 December 2018 | Category Data Science

This article contains derivations when applying the shrinkage methods of empirical Bayes to proportion problems.

Derivations and Conjugate Priors (average ratings)

Damien martin | Fri 28 December 2018 | Category Data Science

This article contains derivations when applying the shrinkage methods of empirical Bayes to average rating problems.

Empirical Bayes with regression

Damien martin | Fri 28 December 2018 | Category Data Science

Pet Peeve - Art vs Science

Damien martin | Wed 26 December 2018 | Category Data Science

The expression "Data science is more art than science" makes my skin crawl. Data science, like all sciences, requires both strict methodology and a lot of creativity.

Shrinkage and Empirical Bayes to improve inference

Damien martin | Wed 26 December 2018 | Category Data Science

The highest and lowest rated books, films, and music are those that have very few ratings. This is because for small samples, it is easier for small fluctuations to dominate. Shrinkage is the technique for moving the average for a particular item...

An introduction to SimpleProphet

Damien martin | Fri 23 November 2018 | Category Data Science

Introduces SimpleProphet, a less automated version of Facebook's time series analysis package Prophet. Compares the approach of Prophet to other standard approaches: ARIMA and LSTMs.

Data Lakes, Data Warehouses and Databases - Oh My!

Damien martin | Fri 02 November 2018 | Category Tools

What is the difference between a production database and a data warehouse? How does that differ from a data lake? Why would I use one over the other? With the volume of data around, there are more and more use cases for data storage. This article...

An Introduction to ARIMA

Damien martin | Mon 22 October 2018 | Category Data Science

An article that outlines the standard approach to time series.

Prepping for the interview - SQL

Damien martin | Sun 21 October 2018 | Category Interview

Links to a couple of useful resources for preparing for the SQL, whether it is for a data science or data analyst position.

Using API calls via the Network Panel

Damien martin | Tue 16 October 2018 | Category Web

Second article in the advanced web-scraping series. Clarifies the difference between static and dynamic pages. Shows how to use Chrome's Network Panel to intercept Javascript and AJAX calls.

Webscraping beyond BeautifulSoup and Selenium

Damien martin | Mon 15 October 2018 | Category Web

First article in the advanced web-scraping series. Clarifies the difference between static and dynamic pages. Outlines different approaches for getting data from pages generated with Javascript and AJAX.

Getting data with OAuth

Damien martin | Sun 14 October 2018 | Category Web

An example of using OAuth2.0 to access an API using Python's requests module, using Spotify as an example.

Name to Age

Damien martin | Mon 01 October 2018 | Category Pandas

How much does your name say about your age? We use the database of names from the social security administration, as well as age distribution data from the US Census, to find out! See what your own name's age distribution looks like here.

What is tidy data?

Damien martin | Fri 28 September 2018 | Category Pandas

The principles of Hadley Wickham's tidy data, and how it relates to long and wide form data.

Munging with MultiIndices: election data

Damien martin | Thu 27 September 2018 | Category Pandas

We show how to take an Excel spreadsheet, with merged column headings, and process it for further analysis.

Long vs Wide Data

Damien martin | Wed 19 September 2018 | Category Pandas

What does it mean for data to be in long form vs wide form, and when would you use each? In Pandas, how do you convert from one form to another?

Undo in Github (aka the elephants in the room)

Damien martin | Wed 12 September 2018 | Category Github

How to rollback in Github

Big commits in GitHub

Damien martin | Mon 10 September 2018 | Category Github

What do you do when you have committed a large file to GitHub?

The James-Stein Encoder

Damien martin | Mon 10 September 2018 | Category Data Science

One technique, sometimes called "target" or "impact" encoding, uses the average value of the target variable per value to encode. The James-Stein encoder is a twist the "shrinks" the target value back to the global average to stop statistical...

Custom scoring in cross-validation

Damien martin | Thu 03 May 2018 | Category Data Science

The scoring functions used in our models are often baked in (such as using cross-entropy in Logistic Regression). We do get some choices when cross-validating, however. For example, we can pick the regularization parameter by using the ROC area...

Using Folium: What is the furthest you can get from Starbucks in Seattle?

Damien martin | Sat 08 April 2017 | Category Pandas

It seems that Starbucks is ubiquitous in Seattle. Where in Seattle is furthest from a Starbucks store? In order to work this out, we need a list of all the stores in Seattle. The open data project Socrata makes it easy to find out - you can pull...

A/B Test simulator

Damien martin | Fri 10 March 2017 | Category Calculators

Determine the sample size needed to discover differences between two treatments, given your tolerance for false acceptances of inferior treatments, and false rejection of good treatments. Also includes a simulation of a trial, so that you can see...

Snake on a cube with ReactJS

Damien martin | Fri 17 February 2017 | Category Tools

Gauge test

Damien martin | Sun 24 November 2013 | Category Tools