What is tidy data?

Summary: The principles of Hadley Wickham's tidy data, and how it relates to long and wide form data.

Tidy data

In the blog post long_vs_wide, we looked at how to turn data between long and wide formats in Pandas using the commands DataFrame.pivot, DataFrame.pivot_table and DataFrame.melt. As a quick reminder, we can look at an example of sales at a store.

In wide form, we might list out the sales for each quarter:

Store_Id Year Q1_Sales Q2_Sales Q3_Sales Q4_Sales
A001 2018 55 M 45 M 22 M 50 M
A002 2018 98 M 70 M 60 M 60 M

In long form, we would not use the column to label the quarter. Instead, we would add a separate column for the quarter, like so:

Store_Id Year Quarter Sales
A001 2018 1 55 M
A001 2018 2 45 M
A001 2018 3 22 M
A001 2018 4 50 M
A002 2018 1 55 M
.... .... ... ...

In the long vs wide post, we concentrated on how to convert from one format to the other. This post will concentrate on when we would want to put the data into "tidy" format, and why.

Tidy format

Hadley Wickham defines "tidy data" is data stored in "3rd normal form". Restated in language used by statisticans and data scienctists, we have:

  1. Each variable forms a column, and that column contains one "type" of data
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Looking at the long and wide forms of our sales data, it can be difficult to apply these rules to say which one is "tidy". For example, if we are trying to observe quarterly sales, the long form would be tidy. If we are trying to observe annual sales, there is an argument that the wide form with all the data about a year in a single row, is tidy.

An easier way for me to think about whether a dataset is "tidy" or not is to consider the multiple ways a dataset might become untidy. From Wickham's paper, these are

  1. Column headers contain values, rather than names
  2. Multiple variables are stored in a single column
  3. Variables are stored in both rows and columns
  4. Multiple observational types are stored in a single table
  5. A single observational unit is stored in multiple tables.

The last two are more about how to split your dataframe into multiple dataframes. I will concentrate on the first three, which tell us about the shape that our dataframe should have to be considered tidy. When thought of this way, it becomes clearer that the long form of our dataset is the "tidy" one, as the column names Qn_sales are secretly storing a value (namely n, the quarter number).

Other examples

Medical data: age, weight, and gender

Consider the following example, where we tabulate someone's age, weight, name and gender. Here are two ways of presenting the data. First in wide format:

name age weight gender
Alice 34 115lb Female
Bob 35 160lb Male
Christine 38 125 lb Female

The same data can be presented in a long format:

name variable value
Alice age 34
Alice gender Female
Alice weight 115lb
Bob age 35
Bob weight 160lb
Bob gender Male
... ... ...

In this case, the wide dataset is the tidy one. Each row in the wide dataset is relevant to the same person, so each row data about an "observation" or "individual sample" of our population. In the wide dataset, we have split this observation over three rows per person! In the long dataset, the value column doesn't contain the same type of information - sometimes it is a string ("Male"/"Female"), sometimes it is a weight, and sometimes it is a weight. We would not be able to perform a "standard scaling" on this column! Finally notice that going to the long form means that we could accidentally leave off an age, weight, or gender for one of our patients.

Tidy version: the wide dataset.

Medical trial data

Let's take a dataset from the Wikipedia page on Simpson's paradox which looks at the effect of two different treatments for kidney stones. Here is the original dataset:

Stone size Treatment A - recovered Treatment A - failed Treatment B - recovered Treatment B - failed
Small 81 6 234 36
Large 192 71 55 25

Here are two contenders for a "tidy" format. In both cases, it is easy to note that the columns have two pieces of information: the treatment and the outcome. Let's separate these out:

Stone size Treatment Recovered Failed
Small A 81 6
Small B 234 36
Large A 192 71
Large B 55 25

The question is whether the outcome should be a separate column or not. An alternative is:

Stone size Treatment Outcome Count
Small A Recovered 81
Small A Failed 6
Small B Recovered 234
Small B Failed 36
Large A Recovered 192
Large A Failed 71
Large B Recovered 55
Large B Failed 25

In this case, I would argue the version with the columns Stone Size, Treatment, Recovered, Failed would be the "tidy" version of this dataset. The last version splits the results of a trial of a particular treatment on a particular stone size into two different rows.

Box office revenue

We can go to BoxOffice Mojo and get information on ticket sales for the first 6 weeks (rounded to the nearest million)

Movie Release date Wk1 Wk2 Wk3 Wk4 Wk5 Wk6
The Lego Movie 2014-02-07 80M 71M 37M 26M 15M 10M
Big Hero 6 2014-11-07 76M 40M 33M 21M 10M 8M
Incredibles 2 2018-06-15 269M 125M 81M 44M 26M 20M

In this case, the "tidy" data set would take the form

Movie Date Week Revenue
The Lego Movie 2014-02-07 1 80M
The Lego Movie 2014-02-14 2 71M
The Lego Movie 2014-02-21 3 37M
... ... ... ...

There is even a case for saying that "Date" and "Week" contain overlapping information, and that a better way of approaching the problem might be to separate off the movie's opening date into a separate dataframe, and keep only the "Date" or "Week" column in this dataframe.

This case is interesting, because even though the form above is the "tidy" form, it might not be the form we want our data in. For example, we could try and build a model that takes the last two weeks of the box office revenue and uses it to predict the current week's revenue. For sklearn to be able to model this, we would need to included the previous two weeks as "features" in our dataframe, so we would try and make our dataframe look like the following:

Movie Date Week Revenue_lag2 Revenue_lag1 Revenue (target)
The Lego Movie 2014-02-07 1 NaN NaN 80M
The Lego Movie 2014-02-14 2 NaN 80M 71M
The Lego Movie 2014-02-21 3 80M 71M 37M
The Lego Movie 2014-02-21 4 71M 37M 26M
... ... ... ...

This is the form we would need for our model, even though it isn't "tidy".

When to be tidy?

Hadley Wickham works as one of the primary developers of R, and has constructed the "TidyVerse" set of functions. These are functions that are designed to take in tidy data, and if they output a dataframe, the output should also be tidy. The output could also be graphs. So the R approach to the problem of "we need to include lags" is to write the machine learning code so that it can take a tidy dataframe and do the lagging for you. Python has something similar with statsmodels and the timeseries packages contained within.

However, Python is unlikely to ever have something equivalent or as mature as the TidyVerse, simply because Python is designed around different constraints. Python and sklearn are designed so that each row can be run through the machine learning models independently. This becomes really important when using tools like Dask or PySpark, where the different parts of your dataframe might be sent to different processors or even different machines. The R solution of transforming the rows on the fly doesn't play nicely when you have to distribute your computation!

Even if working in Python, we will not manage to get the level of consistence that the Tidyverse provides R users, it is still useful to have a standardized format, even if we choose not to use it. Knowing about the TidyVerse gives us a set of best practices, and we break them only out of the practical needs of our models.

As Wickham acknowledges at the end of the paper, a "tidy" or "standard" data format is useful only so far as it makes future analysis less painful. It should not be considered an end onto itself. The current "tidy" format isn't even necessarily the best one, but simply one that we have converged on as a community.