Summary: The principles of Hadley Wickham's tidy data, and how it relates to long and wide form data.

Tidy data¶

In the blog post long_vs_wide, we looked at how to turn data between long and wide formats in Pandas using the commands DataFrame.pivot, DataFrame.pivot_table and DataFrame.melt. As a quick reminder, we can look at an example of sales at a store.

In wide form, we might list out the sales for each quarter:

Store_Id	Year	Q1_Sales	Q2_Sales	Q3_Sales	Q4_Sales
A001	2018	55 M	45 M	22 M	50 M
A002	2018	98 M	70 M	60 M	60 M

In long form, we would not use the column to label the quarter. Instead, we would add a separate column for the quarter, like so:

Store_Id	Year	Quarter	Sales
A001	2018	1	55 M
A001	2018	2	45 M
A001	2018	3	22 M
A001	2018	4	50 M
A002	2018	1	55 M
....	....	...	...

In the long vs wide post, we concentrated on how to convert from one format to the other. This post will concentrate on when we would want to put the data into "tidy" format, and why.

Tidy format¶

Hadley Wickham defines "tidy data" is data stored in "3rd normal form". Restated in language used by statisticans and data scienctists, we have:

Each variable forms a column, and that column contains one "type" of data
Each observation forms a row
Each type of observational unit forms a table

Looking at the long and wide forms of our sales data, it can be difficult to apply these rules to say which one is "tidy". For example, if we are trying to observe quarterly sales, the long form would be tidy. If we are trying to observe annual sales, there is an argument that the wide form with all the data about a year in a single row, is tidy.

An easier way for me to think about whether a dataset is "tidy" or not is to consider the multiple ways a dataset might become untidy. From Wickham's paper, these are

Column headers contain values, rather than names
Multiple variables are stored in a single column
Variables are stored in both rows and columns
Multiple observational types are stored in a single table
A single observational unit is stored in multiple tables.

The last two are more about how to split your dataframe into multiple dataframes. I will concentrate on the first three, which tell us about the shape that our dataframe should have to be considered tidy. When thought of this way, it becomes clearer that the long form of our dataset is the "tidy" one, as the column names Qn_sales are secretly storing a value (namely n, the quarter number).

Other examples¶

Medical data: age, weight, and gender¶

Consider the following example, where we tabulate someone's age, weight, name and gender. Here are two ways of presenting the data. First in wide format:

name	age	weight	gender
Alice	34	115lb	Female
Bob	35	160lb	Male
Christine	38	125 lb	Female

The same data can be presented in a long format:

name	variable	value
Alice	age	34
Alice	gender	Female
Alice	weight	115lb
Bob	age	35
Bob	weight	160lb
Bob	gender	Male
...	...	...

In this case, the wide dataset is the tidy one. Each row in the wide dataset is relevant to the same person, so each row data about an "observation" or "individual sample" of our population. In the wide dataset, we have split this observation over three rows per person! In the long dataset, the value column doesn't contain the same type of information - sometimes it is a string ("Male"/"Female"), sometimes it is a weight, and sometimes it is a weight. We would not be able to perform a "standard scaling" on this column! Finally notice that going to the long form means that we could accidentally leave off an age, weight, or gender for one of our patients.

Tidy version: the wide dataset.

Medical trial data¶

Let's take a dataset from the Wikipedia page on Simpson's paradox which looks at the effect of two different treatments for kidney stones. Here is the original dataset:

Stone size	Treatment A - recovered	Treatment A - failed	Treatment B - recovered	Treatment B - failed
Small	81	6	234	36
Large	192	71	55	25

Here are two contenders for a "tidy" format. In both cases, it is easy to note that the columns have two pieces of information: the treatment and the outcome. Let's separate these out:

Stone size	Treatment	Recovered	Failed
Small	A	81	6
Small	B	234	36
Large	A	192	71
Large	B	55	25

The question is whether the outcome should be a separate column or not. An alternative is:

Stone size	Treatment	Outcome	Count
Small	A	Recovered	81
Small	A	Failed	6
Small	B	Recovered	234
Small	B	Failed	36
Large	A	Recovered	192
Large	A	Failed	71
Large	B	Recovered	55
Large	B	Failed	25

In this case, I would argue the version with the columns Stone Size, Treatment, Recovered, Failed would be the "tidy" version of this dataset. The last version splits the results of a trial of a particular treatment on a particular stone size into two different rows.

Box office revenue¶

We can go to BoxOffice Mojo and get information on ticket sales for the first 6 weeks (rounded to the nearest million)

Movie	Release date	Wk1	Wk2	Wk3	Wk4	Wk5	Wk6
The Lego Movie	2014-02-07	80M	71M	37M	26M	15M	10M
Big Hero 6	2014-11-07	76M	40M	33M	21M	10M	8M
Incredibles 2	2018-06-15	269M	125M	81M	44M	26M	20M

In this case, the "tidy" data set would take the form

Movie	Date	Week	Revenue
The Lego Movie	2014-02-07	1	80M
The Lego Movie	2014-02-14	2	71M
The Lego Movie	2014-02-21	3	37M
...	...	...	...

There is even a case for saying that "Date" and "Week" contain overlapping information, and that a better way of approaching the problem might be to separate off the movie's opening date into a separate dataframe, and keep only the "Date" or "Week" column in this dataframe.

This case is interesting, because even though the form above is the "tidy" form, it might not be the form we want our data in. For example, we could try and build a model that takes the last two weeks of the box office revenue and uses it to predict the current week's revenue. For sklearn to be able to model this, we would need to included the previous two weeks as "features" in our dataframe, so we would try and make our dataframe look like the following:

Movie	Date	Week	Revenue_lag2	Revenue_lag1	Revenue (target)
The Lego Movie	2014-02-07	1	NaN	NaN	80M
The Lego Movie	2014-02-14	2	NaN	80M	71M
The Lego Movie	2014-02-21	3	80M	71M	37M
The Lego Movie	2014-02-21	4	71M	37M	26M
...	...	...	...

This is the form we would need for our model, even though it isn't "tidy".

When to be tidy?¶

Hadley Wickham works as one of the primary developers of R, and has constructed the "TidyVerse" set of functions. These are functions that are designed to take in tidy data, and if they output a dataframe, the output should also be tidy. The output could also be graphs. So the R approach to the problem of "we need to include lags" is to write the machine learning code so that it can take a tidy dataframe and do the lagging for you. Python has something similar with statsmodels and the timeseries packages contained within.

However, Python is unlikely to ever have something equivalent or as mature as the TidyVerse, simply because Python is designed around different constraints. Python and sklearn are designed so that each row can be run through the machine learning models independently. This becomes really important when using tools like Dask or PySpark, where the different parts of your dataframe might be sent to different processors or even different machines. The R solution of transforming the rows on the fly doesn't play nicely when you have to distribute your computation!

Even if working in Python, we will not manage to get the level of consistence that the Tidyverse provides R users, it is still useful to have a standardized format, even if we choose not to use it. Knowing about the TidyVerse gives us a set of best practices, and we break them only out of the practical needs of our models.

As Wickham acknowledges at the end of the paper, a "tidy" or "standard" data format is useful only so far as it makes future analysis less painful. It should not be considered an end onto itself. The current "tidy" format isn't even necessarily the best one, but simply one that we have converged on as a community.