Summary: The principles of Hadley Wickham's tidy data, and how it relates to long and wide form data.
In the blog post long_vs_wide, we looked at how to turn data between long and wide formats in Pandas using the commands
DataFrame.melt. As a quick reminder, we can look at an example of sales at a store.
In wide form, we might list out the sales for each quarter:
|A001||2018||55 M||45 M||22 M||50 M|
|A002||2018||98 M||70 M||60 M||60 M|
In long form, we would not use the column to label the quarter. Instead, we would add a separate column for the quarter, like so:
In the long vs wide post, we concentrated on how to convert from one format to the other. This post will concentrate on when we would want to put the data into "tidy" format, and why.
Hadley Wickham defines "tidy data" is data stored in "3rd normal form". Restated in language used by statisticans and data scienctists, we have:
- Each variable forms a column, and that column contains one "type" of data
- Each observation forms a row
- Each type of observational unit forms a table
Looking at the long and wide forms of our sales data, it can be difficult to apply these rules to say which one is "tidy". For example, if we are trying to observe quarterly sales, the long form would be tidy. If we are trying to observe annual sales, there is an argument that the wide form with all the data about a year in a single row, is tidy.
An easier way for me to think about whether a dataset is "tidy" or not is to consider the multiple ways a dataset might become untidy. From Wickham's paper, these are
- Column headers contain values, rather than names
- Multiple variables are stored in a single column
- Variables are stored in both rows and columns
- Multiple observational types are stored in a single table
- A single observational unit is stored in multiple tables.
The last two are more about how to split your dataframe into multiple dataframes. I will concentrate on the first three, which tell us about the shape that our dataframe should have to be considered tidy. When thought of this way, it becomes clearer that the long form of our dataset is the "tidy" one, as the column names
Qn_sales are secretly storing a value (namely
n, the quarter number).
Medical data: age, weight, and gender¶
Consider the following example, where we tabulate someone's
gender. Here are two ways of presenting the data. First in wide format:
The same data can be presented in a long format:
In this case, the wide dataset is the tidy one. Each row in the wide dataset is relevant to the same person, so each row data about an "observation" or "individual sample" of our population. In the wide dataset, we have split this observation over three rows per person! In the long dataset, the
value column doesn't contain the same type of information - sometimes it is a string ("Male"/"Female"), sometimes it is a weight, and sometimes it is a weight. We would not be able to perform a "standard scaling" on this column! Finally notice that going to the long form means that we could accidentally leave off an
gender for one of our patients.
Tidy version: the wide dataset.
Medical trial data¶
Let's take a dataset from the Wikipedia page on Simpson's paradox which looks at the effect of two different treatments for kidney stones. Here is the original dataset:
|Stone size||Treatment A - recovered||Treatment A - failed||Treatment B - recovered||Treatment B - failed|
Here are two contenders for a "tidy" format. In both cases, it is easy to note that the columns have two pieces of information: the treatment and the outcome. Let's separate these out:
The question is whether the outcome should be a separate column or not. An alternative is:
In this case, I would argue the version with the columns
Failed would be the "tidy" version of this dataset. The last version splits the results of a trial of a particular treatment on a particular stone size into two different rows.
Box office revenue¶
We can go to BoxOffice Mojo and get information on ticket sales for the first 6 weeks (rounded to the nearest million)
|The Lego Movie||2014-02-07||80M||71M||37M||26M||15M||10M|
|Big Hero 6||2014-11-07||76M||40M||33M||21M||10M||8M|
In this case, the "tidy" data set would take the form
|The Lego Movie||2014-02-07||1||80M|
|The Lego Movie||2014-02-14||2||71M|
|The Lego Movie||2014-02-21||3||37M|
There is even a case for saying that "Date" and "Week" contain overlapping information, and that a better way of approaching the problem might be to separate off the movie's opening date into a separate dataframe, and keep only the "Date" or "Week" column in this dataframe.
This case is interesting, because even though the form above is the "tidy" form, it might not be the form we want our data in. For example, we could try and build a model that takes the last two weeks of the box office revenue and uses it to predict the current week's revenue. For
sklearn to be able to model this, we would need to included the previous two weeks as "features" in our dataframe, so we would try and make our dataframe look like the following:
|The Lego Movie||2014-02-07||1||NaN||NaN||80M|
|The Lego Movie||2014-02-14||2||NaN||80M||71M|
|The Lego Movie||2014-02-21||3||80M||71M||37M|
|The Lego Movie||2014-02-21||4||71M||37M||26M|
This is the form we would need for our model, even though it isn't "tidy".
When to be tidy?¶
Hadley Wickham works as one of the primary developers of R, and has constructed the "TidyVerse" set of functions. These are functions that are designed to take in tidy data, and if they output a dataframe, the output should also be tidy. The output could also be graphs. So the R approach to the problem of "we need to include lags" is to write the machine learning code so that it can take a tidy dataframe and do the lagging for you. Python has something similar with
statsmodels and the timeseries packages contained within.
However, Python is unlikely to ever have something equivalent or as mature as the TidyVerse, simply because Python is designed around different constraints. Python and sklearn are designed so that each row can be run through the machine learning models independently. This becomes really important when using tools like
PySpark, where the different parts of your dataframe might be sent to different processors or even different machines. The R solution of transforming the rows on the fly doesn't play nicely when you have to distribute your computation!
Even if working in Python, we will not manage to get the level of consistence that the Tidyverse provides R users, it is still useful to have a standardized format, even if we choose not to use it. Knowing about the TidyVerse gives us a set of best practices, and we break them only out of the practical needs of our models.
As Wickham acknowledges at the end of the paper, a "tidy" or "standard" data format is useful only so far as it makes future analysis less painful. It should not be considered an end onto itself. The current "tidy" format isn't even necessarily the best one, but simply one that we have converged on as a community.