The way we store data is often different from the way it is used to create visualizations, or how it is fed into models. Often the data stored in a database is in tidy format (as described in this paper by Hadley Wickham), and we have to transform it into a form appropriate for our analysis.
The two forms of data we will talk about are long and wide.
- Long form This is very close to the tidy format. It typically makes the data easy to store, and allows easy transformations to other types.
- Wide form Useful when looking at multiple lines / series on a graph, or when making tables for quick comparison.
Which format is more useful for a machine learning model depends on the details of the model. These descriptions can seem a little abstract without some explicit examples.
Stock price: long to wide via
Let's look at the stock prices for Apple (AAPL), Amazon (AMZN), and Google (GOOGL). We can use quandl, or simply scrape the Nasdaq pages, to get some information on how these stocks are doing. Here is a snapshot of the dataframe
stock_price_long in "long form"/"tidy form":
Here every row tells us about a specific stock on a specific day. If we wanted to plot the performance of the stock over time, we could use seaborn (which works well with long data sets):
sns.lineplot(x='Date', y='Open', hue='Symbol', data='stock_price_long')
INSERT GRAPH HERE
matplotlib, long form isn't ideal for making this plot. We can use
groupby (or filter) to make the plots:
for group_index, group_frame in stock_frame_long.groupby('Symbol'): plt.plot(group_frame['Date'], group_frame['Open'], label=group_index) plt.ylabel('Open price') plt.legend()
INSERT GRAPH HERE
An alternative is to use a wide dataframe. We can make one using the
stock_frame_wide = stock_frame_long.pivot(index='Date', columns='Symbol', values='Open')
which produces the following dataframe:
Note that the dates have been moved into the index, which makes plotting relatively simple
for col in stock_frame_wide.columns: stock_frame_wide[col].plot() plt.ylabel('Open price') plt.legend()
The advantages of the wide format in this case is that it is a lot easier to present the information to people, and is slightly more natural to use with plotting. The disadvantages of the wide form is that it becomes cumbersome to add or remove columns. For example, if a company goes bankrupt, you have to decide whether to add blank rows, or drop the column. Likewise, if a new company starts, we have missing values for the dates before that company opens.
Olympic Medals: long vs wide with
Kaggle has an easy to read data set of Olympic medal winners. Loading the data from the summer olympics we see
|0||1896||Athens||Aquatics||Swimming||HAJOS, Alfred||HUN||Men||100M Freestyle||Gold|
|1||1896||Athens||Aquatics||Swimming||HERSCHMANN, Otto||AUT||Men||100M Freestyle||Silver|
|2||1896||Athens||Aquatics||Swimming||DRIVAS, Dimitrios||GRE||Men||100M Freestyle For Sailors||Bronze|
|3||1896||Athens||Aquatics||Swimming||MALOKINIS, Ioannis||GRE||Men||100M Freestyle For Sailors||Gold|
|4||1896||Athens||Aquatics||Swimming||CHASAPIS, Spiridon||GRE||Men||100M Freestyle For Sailors||Silver|
|5||1896||Athens||Aquatics||Swimming||CHOROPHAS, Efstathios||GRE||Men||1200M Freestyle||Bronze|
|6||1896||Athens||Aquatics||Swimming||HAJOS, Alfred||HUN||Men||1200M Freestyle||Gold|
This information is already in long form, which is convenient for storage. The wide form is better for: Visualization: The data is too detailed and not well organized to allow quick visual comparisons. Point summarization: If we wanted to calculate a weighted medal scores - for example 5 points for gold, 3 for silver, and 1 for gold - then this format isn't great for analysis. * Some machine learning models: we might use this data to see if performance in previous Olympic games can help predict the spending of that country in the next Olympic games. In this case, we want to feed the model all the information about a countries performance in a given set of games on a single row to allow it to make predictions.
Let's convert this to a "wide form" with the count of medals, broken down by year, country, gender, and medal type.
Last time we converted from long form to wide form, we used
DataFrame.pivot. In that case, each element in the wide table came from a single row. For this problem, we want to aggregate many rows by counting how many occurred, which
pivot cannot do. Instead, we are going to use
DataFrame.pivot_table in the following way:
summer = pd.read_csv('summer.csv') summer_wide = (summer.pivot_table(index='Year', columns=['Country', 'Gender', 'Medal'], aggfunc='count') .fillna(0).astype(int) .loc[:,('Athlete')] )
All the columns in
summer that are not used in the
columns call to
pivot_table (such as "Athlete", "Discipline", and "City") are copied at the top level. The call
.loc[:, ('Athlete')] selects just the copy for Athletes.
This dataframe has up to 6 columns per country (3 medal types for each gender), with a total of 558 columns, which is still hard to visualize. We can focus down to just American medals which has only 6 columns, and can keep the number of rows reasonable by looking Olympics ceremonies starting in 1984, using
|('Men', 'Bronze')||('Men', 'Gold')||('Men', 'Silver')||('Women', 'Bronze')||('Women', 'Gold')||('Women', 'Silver')|
To show selection using the multi-index, we could also look at women's medals from the US and Canada, starting in 1984:
summer_women = summer_wide.loc[1984:, (['USA', 'CAN'], 'Women',['Gold', 'Silver', 'Bronze'])]
|('CAN', 'Women', 'Bronze')||('CAN', 'Women', 'Gold')||('CAN', 'Women', 'Silver')||('USA', 'Women', 'Bronze')||('USA', 'Women', 'Gold')||('USA', 'Women', 'Silver')|
Even so, we are left with a lot of different countries, and a very wide table. Still focusing on games from 1984 onward, lets select the countries that have won the most medals:
summer_countries = (summer.pivot_table(index='Year', columns=['Country'], aggfunc='count') .fillna(0).astype(int) .loc[:,('Athlete')] ) medal_totals = summer_countries.sum(axis=0) country_mask = (medal_totals.rank(ascending=False) <= 10) summer_countries.loc[1984:, country_mask]
We have reduced the dataset down enough that someone would be able to look at it and discern patterns in the data.
Demographic data: wide to long with
The wikipedia page for Seattle has the following demographic information presented
|3||Black or African American||7.9%||10.1%||7.1%||1.0%|
|4||Hispanic or Latino (of any race)||6.6%||3.6%||2.0%||NaN|
|7||Two or more races||5.1%||NaN||NaN||NaN|
The numbers in brackets are the classic wikipedia citations. Here we see some of the problems with wide format: early on the questionnaires didn't ask about "other", "two or more races" or "non-hispanic" as categories, so we are forced to use
NaNs instead. In long format, we simple wouldn't store this data.
The long format should include the
year, and the
percentage of population. We will also have to clear the data a little (for example, eliminating the percentage signs and the citation brackets).
We can grab the table with a little experimentation:
# Grab all the tables on the wikipedia page seattle_tables = pd.read_html('https://en.wikipedia.org/wiki/Seattle') demographic_wide = seattle_tables # Convert to long form: demographic_long = demographic_wide.melt(id_vars='Race', value_vars = [2010, 1990, 1970, 1940], var_name='Year', value_name='fraction')
This command changes the dataframe to have two new columns: the "variable" column called
Year and the "value" called
fraction. Each entry in the columns
2010 gets copied onto its own row, where the column name is entered for the year, and the entry value is used for the
|2||Black or African American||2010||7.9%|
|3||Hispanic or Latino (of any race)||2010||6.6%|
|6||Two or more races||2010||5.1%|
|9||Black or African American||1990||10.1%|
|10||Hispanic or Latino (of any race)||1990||3.6%|
|13||Two or more races||1990||nan|
|16||Black or African American||1970||7.1%|
|17||Hispanic or Latino (of any race)||1970||2.0%|
|20||Two or more races||1970||nan|
|23||Black or African American||1940||1.0%|
|24||Hispanic or Latino (of any race)||1940||nan|
|27||Two or more races||1940||nan|
We still have a little cleaning to do in the
def clean_fractions(series): return series.replace(r'%(\s*\[\d*\])?', '', regex=True).astype(float)/100 demographic_long['fraction'] = clean_fractions(demographic_long['fraction']) demographic_long.dropna(inplace=True)
Now we have our long form "tidy" dataset:
|2||Black or African American||2010||0.079|
|3||Hispanic or Latino (of any race)||2010||0.066|
|6||Two or more races||2010||0.051|
|9||Black or African American||1990||0.101|
|10||Hispanic or Latino (of any race)||1990||0.036|
|16||Black or African American||1970||0.071|
|17||Hispanic or Latino (of any race)||1970||0.02|
|23||Black or African American||1940||0.01|
Summary and next up
- Going long to wide, where each element in the wide table comes from a single row.
- Reasonably straight-forward
- Going long to wide, where each element in the wide table is an aggregation of multiple rows.
- Will generate a copy of the aggregation for each unused varaible. This could strain memory if the number of columns is large. Possible work arounds are to `group_by` first, or select only the columns you are interested in.
- Going from wide to long:
- Usually done to store the data in tidy format.
- Can be done partially (i.e. convert some variables from wide format to long)
- Uses the
Other articles on reshaping data you might be interested in: