The other election years can typically be found in the form of PDFs, and have to be manually converted. Finding a comprehensive list of election results for multiple years is challenging. Stephen Wolf of the Daily Kos has created an Excel spreadsheet from Dave Leip's Atlas, which gives the election results back as far as 1828.
The trouble is, the Excel sheet is difficult to import into Pandas.
Note the data is wide, with a multiple headers. The
Year spans multiple columns, which is then broken down by candidate. Also note that some years have two candidates, while other years have three or more.
The data shown in the screenshot are shown by percentages of the vote. If we scroll across to column HG, the votes are reported again in raw numbers.
Our goal is to read the data into the following format:
year,state,votes,candidate,party,short_state 2016,Alabama,729547,"Clinton, Hillary",Democratic,AL 2016,Alaska,116454,"Clinton, Hillary",Democratic,AK 2016,Arizona,1161167,"Clinton, Hillary",Democratic,AZ 2016,Arkansas,380494,"Clinton, Hillary",Democratic,AR 2016,California,8753788,"Clinton, Hillary",Democratic,CA 2016,Colorado,1338870,"Clinton, Hillary",Democratic,CO 2016,Connecticut,897572,"Clinton, Hillary",Democratic,CT
In this post we will walk through the main steps, but a repository including the full code for processing this data set is available here.
Dealing with a MultiIndex
Let's start with a simplified excel spreadsheet with the following format:
|State||Total||Obama, Barak - Democratic||Romney, Mitt - Republican||Total||Obama, Barak - Democratic||McCain, John - Republican|
Note that the sum of the "major" candidates don't sum to the totals. Presumably there are some write-ins that are not counted.
We have also skipped the percentages that occurred at the beginning of the real spreadsheet. Our goal here is to be able to unpack the multiindex.
Let's start by reading in the dataset. We don't try and name the headers.
import pandas as pd df = pd.read_excel(filename, header=None)
Our dataframe now looks like
|State||Total||Obama, Barak - Democratic||Romney, Mitt - Republican||Total||Obama, Barack - Democratic||McCain, John - Republican|
The merged cells have been split, and the year only appears in the first one. The other cells have been replaced by NaNs.
Now let's make a MultiIndex using the first two rows (year and candidate). Let's start by replacing the NAs using
ffill (foward fill). This moves the last non-null value "forward" over null values:
multi_index_values = df.ffill(axis=1).values[:2, 1:]
Note that when we only kept the first couple of columsn, and dropped the first row. Now
|Total||Obama, Barak - Democratic||Romney, Mitt - Republican||Total||Obama, Barack - Democratic||McCain, John - Republican|
Let's make a multi-index, and make a dataframe out of it. Recall the first two rows included the headers, so we are copying the index from row 2 down. We are making the state (i.e. column 0) the row index.
multi_index = pd.MultiIndex.from_arrays(multi_index_values, names=['year', 'CANDIDATES']) multi_df = pd.DataFrame(df.values[2:, 1:], index=df.values[2:, 0], columns=multi_index)
Now our dataframe in pandas is almost in the same layout as the original worksheet.
|CANDIDATES||Total||Obama, Barak - Democratic||Romney, Mitt - Republican||Total||Obama, Barak - Democratic||McCain, John - Republican|
We have the problem that our columns are actually names of variables. At the top level of the column index, we have the years. At the lower level, we have the candidates. We can undo this with the
unstacked_df = multi_df.unstack()
unstacked_df has the form
2012 Obama, Barack - Democratic Alabama 795696 Alaska 122640 Romney, Mitt - Republican Alabama 1255925 Alaska 164676 2008 Obama, Barack - Democratic Alabama 813479 Alaska 123594 McCain, John - Republican Alabama 1266546 Alaska 193841
unstacked_df.reset_index() fills out the columns:
year CANDIDATES level_2 0 2012 Obama, Barack - Democratic Alabama 795696 2012 Obama, Barack - Democratic Alaska 122640 2012 Romney, Mitt - Republican Alabama 1255925 2012 Romney, Mitt - Republican Alaska 164676 2008 Obama, Barack - Democratic Alabama 813479 2008 Obama, Barack - Democratic Alaska 123594 2008 McCain, John - Republican Alabama 1266546 2008 McCain, John - Republican Alaska 193841
We can separate the candidate and party information with the following lines:
unstacked_df['candidate'] = unstacked_df.CANDIDATES.str.split('-').str.str.strip() unstacked_df['party'] = unstacked_df.CANDIDATES.str.split('-').str.str.strip() unstacked_df.drop('CANDIDATES', axis=1, inplace=True)
The final product
Once we have this, a rename on the columns
0 give us
year state votes candidate party 0 2012 Alabama 795696 Obama, Barack Democratic 1 2012 Alaska 122640 Obama, Barack Democratic 2 2012 Alabama 1255925 Romney, Mitt Republican 3 2012 Alaska 164676 Romney, Mitt Republican 4 2008 Alabama 813479 Obama, Barack Democratic 5 2008 Alaska 123594 Obama, Barack Democratic 6 2008 Alabama 1266546 McCain, John Republican 7 2008 Alaska 193841 McCain, John Republican
This is a simplified process. We didn't show the join to create the "short state code", or show how to skip the columns in the spreadsheet that contain the percentages. A complete description of the process can be found in this github repo. If you are just interested in the results, you can find them in this gist instead.