Election analysis contest entry part 2 - building the nzelect R package
At a glance:
I explain the structure and techniques behind building the nzelect R package, which has New Zealand election results, in case anyone is interested or wants to adapt the process for other packages that rely on preparatory data munging.
04 Apr 2016
Motivation
This post is the second in a series that make up my entry in Ari Lamstein’s R Election Analysis Contest, Yesterday I introduced the nzelect R package from a user perspective. Today I’m writing about how the build of that package works. This might be of interest to someone planning on doing something similar, or to anyone who wants to contribute to nzelect.
Here’s today’s themes:
Structure - separation of the preparation from the package
Modular code
Specific techniques - combining multiple CSVs, and overlaying spatial points and shapefiles
Structure
Here’s the directory structure within the Git repo that builds this package, with the individual files in the key ./prep/ folder shown on the right:
You can view, fork and clone the whole project for yourself if interested from GitHub. The code excerpts in this blog post won’t run if you just paste them into R; they need a specific folder structure that comes from running them in a clone of the main project.
Conceptually, there are three main parts of this project:
downloads and data munging that need to be done separately to the published R package, and which include products like data tidying scripts and downloaded raw data which must not be included in the package itself. This comprises the prep and downloads folders
The R package itself, which is in the pkg folder and includes subdirectories for data, man, R and tests folders
secondary material which depends on the R package being in existence and installed, such as the README for the GitHub repo, extended examples in examples including a Shiny app which will be the subject of the next post, etc.
There’s also various administrative stuff, such as the .git folder holding the version control database, .Rproj.user holding information on the RStudio project, the travis.yml file that sets up hooks from GitHub to Travis Continuous Integration, and nzelect.Rcheck holding the latest version of the R package build checks.
This is a typical structure for me. I generally don’t like an R package based in the root folder of a project. I nearly always have a bunch of stuff I want to do - prep and extended examples - that I don’t want in the built and published version of the package for end users, but I do want kept together with the code building the package in a single RStudio project and Git repository.
The project is held together by a script entitled build.R. In other projects this would make sense as a makefile but my workflow with nzelect is quite interactive (I pick and choose which files to run during development) and I’m more comfortable doing that with an R script. In principle it should work if run end to end with a fresh clone of the repository, and it looks like this:
Modular code
Looking at that build.R script introduces my second theme for today - modularity. I have separate scripts for different tasks such as “download election results”, “download shapefiles” and “tidy voting place results”. This makes development easier to keep track of, and easier to run and debug a whole script at once without repeating expensive processes like downloads. Note that the downloaded data isn’t tracked by Git - that would bloat out the repository too much, so all downloads are covered in the .gitignore instead.
Some of those scripts are very short. For example, download_map_shapefiles.R is only 8 lines long including blanks:
This sort of thing is key for maintainability. If multiple people are working on a project, it also stops them treading on eachothers’ toes working on the same files at once.
Specific techniques
I won’t go through the whole project - that’s easiest done for yourself if interested from a clone of it - but will highlight three more things.
Munging the Electoral Commission CSVs
The 2014 general election results are published on the web in the form of 71 CSV files for candidate vote (one per electorate) and 71 CSV files for party vote. Each CSV has data on each voting place used by voters enrolled in the electorate (voting places are not necessarily physically in the electorate). Here’s a screenshot of a typical CSV:
As can be seen, they go some way towards being nicely machine readable, but are not yet in tidy shape.
Some features of these CSVs include:
the name of the electorate is in cell A2
the main body of data starts in row 3
column A of the main data represents suburb, and a blank represents “same as previous entry”
the final row of the main data (not shown) is always a Total, followed by a row that is blank for columns A:E
below the main data rectangle there is a secondary rectangle of data (not shown) with data on candidates that is otherwise not presented (ie which party the candidates represent) as well as sum totals of their votes
Luckily, each CSV has an identical pattern - not identical numbers of rows and columns, but enough pattern that it is possible to programmatically identify the main data rectangle. For example, after skipping the first three rows, the next empty cell in column B indicates the main data rectangle is finished, and the row above that cell is the sum total (which needs to be excluded). Here’s the part of the tidying script that deals with those candidate CSVs:
Overlaying shapefiles
One important task was to take the point locations (in NZTM coordinates) of the 2,500 or so voting places and determine which meshblock, area unit, Territorial Authority and Regional Council each was in. Luckily this is a piece of cake with the over() function from the sp package. The process is:
import a shapefile of the boundaries we want (the originals are at http://www.stats.govt.nz/browse_for_stats/Maps_and_geography/Geographic-areas/digital-boundary-files.aspx and a script in the project downloads the versions for 2014) into R as a sp::SpatialPolygonsDataFrame object
use the proj4string from that shapefile to help convert the NZTM coordinates into a sp::SpatialPoints object
use sp::over() to map the SpatialPoints to the SpatialPolygonsDataFrame.
Unit tests
Any serious coding project needs unit tests so you can be sure that things, once fixed, continue to work as you make changes to other parts of the project. This applies as much to analytical projects as to software development. Hadley Wickham’s testthat package gives a nice structure for including unit tests in the actual check and build of an R package, and I use that in this project. Here’s some of the tests I’m using, in the ./pkg/tests/testthat/ directory:
Conclusion
The three themes for today:
structure - separating prep, package, and extended use
modularity
a few specific techniques of data cleaning
That’s enough for now. Stay tuned for the next post in the series, which looks at the Shiny app I built as an extended example of use of the nzelect package.