I explore half a million rows of disaggregated crash data for New Zealand, and along the way illustrate geo-spatial projections, maps, forecasting with ensembles of methods, a state space model for change over time, and a generalized linear model for understanding interactions in a three-way cross tab.
Traffic, and crashes in particular, are characterised by large volumes of high quality data. Sam Warburton’s workings can be supplemented with a range of data from the NZTA. I know little about the subject and thought it seemed a good one to explore. This is a long blog post but I still feel I’ve barely touched the surface. I nearly split this into a series of posts but decided I won’t get back to this for months at least, so best to get down what I fiddled around with straight away.
I always like to see as long term a picture as I can in a new area. In one of the tabs in Sam Warburton’s Excel book is a table with annual data back to 1950. Check this out:
Here’s the R code that draws that graph, plus what’s needed in terms of R packages for the rest of the post:
There have been a few charts going around of crashes and fatalities over time aggregated by region, but I was curious to see a more granular picture. It turns out the NZTA publish a CSV of all half a million recorded crashes (fatal and non-fatal) back to 2000. The data is cut back a bit - it doesn’t have comments on what may have caused the crash for example, or the exact date - but it does have the exact NZTM coordinates of the crash location, which we can convert to latitude and longitude and use for maps.
So I started as always with the big picture - where have all the fatal crashes been since 2000?
Unsurprisingly, they are heavily concentrated in the population centres and along roads. What happens when we look at this view over time?
So, turns out a map isn’t a great way to see subtle trends. In fact, I doubt there’s an effective simple visualisation that would show changing spatial patterns here; it’s definitely a job for a sophisticated model first, followed by a visualisation of its results. Outside today’s scope.
A map is a good way to just get to understand what is in some data though. Here are another couple of variables the NZTA provide. First, whether the accidents happened on a public holiday:
… and combination of vehicles they involved:
Here’s the code for downloading the individual crash data, converting coordinates from NZTM to latitude and longitude, and building those maps. This is a nice example of the power of the ggplot2 universe. Once I’ve defined my basic map and its theme, redrawing it with new faceting variables is just one line of code in each case.
Closer up maps
I live in New Zealand’s capital, Wellington, and was interested in a closer view of accidents in there. ggmap makes it easy to do this with a street map in the background:
But clearly there’s a limit to how many small, static maps I’ll want to make. So I made an interactive version covering the whole country:
Much of the public discussion is about whether there is significant evidence of the recent uptick in deaths being more than the usual year-on-year variation. The most thorough discussion was in Thomas Lumley’s StatsChat. He fit a state space model to the fatality counts, allowing latent tendency for fatalities per kilometres travelled to change at random over time as a Markov chain, with extra randomness each year on top of the randomness coming from a Poisson distribution of the actual count of deaths. He concluded that there is strong evidence of a minimum some time around 2013 or 2014, and growth since then.
To keep the Bayesian modelling part of my brain in practice I set out to repeat and expand this, using six monthly data rather than annual. The disaggregated crash data published by NZTA has both calendar year and financial year for each crash, which we can use to deduce the six month period it was in. The vehicle kilometres travelled in Sam Warburton’s data is at a quarterly grain, although it looks like the quarters are rolling annual averages.
First, to get that data into shape and check out the vehicle kilometres travelled data. I decided to divide the original numbers by four on the assumption they are rolling annual averages to convert them to a quarterly grain. I might be missing something, but it looked to me that the kilometres travelled data stop in first quarter of 2016, so I forecast forward another couple of years using the forecastHybrid R package as a simple way to get an average of ARIMA and exponential state space smoothing models. The forecast looks like this:
Then there’s a bit of mucking about to get the data in a six monthly state and ready for modelling but it wasn’t too fiddly. I adapted Professor Lumley’s model to Stan (which I’m more familiar with than JAGS which he used) to a six monthly observations. Here’s the end result
Consistent with Professor Lumley, I find that 88% of simulations show the first half of 2017 (where my data stops) having a higher latent crash rate per km travelled than the second half of 2013.
Fitting a model like this takes three steps:
data management in R
model fitting in Stan
presentation of results in R
First, here’s the code for the two R stages. The trick here is that I make one column with year (eg 2005) and one called six_month_period which is a for the first six months of the year, and b for the rest. I do this separately for the vehicle kilometres travelled (which needs to be aggregated up from quarterly) and the deaths (which needs to be aggregated up from individual, taking advantage of the fact that we know both calendar and financial years in which it occurred).
…and here’s the model specification in Stan:
Objects hit during crashes
Finally, there were a few things I spotted along the way I wanted to explore.
Here is a graphic of the objects (excluding other vehicles and pedestrians) hit during crashes.
Regional relationship with holidays
I expected there to be a relationship between region and holidays - some regions would see a higher proportion of their accidents in particular periods than others. I was only able to scratch the surface of this, as everything else, of course.
I started by creating a three-way cross tab of year, region and holiday (four holidays plus “none”). I used the frequency count in each cell of this cross tab as the response in a generalized linear model and tested for an interaction effect between region and holiday. It’s there but it’s marginal; a Chi square test shows up as definitely significant (p value < 0.01), but on other criteria the simpler model could easily be the better one.
I extracted the interaction effects and presented them in this chart:
So we can see, for example, that the West Coast (on the South Island) has a particularly strong Christmas time effect (expected); Gisborne gets disproportionate rate of its fatal accidents on Labour Day Weekend (not expected, by me anyway); and so on. The horizontal lines indicate 95% confidence intervals.
To check this wasn’t out to lunch, I also did the simpler calculation of just looking at what percentage of each region’s fatal accidents were on each of the four holidays:
This is much easier to interpret and explain!
Here’s the code for that region-holiday analysis:
That will do for now. Interesting data, very unsatisfactory state of the world.
My day job is Chief Data Scientist at Nous Group, an international management consultancy with over 400 people working across Australia, the UK and Canada. Contact me if you are interested working with us on a grand challenge or broad agenda.
I'm pleased to be aggregated at R-bloggers, the one-stop shop for blog posts featuring R.