Bayesian state space modelling of the Australian 2019 election

At a glance:

I tidy up Australian polling data back to 2007 and produce a statistical model of two-party-preferred vote for the coming election.

02 Mar 2019

So I’ve been back in Australia for five months now. While things have been very busy in my new role at Nous Group, it’s not so busy that I’ve failed to notice there’s a Federal election due some time by November this year. I’m keen to apply some of the techniques I used in New Zealand in the richer data landscape (more polls, for one) and complex environment of Australian voting systems.

Polling data

The Australian Electoral Commission has wonderful, highly detailed data on actual results, which I’ll doubtless be coming to at some point. However, I thought for today I’d start with the currency and perpetual conversation-making power (at least in the media) of polling data.

There’s no convenient analysis-ready collection of Australian polling data that I’m aware of. I used similar methods to what’s behind my nzelect package to grab these survey results from Wikipedia where it is compiled by some anonymous benefactor, from the time of the 2007 election campaign until today.

Thanks are owed to Emily Kothe who did a bunch of this scraping herself for 2010 and onwards and put the results on GitHub (and on the way motivated me to develop a date parser for the horror that is Wikipedia’s dates), but in the end I started from scratch so I had all my own code convenient for doing updates, as I’m sure I’ll be wanting.

All the code behind this post is in its own GitHub repository. It covers grabbing the data, fitting the model I’ll be talking about soon, and the graphics for this post. That repo is likely to grow as I do more things with Australian voting behaviopur data.

Here’s how that polling data looks when you put it together:

Notes on the abbreviations of Australian political parties in that chart:

  • ONP ~ “Pauline Hanson’s One Nation” - nationalist, socially conservative, right-wing populism
  • Oth ~ Other parties
  • Grn ~ “Australian Greens” ~ left wing, environment and social justice focus
  • ALP ~ “Australian Labor Party” ~ historically the party of the working person, now the general party of the centre left
  • Lib/Nat ~ “Liberal Party” or “National Party” ~ centre and further right wing, long history of governing in coalition (and often conflated in opinion polling, hence the aggregation into one in this chart)

I’m a huge believer in looking at polling data in the longer term, not just focusing on the current term of government and certainly not just today’s survey release. The chart above certainly tells some of the story of the last decade or so; even a casual observer of Australian politics will recognise some of the key events, and particularly the comings and goings of Prime Ministers, in this chart.

Prior to 2007 there’s polling data available in Simon Jackman’s pscl package which has functionality and data relating to political science, but it only covers the first preference of voters so I haven’t incorporated it into my cleaned up data. I need both the first preference and the estimated two-party-preference of voters.

(Note to non-Australian readers - Australia has a Westminster-based political system, with government recommended to the Governor General by whomever has the confidence of the lower house, the House of Representatives; which is electorate based with a single-transferrable-vote aka “Australian vote” system. And if the USA could just adopt something as sensible as some kind of preferential voting system, half my Twitter feed would probably go quiet).

Two-party-preferred vote

For my introduction today to analysis with this polling data, I decided to focus on the minimal simple variable for which a forecast could be credibly seen as a forecast of the outcome on election day, whenever it is. I chose the two-party-preferred voting intention for the Australian Labor Party or ALP. We can see that this is pretty closely related to how many seats they win in Parliament:

The vertical and horizontal blue lines mark 50% of the vote and of the seats respectively.

US-style gerrymanders generally don’t occur in Australia any more, because of the existence of an independent electoral commission that draws the boundaries. So winning on the two-party-preferred national vote generally means gaining a majority in the House of Representatives.

Of course there are no guarantees; and with a electoral preference that is generally balanced between the two main parties even a few accidents of voter concentration in the key electorates can make a difference. This possibility is enhanced in recent years with a few more seats captured by smaller parties and independents:

All very interesting context.

State space modelling

My preferred model of the two I used for the last New Zealand election was a Bayesian state space model. These are a standard tool in political science now, and I’ve written about them in both the Australian and New Zealand context.

To my knowledge, the seminal paper on state space modelling of voting intention based on an observed variety of polling data is Jackman’s “Pooling the Polls Over an Election Campaign”. I may be wrong; happy to be corrected. I’ve made a couple of blog posts out of replicating some of Jackman’s work with first preference intention for the ALP in the 2007 election. In fact, this was one of my self-imposed homework tasks in learning to use Stan, the wonderfully expressive statistcal modelling and high-performance statistical computation tool and probability programming language.

My state space model of the New Zealand electorate was considerably more complex than I need today, because in New Zealand I needed to model (under proportional representation) the voting intention for multiple parties at once. Whereas today I can focus on just two-party-preferred vote for either of the main blocs. Obviously a better model is possible, but not today!

The essence of this modelling approach is that we theorise the existence of an unobserved latent voting intention, which is measured imperfectly and irregularly by opinion poll surveys. These surveys have sampling error and other sources of “total survey error”, including “house effects” or statistical tendencies to over- or under-estimate vote in particular ways. Every few years, the true voting intention manifests itself in an actual election.

Using modern computational estimation methods we can estimate the daily latent voting intention of the public based on our imperfect observations, and also model the process of change in that voting intention over time and get a sense of the plausibility of different outcomes in the future. Here’s what it looks like for the 2019 election:

This all seems plausible and I’m pretty happy with the way the model works. The model specification written in Stan and the data management in R are both available on GitHub.

An important use for a statistical model in my opinion is to reinforce how uncertain we should be about the world. I like the representation above because it makes clear, in the final few months of modelled voting intention out to October or November 2019, how much change is plausible and consistent with past behaviour. So anyone who feels certain of the election outcome should have a look at the widening cone of uncertainty on this chart and have another think.

A particularly useful side effect of this type of model is statistical estimates of the over- or under-estimation of different survey types or sources. Because I’ve confronted the data with four successive elections we can get a real sense of what is going on here. This is nicely shown in this chart:

We see the tendency of Roy Morgan polls to overestimate the ALP vote by one or two percentage points, and of YouGov to underestimate it. These are interesting and important findings (not new to this blog post though). Simple aggregations of polls can’t incorporate feedback from election results in this way (although of course experienced people routinely make more ad hoc adjustments).

A more sophisticated model would factor in change over time in polling firms methods and results, but again that would take me well beyond the scope of this blog post.

Looking forward to some more analysis of election issues, including of other data sources and of other aspects, over the next few months.

Here’s a list of the contributors to R that made today’s analysis possible:

thankr::shoulders() %>% knitr::kable() %>% clipr::write_clip()
maintainer no_packages packages
Hadley Wickham 16 assertthat, dplyr, ellipsis, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, tidyr, tidyverse
R Core Team 12 base, compiler, datasets, graphics, grDevices, grid, methods, parallel, stats, stats4, tools, utils
Gábor Csárdi 6 callr, cli, crayon, pkgconfig, processx, ps
Winston Chang 4 extrafont, extrafontdb, R6, Rttf2pt1
Yihui Xie 4 evaluate, knitr, rmarkdown, xfun
Kirill Müller <> 4 DBI, hms, pillar, tibble
Dirk Eddelbuettel 3 digest, inline, Rcpp
Lionel Henry 3 purrr, rlang, tidyselect
Jeroen Ooms 2 curl, jsonlite
Jim Hester 2 pkgbuild, readr
Ben Goodrich 2 rstan, StanHeaders
Jim Hester 2 glue, withr
Vitalie Spinu 1 lubridate
Deepayan Sarkar 1 lattice
Gabor Csardi 1 prettyunits
Patrick O. Perry 1 utf8
Jennifer Bryan 1 cellranger
Michel Lang 1 backports
Simon Jackman 1 pscl
Jennifer Bryan 1 readxl
Kevin Ushey 1 rstudioapi
Justin Talbot 1 labeling
Simon Potter 1 selectr
Jonah Gabry 1 loo
Charlotte Wickham 1 munsell
Alex Hayes 1 broom
Joe Cheng 1 htmltools
Baptiste Auguie 1 gridExtra
Luke Tierney 1 codetools
Henrik Bengtsson 1 matrixStats
Peter Ellis 1 frs
Simon Garnier 1 viridisLite
Brodie Gaslam 1 fansi
Brian Ripley 1 MASS
R-core 1 nlme
Stefan Milton Bache 1 magrittr
Marek Gagolewski 1 stringi
James Hester 1 xml2
Max Kuhn 1 generics
Simon Urbanek 1 Cairo
Jeremy Stephens 1 yaml
Achim Zeileis 1 colorspace

← Previous post

Next post →