free range statistics

I write about applications of data and analytical techniques like statistical modelling and simulation to real-world situations. I show how to access and use data, and provide examples of analytical products and the code that produced them.

Recent posts


How to make that crazy Fox News y axis chart with ggplot2 and scales

06 April 2020

I demonstrate the power of the transformation functionality in the scales R package by re-creating an eccentric Fox News chart.


Impact of a country's age breakdown on COVID-19 case fatality rate

21 March 2020

I have a go at quantifying how important different demographic profiles will be for country average case fatality rates for COVID-19.


COVID-19 cumulative observed case fatality rate over time

17 March 2020

I have a quick look at how the observed case fatality rate of COVID-19 has evolved over time so far.


New Zealand Election Study webtool

07 March 2020

I release an improved and updated version of my crosstab webtool for exploring the New Zealand Election Study data, now covering 2017 as well as 2014, and letting the user explore relationship between party vote and a range of attitudes, experiences and demographics.


Log transform or log link? And confounding variables.

01 March 2020

I check the robustness of last week's analysis of height -> weight by trying a different method of specifying and fitting the model, and checking to see if socioeconomic status is acting as a confounder (because better-off people are both taller and healthier).


Body Mass Index

23 February 2020

I test the traditional BMI calculation against the actual distribution of height and weight in USA adults in 2018. I decide BMI is quite a good metric. I find that one prominent critique of the BMI gets the direction wrong for whom has their weight exaggerated by BMI.


Lewis Carroll's proposed rules for tennis tournaments

01 February 2020

I put to the test a method of running a tennis tournament suggested by Lewis Carroll. It performs ok in allocating prizes fairly, although it takes about twice as many matches as a standard modern single-elimination. When there is realistic randomness in results it doesn't perform as well as Carroll argued it would on the unrealistic basis of deterministic match outcomes.


Analysing the effectiveness of tennis tournament seeding

26 January 2020

I have a go at quantifying how much giving a special draw to the top 32 seeds in a tennis tournament impacts on who makes it to the finals and who wins, based on simulations of a hypothetical matchup of the 128 top women players in 1990.


Analysing large data on your laptop with a database and R

22 December 2019

SQL Server and R work fine together for analysing 200 GB of the New York City taxi data. There's a lot of effort needed to prepare for analysis even relatively-tidy data. Also, you can't analyse big data without aggregating and summarising it somehow.


Cost-benefit analysis in R

24 November 2019

I try to show that cost-benefit analysis is easy to perform in R, and that R lets you build in uncertainty in a much clearer way than is generally done; and to demystify the internal rate of return.