free range statistics

I write about applications of data and analytical techniques like statistical modelling and simulation to real-world situations. I show how to access and use data, and provide examples of analytical products and the code that produced them.

Recent posts

Body Mass Index

23 February 2020

I test the traditional BMI calculation against the actual distribution of height and weight in USA adults in 2018. I decide BMI is quite a good metric. I find that one prominent critique of the BMI gets the direction wrong for whom has their weight exaggerated by BMI.

Lewis Carroll's proposed rules for tennis tournaments

01 February 2020

I put to the test a method of running a tennis tournament suggested by Lewis Carroll. It performs ok in allocating prizes fairly, although it takes about twice as many matches as a standard modern single-elimination. When there is realistic randomness in results it doesn't perform as well as Carroll argued it would on the unrealistic basis of deterministic match outcomes.

Analysing the effectiveness of tennis tournament seeding

26 January 2020

I have a go at quantifying how much giving a special draw to the top 32 seeds in a tennis tournament impacts on who makes it to the finals and who wins, based on simulations of a hypothetical matchup of the 128 top women players in 1990.

Analysing large data on your laptop with a database and R

22 December 2019

SQL Server and R work fine together for analysing 200 GB of the New York City taxi data. There's a lot of effort needed to prepare for analysis even relatively-tidy data. Also, you can't analyse big data without aggregating and summarising it somehow.

Cost-benefit analysis in R

24 November 2019

I try to show that cost-benefit analysis is easy to perform in R, and that R lets you build in uncertainty in a much clearer way than is generally done; and to demystify the internal rate of return.

A small simple random sample will often be better than a huge not-so-random one

09 November 2019

A small random sample will give better results than a much larger non-random sample, under certain conditions; but more importantly, it is reliable and controls for risk.

Re-creating survey microdata from marginal totals

03 November 2019

I play around with creating my own synthetic unit record file of survey microdata to match published marginal totals. I conclude that making synthetic data for development or education purposes can be useful, but the effort to try to exactly match marginal totals from this sort of survey is unlikely to be useful for either bona fide researchers or ill-intentioned data snoopers.

Poisson point processes, mass shootings and clumping

07 September 2019

I annotate and explain an example use of Poisson process modelling to test an important hypothesis about the frequency of mass shootings in Australia over time.

Inferring a continuous distribution from binned data

25 August 2019

I show how modelling the distribution of an underlying continuous variable that has been clouded by binning is much better way of understanding the data than crude methods dealing directly with the binned counts.

Forecasting unemployment

28 July 2019

Forecasting unemployment is hard, with lots of complex bi-directional causality. Also, while AIC is asymptotically equivalent to cross-validation, it's probably better to check. It turns out that interest rates or stock prices don't have any useful information for nowcasting unemployment.