I write about applications of data and analytical techniques like statistical modelling and simulation to real-world situations. I show how to access and use data, and provide examples of analytical products and the code that produced them.
I test the traditional BMI calculation against the actual distribution of height and weight in USA adults in 2018. I decide BMI is quite a good metric. I find that one prominent critique of the BMI gets the direction wrong for whom has their weight exaggerated by BMI.
I put to the test a method of running a tennis tournament suggested by Lewis Carroll. It performs ok in allocating prizes fairly, although it takes about twice as many matches as a standard modern single-elimination. When there is realistic randomness in results it doesn't perform as well as Carroll argued it would on the unrealistic basis of deterministic match outcomes.
I have a go at quantifying how much giving a special draw to the top 32 seeds in a tennis tournament impacts on who makes it to the finals and who wins, based on simulations of a hypothetical matchup of the 128 top women players in 1990.
SQL Server and R work fine together for analysing 200 GB of the New York City taxi data. There's a lot of effort needed to prepare for analysis even relatively-tidy data. Also, you can't analyse big data without aggregating and summarising it somehow.
I try to show that cost-benefit analysis is easy to perform in R, and that R lets you build in uncertainty in a much clearer way than is generally done; and to demystify the internal rate of return.
A small random sample will give better results than a much larger non-random sample, under certain conditions; but more importantly, it is reliable and controls for risk.
I play around with creating my own synthetic unit record file of survey microdata to match published marginal totals. I conclude that making synthetic data for development or education purposes can be useful, but the effort to try to exactly match marginal totals from this sort of survey is unlikely to be useful for either bona fide researchers or ill-intentioned data snoopers.
I annotate and explain an example use of Poisson process modelling to test an important hypothesis about the frequency of mass shootings in Australia over time.
I show how modelling the distribution of an underlying continuous variable that has been clouded by binning is much better way of understanding the data than crude methods dealing directly with the binned counts.
Forecasting unemployment is hard, with lots of complex bi-directional causality. Also, while AIC is asymptotically equivalent to cross-validation, it's probably better to check. It turns out that interest rates or stock prices don't have any useful information for nowcasting unemployment.