Ten year anniversary of Free Range Statistics

Ten years of blog posts

A few months ago—26 July 2025 to be precise—was the tenth anniversary of my first blog post. Over that time it turns out I’ve written about 225 blog posts, and an astonishing (to me) 350,000 words. That’s after you take out the code.

Free Range Statistics is an old-fashioned blog, with a single author and very much representing the ideal of a “web log” just recording things of interest to me. It’s not a comprehensive personal blog (I never have posts just about my travel, family life, etc.), but focused on issues that somehow relate to statistics—ranging from the abstract and methodological, through to specific applications of the type “here’s a fun chart of some interesting historical or current data I saw”. It’s strictly non-monetised; open to the world to read for free, and will never make paid endorsements. I’ll go a bit into what’s kept me motivated later, but the spoiler is that, like art, blogging is in my opinion something best done primarily for your own interests and needs, and if anyone else likes it that’s a bonus.

The ten years of blog history hasn’t been an even one, but has had some ebbs and flows. We can see this in this chart of number of blog posts per month over time.

Code for these charts is at the bottom of the post. Two things worth noting about this one are how I’ve turned the months with zero posts into hollow circles to de-emphasise them visually while still including the zeroes in the modelling; and used for the trend line a continuous single model over all years instead of a separate model fit to each year-facet, which would be the easy default but does not really make sense given how time is continuous and all.

The low point of post frequency was 2021 and 2022, when life events got in the way. I was very busy in my day job as Chief Data Scientist for Nous Group, and this also was fairly hands-on technical itself which reduced my motivation to write code out-of-hours to relax. I was also playing a lot of Elite Dangerous in this period, right up until 2024 (when the civil unrest in Noumea led me to drop that cold turkey). Mid 2018 and mid 2022 both saw me change jobs and countries. In 2025 I’ve had health challenges, but these seem to be under control and I’m getting into a better modus vivendi with them.

The past couple of years has seen a subtle but material uptick in my posting frequency, and I think this is going to continue. I’ve got quite a backlog of half-finished posts to write about. These are on topics ranging from synthetic controls, to power and p-values, to lots of empirical stuff on the Pacific.

One thing that’s happened over time is the posts have gotten longer and, perhaps, more thorough over time. Certainly they are much more likely to be crafted over weeks or even months (or years in some cases), rather than knocked out in a single Saturday morning as used to be the case. Back when I wrote 45 posts in 2016—nearly one a week—they were short, very single topic, no great level of detail. More recently I am more inclined to try to thoroughly tease something out, particularly when I am learning for myself or trying to consolidate my understanding of something. A good example would be my recent set of posts on modelling fertility rates, which I had to split into two, one on the substance and one on the grab bag of things I learned on the way.

Here’s a connected scatter plot that lets us see both word count and posting frequency together, with some very crude characterisations of characteristic themes I was writing about at the time:

While one does one’s art for its own sake, there’s no denying it’s interesting to see what other people read in my blog too. I get a modest but steady trickle of around 60 unique visitors and 80-100 pages read a day. That is, modest compared to say Heather Armstrong’s peak numbers of about 300,000 visitors a day at the peak of mummy blogging, but quite a few more than I thought I’d get when I set out (which would have been, to be honest, in round numbers, around zero).

At its high point back when Twitter more or less worked, I wrote more frequently and was doing election forecasts, I think I got about 70% more traffic than now, but it’s hard to tell, with changing approaches to tracking visitors.

I used to have an automated “most popular” listing but changes in analytics services over the blog’s lifetime degraded this and I’ve pulled it. But from a more ad hoc examination using partial data from some mixed sources (too complicated to talk about here), here are some posts that have been most read recently:

Stepwise selection of variables in regression is Evil (2025)
Skill v luck in determining backgammon winners (2016)
Relative risk ratios and odds ratios (2018)
Weighted versus unweighted percentiles (2023)
Log transforms, geometric means, and estimating population totals (2023)
Dual axis time series plots may be ok sometimes after all (2016)
Extrapolation is tough for trees! (2016)
Snakes and ladders (2024)

This is interesting and I think is probably showing some external searches are turning up my blog on basic methodological questions. This must be dominating over social media or RSS feeds pulling in visitors when I publish a new post. I’m pleased to say each of these posts above does indeed have something useful in it—roughly defined as meaning I sometimes go back to them myself to see what I thought. So I hope other people are finding them of some use at the end of their random web search too.

If I had a longer series of analytics data I’m sure my various election-related posts pages, and time series modelling posts, would be in the genuine top hits. At one point it looked like some of my comparisons of forecasting methods were in the required reading for some courses, they were getting so many hits.

Blog benchmarks

I did some cursory internet research into blog longevity, to see how my 10 years stands up in comparison. ChatGPT¹ first assured me that research said the 60-70% of blogs are abandoned after one year (attributed to Herring et al) and that the median life was four months (Mishne and de Rijke) or 50% stopped after one month (same alleged authors).

These all sound plausible! And maybe these authors did find that. But I can’t (with limited time and access, admittedly) find them doing so. Application of intensive interrogation techniques to ChatGPT revealed that these were things that it thought sounded plausible as things these people might have written, rather than it could actually find real, published papers that contained these numbers.

Truly, ChatGPT is like an enthusiastic, immensely well-read but very unreliable research assistant who has had a couple of drinks, whose outputs should all be prefaced with “I seem to remember reading or hearing somewhere….” and treated with a heap of scepticism.

In terms of real findings I can actually source, some research from back when blogs were cool and before short-form social media really took off found that a quarter of blogs only last one post. Back in 2003, apparently, “the typical blog is written by a teenage girl who uses it twice a month to update her friends and classmates on happenings in her life.” These days, I do not think such people write blogs or even micro-blogs, but post videos on TikTok or equivalent.

A 2012 study of research blogs—closer in form and motivation to my own than the more personal blogs that make (or made) up the bulk of the blogosphere—found 84% of research blogs published under the author’s own name; 86% in English; and 72% by one or two male authors. So I’m in the majority in those respects.

At around 1,500 words each, my blog posts are much longer than the average of 200-300 words found by Susan Herring and others in a 2004 study.

Much of the research above is dated. Effectively it precedes the rise of video-based influencers. Short-form video (TikTok etc), podcasts, general video, and short-form text (X, Bluesky, LinkedIn etc) seem to dominate over written blogs these days. I have no interest in producing any of these things except the short-form text / social media sites.

There are still apparently a million or so active blogs, many of them forming a more stable piece of infrastructure underneath the froth of these more modern forms. This is basically how I engage with Bluesky, Mastodon and LinkedIn too, in terms of the relationship with my blog. I write in the blog, and use the social media to publicise that writing.

Why I write my blog

Ten years is a success, I guess. While I couldn’t find a citable source, I’m well prepared to believe that most blogs are abandoned after a few months. So what kept me motivated to keep writing for ten years?

My motivations have certainly evolved over time as I settled into a rhythm of writing and publishing posts. Compared to when I set out, I can give a much more accurate picture of why I’m really doing this:

It helps me exercise and extend my hands-on technical craft—something that doesn’t happen naturally in the managerial roles in my day job, but is still useful for executing those roles even in a directorial and decision-making rather than hands-on capacity.
I can learn things, with the motivation for extra discipline (I really want to be confident I’m getting some unfamiiar thing right if I’m going to post it) that comes from doing so in public. A number of times I have had my course corrected by positive engagements after posting a blog, either on social media or in the comments section.
Sometimes it’s just fun, and relaxing, to play around with data and code. Particularly when I read something interesting and want to check “wait, is that for real?”. Or when I just want to make a cool animation.
I can try out stuff we might (or might not) want to use at work but, for whatever reason, needs me to give it a go myself in a way that doesn’t fit in with my normal work responsibilities yet can be drawn on if helpful.
Sometimes (but not very often) I actually want to make an intervention in the public sphere and communicate some facts and ideas. How important this motivation is has varied over time, and it’s never been particularly important. There were periods when I published election forecasts for New Zealand that had no other equivalent at the time, and some Covid modelling in Australia, where communicating actual issues was the most important thing for my blog. But these didn’t (and almost certainly, couldn’t possibly, given my energy and interest levels) last. Perhaps the high point blog post that I really wanted people to read was my exposure of Surgisphere, which made me Twitter-famous for a few days and was an important contribution to an investigation by the Guardian and then retraction of an article in the Lancet (surprisingly but gratifyingly rapidly).
My day job is helped by networking, and my interest and skills in data and code is one tool I can use in a small way to do that. I’m certainly not into blogging for fame (or I hope I would do differently and better than I am), but I do seek to use my posts in a certain way to broaden and strengthen my professional networks. I publicise my posts on Bluesky and LinkedIn, sometimes Facebook (and until 2024 on Twitter). They are a way of getting myself known to niche audiences, and very occasionally a way to achieve an objective for my day job by publicising something cool we’re doing, positions we’re recruiting, or an issue we’re concerned about.

Technical stuff about the blog

When I set up my blog I really, really hated the non-data technical stuff about getting it to work, having the fonts right, working out how domain names work, deciding on layout, etc. I had to read quite a few blogs on how to set up blogs, and vowed to myself not to become one of them. So I have relatively few posts on the back end of my blog. But ten years on, there is some (small) possible interest in what works for me, so here is how my blog works under the hood:

It is hosted on GitHub pages but has its own domain name. This (the GitHub part) is free, and gives me plenty of control over formatting, and works well with Jekyll.
I use Jekyll and resisted upgrading to Hugo when it comes along. In things like this, “there’s a time for change, which is when it can no longer be resisted”. If it ain’t broke, don’t fix it.
It is a Git repository within a repository. The source code is the important one that I work on and has a _working folder with all the R and other technical scripts, and a _posts folder with Markdown or HTML files for the actual posts.
When I build the site it appears in the _site folder of the source code repository. _site is also a Git repository and, when it is all good to go, I push that to the https://github.com/ellisp/ellisp.github.io repository on GitHub, which is automatically published on GitHub pages.
I write all the Markdown or HTML pages by hand. I use HTML when things get too complicated layout-wise for Markdown (not very often).
I don’t use RMarkdown or similar for this blog (knitting results in with the code and text) because I prefer to have complete, manual control of where I put a code chunk, plot or table. And my creative process is very much “work on the analysis” and then “write it up”, which is well supported by having a separate R script with the analysis and a Mardown or HTML file with the write-up.
I created and use the frs R package with a few supporting functions, most important of which is the svg_png() function. It uses the method described in this post. It helps SVG files look good with Google fonts and working across platforms. It also saves near-identical PNG and SVG versions of images, so I can have PNG fall-backs for browsers that don’t show SVGs (this was a real issue 10 years ago, I don’t know about now).
There are some things like syntax highlighting, the domain name, link to Disqus for comments section, that involved a bunch of mucking around that I’m pleased to say I’ve forgotten completely what I had to do.

Yeah, blog to live, don’t live to blog. That’s true in general, but never more so than in thinking about the stuff that makes it possible to blog.

Word count code

Here’s the code that produced the charts shown earlier in this post:

library(tidyverse)
library(stylo) # for delete.markup
library(glue)
library(ggtext)

#---------------Import and process blog posts-------------
blog_names <- list.files("../_posts", full.names = TRUE)
blogs <- tibble()

for(i in 1:length(blog_names)){
  blogs[i, "full"] <- paste(readLines(blog_names[i]), collapse = " ")
  blogs[i, "filename"] <- gsub("../_posts/", "", blog_names[i], fixed = TRUE)
}


blogs <- blogs |> 
  mutate(no_jekyll = gsub("\\{\\% highlight R.*?\\%\\}.*?\\{\\% endhighlight \\%\\}", " ", full),
         txt = "")

# delete markup only works on one string at a time, seems easiest to do it in a loop:
for(i in 1:nrow(blogs)){
  blogs[i, ]$txt <- delete.markup(blogs[i, ]$no_jekyll, markup.type = "html")
}

# a few more basic stats per blog post:
blogs <- blogs |> 
  mutate(word_count = stringi::stri_count_words(txt),
          word_count_with_tags = stringi::stri_count_words(no_jekyll),
          date = as.Date(str_extract(filename, "^[0-9]*-[0-9]*-[0-9]*")),
          month = month(date),
          year = year(date))

#---------------Minimal anaylsis----------------

# Summary aggregates
blog_sum <- blogs |> 
  summarise(number_blogs = n(), 
            words_with_tabs = sum(word_count_with_tags),
            total_words = sum(word_count),
            mean_words = mean(word_count),
            median_words = median(word_count),
            max_words = max(word_count),
            min_words = min(word_count))


# Shortest blog (turns out to be one just announcing a work shiny app):
blogs |> 
  arrange(word_count) |> 
  slice(1) |> 
  pull(txt)

#------------------Graphics for use in blog-------------------------

the_caption <- "Source: https://freerangestats.info"

# Time series plot showing number of posts by month:
d1 <- blogs |> 
  group_by(year, month) |> 
  summarise(number_blogs = n()) |> 
  ungroup() |> 
  complete(year, month, fill = list(number_blogs = 0)) |> 
  # remove October, November, December in 2025 (as time of writing is September 2025):
  filter(!(year == 2025 & month %in% 10:12)) |> 
  # remove months blog did not exist:
  filter(!(year == 2015 & month %in% 1:6)) |> 
  group_by(year) |> 
  mutate(year_lab = glue("{year}: {sum(number_blogs)} posts"),
         is_zero = ifelse(number_blogs == 0, "Zero", "NotZero")) 

# model a smooth curve to the whole data set (don't want)
# to do this with geom_smooth in the plot as then it has
# break every year:
mod <- loess(number_blogs ~ I(year + month / 12), data = d1, span = 0.15)
d1$fitted <- predict(mod)

# draw time series plot of number of blogs:
d1 |> 
  ggplot(aes(x = month, y = number_blogs)) +
  facet_wrap(~year_lab) +
  geom_line(aes(y = fitted), colour = "grey80") +
  geom_point(colour = "steelblue", size = 2.5, aes(shape = is_zero)) +
  expand_limits(y = 0) +
  scale_x_continuous(breaks = 1:12, labels = month.abb) +
  scale_shape_manual(values = c("Zero" = 1, "NotZero" = 19)) +
  theme(panel.grid.minor = element_blank(),
       axis.text.x = element_text(angle = 45, hjust = 1),
       legend.position = "none") +
  labs(x = "",
       y = "Number of blog posts",
       title = "Ten years of Free Range Statistics blogging",
       subtitle = glue("{nrow(blogs)} posts and {comma(blog_sum$total_words)} words, in just over ten years."),
      caption = the_caption)

# Connected scatter plot comparing average word count to number of posts:
blogs |> 
  mutate(number_months = case_when(
            year == 2015 ~ 6,
            year == 2025 ~8.5,
            TRUE ~ 12
          )) |> 
  group_by(year, number_months) |> 
  summarise(avg_word_count = mean(word_count, tr = 0.1),
            number_blogs = n()) |> 
  ungroup() |> 
  mutate(blogs_per_month = number_blogs / number_months) |> 
  ggplot(aes(x = blogs_per_month, y = avg_word_count, label = year)) +
  geom_path(colour = "grey80") +
  geom_text(colour = "grey50") +
  scale_y_continuous(label = comma) +
  expand_limits(x = 4.5) +
  annotate("text", fontface = "italic", hjust = 0, colour = "darkblue",
            x = c(4, 3.4, 2.1), 
            y = c(1165, 1350, 1880),
            label = c("Time series", "Elections", "Covid") 
            )  +
  # add day jobs
  annotate("text", fontface = "italic", hjust = 0, colour = "brown",
            x = c(3.1, 2.5, 0, 1.1), 
            y = c(1130, 1675, 1420, 1330),
            label = c("NZ economics", "Consultant", "Chief Data Scientist", "Pacific") 
            )  +
  labs(x = "Blog posts per month",
       y = "Average words per blog post",
       title = "Ten years of Free Range Statistics blogging",
       subtitle = "Annotated with important (but not necessarily dominant) <span style = 'color:darkblue'>themes</span> and <span style = 'color:brown'>day-jobs</span> for different phases.",
      caption = the_caption) +
  theme(plot.subtitle = element_markdown())

This is my first use of a large language model for any purpose with this blog. I can categorically say Free Range Statistics will never use generative AI to produce either words or code. ↩