Counting digits appearing in page numbers
The other day in a training session, the facilitators warmed people up into intellectual work with this group exercise:
- Count the number of times the digit “1” appears in the page numbers of a 90 page book.
- Then, count the number of times it appears in the page numbers of a 2,000 page book.
(Actual numbers vary slightly from those used so I don’t just give away the answers).
The first task is easy enough; you can pretty much count them in your head (something like “1, 2, 3-4, plus 8 equals 12, plus 7 equals 19” - if you follow me). The second task is obviously harder and the aim was to illustrate the need to break down more complicated problems into a structure for a solution.
But the alternative message that occurred to me of course is that “this is why we invented computers isn’t it?”. Here’s the one-liner in R that answers the second question, about the 2,000 page book:
Basically, starting from the middle, this tells the computer:
- make a vector of all the numbers from 1 to 2,000
- collapse them all into a single big character string
- extract all the digits “1” from that string into a new vector which each element is just the extracted “1”
- count the length of that vector
The answer is 1,600.
What about if we wanted to generalise this though - for more page numbers, and different digits? After the training session was over, I had a go at this. For the sake of efficiency and speed, I treated this as a slightly different problem and I solved it with the help of the
count_digits function defined below which extracts the number of times a digit appears in a given page number; and then we use cumulative sums to get the eventual result for books of different lengths with minimal duplication of calculations.
Here’s the code to do this, and its application to a book of 2,500 pages, for all digits.
It’s quite a pretty pattern:
When we get a book of 75,000 pages it gets even prettier:
(Code not shown as it’s basically identical to the 2,500 page case).
Funnily enough, after doing the thinking above, I found out that one of my work colleagues who had done the same training session a few months earlier had reacted almost identically, in using a computer to count the results. If you’re at all interested in counting the digits appearing in page numbers, I thoroughly recommend Paul Geil’s DigitCounter Shiny app that he built in response.
All this digit counting reminded me that I’d wanted to look into “Benford’s ‘law’” at some point, and prompted me to make today the day. Benford observed that the frequency distribution of the leading digit in many real-life sets of numerical data follows a pattern. 1 appears as the first digit about 30.1% of the time, 2 17.6%, and so on.
If Benford’s law applies to a series of numbers, the probability of a number beginning with the digit d is:
Benford wasn’t the first person to notice this, although he did go into it particularly systematically. The reasons why it applies are interesting but I won’t go into them, they’re nicely explained in the Wikipedia page. Obviously in fact it is not always going to apply; for instance, the leading digits of the page numbers of a book aren’t going to follow this distribution. It does apply often enough though to be useful in detecting fraud (although in practice, automated fraud detection should use much more than this).
Benford’s law comes closer to application in distributions where the data cover at least two orders of magnitude. Take a look at these examples:
The normal distribution isn’t a bad fit because it covers a couple of orders of magnitude (lots of data in 0-1, some in the 1-10 range), but the best fits are the distributions with fatter tails in at least one direction, like the log-normal and the Cauchy.
Those simulations were measured with the
count_first_digit function below:
Having built that function, I tried out Benford’s law on some famous datasets built into R. The results are illuminating I think:
Only the closing prices of stock indices comees close to following the predicted pattern. The failure of the law to apply is partly because of smallish sample sizes but mostly because these datasets all cover too small a range of results for the first digit law to apply. For example, the monthly road casualty dataset is mostly in the low 100s, with an occasional value in the 90s or 80s. So the first digit law doesn’t apply. But I’m pretty sure that if we compared the monthly road casualties of many different countries, it would be a much closer fit.
When the data is from multiple diverse units of observation and hence covers several orders of magnitude, the law is much more likely to apply. For example, consider the
population data set in
tidyr package, a subset of World Health Organization tuberculosis data:
We see quite a close match.
The take-home story is to not expect the first-digit law, or Benford’s law, to apply unless you’ve got the right sort of data for it - data that ranges across several orders of magnitude.
Here’s the code that produced those real life tests of Benford’s law.