I last blogged about books in September 2016 and it’s still in my 10 most popular posts. I thought I’d update it with some of the data-relevant books I read in 2017. Mostly these are books that were published before 2017 - it’s just that it was last year that I got around to reading them. Well, this is a “web log” (ie “blog”) after all, so what could be more pertinent?
I read a lot of books, and most of them aren’t directly relevant for here (although I always look for interesting topics eg in history that might lead to a post). So I’m only going to talk about those which are somehow related to statistics or data science, and which I rated 3 stars or more. I try to calibrate my ratings to the Goodreads star system, so 3 stars means “I liked it” and I’d be happy recommending people to read it. Whereas 5 stars means something that really made a particular impact on me, and was enjoyable to read to boot. In the data context, it means I’m definitely doing things differently as a result of that book - new techniques, better practice, new ideas, or all three.
So here’s some books I read and liked in 2017.
Kimball and Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling
In 2017 I found myself much deeper and more hands on in the world of data warehousing than I had been, and I did some reading to get my formal understanding up to speed with the very ambitious project I was driving. I knew a bit about data modelling for relational databases, and was familiar with thinking in terms of “facts” and “dimensions”, but I had some definite gaps in the warehouse-specific domain of knowledge. This book was a complete eye-opener, and page after page I found myself saying “yes that makes sense!” “I should have thought of that” and “I wish I’d known that before”.
I can’t recommend this book enough; for a statistician or data scientist who wants to understand how databases can be used to support analytics, this is the one book I’d say is a “must read”. There’s key terminology and process to help you talk with IT; established good practice; and lots of good thoughts on structuring data, handling change, and basically on organising data. It’s based on years of learning the hard way in building datamarts and combining them into warehouses. You’ll find it useful even if you’re not designing and building a database. In fact, I’ll probably write a blog post just on this some time in the next few months.
This is also one book I wish everyone in IT departments reads, or re-reads if necessary. Data warehousing is different to other database design. It has particular problems, and it has standard, tried-and-true solutions. Those standard solutions are neglected at an organisation’s peril. In fact, many of the concepts set out here are ones that I think even non-technical senior managers, involved in data governance for instance, need to understand.
2017 was the year I determined to update my understanding of practical modern Bayesianism, and Statistical Rethinking was one of a half dozen books I read or at least started on the subject. (A few of them I haven’t finished yet, particularly two great Andrew Gelman numbers which won’t be covered today.)
Statistical Rethinking is very well written, engaging, humourous and very very eye-opening. I now recommend it to all sorts of people. I think it’s particularly perfect for people who learnt stats the bad way (ie cookbooks) and need a jolt to point them in the much more sustainable and scalable direction of thinking in terms of principles, how we know what we know and philosophy of science. But it’s also good for anyone in the position I was - wanting a friendly introduction into the modern Bayesian world, from whatever level of familiarity with other aspects of data.
While I read this book end to end without doing the exercises because that’s my learning style, they also look really useful for people who like that sort of thing. I also don’t use the R package that came with it as an intermediary between R and Stan but went straight to writing my own Stan programs; that too will be a matter of preference. The book was great as an introduction to Stan when it came to that part, but for me it was its conceptual clarity and communicative teaching style that was really valuable.
Stan Development Team, Stan Modeling Language User’s Guide and Reference Manual
Yes, 2017 was indeed the year I began to take Bayesianism seriously. Discovering Stan was a big part of that. I incorporated Stan models into my election forecasts and other analysis on this blog like this post on traffic accidents, as part of my self-learning voyage.
The self-learning was possible because Stan is very well documented, and the heart of that is this User’s Guide and Reference Manual. Sure, you’d want to supplement it with more theoretical Bayesian texts and with examples and blog posts, but the guts of the actual language is the manual, as it should be. So even though it’s not a “book” one purchases from a book-seller, I’m including this invaluable document in my five star list for 2017.
Duncan, Elliot and Juan Jose Salazar, Statistical Confidentiality: Principles and Practice
This was another work-related one for me. A readable and authoritative text on the latest techniques and challenges of statistical disclosure control. Very much a book for professional statisticians who have to deal with the intriguing and highly problematic area of outthinking the snoopers. But anyone who has ever thought “don’t you just need to delete the names and addresses and then you can release the microdata?” should probably at least skim through this book to get an idea of why national statistical offices always say “I think you’ll find it’s more complicated than that.”
One of the few books on this list that was actually published for the first time in 2017, this is a great introduction to the specific techniques of treating a broad empirical distribution as a prior expectation. You don’t need to be a baseball fan. For those who don’t know, baseball is a ball game, a bit like the rounders game we used to play as kids when we couldn’t get cricket; I understand it’s popular in America where perhaps the turf for cricket pitches can’t grow for climatic reasons.
A read that is both entertaining and edifying, he introduces “the empirical Bayesian approach to estimation, credible intervals, A/B testing, mixture models, and other methods, all through the example of baseball batting averages.” With simple to follow R code to make it all real.
Inkscape is a vector graphics creation and editing software, the open source competitor for Adobe Illustrator. Apart from its lack of support for printer-ready CMYK colours, it’s nearly there as a viable professional alternative. It’s definitely good enough for what I need it for, which is touching up the odd SVG graphic, and building the occasional icon or other vector image. It can even be run in batch mode from the command line which is always useful.
Not being a specialist in the field, I don’t know if this book really is the best one, but it certainly is a good book - very comprehensive introduction to the tool.
It’s worth taking this opportunity to list here three critical bits of graphics software that I think repay familiarity for data science types:
- Inkscape (vector graphics editor equivalent of Illustrator)
- Scribus (publishing / layout tool, equivalent of InDesign)
- Gimp (raster graphics editor, equivalent of PhotoShop)
They’re all very powerful, which comes with a learning curve; but it’s worth being aware of what they can do and considering if an investment in more familiarity may be worth while.
McGrayne, The Theory That Would Not Die
Subtitled “How Bayes’ Rule Cracked The Enigma Code, Hunted Down Russian Submarines, And Emerged Triumphant From Two Centuries Of Controversy”, which pretty much sums it up. A fun and edifying bit of history of the ups and downs in the credibility of Bayes’ rule as a cornerstone of inference about an uncertain world.
One of the famous Machine-defeats-Human images is of course Kasparov in deep thought during his losing match with Deep Blue in 1997. This book gives a brilliant and enthralling human angle on the lead up to that point and Kasparov’s own abiding interest in chess-playing artificial intelligence. He argues convincingly that the particular outcome on that particular day was in fact somewhat unfair (not IBM’s finest hour), while still making clear that this is really not the point, with the incredible developments in statistics since then.
Kasparov is an engaging writer and genuine big picture thinker, so this becomes much more than a personal account of a chess match and gives a nuanced view of what it means to be both or either “creative” and “intelligent”. A good entry point into thinking about machine intelligence. It also has some revealing insights into how high level competitive chess works, but those are bonuses.
Hendy, Silencing Science
I’ve long had an interest in the politics of public debate by government-funded actors (such as this piece in the overseas aid context). In fact, I’m aware of the boundaries of public comment all the time while working on this blog. In this book based in the New Zealand experience, Hendy argues that there are too many instances of scientists being silenced - either through explicit suppression or institutionally facilitated self-censorship. “Few scientific institutions … feel secure enough to criticise the government of the day.”
It’s a shame (to say the least) that the people who know most about any particular topic are often precluded from contributing to the public debate on it. This issue extends beyond scientists of course - with public servants in particular being subject to ethical and professional constraints of some complexity. A good read.
Another read as part of my self-imposed Bayesian re-education program, and a particularly interesting one for me. This was a good introduction for me to a range of Bayesian methods across social sciences, but was particularly useful for me in its clear explanation of state space modelling of irregularly and imperfectly observed time series processes, using an Australian election as an example. I reworked Jackman’s example, with his data but rewritten in Stan, in a couple of blog posts. I subsequently successfully adapted this method for the multi-party New Zealand election.
Good introduction to analysis of (pre-2009) US politics by the master statistician.
Well, this has gotten long hasn’t it… these last few books, all of them worth reading, I’ll leave just as titles and links:
- Tsay, Multivariate Time Series Analysis
- Nin and Herranz, Privacy and Anonymity in Information Management Systems: New Techniques for New Practical Problems (Advanced Information and Knowledge Processing)
- Rosenthal, Struck by Lightning: The Curious World of Probabilities
- Sagers and Hosack, Information Technology Security Fundamentals
- Saini, Inferior: How Science Got Women Wrong—and the New Research That’s Rewriting The Story