Bootstrap and cross-validation for evaluating modelling strategies
At a glance:
I compare 'simple' bootstrap, 'enhanced' (optimism-correcting) bootstrap, and repeated k-fold cross-validation as methods for estimating fit of three example modelling strategies.
05 Jun 2016
I’ve been re-reading Frank Harrell’s Regression Modelling Strategies, a must read for anyone who ever fits a regression model, although be prepared - depending on your background, you might get 30 pages in and suddenly become convinced you’ve been doing nearly everything wrong before, which can be disturbing.
I wanted to evaluate three simple modelling strategies in dealing with data with many variables. Using data with 54 variables on 1,785 area units from New Zealand’s 2013 census, I’m looking to predict median income on the basis of the other 53 variables. The features are all continuous and are variables like “mean number of bedrooms”, “proportion of individuals with no religion” and “proportion of individuals who are smokers”. Restricting myself to traditional linear regression with a normally distributed response, my three alternative strategies were:
use all 53 variables;
eliminate the variables that can be predicted easily from the other variables (defined by having a variance inflation factor greater than ten), one by one until the main collinearity problems are gone; or
eliminate variables one at a time from the full model on the basis of comparing Akaike’s Information Criterion of models with and without each variable.
None of these is exactly what I would use for real, but they serve the purpose of setting up a competition of strategies that I can test with a variety of model validation techniques.
The main purpose of the exercise was actually to ensure I had my head around different ways of estimating the validity of a model, loosely definable as how well it would perform at predicting new data. As there is no possibility of new areas in New Zealand from 2013 that need to have their income predicted, the “prediction” is a thought-exercise which we need to find a plausible way of simulating. Confidence in hypothetical predictions gives us confidence in the insights the model gives into relationships between variables.
There are many methods of validating models, although I think k-fold cross-validation has market dominance (not with Harrell though, who prefers varieties of the bootstrap). The three validation methods I’ve used for this post are:
‘simple’ bootstrap. This involves creating resamples with replacement from the original data, of the same size; applying the modelling strategy to the resample; using the model to predict the values of the full set of original data and calculating a goodness of fit statistic (eg either R-squared or root mean squared error) comparing the predicted value to the actual value. Note - Following Efron, Harrell calls this the “simple bootstrap”, but other authors and the useful caret package use “simple bootstrap” to mean the resample model is used to predict the out-of-bag values at each resample point, rather than the full original sample.
‘enhanced’ bootstrap. This is a little more involved and is basically a method of estimating the ‘optimism’ of the goodness of fit statistic. There’s a nice step by step explanation by thestatsgeek which I won’t try to improve on.
repeated 10-fold cross-validation. 10-fold cross-validation involves dividing your data into ten parts, then taking turns to fit the model on 90% of the data and using that model to predict the remaining 10%. The average of the 10 goodness of fit statistics becomes your estimate of the actual goodness of fit. One of the problems with k-fold cross-validation is that it has a high variance ie doing it different times you get different results based on the luck of you k-way split; so repeated k-fold cross-validation addresses this by performing the whole process a number of times and taking the average.
As the sample sizes get bigger relative to the number of variables in the model the methods should converge. The bootstrap methods can give over-optimistic estimates of model validity compared to cross-validation; there are various other methods available to address this issue although none seem to me to provide all-purpose solution.
It’s critical that the re-sampling in the process envelopes the entire model-building strategy, not just the final fit. In particular, if the strategy involves variable selection (as two of my candidate strategies do), you have to automate that selection process and run it on each different resample. That’s because one of the highest risk parts of the modelling process is that variable selection. Running cross-validation or the bootstrap on a final model after you’ve eliminated a bunch of variables is missing the point, and will give materially misleading statistics (biased towards things being more “significant” than there really is evidence for). Of course, that doesn’t stop this being common misguided practice.
One nice feature of statistics since the revolution of the 1980s is that the bootstrap helps you conceptualise what might have happened but didn’t. Here’s the root mean squared error from the 100 different bootstrap resamples when the three different modelling strategies (including variable selection) were applied:
Notice anything? Not only does it seem to be generally a bad idea to drop variables just because they are collinear with others, but occasionally it turns out to be a really bad idea - like in resamples #4, #6 and around thirty others. Those thirty or so spikes are in resamples where random chance led to one of the more important variables being dumped before it had a chance to contribute to the model.
The thing that surprised me here was that the generally maligned step-wise selection strategy performed nearly as well as the full model, judged by the simple bootstrap. That result comes through for the other two validation methods as well:
In all three validation methods there’s really nothing substantive to choose between the “full model” and “stepwise” strategies, based purely on results.
The full model is much easier to fit, interpret, estimate confidence intervals and perform tests on than stepwise. All the standard statistics for a final model chosen by stepwise methods are misleading and careful recalculations are needed based on elaborate bootstrapping. So the full model wins hands-down as a general strategy in this case.
With this data, we have a bit of freedom from the generous sample size. If approaching this for real I wouldn’t eliminate any variables unless there were theoretical / subject matter reasons to do so. I have made the mistake of eliminating the co-linear variables before from this dataset but will try not to do it again. The rule of thumb is to have 20 observations for each parameter (this is one of the most asked and most dodged questions in statistics education; see Table 4.1 of Regression Modelling Strategies for this particular answer), which suggests we can have up to 80 parameters with a bit to spare. This gives us 30 parameters to use for non-linear relationships and/or interactions, which is the direction I might go in a subsequent post. Bearing that in mind, I’m not bothering to report here the actual substantive results (eg which factors are related to income and how); that can wait for a better model down the track.
Data and computing
The census data are ultimately from Statistics New Zealand of course, but are tidied up and available in my nzelect R package, which is still very much under development and may change without notice. It’s only available from GitHub at the moment (installation code below).
I do the bootstrapping with the aid of the boot package, which is generally the recommended approach in R. For repeated cross-validation of the two straightforward strategies (full model and stepwise variable selection) I use the caret package, in combination with stepAIC which is in the Venables and Ripley MASS package. For the more complex strategy that involved dropping variables with high variance inflation factors I found it easiest to do the repeated cross-validation old-school with my own for loops.
This exercise was a bit complex and I won’t be astonished if someone points out an error. If you see a problem, or have any suggestions or questions, please leave a comment.
My day job is Director of the Statistics for Development Division at the Pacific Community, the principal scientific and technical organisation in the Pacific region, proudly supporting development since 1947. We are an international development organisation owned and governed by our 27 country and territory members. This blog is not part of my role there and contains my personal views only.
I'm pleased to be aggregated at R-bloggers, the one-stop shop for blog posts featuring R.