I’ve recently noticed that stepwise regression is still fairly popular, despite being well and truly frowned upon by well-informed statisticians. By stepwise regression, I mean any modelling strategy that involves adding or subtracting variables from a regression model on the basis that they are “significant”, reduce the Akaike Information Criterion, or increase adjusted R-squared, or in fact any other data-driven statistics.
This might be an automated variable procedure or it might be a matter of eyeballing the results of the first model you fit and saying (for example) “Let’s take literacy out, it’s p-value is not significant, and it will be a more parsimonious model once we do that.”.
And then people produce and report t tests, F tests, and so on, as though the end model was the one they always intended to run.
Let me clear about this. This is wrong. It’s not as disastrously wrong as, say, sorting the data separately one column at a time before you fit your model, but it’s still objectively bad. As my professor once told our class:
“If you choose the variables in your model based on the data and then run tests on them, you are Evil; and you will go to Hell.”
Why is it wrong? Here are the seven reasons given by Frank Harrell in his must-read classic, Regression Modeling Strategies:
- The R-squared or even adjusted R-squared values of the end model are biased high.
- The F and Chi-square test statistics of the final model do not have the claimed distribution.
- The standard errors of coefficient estimates are biased low and confidence intervals for effects and predictions are falsely narrow.
- The p values are too small (there are severe multiple comparison problems in addition to problems 2. and 3.) and do not have the proper meaning, and it is difficult to correct for this.
- The regression coefficients are biased high in absolute value and need shrinkage but this is rarely done.
- Variable selection is made arbitrary by collinearity.
- It allows us to not think about the problem.
At the core of the problem is using statistical inference methods like p values, confidence intervals and ANOVA F tests that were designed and valid for a pre-specified model, but applying them instead to a model we have structured based on the data. The variables are selected partly based on chance, and we are giving ourselves a sneaky headstart in making a variable being significant.
Basically, this is the sort of thing that leads to the reproducibility crisis in science.
Some of the problems don’t matter as much if your goal for the model is just prediction, not interpretation of the model and its coefficients. But most of the time that I see the method used (including recent examples being distributed by so-called experts as part of their online teaching), the end model is indeed used for interpretation, and I have no doubt this is also the case with much published science. Further, even when the goal is only prediction, there are better methods like the Lasso, of dealing with a problem of a high number of variables.
Let’s look at a couple of simulations to show how this is a problem.
Increases the false positive rate even with white noise
First, let’s take a case where we simulate data that is known to have no relation at all to the response variable. In the code below I simulate 1,000 observations with 100 explanatory X variables and 1 response variable y. All of these variables are unrelated to eachother and are just normally distributed with a mean of zero and standard deviation of 1.
When I fit a regression model of y ~ X, I should expect about five of the columns to appear ‘significant’ by conventional p value of 0.05 or less - because that’s more or less the definition of that critical cut-off value. That is, we tolerate a 1 in 20 false positive rate. In this case we have six variables below the cut-off, about what we’d expect:
Estimate Std. Error t value Pr(>|t|)
V2 -0.06606672 0.03213918 -2.055644 0.04010514
V38 0.06881778 0.03406473 2.020206 0.04365814
V62 0.06263414 0.03137298 1.996436 0.04618768
V91 0.06826250 0.03302463 2.067018 0.03901824
V94 -0.07923079 0.03423568 -2.314275 0.02087724
V96 -0.07962012 0.03290373 -2.419790 0.01572689
Now let’s use stepwise selection, “both” directions (so we can remove variables from the model or add them), using the Akaike Information Criterion to choose a ‘better’ model at each step. This is better than just using p values, and much better than using p values and a low cut-off like 0.05, so I’m giving the stepwise method a fair go here.
That gets us these variables showing up as ‘significant’:
Estimate Std. Error t value Pr(>|t|)
V2 -0.06312445 0.03046917 -2.071748 0.038550346
V4 -0.07353379 0.03222313 -2.282019 0.022702106
V32 -0.06120508 0.03094750 -1.977707 0.048241917
V38 0.06383031 0.03227612 1.977633 0.048250238
V90 0.06288076 0.03094938 2.031729 0.042450732
V91 0.07450724 0.03105172 2.399456 0.016605492
V94 -0.06617689 0.03208892 -2.062297 0.039442821
V96 -0.08052606 0.03073565 -2.619957 0.008930218
So the net impact of this fancy-looking automated procedure is to worsen our false positive rate from 6% to 8%.
OK, that’s just one dataset. Let’s try it with a range of others, of different sample sizes, and to make things more interesting let’s let the X variables sometimes be correlated with eachother. The stepwise selection process can be a bit slow so I spread the 700 runs of the simulation below over seven parallel processes:
The average false positive rate of the full model is 5.1%; for the stepwise variable selection it is 9.5%. In the chart below we can see that sample size relative to the number of variables in X matters a lot here:
For example, the case with the sample size of only 200 observations gets a false positive rate above 15% from the stepwise method. But even with larger samples, we take a hit in false positives from the stepwise approach. The degree of multicollinearity in the X doesn’t seem to make much difference.
It might seem unfair to have a model with 100 explanatory variables and only 200 observations, but out there on the internet (I’m not going to link) there are guides telling you it is ok to do this procedure even when you have more variables than observations. In fact I have a horrible fear that this practice might be common in some parts of science. You can imagine how doing that is basically a machine for generating false, non-reproducible findings.
Even the correctly-retained variables’ coefficients are biased big
The above simulation was pure noise so everything was a false positive. What does stepwise variable selection do in a more realistic case where some of the variables are correctly in the model and are related to y?
To explore this I wrote a function (code a little way further down the blog) to simulate data with 15 X correlated variables and 1 y variable. The true model is:
y = V1 + V2 + V3 + V4 + V5 + V6 + V7 + 0.1 (V8 + V9 + V10) + e
That is, the true regression coefficients for variables V1 to V7 are 1; for V8, V9 and V10 they are 0.1; for the the remaining 5 variables there is no structural relationship to y.
When we simulate 50 data sets of this sort and use stepwise variable selection to regress y on X, here are the coefficients we get. Each point represents the coefficient for one variable from one of those runs.
The large dots on zero indicate the multiple runs in which that particular variable was not included in the final model. We see:
- Many occasions, variables V11 to V15 were rightly excluded, but a smattering of occasions they do get included in the model.
- A lot of false negatives - variables V1 to V10 that should be found in the model and aren’t
- Worse, when one of variables V1 to V10 is correctly included in the final model, the coefficient estimated for it is always (in this dataset) larger than the true coefficient (which remember should be 1 or 0.1 - the correct values shown by the red crosses).
For comparison, here is an equivalent chart for when we fit the full model to these data. There’s a lot of variation in the coefficient estimates, but at least they’re not biased (that is, on average they are correct, their expected value is the true value):
Here’s the code for the function simulating that data and drawing the plots:
Now, you can see that there are a few arbitrary aspects to that simulation - in particular the sample size, the variance of the y relative to the X, and the multicollinearity of the X. The idea of having it as a function is that you can play around with these and see the impacts; for example, if you make the variance of y smaller compared to that of the X, the model gets better at explaining the variance and the stepwise algorithm is less prone to getting things wrong. Rather than include a whole bunch of individual cases, I ran some more simulations covering a range of such values so we can see the relationship to those parameters of the average bias in the estimated regression coefficients remaining in the model.
So here is the relationship of that bias to the R-squared of the model, at various levels of correlation between the X variables.
And here is the relationship of the bias to sample size, standard deviation of Y, and correlation between the X all in one chart:
Of the two visualisations I probably prefer that last one, the heat map. First, it dramatically shows (all that white) that the regression estimates from the true model aren’t biased at all. Secondly, it nicely shows that the bias in the estimates returned by stepwise regression are worse
- for smaller samples
- with higher correlation between the X variables
- and with high variance of the y variable
So, finally, here’s the code for those simulations:
That’s all folks. Just don’t use stepwise selection of variables. Don’t use automated addition and deletion of variables, and don’t take them out yourself by hand “because it’s not significant”. Use theory-driven model selection if it’s explanation you’re after, Bayesian methods are going to be good too as a complement to that and forcing you to think about the problem; and for regression-based prediction use a lasso or elastic net regularization.