Stepwise selection of variables in regression is Evil.

At a glance:

Stepwise variable selection is bad and dangerous, and you shouldn't do it. It increases false positives. It drops variables that should be in the model. It gives biased estimates for regression coefficients. The problems are worse for smaller samples; higher correlation between the X variables; and models with weaker explanatory power for the y (i.e. lower R-squared).

14 Sep 2024


I’ve recently noticed that stepwise regression is still fairly popular, despite being well and truly frowned upon by well-informed statisticians. By stepwise regression, I mean any modelling strategy that involves adding or subtracting variables from a regression model on the basis that they are “significant”, reduce the Akaike Information Criterion, or increase adjusted R-squared, or in fact any other data-driven statistics.

This might be an automated variable procedure or it might be a matter of eyeballing the results of the first model you fit and saying (for example) “Let’s take literacy out, it’s p-value is not significant, and it will be a more parsimonious model once we do that.”.

And then people produce and report t tests, F tests, and so on, as though the end model was the one they always intended to run.

Let me clear about this. This is wrong. It’s not as disastrously wrong as, say, sorting the data separately one column at a time before you fit your model, but it’s still objectively bad. As my professor once told our class:

“If you choose the variables in your model based on the data and then run tests on them, you are Evil; and you will go to Hell.”

Why is it wrong? Here are the seven reasons given by Frank Harrell in his must-read classic, Regression Modeling Strategies:

  1. The R-squared or even adjusted R-squared values of the end model are biased high.
  2. The F and Chi-square test statistics of the final model do not have the claimed distribution.
  3. The standard errors of coefficient estimates are biased low and confidence intervals for effects and predictions are falsely narrow.
  4. The p values are too small (there are severe multiple comparison problems in addition to problems 2. and 3.) and do not have the proper meaning, and it is difficult to correct for this.
  5. The regression coefficients are biased high in absolute value and need shrinkage but this is rarely done.
  6. Variable selection is made arbitrary by collinearity.
  7. It allows us to not think about the problem.

At the core of the problem is using statistical inference methods like p values, confidence intervals and ANOVA F tests that were designed and valid for a pre-specified model, but applying them instead to a model we have structured based on the data. The variables are selected partly based on chance, and we are giving ourselves a sneaky headstart in making a variable being significant.

Basically, this is the sort of thing that leads to the reproducibility crisis in science.

Some of the problems don’t matter as much if your goal for the model is just prediction, not interpretation of the model and its coefficients. But most of the time that I see the method used (including recent examples being distributed by so-called experts as part of their online teaching), the end model is indeed used for interpretation, and I have no doubt this is also the case with much published science. Further, even when the goal is only prediction, there are better methods like the Lasso, of dealing with a problem of a high number of variables.

Let’s look at a couple of simulations to show how this is a problem.

Increases the false positive rate even with white noise

First, let’s take a case where we simulate data that is known to have no relation at all to the response variable. In the code below I simulate 1,000 observations with 100 explanatory X variables and 1 response variable y. All of these variables are unrelated to eachother and are just normally distributed with a mean of zero and standard deviation of 1.

library(MASS)
library(tidyverse)
library(glue)
library(foreach)
library(doParallel)

#--------------------X not related to y--------------------

set.seed(42)
k <- 100
n <- 1000
Sigma <- matrix(0, nrow = k, ncol = k)
diag(Sigma) <- 1

noise <- mvrnorm(n = n, mu = rep(0, k), Sigma = Sigma) |>
  as.data.frame() |>
  as_tibble() |>
  mutate(y = rnorm(n))

full_model <- lm(y ~ ., data = noise)

# we get 6 variables that look 'significant' - about what
# we'd expect, about 5% false positives:
summary(full_model)$coefficients |>
  as.data.frame() |>
  filter(`Pr(>|t|)` < 0.05)

When I fit a regression model of y ~ X, I should expect about five of the columns to appear ‘significant’ by conventional p value of 0.05 or less - because that’s more or less the definition of that critical cut-off value. That is, we tolerate a 1 in 20 false positive rate. In this case we have six variables below the cut-off, about what we’d expect:

       Estimate Std. Error   t value   Pr(>|t|)
V2  -0.06606672 0.03213918 -2.055644 0.04010514
V38  0.06881778 0.03406473  2.020206 0.04365814
V62  0.06263414 0.03137298  1.996436 0.04618768
V91  0.06826250 0.03302463  2.067018 0.03901824
V94 -0.07923079 0.03423568 -2.314275 0.02087724
V96 -0.07962012 0.03290373 -2.419790 0.01572689

Now let’s use stepwise selection, “both” directions (so we can remove variables from the model or add them), using the Akaike Information Criterion to choose a ‘better’ model at each step. This is better than just using p values, and much better than using p values and a low cut-off like 0.05, so I’m giving the stepwise method a fair go here.

stepped <- step(full_model)

summary(stepped) $coefficients |>
  as.data.frame() |>
  filter(`Pr(>|t|)` < 0.05)

That gets us these variables showing up as ‘significant’:

       Estimate Std. Error   t value    Pr(>|t|)
V2  -0.06312445 0.03046917 -2.071748 0.038550346
V4  -0.07353379 0.03222313 -2.282019 0.022702106
V32 -0.06120508 0.03094750 -1.977707 0.048241917
V38  0.06383031 0.03227612  1.977633 0.048250238
V90  0.06288076 0.03094938  2.031729 0.042450732
V91  0.07450724 0.03105172  2.399456 0.016605492
V94 -0.06617689 0.03208892 -2.062297 0.039442821
V96 -0.08052606 0.03073565 -2.619957 0.008930218

So the net impact of this fancy-looking automated procedure is to worsen our false positive rate from 6% to 8%.

OK, that’s just one dataset. Let’s try it with a range of others, of different sample sizes, and to make things more interesting let’s let the X variables sometimes be correlated with eachother. The stepwise selection process can be a bit slow so I spread the 700 runs of the simulation below over seven parallel processes:

# set up parallel processing cluster
cluster <- makeCluster(7) # only any good if you have at least 7 processors :)
registerDoParallel(cluster)

clusterEvalQ(cluster, {
  library(foreach)
  library(tidyverse)
  library(MASS)
  library(glue)
  library(scales)
})

noise_results <- foreach(i = 1:700, .combine = rbind) %dopar% {
  set.seed(i)
  
  k <- 100
  n <- sample(c(200, 400, 800, 1600), size = 1)
  
  r <- runif(1, 0, 0.7)
  Sigma <- matrix(r, nrow = k, ncol = k)
  diag(Sigma) <- 1
  
  noise <- mvrnorm(n = n, mu = rep(0, k), Sigma = Sigma) |>
    as.data.frame() |>
    as_tibble() |>
    mutate(y = rnorm(n))
  
  full <- lm(y ~ ., data = noise)
  stepped <- stepAIC(full, trace = FALSE)
  
  # count the false positives
  false_pos1 <- summary(full)$coefficients |>
    as.data.frame() |>
    filter(`Pr(>|t|)` < 0.05) |>
    nrow()
  
  false_pos2 <- summary(stepped)$coefficients |>
    as.data.frame() |>
    filter(`Pr(>|t|)` < 0.05) |>
    nrow()
  
  tibble(seed = i, full = false_pos1, stepped = false_pos2, n = n, r = r)
}

summary(noise_results)

noise_results |>
  rename(`All variables included` = full,
         `Stepwise selection of variables` = stepped) |>
  mutate(n2 = glue("Sample size: {n}"),
         n2 = fct_reorder(n2, n)) |>
  gather(method, value, -seed, -n, , -n2, -r) |>
  ggplot(aes(x = r, y = value, colour = method)) +
  facet_wrap(~n2) +
  geom_hline(yintercept = 5) +
  geom_smooth(se = FALSE, method = "gam", formula = y ~ s(x)) +
  geom_point() +
  labs(y = "Number of false positives -\nvariables returned as 'significant'",
       x = "Correlation of the X predictor variables",
       colour = "",
       title = "False positive rates when using stepwise variable selection",
       subtitle = "Models with 100 X explanatory variables that are in truth unrelated to Y; expecting 5 falsely 'significant' variables. 
Small sample sizes make the false positive problem for stepwise selection of variables; multicollinearity in the X when no relation to the Y doesn't matter.")

The average false positive rate of the full model is 5.1%; for the stepwise variable selection it is 9.5%. In the chart below we can see that sample size relative to the number of variables in X matters a lot here:

For example, the case with the sample size of only 200 observations gets a false positive rate above 15% from the stepwise method. But even with larger samples, we take a hit in false positives from the stepwise approach. The degree of multicollinearity in the X doesn’t seem to make much difference.

It might seem unfair to have a model with 100 explanatory variables and only 200 observations, but out there on the internet (I’m not going to link) there are guides telling you it is ok to do this procedure even when you have more variables than observations. In fact I have a horrible fear that this practice might be common in some parts of science. You can imagine how doing that is basically a machine for generating false, non-reproducible findings.

Even the correctly-retained variables’ coefficients are biased big

The above simulation was pure noise so everything was a false positive. What does stepwise variable selection do in a more realistic case where some of the variables are correctly in the model and are related to y?

To explore this I wrote a function (code a little way further down the blog) to simulate data with 15 X correlated variables and 1 y variable. The true model is:

y = V1 + V2 + V3 + V4 + V5 + V6 + V7 + 0.1 (V8 + V9 + V10) + e

That is, the true regression coefficients for variables V1 to V7 are 1; for V8, V9 and V10 they are 0.1; for the the remaining 5 variables there is no structural relationship to y.

When we simulate 50 data sets of this sort and use stepwise variable selection to regress y on X, here are the coefficients we get. Each point represents the coefficient for one variable from one of those runs.

The large dots on zero indicate the multiple runs in which that particular variable was not included in the final model. We see:

  • Many occasions, variables V11 to V15 were rightly excluded, but a smattering of occasions they do get included in the model.
  • A lot of false negatives - variables V1 to V10 that should be found in the model and aren’t
  • Worse, when one of variables V1 to V10 is correctly included in the final model, the coefficient estimated for it is always (in this dataset) larger than the true coefficient (which remember should be 1 or 0.1 - the correct values shown by the red crosses).

For comparison, here is an equivalent chart for when we fit the full model to these data. There’s a lot of variation in the coefficient estimates, but at least they’re not biased (that is, on average they are correct, their expected value is the true value):

Here’s the code for the function simulating that data and drawing the plots:

#---------------------when X is related to y--------------

#' @param xcm correlations of the X variables with eachother, as a multiplier of
#'   their standard deviation (all X variables have the same variance / sd of 1)
#' @param ysdm standard deviation of the y variable, expressed as a multiplier of variance
#'   of the X
#' @param n sample size
#' @param k number of columns in X. Only currently works if this is 15 (because of the hard-coded true_coef)
#' @param runs number of simulations to run
#' @param seed random seed for reproducibility
sim_steps <- function(xcm = 0.4, ysdm = 9, n = 200, k = 15, runs = 50, seed = 321){
  
  set.seed(seed)
  
  results <- tibble()
  
  true_coef <- c(rep(1, 7), rep(0.2, 3), rep(0, 5))
  
  for(i in 1:runs){
    Sigma <- matrix(xcm, nrow = k, ncol = k)
    diag(Sigma) <- 1
    
    # Sigma, not sigma squared
    m <- mvrnorm(n = n, mu = rep(0, k), Sigma = Sigma) 
    
    d <- m |>
      as.data.frame() |>
      mutate(y = m %*% true_coef + rnorm(n(), 0, ysdm))
    
    mod <- lm(y ~ ., data = d)
    
    step_mod <- stepAIC(mod, trace = FALSE)
    
    cm <- coef(mod)[-1]  
    csm <- coef(step_mod)[-1]
    
    results <- rbind(results,
                       tibble(
                          variable = names(csm),
                          coefficient = csm,
                          run = i,
                          model = "Stepwise"
                       ),
                       tibble(
                         variable = names(cm),
                         coefficient = cm,
                         run = i,
                         model = "Full model"
                       )
                     )
  }
  
  true_coef_df <- tibble(
    variable = factor(names(d)[1:k], levels = names(d)[1:k]),
    coefficient = true_coef
  )
  
  results <- results |>
    complete(run, variable, model, fill = list(coefficient = 0)) |>
    mutate(variable = factor(variable, levels = true_coef_df$variable)) 
  
  biases_df <- results |>
    filter(coefficient != 0) |>
    left_join(true_coef_df, by = "variable") |>
    group_by(model) |>
    summarise(bias = round(mean(coefficient.x - coefficient.y), 2)) |>
    mutate(r2 = c(summary(mod)$r.squared, summary(step_mod)$r.squared),
           xcm = xcm,
           ysdm = ysdm,
           n = n)
  
  biases <- pull(biases_df, bias)
  names(biases) <- pull(biases_df, model)
  
  mclabel <- case_when(
    abs(xcm) < 0.05 ~ "negligible",
    abs(xcm) < 0.19 ~ "mild",
    abs(xcm) < 0.39 ~ "medium",
    TRUE            ~ "strong"
  ) |>
    paste("multicollinearity")
  
  p1 <- results |>
    filter(model == "Stepwise") |>
    count(variable, coefficient) |>
    ggplot(aes(x = coefficient, y = variable)) +
    geom_point(aes(size = n)) +
    geom_point(data = true_coef_df, colour = "red", size = 4, shape = 12) +
    scale_size_area(breaks = c(1, 1:4 * 10)) +
    labs(x = "Coefficient value",
         y = "Variable",
         size = "Number of observations:\n(usually only one, except when coefficient dropped altogether)",
         title = "Stepwise regression returns coefficient estimates biased away from zero",
         subtitle = glue("Black dots show coefficient estimates from one run of a stepwise (AIC-based) model fitting. Red squares show correct values.
Coefficients of variables left in the model are on average {biases['Stepwise']} too large (compared to real value of 0, 0.1 or 1).
Also, real explanatory variables are often dropped. Fake ones are often included."),
         caption = glue("Simulated data with a model that explains about {percent(summary(step_mod)$r.squared)} of variation in response variable, with {mclabel}, by https://freerangestats.info"))
  
  
  
  p2 <- results |>
    filter(model != "Stepwise") |>
    count(variable, coefficient) |>
    ggplot(aes(x = coefficient, y = variable)) +
    geom_point() +
    geom_point(data = true_coef_df, colour = "red", size = 4, shape = 12) +
    labs(x = "Coefficient value",
         y = "Variable",
         title = "Using the full model returns unbiased coefficient estimates",
         subtitle = glue("Black dots show coefficient estimates from one run of a all-variables-in model fitting. Red squares show correct values.
Coefficients of variables left in the model are on average {biases['Full model']} too large (compared to real value of 0, 0.1 or 1)."),
         caption = glue("Simulated data with a model that explains about {percent(summary(mod)$r.squared)} of variation in response variable, with mild multicollinearity, by https://freerangestats.info"))
  
  return(list(
    results = results,
    p1 = p1,
    p2 = p2,
    biases_df = biases_df
  ))
}

my_sim <- sim_steps()
my_sim$p1 # plot results for stepwise selection
my_sim$p2 # plot results for full model

Now, you can see that there are a few arbitrary aspects to that simulation - in particular the sample size, the variance of the y relative to the X, and the multicollinearity of the X. The idea of having it as a function is that you can play around with these and see the impacts; for example, if you make the variance of y smaller compared to that of the X, the model gets better at explaining the variance and the stepwise algorithm is less prone to getting things wrong. Rather than include a whole bunch of individual cases, I ran some more simulations covering a range of such values so we can see the relationship to those parameters of the average bias in the estimated regression coefficients remaining in the model.

So here is the relationship of that bias to the R-squared of the model, at various levels of correlation between the X variables.

And here is the relationship of the bias to sample size, standard deviation of Y, and correlation between the X all in one chart:

Of the two visualisations I probably prefer that last one, the heat map. First, it dramatically shows (all that white) that the regression estimates from the true model aren’t biased at all. Secondly, it nicely shows that the bias in the estimates returned by stepwise regression are worse

  • for smaller samples
  • with higher correlation between the X variables
  • and with high variance of the y variable

So, finally, here’s the code for those simulations:

# parameters to run this for:
var_params <- expand_grid(xcm = 0:9 / 10, ysdm = 1:9 * 3, n = c(200, 2000))

# export onto the cluster some objects we need to use:
clusterExport(cluster, c("sim_steps", "var_params"))

# run all the simulations
many_params <- foreach(i = 1:nrow(var_params), .combine = rbind) %dopar% {
  res <- sim_steps(xcm = var_params[i, ]$xcm, 
                   ysdm = var_params[i, ]$ysdm,
                   n = var_params[i, ]$n,
                   runs = 200)
  res$biases_df
}

# linechart plot:
many_params |>
  mutate(n = glue("n = {n}")) |>
  ggplot(aes(x = r2, y = bias, colour = model)) +
  facet_grid(n ~ as.ordered(xcm)) +
  geom_smooth(se = FALSE, method = "loess", formula = y ~ x, span = 0.8) +
  geom_point() +
  scale_x_continuous(breaks = c(0.2, 0.8)) +
  labs(x = "R-squared (proportion of Y's variance explained by model)",
       y = "Average bias of estimated variable coefficients",
       title = "Bias in regression coefficients after stepwise selection of variables",
       subtitle = "Bias is worst with small samples, models with low R-squared, and correlation in the explanatory variables (shown from 0.1 to 0.9).
Model with 15 explanatory 'X' variables. Correct values of coefficients are 0, 0.1 or 1;  so a bias of +1 is very serious.",
colour = "Model fitting procedure")

# heatmap plot:
many_params |>
  mutate(n = glue("n = {n}")) |>
  ggplot(aes(x = xcm, y = ysdm, fill = bias)) +
  facet_grid(model~n) +
  geom_tile() +
  scale_fill_gradientn(colours = c("white", "steelblue", "darkred")) +
  labs(x = "Correlation between the X variables",
       y = "Standard deviation of the Y variable",
       fill = "Average bias of estimated variable coefficients:",
       title = "Bias in regression coefficients after stepwise selection of variables",
       subtitle = "Bias is worst with small samples, high variance response, and correlation in the explanatory variables.
Model with 15 explanatory 'X' variables. Correct values of coefficients are 0, 0.1 or 1;  so a bias of +1 is very serious.")

That’s all folks. Just don’t use stepwise selection of variables. Don’t use automated addition and deletion of variables, and don’t take them out yourself by hand “because it’s not significant”. Use theory-driven model selection if it’s explanation you’re after, Bayesian methods are going to be good too as a complement to that and forcing you to think about the problem; and for regression-based prediction use a lasso or elastic net regularization.

← Previous post

Next post →