Simulations to explore excessive lagged X variables in time series modelling

I was once in a meeting discussing a time series modelling and forecasting challenge where it was suggested that “the beauty of regression is you just add in more variables and more lags of variables and try the combinations until you get something that fits really well”. Well, no, it doesn’t work like that; at least not in many cases with realistically small social science or public policy datasets. Today’s post looks at one particular trap - a modelling strategy that tries more lags of explanatory variables than are justified given the amount of data available.

The data

Imagine you have a time series y and two candidate explanatory variables x1 and x2. We have values of x1 and x2 for the future - perhaps because they are genuinely known in advance, perhaps generated by independent forecasts. Let’s say there is good theoretical or subject matter reason for thinking there is a relationship between them, but it’s clearly not a simple one. It’s possible the relationship is lagged - perhaps x1 and x2 are leading economic indicators and y is the result of decisions people take based on the history of the xs. The job is to forecast y conditional on X, and if possible generate some insight about the structural relationship of the xs and y, and be prepared for some scenarios where x1 or x2 might change and we want to see the impact on y.

Here’s an example of something we might see:

Those series were generated by me. We have 110 historical observations, and a forecast period of 12 for which there are observations for x1 and x2 but not y. Each series is in reality completely independent of the others, something I’ve done deliberately to illustrate the point I’m about to make. For those with an interest, they are all ARIMA(2,1,1) models with the same parameterisation but different random sequences behind them.

Basic forecasting models

Univariate

A simple strategy for forecasting y would be to find the best ARIMA model that fits the data using Rob Hyndman’s auto.arima algorithm, which has been shown to perform well. We can do this with or without considering the x variables at all. If we ignore the x variables, we get a forecast that looks like this:

Series: y 
ARIMA(0,1,2)                    

Coefficients:
         ma1     ma2
      1.4250  0.4415
s.e.  0.0865  0.0867

sigma^2 estimated as 0.7736:  log likelihood=-141.77
AIC=289.54   AICc=289.77   BIC=297.62

In fact, because we secretly know how the data was generated, we know that this is a good model of this particular data. The auto.arima algorithm picked up that the data was non-stationary and needed to be differenced once before modelling; but it got the “wrong” specification of the number of auto-regressive and moving average terms to include (this isn’t unusual, nor is it necessarily a problem). By default it uses a stepwise method to identify the best combination of up to five auto-regressive parameters and five moving average parameters.

Linear regression with time series error terms

A second, nearly as simple method of forecasting would be to fit a linear regression with an ARIMA time series error term. The auto.arima method will do this, and look after all the tests of non-stationarity and difference in response that are required. It gives easily interpretable coefficients for the relationship between y and x1 and x2. Here’s what we see in our case:

Series: y 
Regression with ARIMA(0,1,2) errors 

Coefficients:
         ma1     ma2      x1       x2
      1.4202  0.4349  0.0371  -0.0364
s.e.  0.0870  0.0878  0.0867   0.0796

sigma^2 estimated as 0.785:  log likelihood=-141.59
AIC=293.18   AICc=293.77   BIC=306.64

Basically, the model has correctly identified there is a weak or no relationship between the xs and y. The momentum of y itself is the main predictor of future y, and this shows in the forecasts which are only marginally different from our first univariate case. Using the Akaike Information Criterion, the model with x regressors is inferior to the simpler model.

More complex models

A problem with above approaches is that they don’t take into account the (wrongly) suspected lagged relationship between the xs and y, other than through the current/simultaneous in time relationship. There’s an obvious solution - introduce lagged terms of x1 and x2 into the regression. There are a variety of ways of doing this and specialist R packages like dynlm, dyn and dse that I’m not going to introduce because I want things from here on to be simple and explicit. See the CRAN time series task view for a starting point to explore further.

I want to focus on how problematic this can be. How many parameters in our model are we prepared to estimate, given our 110 historical data points? Frank Harrell’s advice in Regression Modeling Strategies is:

“Studies in which models are validated on independent datasets have shown that in many situations a fitted regression model is likely to be reliable when the number of predictors (or candidate predictors if using variable selection) p is less than m/10 or m/20, where m is the limiting sample size”

Frank Harrell

Our “limiting sample size” is 110, and our 110 time series observations are actually worth less than 110 independent ones would be because each observation contains only marginal new information for us; knowing the value at time t gives us a good indication of where it will be at time t+1, something not the case in more independent sampling situations.

This suggests we should be considering five or six, or eleven at the very most, candidate variables for inclusion. Just by using auto.arima we’ve looked at ten parameters (five autoregression lags, and five moving average parameters), and by adding in x1 and x2 we’ve definitely reached our maximum. Ouch. As we’re usually not interested in individual AR and MA parameters we can perhaps get away with stretching a point, but there’s no get out of jail card here that means we can meaningfully keep adding in more explanatory variables.

So what happens if we proceed anyway? To check this out, I tried three strategies:

Created variables for up to 20 lag periods for each of the three original variables, leaving us with 62 candidate predictors, and fit a model by ordinary least squares.
Take the full model from above and use stepwise selection to knock out individual variables with p values < 0.05 (the conventional cut-off point for a variable being “not significant”). I had to torture R’s step function to do this, because quite rightly it prefers to choose models based on AIC rather than individual p-values; I wanted to mimic the p-value method for demonstration purposes.
Take the full model and use lasso-style shrinkage/regularisation on the 62 predictors’ coefficients.

Of those methods, I think #1 and #2 are both potentially disastrous and will lead to overfitting (which means that the model will perform badly when confronted with new data ie in predicting our twelve future points). Stepwise variable selection is particularly pernicious because it can create the illusion that “significant” relationships between the variables have been found. At the end of a stepwise procedure, software can provide you with F tests, t tests and p-values but they are invalid because you have chosen the variables remaining in the model precisely because they score well on those tests. In today’s case, here is what we end up with.

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.23003    0.10375   2.217 0.029642 *  
y_d_lag_1    1.15802    0.09837  11.772  < 2e-16 ***
x1_d_lag_1   0.27902    0.07908   3.528 0.000718 ***
y_d_lag_2   -1.04546    0.14885  -7.023 8.39e-10 ***
y_d_lag_3    0.78761    0.16493   4.775 8.68e-06 ***
y_d_lag_4   -0.44729    0.14928  -2.996 0.003704 ** 
x1_d_lag_4   0.24485    0.07325   3.342 0.001298 ** 
y_d_lag_5    0.33654    0.10118   3.326 0.001366 ** 
x1_d_lag_6   0.19763    0.06565   3.010 0.003555 ** 
x2_d_lag_8  -0.11246    0.05336  -2.107 0.038422 *  
x2_d_lag_16 -0.17330    0.06347  -2.731 0.007876 ** 
x1_d_lag_17  0.22910    0.07073   3.239 0.001789 ** 
x1_d_lag_18 -0.18724    0.07147  -2.620 0.010643 *  
x2_d_lag_19 -0.11310    0.05536  -2.043 0.044546 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8418 on 75 degrees of freedom
Multiple R-squared:  0.7594,	Adjusted R-squared:  0.7177 
F-statistic: 18.21 on 13 and 75 DF,  p-value: < 2.2e-16

A high R-squared, even a high “Adjusted R-squared” and many variables with p values well below the 0.05 cut-off. So (for example) the unwary analyst would conclude we have strong evidence that y is impacted by x2 16 periods ago and by x1 17 periods ago, but mysteriously not by anything from lags 9 through 15.

Remember that the true data generation process had zero influence from x1 and x2 on y or vice versa.

Working with lags in a time series situation does not give any wild cards that make stepwise selection valid. The usual problems with stepwise selection within a model building strategy apply:

\(R^2\) values are biased high
The usual F and \(\chi^2\) test statistics of the final model do not have the claimed distribution
The standard errors of coefficients are biased low and confidence intervals of effects are too narrow
p-values are far too small, do not have the proper meaning, and correction is very difficult
The coefficients of effects that remain in the model are strongly biased away from zero
Variable selection is made arbitrary
The whole process allows us to escape thinking about the problem

From Harrell’s Regression Modeling Strategies.

In the case of the full model, with 62 variables, the exact problems above do not apply. The p-values for example will be basically ok (and in fact in this case none of the x1 and x2 lags show up as significant until we start knocking out the variables one at a time). The problem is that with 64 parameters (including the intercept and variance) estimated from only 109 data points (after we have differenced the series to make them stationary), we have extremely high variance and instability in the model.

Shrinkage of the estimated coefficients via the lasso is one way of fixing this, and indeed the lasso is generally used precisely in this situation of excessive candidate explanatory variables when predictive power is important. We trade off some bias in the coefficients for better stability and predictive power. Using the lasso in this case reduces nearly all the x coefficients to zero, and lags 1 and 2 of the y variable staying in the model as they should.

Comparison of results

Here’s the actual forecasts of the five methods we’ve tried so far:

What we see is that the two methods I’ve labelled “bad” give very similar results - a sharp decline. As a result of their overfitting they are both picking up structure in the xs that does not really impact on y.

The three “ok” methods give results that are more in tune with the secret data generating process. The most encouraging thing for me here is the success of the lasso in avoiding the pitfalls of the stepwise method. If there really is a complex lagged relationship between the xs and y, this method of looking at all 62 variables and forcing a relatively stable estimation process at least gives the analyst a sporting chance of finding it. The orderedLasso R package - which I haven’t tried using yet - looks to take this idea further in implementing the idea of an “Ordered Lasso and Sparse Time-lagged Regression”, and I’m pretty confident is worth a look.

Final word - this is an illustration rather than a comprehensive demonstration. Obviously, the thing to be done to really prove the issues would be to generate many such cases, including “real” future values of y for test purposes, and test the error rates of the forecasts from the different methods. I don’t particularly feel the need to do that, but maybe one day if I run out of things to do will give it a go.

Code

library(tidyverse)
library(scales)
library(forecast)
library(glmnet)
library(forcats)
library(directlabels)

#===========generate data============
n <- 110
h <- 12

set.seed(123)
y <- ts(cumsum(arima.sim(model = list(ar = c(0.5, -0.2), ma = 0.9), n = n) + 0.2))
x1 <- cumsum(arima.sim(model = list(ar = c(0.5, -0.2), ma = 0.9), n = n + h) - 0.2)
x2 <- cumsum(arima.sim(model = list(ar = c(0.5, -0.2), ma = 0.9), n = n + h))

orig_data <- data_frame(Value = c(y, x1, x2),
           Variable = rep(c("y", "x1", "x2"), times = c(n, n + h, n + h)),
           TimePeriod = c(1:n, 1:(n+h), 1:(n+h))) 

orig_data %>%
   ggplot(aes(x = TimePeriod, y = Value)) +
   geom_line() +
   facet_wrap(~Variable) +
   ggtitle("Original data",
           "Three unrelated (but that is unknown to us) univariate time series")

X_full <- cbind(x1, x2)
X_historical <- X_full[1:n, ]
X_future <- X_full[-(1:n), ]


# A version that has been differenced once....
YX_diff_lags <- data_frame(
   y_d = c(diff(y), rep(NA, h)),
   x1_d = diff(x1),
   x2_d = diff(x2))

# And has lagged variables up to lag period of 20 for each variable:
lagnumber <- 20

for (i in 1:lagnumber){
   YX_diff_lags[, ncol(YX_diff_lags) + 1] <- lag(YX_diff_lags$y_d, i)
   YX_diff_lags[, ncol(YX_diff_lags) + 1] <- lag(YX_diff_lags$x1_d, i)
   YX_diff_lags[, ncol(YX_diff_lags) + 1] <- lag(YX_diff_lags$x2_d,i)
}

names(YX_diff_lags)[-(1:3)] <- 
   paste0(names(YX_diff_lags)[1:3], "_lag_", rep(1:lagnumber, times = rep(3, lagnumber)))
       
if(ncol(YX_diff_lags) != 3 + 3 * lagnumber){
   stop("Wrong number of columns; something went wrong")
}
#===================Modelling options========

#-------------univariate-----------------
mod1 <- auto.arima(y)
fc1 <- forecast(mod1, h = h)

autoplot(fc1)

#------------xreg, no lags----------------
mod2 <- auto.arima(y, xreg = X_historical)
fc2 <- forecast(mod2, xreg = X_future)

autoplot(fc2)

#-----------xreg, lags----------------
YX_hist <- YX_diff_lags[complete.cases(YX_diff_lags), ]

mod3 <- lm(y_d ~ ., data = YX_hist)

#' Forecast from a regression model with lags one row at a time
#' Assumes the existence of: YX_diff_lags & y.
#' Has to forecast one row at a time, then populate the explanatory data frame
#' with the lagged values possible after that forecast.
onerow_forecast <- function(model){
   prediction_data <- YX_diff_lags[is.na(YX_diff_lags$y_d), ]
   y_names <- names(YX_diff_lags)[grepl("y_d", names(YX_diff_lags))]
   
   for(i in 1:nrow(prediction_data)){
      for(j in 2:nrow(prediction_data)){
         for(k in 2:length(y_names)){
            prediction_data[j , y_names[k]] <- prediction_data[j-1 , y_names[k - 1]]
         }
      }
      
      if(class(model) == "cv.glmnet"){
         new_y <- predict(model, newx = as.matrix(prediction_data[i, -1]))
      } else {
         new_y <- predict(model, newdata = prediction_data[i, -1])   
      }
      
      prediction_data[i , "y_d"] <- new_y
   }
   
   # de-diff ie return to the original, non-differenced scale
   fc_y <- cumsum(c(as.numeric(tail(y, 1)), prediction_data$y_d))
   return(fc_y[-1])
}

fc3 <- onerow_forecast(mod3)

#-------xreg, lags, stepwise----------------
# high value of k to mimic stepwise selection based on p values:
mod4 <- step(mod3, k = 4.7, trace = 0)
fc4 <- onerow_forecast(mod4)

#-----------xreg, lags, lasso---------------
# Use cross-validation to determine best value of lambda, for how much
# shrinkage to force:
mod5 <- cv.glmnet(y = YX_hist$y_d, x = as.matrix(YX_hist[ , -1]))
fc5 <- onerow_forecast(mod5)     

#====================compare results==============
p <- data_frame(Univariate = fc1$mean, 
           SimpleXreg = fc2$mean, 
           AllLags = fc3, 
           StepLags = fc4, 
           LassoLags = fc5,
           TimePeriod = 1:h + n) %>%
   gather(Variable, Value, -TimePeriod) %>%
   mutate(Judge = ifelse(grepl("Lags", Variable) & !grepl("Lasso", Variable),
                         "Bad", "OK")) %>%
   mutate(Variable = fct_relevel(Variable, c("AllLags", "StepLags"))) %>%
   ggplot(aes(x = TimePeriod, y = Value, colour = Variable)) +
   facet_wrap(~Judge, ncol = 1) +
   geom_line(size = 1.2) +
   scale_colour_brewer(palette = "Set1") +
   theme(legend.position = "right") +
   geom_line(data = filter(orig_data, Variable == "y"), colour = "black", linetype = 3) +
   xlim(c(0, 130)) +
   annotate("text", x = 25, y = 15, label = "Historical data", 
            colour = "grey40", family = "myfont") +
   ggtitle("Excessive inclusion of lagged explanatory variables leads to overfitting",
           "Regularization by lasso, or using a simpler model in the first place, gives much better results.") +
   labs(caption = "Simulations from forecasting one time series based on two unrelated time series.")

direct.label(p)