Why time series forecasts prediction intervals aren't as good as we'd hope

At a glance:

A quick demonstration of the impact of inevitably random estimates of the parameters and meta-parameters in ARIMA time series modelling

07 Dec 2016

Five different sources of error

When it comes to time series forecasts from a statistical model we have five sources of error:

Random individual errors
Random estimates of parameters (eg the coefficients for each autoregressive term)
Uncertain meta-parameters (eg number of autoregressive terms)
Unsure if the model was right for the historical data
Even given #4, unsure if the model will continue to be right

A confidence interval is an estimate of the statistical uncertainty of the estimated parameters in the model. It usually estimates the uncertainty source #2 above, not interested in #1 and conditional on the uncertainty of sources #3, #4 and #5 all being taken out of the picture. A prediction interval should ideally take all five sources into account (see Rob Hyndman for more on the distinction between prediction and confidence intervals).

Unfortunately, the standard ways of providing time series prediction intervals typically only take source #1 into account - random individual errors. This differs from standard prediction intervals from more straightforward regression and generalized linear models, which at least usually factor in uncertainty of the estimates of parameters.

The problem is that for all but the most trivial time series forecasting method there is no simple way of estimating the uncertainty that comes from having estimated the parameters from the data, and much less so the values of meta-parameters like the amount of differencing needed, how many autoregressive terms, how many moving average terms, etc (those example meta-parameters come from the Box-Jenkins ARIMA approach, but other forecasting methods have their own meta-parameters to estimate too).

Demonstration of the cost of estimating meta-parameters

Here’s a simple simulation to show the cost of estimating the meta-parameters, even when sources of error #4 and #5 can be discounted. I generated 10,000 time series, each 100 observations long, from an ARIMA(1, 1, 1) process and split them into a 90 observation training set and 10 observation test set. This creates data that looks like this (just four examples shown):

examples

For each one of those datasets, I fit an ARIMA model to the 90 observation test set two ways:

using forecast::Arima and advising the algorithm that the model to use is (1,1,1)
using forecast::auto.arima and forcing the algorithm to estimate from the data the correct level of differencing, number of autoregression terms, and number of moving average terms.

Then I compared the resulting forecast prediction intervals (at 80% and 95% confidence levels) of the last 10 observations to see what percentage of forecasts actually had the true values in their range. Here’s the result:

result

As we can see, the Arima model which knew in advance the correct meta-parameter specification didn’t do too badly. Its prediction intervals contain the correct values only just below the promised 80% and 95% of times. But the auto.arima method, which had to estimate the meta-parameters itself, did noticeably worse. Not only that, as the forecast horizon increases, its prediction intervals got increasingly over-optimistic. Having the wrong meta-parameters for your model becomes more of a problem the further you go out (I suspect this is particularly the case for the amount of differencing that needs to be done - an unlucky choice here would have big implications for estimating the trend).

An important thing to remember is that out in the wild, results will generally be worse than this. In this occasion, both methods had the luxury of fitting the same family of models that had generated the data, and knowing that the final ten observations were generated the same way. No outliers, black swans, or changes in data generating process to worry about. One way of partially dealing with this problem with careful use of the method implemented in the forecastHybrid R package by David Shaub and myself was the subject of my talk yesterday at the Australian Statistical Conference.

Code

Here’s the code that does the simulation above:

library(forecast)
library(foreach)
library(doParallel)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(stringr)
library(grid)
library(colorspace)


#=============prediction interval functions===================
#' takes a forecast object, and the actual results, and returns a data frame
#' binary indicator of success as well as the horizon for each row and an optional
#' "type" column.
accuracy_pi <- function(fc, actual, type = NULL, labs = c("80%", "95%")){
   if(nrow(fc$upper) != length(actual)){
      stop("`fc` must be a forecast object with `upper` and `lower` objects that are matrices with number of rows equal to length of `actual`")
   }
   actual <- as.vector(actual)
   h <- length(actual)
   # take advantage of actual being a vector that will recycle, going down each
   # column of the two matrices of logical conditions we superimpose here:
   tmp <- as.data.frame(fc$lower < actual & fc$upper > actual)
   names(tmp) <- labs
   tmp <- cbind(tmp, 
                "h" = 1:h,
                "type" = rep(type, h))
   return(tmp)
}

#===================simulations=======================
n <- 100 # size of full series
h <- 10  # how much of the series to forecast forward
R <- 10000 # how many datasets to try this with
the_model <- list(ar = 0.5, ma = 0.5)

cluster <- makeCluster(7) # only any good if you have at least 7 processors :)
registerDoParallel(cluster)

clusterEvalQ(cluster, {
   library(forecast)
   
})

results <- foreach(1:R, .combine = rbind) %dopar% {
   sim_data <- ts(cumsum(arima.sim(model = the_model, n = n)))
   
   x <- sim_data[1:(n - h)]
   actual <- sim_data[(n - h + 1):n]
   mod1 <- auto.arima(x)
   fc1 <- forecast(mod1, h = h)
   
   mod2 <- Arima(x, order = c(1, 1, 1))
   fc2 <- forecast(mod2, h = h)
   rbind(
      accuracy_pi(fc1, actual = actual, type = "auto"),
      accuracy_pi(fc2, actual = actual, type = "pre-knowledge")
   )
}


stopCluster(cluster)

#==========present results==================
pal <- rainbow_hcl(2)

targetlines <- data.frame(
   limit = c("80%", "95%"),
   yintercept = c(0.8, 0.95)
)

results %>%
   gather(limit, success, -h, -type) %>%
   group_by(h, type, limit) %>%
   summarise(success = mean(success)) %>%
   ggplot(aes(x = h, y = success, colour = type)) +
   facet_wrap(~limit) +
   geom_line() +
   scale_y_continuous(label = percent) +
   ggtitle("Prediction interval coverage of two modelling methods",
           subtitle = "Comparing auto.arima with pre-knowledge of the true data generating process") +
   labs(x = "Number of periods forecast forward",
        caption = "Data generated with an ARIMA(1,1,1) process, which is much more regular than real life data.",
        colour = "Modelling approach:") +
   scale_colour_manual(values = pal) +
   scale_x_continuous(breaks = 0:5 * 2) +
   geom_hline(data = targetlines, aes(yintercept = yintercept), colour = "violet", linetype = 2, size = 1.1)

grid.text(0.3, y= 0.65,
          label = str_wrap("When the true model family and the true meta-parameters p, d, q are known, coverage is close to the 80% and 95% promised.", 33),
          gp = gpar(family = "myfont", col = pal[2], cex = 0.8))
grid.text(0.8, y= 0.45,
          label = str_wrap("When the meta parameters p, d and q are estimated from the data, coverage is materially less.", 33),
          gp = gpar(family = "myfont", col = pal[1], cex = 0.8))

grid.text(0.9, y = 0.779, label = "Gap from estimating\nmeta-parameters",
          gp = gpar(family = "myfont", col = "grey50", cex = 0.7))

grid.text(0.85, y = 0.840, label = "Gap from estimating parameters",
          gp = gpar(family = "myfont", col = "grey50", cex = 0.7))
 
#=================demo data============
# four example datasets
set.seed(123) # for reproducibility
par(mfrow = c(2, 2), bty = "l", font.main = 1)
for(i in 1:4){
   
   sim_data <- ts(cumsum(arima.sim(model = the_model, n = n)))
   
   x <- sim_data[1:(n - h)]
   actual <- sim_data[(n - h + 1):n]
   
   mod2 <- Arima(x, order = c(1, 1, 1))
   fc2 <- forecast(mod2, h = h)   
   plot(fc2, main = paste0("Simulated ARIMA(1,1,1) data: ", i))
   lines((n - h + 1):n, actual, col = "red")
}

← Previous post

Error, trend, seasonality - ets and its forecast model friends

Extrapolation is tough for trees!

My day job is Director of the Statistics for Development Division at the Pacific Community, the principal scientific and technical organisation in the Pacific region, proudly supporting development since 1947. We are an international development organisation owned and governed by our 27 country and territory members. This blog is not part of my role there and contains my personal views only.

Follow this blog with RSS.

Find me on Bluesky or Mastodon.

I'm pleased to be aggregated at R-bloggers, the one-stop shop for blog posts featuring R.

I've never been an academic but have written a few small things and hence have a Google Scholar profile

I read a lot of books. In the unlikely event you want to see what I think about them or what I'm reading at the moment, follow me on Goodreads.

free range statistics by Peter Ellis is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Why time series forecasts prediction intervals aren't as good as we'd hope

At a glance:

Five different sources of error

Demonstration of the cost of estimating meta-parameters

Code

Bluesky feed: