Prediction intervals for ensemble time series forecasts

Peter Ellis
December 2016

The only credible test...

M1

Period DEMOGR INDUST INDUSTRIAL MACRO1 MACRO2 MICRO1 MICRO2 MICRO3
1 MONTHLY 75 183 0 64 92 10 89 104
2 QUARTERLY 39 17 1 45 59 5 21 16
3 YEARLY 30 35 0 30 29 16 29 12

Makridakis et al, 1982

The only credible test...

M3

Period DEMOGRAPHI- DEMOGRAPHIC FINANCE INDUSTRY MACRO MICRO OTHER
1 MONTHLY 0 111 145 334 312 474 52
2 OTHER 0 0 29 0 0 4 141
3 QUARTERLY 57 0 76 83 336 204 0
4 YEARLY 0 245 58 102 83 146 11

Makridakis et al, 2000

The only credible test...

Tourism

Period TOURISM
1 MONTHLY 366
2 QUARTERLY 427
3 YEARLY 518

Athanasopoulos et al, 2011

UNITED NATIONS COPPER ORE PRODUCTION CANADA

forecast_comp(M1[[650]], plot = TRUE) 

plot of chunk unnamed-chunk-5

RATIO CIVILIAN EMPLOYMENT TO TOTAL WORKING AGE POPULATION

forecast_comp(M1[[1000]], plot = TRUE) 

plot of chunk unnamed-chunk-6

Ensemble time series work better than individual models

For example:

model two four six eight
Theta 0.77 1.06 1.35 1.62
ARIMA-ETS average 0.72 1.07 1.38 1.75
ARIMA 0.75 1.12 1.43 1.79
ETS 0.75 1.11 1.44 1.82
Naive 1.08 1.11 1.74 1.87

Mean absolute scaled error of forecasts for 756 quarterly series from the M3 competition, forecast horizon ranging from two to eight quarters.

plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-9

How to estimate prediction intervals?

  • Usually presumed some kind of weighted average of the components
  • Weights might be estimated based on in-sample errors

But the components have poor coverage.

m1-nohybrid

Standard estimates for prediction intervals are conditional on the model being correct, despite the obvious randomness in model selection.

A conservative alternative

  • Take the extremes of the combined prediction interval coverage of the components of the ensemble - but:

“Those prediction intervals look dodgy because they are way too conservative. The package is taking the widest possible intervals that includes all the intervals produced by the individual models. So you only need one bad model, and the prediction intervals are screwed.”

Definitely too wide...

dodgy

This particular example is a combination of five forecast methods

Let's test against a larger set of data

Taking into account past findings that lower freqency data has an increased tendency to overestimate the coverage of forecast prediction intervals.

  • Makridiakis et al. (1987)
  • Athanasopoulos et al (2011)

m1

m3

tourism

When it works

good1

good2

good3

good4

Some examples of when it goes all wrong

bad1

bad2

bad3

Summary of performance

combined

Conclusions

  • Confirm that higher frequency leads to more accurate advertised coverage
  • Domain makes a real difference. The original tourism data ETS prediction intervals have accurate or even better coverage than advertised for all seasonal data, but only for monthly in M3 and never in M1
  • For 80% prediction intervals and seasonal data, the trial method has too high coverage in the Tourism and M3 competitions, but not in M1
  • For non-seasonal data, even the trial method isn't conservative anough, in all three competitions, for 80% or 95% intervals
  • The trial conservative method gives good results for 95% confidence intervals of seasonal data - better than the individual components

Practical implications

  • Ok (ie better than alternatives) to use this method for 95% confidence intervals…
  • … and for 80% confidence intervals for quarterly and yearly data
  • but too conservative for monthly or more frequent 80% confidence intervals

Not considered today

  • implications of having more than two models in the combination
  • Box-Cox transformations
  • what happens when seasonal data is aggregated up to lower frequency
  • implications of coordination by hierarchy (hts) or temporal aggregation (thief)

Today's key messages:

  • The only credible way to seriously test time series forecasts is against large collections of real life datasets
  • The average value of an ensemble of forecasts often out-performs individual results for point accuracy
  • The actual coverage of prediction intervals should be a key part of assessing performance but is often neglected
  • Prediction intervals from individual models often have very poor (less than advertised) actual coverage
  • Better prediction interval performance is possible in some situations by a conservative combination of the prediction intervals from component models
  • The forecastHybrid R package facilitates this

Summary of performance

combined