Prediction intervals for ensemble time series forecasts

Peter Ellis
December 2016

The only credible test...

M1

	Period	DEMOGR	INDUST	INDUSTRIAL	MACRO1	MACRO2	MICRO1	MICRO2	MICRO3
1	MONTHLY	75	183	0	64	92	10	89	104
2	QUARTERLY	39	17	1	45	59	5	21	16
3	YEARLY	30	35	0	30	29	16	29	12

Makridakis et al, 1982

The only credible test...

M3

	Period	DEMOGRAPHI-	DEMOGRAPHIC	FINANCE	INDUSTRY	MACRO	MICRO	OTHER
1	MONTHLY	0	111	145	334	312	474	52
2	OTHER	0	0	29	0	0	4	141
3	QUARTERLY	57	0	76	83	336	204	0
4	YEARLY	0	245	58	102	83	146	11

Makridakis et al, 2000

The only credible test...

Tourism

	Period	TOURISM
1	MONTHLY	366
2	QUARTERLY	427
3	YEARLY	518

Athanasopoulos et al, 2011

UNITED NATIONS COPPER ORE PRODUCTION CANADA

forecast_comp(M1[[650]], plot = TRUE)

plot of chunk unnamed-chunk-5

RATIO CIVILIAN EMPLOYMENT TO TOTAL WORKING AGE POPULATION

forecast_comp(M1[[1000]], plot = TRUE)

plot of chunk unnamed-chunk-6

Ensemble time series work better than individual models

Bates and Granger (1969) The Combination of Forecasts
Many confirmations since.

For example:

model	two	four	six	eight
Theta	0.77	1.06	1.35	1.62
ARIMA-ETS average	0.72	1.07	1.38	1.75
ARIMA	0.75	1.12	1.43	1.79
ETS	0.75	1.11	1.44	1.82
Naive	1.08	1.11	1.74	1.87

Mean absolute scaled error of forecasts for 756 quarterly series from the M3 competition, forecast horizon ranging from two to eight quarters.

plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-9

How to estimate prediction intervals?

Usually presumed some kind of weighted average of the components
Weights might be estimated based on in-sample errors

But the components have poor coverage.

m1-nohybrid

Standard estimates for prediction intervals are conditional on the model being correct, despite the obvious randomness in model selection.

A conservative alternative

Take the extremes of the combined prediction interval coverage of the components of the ensemble - but:

“Those prediction intervals look dodgy because they are way too conservative. The package is taking the widest possible intervals that includes all the intervals produced by the individual models. So you only need one bad model, and the prediction intervals are screwed.”

Definitely too wide...

dodgy

This particular example is a combination of five forecast methods

Let's test against a larger set of data

Taking into account past findings that lower freqency data has an increased tendency to overestimate the coverage of forecast prediction intervals.

Makridiakis et al. (1987)
Athanasopoulos et al (2011)

tourism

When it works

good1

good2

good3

good4

Some examples of when it goes all wrong

bad1

bad2

bad3

Summary of performance

combined

Conclusions

Confirm that higher frequency leads to more accurate advertised coverage
Domain makes a real difference. The original tourism data ETS prediction intervals have accurate or even better coverage than advertised for all seasonal data, but only for monthly in M3 and never in M1
For 80% prediction intervals and seasonal data, the trial method has too high coverage in the Tourism and M3 competitions, but not in M1
For non-seasonal data, even the trial method isn't conservative anough, in all three competitions, for 80% or 95% intervals
The trial conservative method gives good results for 95% confidence intervals of seasonal data - better than the individual components

Practical implications

Ok (ie better than alternatives) to use this method for 95% confidence intervals…
… and for 80% confidence intervals for quarterly and yearly data
but too conservative for monthly or more frequent 80% confidence intervals

Not considered today

implications of having more than two models in the combination
Box-Cox transformations
what happens when seasonal data is aggregated up to lower frequency
implications of coordination by hierarchy (hts) or temporal aggregation (thief)

Today's key messages:

The only credible way to seriously test time series forecasts is against large collections of real life datasets
The average value of an ensemble of forecasts often out-performs individual results for point accuracy
The actual coverage of prediction intervals should be a key part of assessing performance but is often neglected
Prediction intervals from individual models often have very poor (less than advertised) actual coverage
Better prediction interval performance is possible in some situations by a conservative combination of the prediction intervals from component models
The forecastHybrid R package facilitates this

Summary of performance

combined