The M3 forecasting competition
The M3 forecasting competition in 2000, organized by Spyros Makridakis and Michele Hibon, tested a variety of methods against 3,003 time series, with forecasts compared to held out test sets. The data are conveniently available for R users in the
Mcomp package and Rob Hyndman has published example code benchmarking the
auto.arima() functions from his forecast package against the results in 2000.
Hyndman has a good outline of the various various measures of forecast accuracy. I buy the argument that Mean Absolute Scaled Error (MASE) - not one used in the original M3 competition - is a better general indicator of forecast accuracy than any of the variants on percentage error, which fall apart when the series has negative values or values that go near to zero.
I wanted to:
- confirm Hyndman’s results in his “show me the evidence” blog post, showing that a combination of
ets()gives very good out-of-the-box forecasting results
- see how X13-SEATS-ARIMA, when run with its default values, compares to the methods benchmarked by Hyndman
- confirm that hybrid forecasts (average of two or more methods) perform better than their constituent forecasts
To do this I modified Hyndman’s benchmarking code to add two additional forecasting methods:
- hybrid of X13 and
- hybrid of X13,
This exercise ended up more complex than I’d envisaged, and there’s too much code to integrate into the text, so the full set of code can be found in the <source> branch of my blog repository.
X13-SEATS-ARIMA is the industry standard tool for seasonal adjustment, particularly for official statistics published by national statistics offices. It gives a choice of the SEATS algorithm (Bank of Spain) and X11 (Statistics Canada and US Census Bureau). X13-SEATS-ARIMA combines these into a single application, available for download from the US Census Bureau. Christoph Sax’s excellent
seasonal R package gives easy access from R.
To seasonally adjust historical data, X13 can automatically select a seasonal ARIMA model, and it can forecast future values. This made me wonder how it goes as a automated forecasting tool. I couldn’t find (on an admittedly very cursory look) anyone who had benchmarked it against a large number of time series, so I thought I’d try it against the M3 competiton set. I thought it might do well because of its default settings to automatically handle outliers, level shifts, transformations, and moving holidays like Easter.
In testing X13 on the M3 data, I had to give it different settings for annual data from seasonal, as otherwise it trips over in trying to find effects from Easter, number of trading days, etc.
I was surprised to find that X13-SEATS-ARIMA, run with the defaults, doesn’t do particularly well.
The main obstacle is that for 412 series X13-SEATS-ARIMA cannot choose a model to fit without human intervention. In the remaining cases it is mediocre on average, coming out a bit worse than
auto.arima on the Mean Absolute Scaled Measure rating (but a bit better on the various percentage error measures). The clever model selection in
auto.arima - which is successful in fitting all 3,003 series - clearly more than compensates for not considering transforms, outliers and moving holidays.
To get some kind of overall result at all for X13 I replaced all the cases where it couldn’t fit a model with the forecasts from
auto.arima(). This is obviously unfair as a statement of the overall strength of X13 (fairer might have been to replace with a naive model - final value in the test set repeats itself over the forecast period), but it does mean that I can at least make crude comparisons.
Hybrid methods perform well, as is well known in the forecasting community. Taking an average of X13-SEATS-ARIMA and
auto.arima() does better on all measures than either of them by themselves; and an average of X13,
ets() has a case to make to be the best method easily available to the regular R user.
Why does X13 sometimes fail?
There are five types of error that lead to X13 failing to fit a model on this collection of data series. Each error message is shown below with a minimal example to produce it.
Struggles to handle high degree of integration
Basically, this is a failure of the model selection process in X13-SEATS-ARIMA to correctly identify the degree of integration - ie how many times the series must be “differenced” to turn it into a stationary series.
auto.arima() uses an apparently superior method that gives more robust results. The problem could be resolved by using
auto.arima() to determine the order of a model and then fitting it with X13-SEATS-ARIMA - in the case of annual data without outliers like series 434, X13 and
auto.arima then give identical results because there is no Easter or similar impacts that are handled differently by the two.
Note - actually, at the time of writing if you use the version of
seasonal on CRAN you won’t get even this error message, but another one that is caused by problems parsing the X13 error message into R. Christoph Sax kindly patched
seasonal with a hack around this which will make it onto CRAN eventually.
Matrix singularity because of trading days
This problem was relatively rare and happened with some short seasonal series. It’s easy to deal with by specifying that number of trading days should not be included as a regression variable.
Runs but produces no data
This is a mysterious error that I haven’t been able to get to the bottom of. It might be something to do with the R - X13 interface, or something purely with X13.
Cannot process spec file for an unknown reason
Again, I couldn’t work out what is ultimately causing this error.
Trouble-shooting these last two types of problematic series would mean taking them out of R and creating input files for X13 by hand, not something I’ve got time and inclination to do.
Can’t handle phony date/times
A number of the series in the Mcomp package do not have actual times and dates specified in their R format, most frequently because they are daily financial series that are not regularly spaced time series if their date information is included. Because of the way it treats Easter and trading day variables, X13-SEATS-ARIMA needs data that is from the year 1000 or later. There’s no obvious solution for this; I don’t think X13-SEATS-ARIMA can handle irregular daily time series itself (let me know in the comments if I’m wrong).
When it succeeds, where is X13 slightly worse than auto.arima?
Examples of auto.arima() out-performing X13-SEATS-ARIMA
Looking at individual forecasts is a mug’s way of assessing the overall strength of forecasting methods, but it at least gets the brain moving. Here are some examples of where the model selection of X13 wasn’t as good as that of
Examples of X13-SEATS-ARIMA out-performing auto.arima()
…and to be fair, here are some examples in the other direction. X13 has an almost spooky ability to pick the history of unpredictability in the US stock exchange and effectively predicted the big stock market crash when it is fed data up to 1987.
- Because of the fragility of its model selection when faced with a large number of diverse time series, X13-SEATS-ARIMA is not a particularly good tool for automated, no-hands forecasting. This isn’t particularly surprising as I don’t think it’s ever been spruiked as such. It’s a tool for careful specialists who correctly check diagnostic plots and tests, and who can handle a potentially 15% failure rate of its automated model selection when used outside its native territory.
- As expected, hybrid forecasts - averaging several methods - work well. There may be a place for X13-SEATS-ARIMA as part of a hybrid automated forecaster.
Just to be clear - X13-SEATS-ARIMA is a fantastic tool, and I think it is the best out there in its role as a tool for careful analysis of quarterly and monthly seasonal time series, using real calendar dates, as part of an official statistics process. In a big majority of cases, particularly with seasonal data, its defaults and its automatic model selection give you good results, and will deliver very good seasonal adjustment. However, its automatic modelling under default conditions isn’t as robust (ie guaranteed to get a working model) as other methods, and the resulting models don’t always give optimal results for forecasting.