## A function for testing auto.arima()

This is the third, and the last from me on this particular topic for a while, in a series of posts on fitting time series models and comparing time series to eachother, originally inspired by a question on the Cross-Validated Q&A site. In this week's post I'm exploring and testing an almost throw-away comment in my original answer - that looking at the parameterisation of an automatically chosen ARIMA (autoregressive integrated moving average) time series model is a bad idea because very similar data generating models may quite legitimately be fit by different-looking ARIMA models. I want to look at a corollory to that:

"How successful will an automated ARIMA selection be at returning the order of the original data generating process for a simulated time series?"

An ARMA process consists of an "autoregressive" AR part (basically, a linear combination of the lagged values of the series) plus a "moving average" MA (a linear combination of a white noise series). By "order", I mean the number of lags in the AR part plus the number of lagged white noise values in the MA part. An AR(1) model is just a coefficient times the lagged value, plus white noise; an MA(1) model is just a coefficient times the lagged value of the white noise, plus white noise; an ARMA(1,1) is a combination of the two.

In particular, I wanted to explore Rob Hyndman's excellent auto.arima() function from his {forecast} package, which uses small-sample corrected Akaike Information Criterion (AICc) for model selection. We use it a lot at my work and I wanted to see how it performs when fitting a dataset that is known to have been created (because we did it ourselves) by an ARMA process of a particular order. To explore this, I created a function that:

- takes as an argument a model specification (eg AR(1) for autoregression of order 1 ie a time series created by a coefficient times its lagged value plus white noise)
- simulates a large number (user controlled) of replications of that series, of a variety of different sample lengths (defaulting to 20, 40, 80, ..., 10240)
- Uses the auto.arima() model to identify the best ARIMA model for each replication of the series
- Stores the order of the final model and compares it against the correct order

Here's the code for that function:

There's flaws with this function I didn't have time to fix. Most notably, there's a big risk in that the call to the function needs to specify a vector of parameters eg c("ar1", "ma1") for ARMA(1,1) that has to match the model provided. A better function would deduce the vector of parameter names from the model structure; if someone knows how to do this simply, let me know.

## The test

For my actual test, I chose to try out two simple simulations - AR(1) and MA(1) processes - and two more complex ones - ARMA(2,2) and ARMA(3,3). There's a sticking point which is the actual autoreggressive and moving average parameters. An AR(1) process autoregression coefficient of 0.05 for example will be almost indistinguishable from white noise, whereas as it gets closer to 1 it would be much easier for the model selection algorithm to notice. To do a comprehensive test, I should actually compare models across the whole parameter space. As this is just a blog post, I arbitrarily chose some values.

Here's the code performing the actual tests on those four processes, using my new arima_fit_sim() function with 500 repetitions of each combination of model and sample length. This took about eight hours to run on my (nothing special) laptop:

## Results

My function stores the order (ie how many autoregressive and how many moving average parameters)of the 5000 models it has fit for each simulation. Here are the ten most common models that were selected in the case of the AR(1) model:

fitted | count |
---|---|

ar1 | 1752 |

ar1 ar2 ar3 ma1 ma2 | 316 |

ar1 intercept | 278 |

ma1 | 238 |

ar1 ar2 | 224 |

ar1 ma1 | 210 |

ar1 ar2 ma1 | 185 |

ma1 ma2 | 140 |

ar1 ar2 ar3 | 124 |

ma1 ma2 ma3 | 116 |

As a percentage correct for different sample lengths, here's how they end up:

n | correct |
---|---|

20 | 29.0 |

40 | 35.0 |

80 | 38.6 |

160 | 31.2 |

320 | 29.8 |

640 | 35.4 |

1280 | 37.2 |

2560 | 37.4 |

5120 | 36.0 |

10240 | 40.8 |

In this simple case, auto.arima() is moderately successful. It correctly picked an AR(1) model as the best 1752 times out of 5000, much more than the second most popular model which was a very complex ARMA(3,2). Even when the time series is only 20 observations long it gets it right 29 percent of the time, and this goes up to 41 percent by the time there are 10,000 observations. I suspect it gets better as the time series gets longer, but I'd (naively) expected better than this. It certainly means, for example, that when your automated model selection tells you you have an ARMA(3,2) model you should not dismiss the possibility that an AR(1) model might be the true generating process (other information might help you dismiss that of course, including the actual parameters). I was surprised that even for a longish time series only 41 percent of the time is the true model being recovered.

Here's a plot summarising the overall results

Here it is separated out by facets so we can see that pattern closer.

I was a bit astonished at the poor success rate of picking up the ARMA(3, 3) process so I tabulate here the counts of every final model that actually was selected.

fitted | count |
---|---|

ar1 ma1 ma2 ma3 ma4 | 1446 |

ma1 ma2 ma3 ma4 ma5 | 500 |

ma1 ma2 ma3 ma4 | 444 |

ar1 ma1 ma2 ma3 | 339 |

ar1 ma1 ma2 ma3 ma4 intercept | 276 |

ar1 ar2 ar3 ma1 ma2 | 248 |

ar1 ar2 drift | 229 |

ar1 ar2 ar3 | 168 |

ar1 ar2 ma1 ma2 ma3 | 157 |

ma1 ma2 ma3 ma4 intercept | 151 |

intercept | 115 |

ar1 ma1 ma2 ma3 intercept | 103 |

88 | |

ma1 ma2 ma3 | 61 |

ma1 | 54 |

ar1 ar2 ma1 ma2 drift | 50 |

ar1 ar2 ma1 ma2 ma3 intercept | 43 |

ar1 ar2 ar3 ar4 ar5 | 42 |

ma1 ma2 ma3 intercept | 41 |

ar1 ar2 ar3 ar4 | 32 |

ar1 ar2 ar3 ar4 ma1 | 31 |

ar1 ar2 ma1 ma2 | 29 |

ma1 ma2 ma3 ma4 drift | 28 |

ma1 drift | 27 |

ar1 ar2 ar3 drift | 25 |

ar1 ar2 ar3 ma1 ma2 intercept | 24 |

ar1 ar2 intercept | 24 |

ar1 | 21 |

ar1 ar2 | 19 |

ma1 intercept | 16 |

ar1 ar2 ar3 ar4 drift | 14 |

drift | 13 |

ar1 ar2 ar3 ma1 | 12 |

ar1 ar2 ma1 drift | 12 |

ma1 ma2 ma3 ma4 ma5 drift | 12 |

ma1 ma2 ma3 ma4 ma5 intercept | 12 |

ar1 ar2 ar3 intercept | 11 |

ar1 ma1 ma2 | 11 |

ar1 ar2 ar3 ar4 ar5 intercept | 8 |

ma1 ma2 ma3 drift | 8 |

ma1 ma2 drift | 7 |

ar1 ar2 ar3 ar4 ar5 drift | 5 |

ar1 ar2 ar3 ar4 ma1 intercept | 5 |

ar1 ar2 ma1 ma2 ma3 drift | 5 |

ar1 intercept | 5 |

ma1 ma2 | 5 |

ma1 ma2 intercept | 5 |

ar1 ar2 ar3 ma1 ma2 drift | 4 |

ar1 ar2 ma1 intercept | 4 |

ar1 ar2 ar3 ar4 intercept | 3 |

ar1 drift | 2 |

ar1 ma1 | 2 |

ar1 ar2 ar3 ar4 ma1 drift | 1 |

ar1 ar2 ar3 ma1 intercept | 1 |

ar1 ar2 ma1 | 1 |

ar1 ma1 intercept | 1 |

Yes, that's right, many many models are picked but not a single instance matches the "ar1 ar2 ar3 ma1 ma2 ma3" that was the original data generating process. So the take home message is even stronger than I thought - don't pay much attention to the order of your fitted ARIMA models! This doesn't mean there's a problem with using them - they give good fits and work well in many forecasting situations. But we shouldn't confuse them with some essential reality of an underlying process. It's a striking example of the famous (well in some quarters) quote by George E. P. Box:

I found all this a bit surprising. I'd expected a material amount of variation in what models were returned, but not this much. And I'd expected the model selection process to be more consistent (to use a term loosely) ie converge to the true model more noticeably as sample length got bigger. I presume there's a literature on all this but didn't have time to research it, so any pointers are welcomed in the comments or by Twitter.