What do auto-correlated residuals do to your linear model?
For training purposes I wanted to illustrate the dangers of ignoring time series characteristics of the random part of a classical linear regression, and I came up with this animation to do it:
I like this, because it shows how easy it is to fit something that looks to be a good fit but actually misses important parts of reality. The red lines show where the fitted model is, based on a small window of the data - from 5 to 200 points. The black line shows the true data generating process. From very early on the model fit to the simple cross-sectional has converged to pretty close to the black line. However, the model fit to the data with time series errors spends a long time greatly overestimating the value of one of the parameters in the model, and not until there are 120 observations has it converged to anywhere near the true process.
At the very least, it shows that you need many more - four times as many in this case, but unfortunately that’s not a magic number that will always work - observations from a time series to reliably estimate the structural part of a model. Even if we’d explicitly modelled the time series part of the data on the right of the animation, we’d still have that problem.
By including the residual plots below the scatter plots we get a nice picture of a warning sign in this basic (and should be fundamental and universal) diagnostic plot. In this particular case the pattern is obvious; when working with real data you should check with partial autocorrelation function plots too.
The animation illustrates the results of simulating and contrasting two fairly extreme cases:
- cross section data, generated exactly from a model of
y = a + b.x + e, e ~ N(0, 1). This is the textbook case introduced in any basic statistics course;
- time series data, generated with exactly the same model except the error term, in addition to be normally distributed with mean of zero and standard deviation of 1, has a high autocorrelation.
I chose to make the intercept of my model (
a in the above formulation) 1, and the slope (
b) equal to 0.3. Here’s what the first 200 observations of the response variable looks like:
In fact, I’ve over-simplified things by leaving
x in both datasets as independent and identically distributed white noise. In reality, if y has a time series random component,
x probably will have too. But I wanted to illustrate how a single violation of our assumptions can lead to problems, rather than create a fully realistic case (which obviously would show up even more problems).
The data were generated as follows. To illustrate a point and make it a realistic test, I generate a much larger “population” time series, and the mean of zero and standard deviation of 1 applies only to that larger population. The first 200 points is all we see.
Creating the animation is straightforward graphics. I make use of
ggplot2’s faceting feature to cut down on some code, drawing the top two connected scatterplot images with one chunk and the bottom two residuals with another. Each frame is saved as an individual PNG image, and ImageMagick ties it all together into an animated GIF as easily as usual.