## Out-of-sample extrapolation

This post is an offshoot of some simple experiments I made to help clarify my thinking about some machine learning methods. In this experiment I fit four kinds of model to a super-simple artificial dataset with two columns, x and y; and then try to predict new values of y based on values of x that are outside the original range of y. Here’s the end result:

An obvious limitation of the extreme gradient boosting and random forest methods leaps out of this graph - when predicting y based on values of x that are outside the range of the original training set, they presume y will just be around the highest value of y in the original set. These tree-based methods (more detail below) basically can’t extrapolate the way we’d find most intuitive, whereas linear regression and the neural net do ok in this regard.

## Data and set up

The data was generated by this:

## The four different modelling methods

The four methods I’ve used are:

- linear regression estimated with ordinary least squares
- single layer artificial neural network with the
`nnet`

R package - extreme gradient boosting with the
`xgboost`

R package - random forests with the
`ranger`

R package (faster and more efficient than the older`randomForest`

package, not that it matters with this toy dataset)

All these four methods are now a very standard part of the toolkit for predictive modelling. Linear regression, \(E(\textbf{y}) = \textbf{X}\beta\) is the oldest and arguably the most fundamental statistical model of this sort around. The other three can be characterised as black box methods in that they don’t return a parameterised model that can be expressed as a simple equation.

Fitting the *linear model* in R is as simple as:

*Neural networks* create one or more hidden layers of machines (one in this case) that transform inputs to outputs. Each machine could in principle be a miniature parameterised model but the net effect is a very flexible and non-linear transformation of the inputs to the outputs. This is conceptually advanced, but simple to fit in R again with a single line of code. Note the meta-parameter `size`

of the hidden layer, which I’ve set to 8 after some experimentation (with real life data I’d used cross-validation to test out the effectiveness of different values).

* xgboost* fits a shallow regression tree to the data, and then additional trees to the residuals, repeating this process until some pre-set number of rounds set by the analyst. To avoid over-fitting we use cross-validation to determine the best number of rounds. This is a little more involved, but not much:

Then there’s the *random forest*. This is another tree-based method. It fits multiple regression trees to different row and column subsets of the data (of course, with only one column of explanatory features in our toy dataset, it doesn’t need to create different column subsets!), and takes their average. Doing this with the defaults in `ranger`

is simple again (noting that `lm`

, `nnet`

and `ranger`

all use the standard R formula interface, whereas `xgboost`

needs the input as a matrix of explanatory features and a vector of ‘labels’ ie the response variable).

Finally, to create the graphic from the beginning of the post with the predictions of each of these models using the extrapolation dataset, I create a function to draw the basic graph of the real data (as I’ll be doing this four times which makes it worth while encapsulating in a function, to avoid repetitive code). I call this function once for each graphic, and superimpose the predicted points over the top.

## Tree-based limitations with extrapolation

The limitation of the tree-based methods in extrapolating to an out-of-sample range are obvious when we look at a single tree. Here’a single regression tree fit to this data with the standard `rpart`

R package. This isn’t exactly the sort of tree used by either `xgboost`

or `ranger`

but illustrates the basic approach. The tree algorithm uses the values of x to partition the data and allocate an appropriate value of y (this isn’t usually done with only one explanatory variable of course, but it makes it simple to see what is going on). So if x is less than 11, y is predicted to be 4; if x is between 11 and 28 y is 9; etc. If x is greater than 84, then y is 31.

What happens in the single tree is basically repeated by the more sophisticated random forest and the extreme gradient boosting models. Hence no matter how high a value of x we give them, they predict y to be around 31.

The implication? Just to bear in mind this limitation of tree-based machine learning methods - they won’t handle well new data that is out of the range of the original training data.

Here’s the code for fitting and drawing the individual regression tree.