Deal with feature selection and collinearity
Recently I’ve been making more use of elastic net regularization as a way of fitting linear models to data when I have more candidate explanatory variables than I know what to do with and some of them are collinear ie their information doubles up on what is in other variables.
Elastic net regularization is a more general form of the ridge regression and lasso methods. Ridge regression was primarily a method to deal with the instability of estimates of the coefficients of linear models when they are collinear; the lasso a method of feature selection that forces coefficients for some/many explanatory variables to be zero and provides good estimates of the coefficients of the remaining features. Both are preferable to stepwise selection which provides estimates of effect sizes biased upwards in absolute magnitude. I deliberately use the term “was” for ridge regression because I don’t see any reason for using either of these methods any more; they are superseded by the more general elastic net. A hyperparameter alpha
means the method is equivalent to lasso when alpha == 1
and to ridge regression when alpha == 0
. We can use cross-validation assessing the fit of the best models with different values of alpha
to work out the best value of alpha
.
The method works as an extension of ordinary least squares (OLS), the most common and basic (ie Stats 101) method of estimating the coefficients in a linear model. Instead of minimising the sum of the squares of the differences between predicted and actual values of the response variables (the OLS method), the elastic net regularization estimation process minimises that sum of those squares and a penalty based on the size of the estimated coefficients (on a standardised scale) in the model. I like to think of the impact of this as to create an increased burden of proof on the data for including a variable’s effect in the model; the estimation process drags coefficients towards zero in a spirit of conservatism. The alpha
hyperparameter controls the balance between the different penalty methods of ridge and lasso. A second hyperparameter, lambda
sets the total weight of the penalty, and can also be set at the most effective level by trying different values with cross validation.
While ridge regression has been around for decades and the lasso since the 1990s, the generalisation to elastic net regularations was only properly elaborated in 2005 and software to implement it is fairly new. The glmnet
package in R gives very fast estimation and good cross-validation methods, and scales up well to large and wide datasets.
A modest sized dataset with collinearity
To build more familiarity with this tool I grabbed a small dataset of 30 observations from my fitbit tracker. I originally downloaded the data from the web browser interface in accordance with the fitbit instructions. For reproducibility I’ve made it available to others on the web.
My aim with this data is to reverse engineer the way fitbit estimates calories burned in a day. I know from their website that this is done based on an estimate of my basal metabolic rate and the activities my tracker records (plus any activities manually logged by me, which is none in my case).
Basal metabolic rate is dependent on my gender, age, height and weight, none of which changed in this thirty day period, so I would expect that component of the calories burnt to be constant over time. As well as the calories burned per day, the information the fitbit provides on my activities is available in these variables:
- steps
- distance (estimated from steps)
- floors (ie of stairs and stair-like equivalents - some of this period I walked up hills in Western Australia’s Stirling Ranges)
- sedentary minutes
- lightly active minutes
- fairly active minutes
- very active minutes
Here’s how those seven candidate explanatory variables relate to eachother:
From this chart we see a few things of interest:
Distance
can be calculated almost exactly directly fromSteps
- The number of sedentary minutes in a day is inversely related to the other minutes measured - lightly active, fairly active, and very active
I wondered if the four minutes measures would be perfectly collinear - there’s a set number of minutes each day after all, so if I know how long I’ve been lightly active, fairly active and very active couldn’t I precisely say that the remaining minutes were sedentary? A bit of exploration shows that the four “minutes” measures tend to add up to 15 or 16 hours a day, suggesting that sedentary minutes actually means “sedentary but awake minutes”. So they are not completely collinear because the amount I sleep varies from day to day.
For those who want to follow along in R, here’s how I did that chart with the correlation coefficients in the lower triangle of panels. This adapts an example from the helpfile for the pairs()
function in base graphics.
The close relationship between steps and distance (correlation coefficient of 1.00, which means that knowing one you can almost exactly predict the other) isn’t surprising to me because fitbit’s website makes clear it estimates distance based on steps and stride length. It applies one stride length to convert from steps to distance for walking, and one for running (which it can detect, presumably, from the pace and violence of motion), but I know that I did virtually no running in this period so it’s only converting from steps to walking. Here’s the distribution of fitbit’s estimated distance divided by steps:
69 calories per 1,000 steps on average
To get a sense of the data I was interested in a super-simple model with a single explanatory variable. “What’s the relationship between the number of steps walked and calories burned, ignoring height gains and intensity of heart rate”, I asked myself? Judging from various sites like this and this the answer might be about 40-50 calories per thousand steps for walking. Fitbit would use a more sophisticated model than this and take into account the fact that some of my walking was on tougher terrain and/or up stairs or slopes which would reflect in increased heart rate, so it will be interesting to see its estimates of calories per thousand steps compared to the simple website estimates.
The way to do this is just an ordinary least squares estimation of a model with steps as the sole explanatory variable and calories burned as the response variable. This returns the estimate of my base daily rate of 1,926 calories per day, and 69 additional calories burned per thousand steps. This figure of 69 is more than the “moderate walking” rates on various websites, presumably because of the terrain and slope issues mentioned above.
That figure of 1,926 per day as my “no steps” base calory burn is also somewhat more than the 1,629 that this website says should be my basal metabolic rate based on the Harris-Benedict equation. I’m not sure what’s going on here - either fitbit uses a different calculation, or perhaps it’s judging various activities from me from my heart rate that don’t feature in steps, or something else. Fitbit doesn’t know my real calory burn rates, so it can’t be something mysterious to do with my metabolism as it has no way of knowing, it just estimates based on steps, heart rate, weight, height, etc. Anyway, the simple model above isn’t too bad an estimate of how many calories fitbit says I use; for the record, here’s a comparison of the predicted calories compared to residuals from that model:
And just to put one issue to rest, here’s the partial autocorrelation function of the residuals, showing that there’s no particular pattern over time not captured in the model. This is not a particularly interesting chart but I’m including it because it should be standard practice to check for such patterns (which would be shown by one of the vertical bars crossing the horizontal blue lines) when working with data that was captured over a time period.
Choose alpha by cross-validation over a grid of candidate values
Now I’m ready to move to my full model, predicting daily calories based on all the seven explanatory variables available.
The first step is choosing an appropriate alpha - the balance between the extremes of ridge regression and lasso estimation. I choose repeated cross-validation to do this. Basically this means fitting the model to slightly different re-samples many times at different values of alpha, using the fit model to predict the “out of bag” points from the original sample that weren’t in the re-sample. The results show only a weak impact of the choice of alpha:
Here’s the code that did this.
Elastic net gives more sensible results in presence of collinearity
Now that I have a preferred value of alpha I can use the elastic net to get estimated values of the eight coefficients (seven explanatory variables plus an intercept), and compare them to the ordinary least squares equivalents. Remember that there’s a second hyper-parameter, lambda
to estimate too - again I choose this from cross-validation with the chosen value of alpha
, and following the implied suggesting on the glmnet
package I try two versions of lambda
- the level of shrinkage that gives the smallest errors in cross validation, and the higher level of shrinkage that gets acceptable close (within one standard deviation in the cross validation process) to that smallest error. This practice is predicated on the idea that a model with a higher degree of shrinkage may be cheaper (in data management or interpretation) than a lower, so it is worth taking a small hit of accuracy for it.
Here are the results from the three estimation methods:
As experienced modellers would have expected, the “original” (ie ordinary least squares) method gives non-sensible estimates of the coefficients of the two highly collinear variabls, Steps
and Distance
. it clearly makes no sense to say that my calories consumed go down by 67 for each additional 1,000 steps but up by 116 for each kilometre. The elastic net method also struggles with these two, but returns values that are better for interpretation.
Here’s the code for obtaining those coefficient estimates:
Even with lambda=0
, elastic net gives different result to OLS due to numerical calculation issues
I thought I would check what happens if lambda is set to zero so there is no penalty for the total size of coefficients. In theory, the result should be the same as the ordinary least squares estimate, but in the presence of severe collinearity like we have here (between Steps
and Distance
) this turns out not to be the case.
This comes about from numerical issues to do with the estimation method under the hood in glmnet
and I won’t try to understand or explain exactly why. If we remove the severe collinearity (which, in a real life situation, I would have done as the very first step when I realised Distance
contains no real information additional to Steps
) the estimated coefficients are almost identical, as they should be:
In this particular dataset, predictive power is similar in OLS and after elastic net regularisation
To compare the predictive strength of these different models I used the simple bootstrap, where the modelling approach is applied to bootstrap resamples of the data and the estimated model then used to predict the full original dataset. The elastic net regularized model slightly out-performs the ordinary least squares model with all variables; both significantly out-perform the super-simple one variable model with only Steps
in the explanatory side of the formula. Here are the root mean square errors of the three different estimation methods:
Code for the validation process:
“Floors” not needed for a good prediction of calories
One substantive issue of interest to me that I haven’t mentioned was the relationship of “floors climbed” to total calories. This is of interest because it’s the sort of thing that it’s easy to set a personal target for (“steps” is similar). However, none of the estimation methods showed significant evidence of Floors
as an explanatory variable. Basically, the “minutes active” variables (at various levels of intensity) contains all the information needed by fitbit to make its estimates of calories. Walking up stairs contributes to more active minutes as shown by heart rate but other than through that is not directly informative itself.
This can be illustrated by what has become one of the standard plots for this sort of estimation method - a graphic illustration of how the size of coefficients for different variables increase as the penalty against coefficients is relaxed:
We see from this that even the fullest model leaves a zero coefficient for variable #3, Floors
- the horizontal green line. Minutes.Very.Active
is the second last variable to be allowed to increase from zero - it contains relatively little information after all the other “minutes” variables are included.
Code for that graphic: