Individual-level estimates from survey data
I was motivated by web apps like the British Office of National Statistics’ How well do you know your area? and How well does your job pay? to see if I could turn the New Zealand Income Survey into an individual-oriented estimate of income given age group, qualification, occupation, ethnicity, region and hours worked. My tentative go at this is embedded below, and there’s also a full screen version available.
The job’s a tricky one because the survey data available doesn’t go to anywhere near that level of granularity. It could be done with census data of course, but any such effort to publish would come up against confidentiality problems - there are just too few people in any particular combination of category to release real data there. So some kind of modelling is required that can smooth over the actual data but still give a plausible and realistic estimate.
I also wanted to emphasise the distribution of income, not just a single measure like mean or median - something I think that we statisticians should do much more than we do, with all sorts of variables. And in particular I wanted to find a good way of dealing with the significant number of people in many categories (particularly but not only “no occupation”) who have zero income; and also the people who have negative income in any given week.
My data source is the New Zealand Income Survey 2011 simulated record file published by Statistics New Zealand. An earlier post by me describes how I accessed this, normalised it and put it into a database. I’ve also written several posts about dealing with the tricky distribution of individual incomes, listed here under the “NZIS2011” heading.
This is a longer post than usual, with a digression into the use of Random Forests (tm) to predict continuous variables, an attempt at producing a more polished plot of a regression tree than usually available, and some reflections on strengths and weakness of several different approaches to estimating distributions.
Data import and shape
I begin by setting up the environment and importing the data I’d placed in the data base in that earlier post. There’s a big chunk of R packages needed for all the things I’m doing here. I also re-create some helper functions for transforming skewed continuous variables that include zero and negative values, which I first created in another post back in September 2015.
Importing the data is a straightforward SQL query, with some reshaping required because survey respondents were allowed to specify either one or two ethnicities. This means I need an indicator column for each individual ethnicity if I’m going to include ethnicity in any meaningful way (for example, an “Asian” column with “Yes” or “No” for each survey respondent). Wickham’s dplyr
and tidyr
packages handle this sort of thing easily.
After reshaping ethnicity and transforming the income data into something a little less skewed (so measures of prediction accuracy like root mean square error are not going to be dominated by the high values), I split my data into training and test sets, with 80 percent of the sample in the training set.
Modelling income
The first job is to get a model that can estimate income for any arbitrary combination of the explanatory variables hourse worked, occupation, qualification, age group, ethnicity x 7 and region. I worked through five or six different ways of doing this before eventually settling on Random Forests which had the right combination of convenience and accuracy.
Regression tree
My first crude baseline is a single regression tree. I didn’t seriously expect this to work particularly well, but treated it as an interim measure before moving to a random forest. I use the train()
function from the caret
package to determine the best value for the complexity parameter cp
- the minimum improvement in overall R-squared needed before a split is made. The best single tree is shown below.
One nice feature of regression trees - so long as they aren’t too large to see all at once - is usually their easy interpretability. Unfortunately this goes a bit by the wayside because I’m using a transformed version of income, and the tree is returning the mean of that transformed version. When I reverse the transform back into dollars I get a dollar number that is in effect the squared mean of the square root of the original income in a particular category; which happens to generally be close to the median, hence the somewhat obscure footnote in the bottom right corner of the plot above. It’s a reasonable measure of the centre in any particular group, but not one I’d relish explaining to a client.
Following the tree through, we see that
- the overall centre of the data is $507 income per week
- for people who work less than 23 hours, it goes down to $241; and those who work 23 or more hours receive $994.
- of those who work few hours, if they are a community and personal service worker, labourer, no occupation, or residual category occupation their average income is $169 and all other incomes it is $477.
- of those people who work few hours and are in the low paying occupations (including no occupation), those aged 15 - 19 receive $28 per week and those in other categories $214 per week.
- and so on.
It takes a bit of effort to look at this plot and work out what is going on (and the abbreviated occupation labels don’t help sorry), but it’s possible once you’ve got the hang of it. Leftwards branches always receive less income than rightwards branches; the split is always done on only one variable at a time, and the leftwards split label is slightly higher on the page than the rightwards split label.
Trees are a nice tool for this sort of data because they can capture fairly complex interactions in a very flexible way. Where they’re weaker is in dealing with relationships between continuous variables that can be smoothly modelled by simple arithmetic - that’s when more traditional regression methods, or model-tree combinations, prove useful.
The code that fitted and plotted this tree (using the wonderful and not-used-enough prp()
function that allows considerable control and polish of rpart
trees) is below.
(Note - in working on this post I was using at different times several different machines, including some of it on a Linux server which is much easier than Windows for parallel processing. I’ve commented out the Linux-only bits of code so it should all be fully portable.)
The success rates of the various modelling methods in predicting income in the test data I put aside will be shown all in one part of this post, later.
A home-made random spinney (not forest…)
Regression trees have high variance. Basically, they are unstable, and vulnerable to influential small pockets of data changing them quite radically. The solution to this problem is to generate an ensemble of different trees and take the average prediction. Two most commonly used methods are:
- “bagging” or bootstrap aggregation, which involves resampling from the data and fitting trees to the resamples
- Random Forests (trademark of Breiman and Cutler), which resamples rows from the data and also restricts the number of variables to a different subset of variables for each split.
Gradient boosting can also be seen as a variant in this class of solutions but I think takes a sufficiently different approach for me to leave it to further down the post.
Bagging is probably an appropriate method here given the relatively small number of explanatory variables, but to save space in an already grossly over-long post I’ve left it out.
Random Forests (tm) are a subset of the broader group of ensemble tree techniques known as “random decision forests”, and I set out to explore one variant of random decision forests visually (I’m a very visual person - if I can’t make a picture or movie of something happening I can’t understand it). The animation below shows an ensemble of 50 differing trees, where each tree was fitted to a set of data sample with replacement from the original data, and each tree was also restricted to just three randomly chosen variables. Note that this differs from a Random Forest, where the restriction differs for each split within a tree, rather than being a restriction for the tree as a whole.
Here’s how I generated my spinney of regression trees. Some of this code depends on a particular folder structure. The basic strategy is to
- work out which variables have the most crude explanatory power
- subset the data
- subset the variables, choosing those with good explanatory power more often than the weaker ones
- use cross-validation to work out the best tuning for the complexity parameter
- fit the best tree possible with our subset of data nad variables
- draw an image, with appropriate bits of commentary and labelling added to it, and save it for later
- repeat the above 50 times, and then knit all the images into an animated GIF using ImageMagick.
Random Forest
Next model to try is a genuine Random Forest (tm). As mentioned above, a Random Forest is an ensemble of regression trees, where each tree is a resample with replacement (variations are possible) of the original data, and each split in the tree is only allowed to choose from a subset of the variables available. To do this I used the {randomForests} R package, but it’s not efficiently written and is really pushing its limits with data of this size on modest hardware like mine. For classification problems the amazing open source H2O (written in Java but binding nicely with R) gives super-efficient and scalable implementations of Random Forests and of deep learning neural networks, but it doesn’t work with a continuous response variable.
Training a Random Forest requires you to specify how many explanatory variables to make available for each individual tree, and the best way to decide this is vai cross validation.
Cross-validation is all about splitting the data into a number of different training and testing sets, to get around the problem of using a single hold-out test set for multiple purposes. It’s better to give each bit of the data a turn as the hold-out test set. In the tuning exercise below, I divide the data into ten so I can try different values of the mtry
parameter in my randomForest fitting and see the average Root Mean Square Error for the ten fits for each value of mtry
. mtry
defines the number of variables the tree building algorithm has available to it at each split of the tree. For forests with a continuous response variable like mine, the default value is the number of variables divided by three and I have 10 variables, so I try a range of options from 2 to 6 as the subset of variables for the tree to choose from at each split. It turns out the conventional default value of mtry = 3
is in fact the best:
Here’s the code for this home-made cross-validation of randomForest:
Having determined a value for mtry of three variables to use for each tree in the forest, we re-fit the Random Forest with the full training dataset. It’s interesting to see the “importance” of the different variables - which ones make the most contribution to the most trees in the forest. This is the best way of relating as Random Forest to a theoretical question; otherwise their black box nature makes them harder to interpret than a more traditional regression with its t tests and confidence intervals for each explanatory variable’s explanation.
It’s also good to note that after the first 300 or so trees, increasing the size of the forest seems to have little impact.
Here’s the code that fits this forest to the training data and draws those plots:
Extreme gradient boosting
I wanted to check out extreme gradient boosting as an alternative prediction method. Like Random Forests, this method is based on a forest of many regression trees, but in the case of boosting each tree is relatively shallow (not many layers of branch divisions), and the trees are not independent of eachother. Instead, successive trees are built specifically to explain the observations poorly explained by previous trees - this is done by giving extra weight to outliers from the prediction to date.
Boosting is prone to over-fitting and if you let it run long enough it will memorize the entire training set (and be useless for new data), so it’s important to use cross-validation to work out how many iterations are worth using and at what point is not picking up general patterns but just the idiosyncracies of the training sample data. The excellent xgboost
R package by Tianqui Chen, Tong He and Michael Benesty applies gradient boosting algorithms super-efficiently and comes with built in cross-validation functionality. In this case it becomes clear that 15 or 16 rounds is the maximum boosting before overfitting takes place, so my final boosting model is fit to the full training data set with that number of rounds.
Two stage Random Forests
My final serious candidate for a predictive model is a two stage Random Forest. One of my problems with this data is the big spike at $0 income per week, and this suggests a possible way of modelling it does so in two steps:
- first, fit a classification model to predict the probability of an individual, based on their characteristics, having any income at all
- fit a regression model, conditional on them getting any income and trained only on those observations with non-zero income, to predict the size of their income (which may be positive or negative).
The individual models could be chosen from many options but I’ve opted for Random Forests in both cases. Because the first stage is a classification problem, I can use the more efficient H2O platform to fit it - much faster.
Traditional regression methods
As a baseline, I also fit three more traditional linear regression models:
- one with all variables
- one with all variables and many of the obvious two way interactions
- a stepwise selection model.
I’m not a big fan of stepwise selection for all sorts of reasons but if done carefully, and you refrain from interpreting the final model as though it was specified in advance (which virtually everyone gets wrong) they have their place. It’s certainly a worthwhile comparison point as stepwise selection still prevails in many fields despite development in recent decades of much better methods of model building.
Here’s the code that fit those ones:
Results - predictive power
I used root mean square error of the predictions of (transformed) income in the hold-out test set - which had not been touched so far in the model-fitting - to get an assessment of how well the various methods perform. The results are shown in the plot below. Extreme gradient boosting and my two stage Random Forest approaches are neck and neck, followed by the single tree and the random decision forest, with the traditional linear regressions making up the “also rans”.
I was surprised to see that a humble single regression tree out-performed my home made random decision forest, but concluded that this is probably something to do with the relatively small number of explanatory variables to choose from, and the high performance of “hours worked” and “occupation” in predicting income. A forest (or spinney…) that excludes those variables from whole trees at a time will be dragged down by trees with very little predictive power. In contrast, Random Forests choose from a random subset of variables at each split, so excluding hours from the choice in one split doesn’t deny it to future splits in the tree, and the tree as a whole still makes a good contribution.
It’s useful to compare at a glance the individual-level predictions of all these different models on some of the hold-out set, and I do this in the scatterplot matrix below. The predictions from different models are highly correlated with eachother (correlation of well over 0.9 in all cases), and less strongly correlated with the actual income. This difference is caused by the fact that the observed income includes individual level random variance, whereas all the models are predicting some kind of centre value for income given the various demographic values. This is something I come back to in the next stage, when I want to predict a full distribution.
Here’s the code that produces the predicted values of all the models on the test set and produces those summary plots:
Building the Shiny app
There’s a few small preparatory steps now before I can put the results of my model into an interactive web app, which will be built with Shiny.
I opt for the two stage Random Forest model as the best way of re-creating the income distribution. It will let me create simulated data with a spike at zero dollars of income in a way none of the other models (which focus just on averages) will do; plus it has equal best (with extreme gradient boosting) in overall predictive power.
Adding back in individual level variation
After refitting my final model to the full dataset, my first substantive problem is to recreate the full distribution, with individual level randomness, not just a predicted value at each point. On my transformed scale for income, the residuals from the models are fairly homoskedastic, so decide that the Shiny app will simulate a population at any point by sampling with replacement from the residuals of the second stage model.
I save the models, the residauals, and the various dimension variables for my Shiny app.
Contextual information - how many people are like “that” anyway?
After my first iteration of the web app, I realised that it could be badly misleading by giving a full distribution for a non-existent combination of demographic variables. For example, Maori female managers aged 15-19 with Bachelor or Higher qualification and living in Southland (predicted to have median weekly income of $932 for what it’s worth).
I realised that for meaningful context I needed a model that estimated the number of people in New Zealand with the particular combination of demographics selected. This is something that traditional survey estimation methods don’t provide, because individuals in the sample are weighted to represent a discrete number of exactly similar people in the population; there’s no “smoothing” impact allowing you to widen inferences to similar but not-identical people.
Fortunately this problem is simpler than the income modelling problem above and I use a straightforward generalized linear model with a Poisson response to create the seeds of such a model, with smoothed estimates of the number of people for each combination of demographics. I then can use iterative proportional fitting to force the marginal totals for each explanatory variable to match the population totals that were used to weight the original New Zealand Income Survey. Explaining this probably deserves a post of its own, but no time for that now.
The final shiny app
- The full screen version of the web app
- The source code