In this post I simulate a population and a complex survey of it. The survey has stratification, two-stage sampling and post-collection calibration to marginal population totals. The original idea was to follow up on last week’s post on relative risk ratios and odds ratios and in particular to explore the use of the quasipoisson(log)
family compared to the quasibinomial(log)
family when modelling a binary outcome with categorical data.
In the end, the more interesting thing (for me) was the general challenge of simulating a population and complex survey process and doing some basic comparison of the success of different estimation strategies. In particular, I was interested in how different the results are if we treat the sample with appropriate complex survey methods (ie Thomas Lumley’s survey
R package) in contrast to naively fitting the same model while ignoring the cluster sampling, stratification, post-stratification, etc.
To cut to the chase, this chart compares those two approaches fit to 100 different samples of 2,100 individuals:
What we see is that there isn’t much to choose between them when it comes to point estimates. The mean squared errors of the point estimates of effect sizes from both methods, compared to fitting the model to the full population of 1 million people, are similar. However, the complex survey methods do better in terms of confidence intervals. The 95% confidence intervals delivered by survey methods contained the true values 94% of the time, nearly as good as advertised; whereas the naive method’s confidence intervals contained the true values only 89% of the time.
Simulating data
I wanted to simulate data that resembled real-life survey data that I often work with - mostly categorical values, some unknown variables, some reasonable models of behaviour available but certainly at least partly mis-specified. The one complication I left out for now was missing data.
I created data in 14 regions, with 100 neighbourhoods or PSUs (primary sampling units) in each region that were going to be used for cluster sampling. I made variables for people’s “shape” (circle, square, triangle or hexagon), “colour” (red, blue, etc), a mystery unobserved variable, an unobserved latent continuous variable that depends on all the other variables mentioned so far, and a single target or response variable of interest called y
which is TRUE or FALSE with probability depending on that latent variable.
In the end we are going to use these variables in these ways:
- We’ll use
region
as strata andpsu
as primary sampling unit in a cluster sampling approach - We’ll use the marginal population counts by
region
,shape
andcolour
in a weighting scheme - We’ll use
region
,shape
andcolour
as explanatory variables in a model withy
. This model is slightly mis-specified because it doesn’t include the “mystery variable” which is unobserved, and the relationship of the variables to y isn’t exactly as such variables normally work together in a logistic regression.
Once I’ve generated the data, here is how it looks for the full population of one million:
Or here is the correlation matrix, showing the relationships of underlying unobserved continuous variables (rather than their categorical manifestations) with eachother and with y
:
mystery_var | region_y | psu_y | shape_y | colour_y | latent | y | |
---|---|---|---|---|---|---|---|
mystery_var | 1.00 | 0.00 | 0.02 | 0.00 | 0.01 | 0.09 | 0.01 |
region_y | 0.00 | 1.00 | 0.00 | 0.00 | -0.01 | 0.31 | 0.07 |
psu_y | 0.02 | 0.00 | 1.00 | 0.01 | -0.01 | 0.73 | 0.18 |
shape_y | 0.00 | 0.00 | 0.01 | 1.00 | 0.07 | 0.55 | 0.11 |
colour_y | 0.01 | -0.01 | -0.01 | 0.07 | 1.00 | 0.23 | 0.05 |
latent | 0.09 | 0.31 | 0.73 | 0.55 | 0.23 | 1.00 | 0.22 |
y | 0.01 | 0.07 | 0.18 | 0.11 | 0.05 | 0.22 | 1.00 |
For example, region_y
in the above is a variable that takes a given continuous value for each of regions A, B, C, etc. The value of region_y
isn’t observed, I just generate it as part of the simulation process. The underlying value of region_y
has a correlation of 0.31 with the unobserved underlying value of latent
. latent
itself is the direct driver of the probability of y
being TRUE, but y
is a random variable so the correlation of region_y
with y
is 0.07 (ie less than 0.31), still more than zero but not as strong as region_y
’s correlation with the unobserved probability of y.
psu_y
has the highest correlation with y
and I did this by design - I wanted there to be a lot of intra-class correlation at the PSU level, to give distinction to the complex survey sampling and estimation strategy I’m exploring here. So knowing the PSU (which I think of as a suburb or meshblock) tells you a lot about whether an individual has y
or not - more so than the individual’s shape or colour. Perhaps y
is some highly spatially correlated indicator like a contagious disease.
To sum up - in my simulated population, there is a definite relationship between each categorical variable and y
, via unobserved latent continuous variables, but it’s not dramatic.
I wrote an R function to generate this population; even though I only use it once in this blog post, it was very useful to have it as a function with clear parameter choices when I was playing around.
Sampling and processing
Next step is to write a function that takes a two-stage cluster-based sample from this population. This means we first select a sample of clusters (in this case psu
); and then from those clusters we select a sample of individuals. I did this in a way that selects a set number of PSUs (defaults to 10, but most of this blog I use 5) from each and every region, then a set random number of people (15 people if we have 10 PSUs per region, 30 people if we have 5) per PSU to bring the total sample size up to 2,100. 2,100 feels an unremarkable sample size for the sorts of surveys I’ve been looking at this year.
Because the regions have quite different sizes, and the PSUs have slightly different sizes, there are many different probabilities of an individual being selected. Everyone has a non-zero chance; which we need to calculate for subsequent weighting purposes. There are R packages that will perform this sort of sampling directly for you, but I’ve just done it from scratch in the code below for clarity.
Once I’ve got the sample, I:
- use the inverse of the selection probability as initial weights
- create a set of jackknife replicate weights suitable for stratified sample designs
- calibrate each set of weights to match the marginal population counts for each value of region, shape and colour.
If replicate weights are new to you, they’re worth getting on top of. When we use the jackknife or bootstrap with simple random samples, we create new samples in quite a simple way, by resampling with replacement from the original sample. When we have a complex survey, we instead create (and save) new sets of weights reflecting which points are in and which are out, and the new calibrated survey weights for that particular resample. Compared to doing all the resampling from scratch each time you do an analysis, this puts a lot of the computational load up front in the creation of multiple sets of (expensively) calibrated weights. These can be re-used repeatedly, for reproducibility and efficiency purposes.
Here’s my function, which as well as containing my home-made sampling algorithm, uses Lumley’s survey
package to specify the design, replicate weights, and calibration to population totals for the analysis stage:
Like most real world complex surveys, this sampling design has a couple of inefficiencies compared to a simple random sample:
- Because we sample equally from each region, some points (from larger regions) are representing many more population individuals than others. A typical reason for this strategy would be when we are prepared to sacrifice some efficiency in estimating population totals because we want good estimates of totals for each region, even smaller regions.
- Because we first pick a PSU (think of these as a suburb or meshblock), then sample individuals from that PSU, the individual values are correlated. Once you’ve interviewed someone in the PSU, their neighbours are less valuable statistically than people chosen at random. A typical reason for following this strategy is cost and convenience of moving interviewers around.
All this means that the variance of estimates from our complex survey sample is higher than if we had a sample of the same size that was selected by simple random sampling. In fact, if the number of PSUs selected per region is the default of 10, an estimated “design effect” for this particular population and sampling strategy is 1.8. That is, the variance of our estimates is 1.8 times larger than a hypothetical simple random sample. Another way to look at this is that if a simple random sample were possible, we would need only 2,100 / 1.8 = 1,167 members in the sample to get the same quality estimate of total (population) y
as from our cluster sample.
> des <- samp(population, npsu = 10, seed = 123)
> svymean(~y, des, deff = TRUE)
mean SE DEff
yFALSE 0.74814 0.01282 1.8346
yTRUE 0.25186 0.01282 1.8346
The design effect comes mostly from the clustering of individuals in PSUs. The more PSUs we select (and hence, less individuals per PSU for the same overall sample size), the lower the design effect, as seen in this chart:
Modelling
OK, so we’ve got a function to produce our complex sample. Let’s get into the business of modelling y
based on region
, colour
and shape
. From now on, I’ll be selecting 5 PSUs per region and 30 people per PSU, which gives us a relatively high design effect, but still a total of 14 x 5 = 70 PSUs for the whole sample. This is a plausibly realistic number of PSUs if the sampling process involves moving around a whole country and interviewing people face to face. By way of comparison, Lumley’s book Complex Surveys: A Guide to Analysis Using R cites (page 40) the example of the National Health and Nutrition Examination Survey, NHANES II, which sampled 64 locations and 440 participants in each location, for a relatively intensive and expensive interview/examination process.
Because I’m following on from last week’s post, I’m interested in relative risks comparing one category to another (eg the probability of a “circle” having y
, compared to the probability of a “square” having it). So in the generalized linear model I’m going to fit, I need to use a logarithm link function connecting the expected value of the response variable with the linear predictor of the explanatory variables (if asking why? I don’t have time to explain that now). Two of the best candidates for this are family = quasibinomial(log)
and family = quasipoisson(log)
. Taking e to the power of coefficients or other contrast in the linear predictor will give us relative risks in this case. It’s the log
that’s particularly important; the default link function for quasibinomial
would be logit
, which would give us odds ratios, not risk ratios.
Here’s an example of the sort of outcome I get from two such risk models, fitted to a single sample of 2,100 drawn with my functions set out above. We can see the relative risks of different levels of my categorical variables (compared to the base levels of the variable in each case). “1” would mean the given risk level of the value is the same as for the reference level. So we would conclude, for example, that “square” people have 1.4 to 2.5 times the relative risk of getting y
as do “circle” people (circle is the reference level for shape).
I’ve marked the “true” values of the coefficients with dots, and shown the 95% confidence intervals from two quasibinomial(log)
options. One of these is what I’m calling “naive” (although it’s still some fairly sophisticated stats) because it ignores the complex survey nature of the data and just uses R’s glm
function as though we had a simple random sample. The other method uses Lumley’s svyglm
to appropriately take the sampling and weighting into account. The clustering will tend to increase standard errors (and hence the width of confidence intervals); the use of auxiliary information in weighting to match population totals for shape, colour and region will decrease them. In this case, the net effect of treating the survey design with respect is to increase standard errors and the width of confidence intervals, as seen by the longer blue lines compared to the red lines.
Here’s the code that creates that one sample and fits the two models to it, comparing the results to the “correct” values from fitting a model to the full population of 1 million people:
That’s just one sample of course. How about if we do this 100 times? That brings us to the graphic I started the blog with. We can see that both methods are fair performers when it comes to point estimates, and indeed aren’t too bad even for confidence interval coverage (I would be pleased and surprised if 85% of one’s “95%” time series prediction intervals contained the true value, by comparison, but even the naive method exceeds that for these confidence intervals). But the coverage of the survey method very nearly matches the advertised 95% design, which is excellent, particularly as the model is slightly mis-specified compared to the data generation process.
Finally, the question I thought this blog was going to be all about - quasi-poisson versus quasi-binomial families with a log link when using logistic regression to get relative risks? There’s an oldish but good thread on this on R-help.
My own experience so far with these two options is that they give similar results, but that the quasipoisson(log)
method is less likely to fall over (ie fail to converge to a solution) than is quasibinomial(log)
. I don’t know why this is. quasibinomial
will be modelling variance as proportional to mu(1 - mu)
(where mu
is the expected value of the response, given the linear predictor) whereas quasipoisson
will be modelling it as proportional to mu
. In effect, this will mean the quasipoisson
expects high variance when the probability of y
is close to 1 and will down-weight those points; whereas quasibinomial
will do the opposite. In data like that I’ve been using today, the expected values don’t get that high so it makes very little difference.
In the simulations I’ve done for this blog, I got around the problem of the occasional non-convergence of the quasibinomial
model by first fitting it with the quasipoisson
option and then using the estimated coefficient values from that model as starting values for the quasibinomial
model. This seems to work well; we get the natural (or so it seems to me) treatment of variance as being proportional to mu(1-mu)
of the binomial approach while still being robust enough to work without human intervention on a wide range of samples.
[As an aside, the use of quasi Poisson is particularly important if we are modelling binary outcomes as we are here. With the response constrained to being 0 or 1, the variance will often be less (and nearly always different) from being exactly mu
which would be expected for the straight Poisson family. So we will have under-dispersion (or less likely, over-dispersion) relative to the Poisson distribution. Use of quasipoisson
allows this dispersion parameter to be estimated rather than forced onto the model.]
Here’s a comparison of the confidence interval coverage of those two methods; we can see it’s virtually identical in the long run, with this particular population and model:
So, to conclude (bearing in mind that this is only with one set of simulated population data, but the principles seem plausible):
- ignoring the complex survey nature of the data and just using
glm()
out of the box can still mean that you get ok point estimates - however, your standard errors (and hence confidence interval widths, and statistical inferences in general) run the risk of understating the uncertainty and overstating the information, particularly when the original data is clustered
- there’s not much to choose from in weighing up the use of
quasibinomial(log)
orquasipoisson(log)
when setting out get get relative risks from a generalized linear model with a binary response. They give similar results, and similarly successful confidence intervals.
Here’s the last chunk of code - for those more systematic checks of “naive” versus survey methods, and quasipoisson versus quasibinomial:
Finally and slightly off the point, for a great example of the importance of reporting absolute risks as well as relative risks, check out this blog on the recent Lancet article arguing “no safe level of alcohol usage”. The statistics in the original article look to be sound, but while the relative risk of any alcohol is indeed above 1, the actual increase in absolute risk is small. For example:
“So a total of 50,000 bottles of gin among these 1,600 people is associated with one extra health problem.”
Read the full blog for more.