# Weighted versus unweighted percentiles

## At a glance:

When working with complex survey data where the weights are related to a continuous variable of interest, using a weighted rather than unweighted percentile rank will lead to different results towards the middle of the distribution; but the two measures will be highly correlated with eachother. Also, R reportedly calculates weighted percentile ranks much much faster than Stata.

24 Jun 2023

This interesting paper came out recently: A test of the predictive validity of relative versus absolute income for self-reported health and well-being in the United States, by David Brady, Michaela Curran and Richard Carpiano. It uses a large sample of longitudinal data to exploit between-individual and within-individual (across time) variation in absolute and relative income to explore which one is more associated with well-being. Seems that in the USA, relative income is more important.

“Relative income” here is defined as your position (percentile rank) in the national distribution of incomes in a particular year. The average relative income should be 50 by definition. I noticed that the mean percentile reported in table A.1 was 57, not 50, which led to this slightly confusing (for me, anyway) exchange on Twitter. It sparked my interest in a broader question, not really related to the paper itself - what is the relationship between weighted and unweighted percentile rank when using survey data?

## Weighted percentile rank

I was surprised to find (well actually, I had discovered this before, so really should say to be “reminded”) that there doesn’t seem to be a weighted version of dplyr’s percent_rank() function in dplyr or other R packages. There are several packages with functions that will estimate quantiles from weighted data, but if you want to then turn the original vector of continuous variables into percentiles you need to take those quantiles and cut the original variable using them as breaks.

I started by making my own convenience function to do this. To calculate the breaks, I am using the wtd.quantile function from Mark Hancock’s reldist package, which was notably faster than its competition (more on this later).

Post continues after R code

That little test of whether I get the same result as dplyr::percent_rank() with unweighted data turns out ok:

  homemade dplyr difference
<dbl> <dbl>      <dbl>
1    0.189 0.189          0
2    0.557 0.557          0
3    0.869 0.869          0
4    0.944 0.944          0
5    0.704 0.704          0
6    0.448 0.448          0
7    0.411 0.411          0
8    0.121 0.121          0
9    0.584 0.584          0
10    0.307 0.307          0
# ℹ 990 more rows


## Simulating a population and drawing a sample

So to actually explore the difference between weighted and unweighted percentiles, I wanted to simulate a complex survey with unequal probabilities of observations being in the sample, and a skewed variable (income a good example) that was also related, directly or indirectly, to the probability of an observation being in the sample. I did this by creating a educatin variable with 7 different levels, and said

• the higher the education the higher the mean income
• the higher the education, the lower the chance of being sampled (or, which is effectively the same for my purpose, the lower the chance of completing a response to the survey)

Once I have drawn the sample I am going to use standard post-stratification methods to calculate weights that mean the points in the sample between them represent the original population fully.

Whether or not this sampling or non-response effect is realistic doesn’t matter much. Obviously in reality things will be much more complicated. I just wanted something where survey weights were going to vary in a known way (that could be inferred from the data) that correlates with the variable of interest.

Actual income is simulated as coming from a log-normal distribution.

Post continues after R code

This is what the mean income looks like for my different education levels:

   educ mean(income)
<int>          <dbl>
1     1         16307.
2     2         19982.
3     3         24369.
4     4         29761.
5     5         36302.
6     6         44355.
7     7         54212.


OK, now I need to draw a sample. I originally did this with dplyr::slice_sample() but found it slow. The strata() function in Yves Tillé and Alina Matei’s sampling package is about twice as fast (benchmarking later in this post) so I used that instead. The code below draws that sample, does the post-stratification weighting (by simply joining the sample to the marginal totals data frame calculated earlier and creating weights forced to add up to those marginal totals) and then calculates the percentile ranks (both weighted and unweighted).

BTW, the reason I was particularly attuned to speed in this overall exercise is that I had heard that Stata was very slow in calculating the weighted percentile ranks. But apart from drawing the sample, everything I was doing in R turned out to be very fast; even calculating percentile ranks on 200,000 observations grouped by year (so similar to the actual data in the original paper) only takes a few seconds.

Post continues after R code

## Comparing the weighted and unweighted percentiles

OK, time to do the actual comparisons. Here’s a scatter plot of the weighted and unweighted percentiles. Note the very very high correlation - 0.999! - but also the visually clear slight curvature:

That curvature is more obvious if we plot the difference between the two estimates:

Why do we get this pattern? It’s simple enough if you consider three people - at the bottom, the top and the middle of the sample’s income distribution. For the person at the bottom, whether we are weighting the observations or not, we know that this person has the lowest observed income. They’re in the bottom percentile either way, and the weights of all the people above them simply don’t matter. This applies in reverse for the person at the top. So the weighted and unweighted percentiles are identical in these extremes.

For the person in the middle however it does matter. We look at the unweighted sample and say “right, you’re bang in the middle, you’re the 50th percentile”. But then we look at the weights and say “hold on, the half of the sample that has incomes below you has relatively low weights (because, in my simulation, people with lower education were more likely to be in the survey so get lower weights to compensate). So while it looks like you’ve got higher income than 50% of people, that 50% of the sample actually only represents 45% of the population.” So we get a material difference between weighted and unweighted percentile estimates in the middle, but not at the ends, of the distribution.

There’s nothing inherent that says the weighted percentiles in the middle will always be lower than unweighted as in my simulation - that’s only the case because the lower income people on average had lower weights. With other sampling designs or non-response rates, this could be reversed. But however the weights work out, there will be more discrepancy between the weighted and unweighted percentiles in the middle of the distribution than the ends.

Here’s the simple code for drawing those plots.

Post continues after R code

A couple of other interesting observations. The unweighted mean of the unweighted percentile is 50; and so is the weighted mean of the weighted percentile. But the unweighted mean of the weighted percentile is low - about 47 - and the weighted mean of the unweighted percentile is high - about 53. This all makes sense based on the above reasoning.

Here are some summary stats on the sample:

educ mean(wt) mean(income) mean(uw_perc) mean(w_perc) n() sum(wt)
1 690.4403 16980.99 37.66090 34.70156 2067 1427140
2 762.0085 21147.84 42.89254 39.78324 1873 1427242
3 842.4838 25146.92 48.18424 45.01756 1697 1429695
4 989.3329 31281.51 54.45006 51.25665 1445 1429586
5 1175.7494 35873.40 58.81986 55.67827 1217 1430887
6 1500.4732 42648.17 62.77626 59.67482 951 1426950
7 1904.6667 50135.01 66.77896 63.82556 750 1428500

And the code to make those summary stats and compare the different means of the different percentiles.

Post continues after R code

## Some speed benchmarking

Finally, here are some speed tests that were behind some of the design choices above. As it turns out, none of it matters very much. As mentioned above, R can calculate the weighted percentiles for hundreds of thousands of observations grouped by year in only a couple of seconds, so I didn’t need to worry about that. But in case it matters, here’s my core conclusions from running the benchmarking below

• reldist::wtd.quantile is about twice as fast as Desctools::Quantile in calculating weighted quantiles and gives near-identical results.
• setting labels = FALSE in a call to cut() makes the function about four times as fast
• dplyr::slice_sample takes about the same time to do a sample with unequal probability of selection as a home made function using base::sample, but sampling:strata function is about twice as fast.
• my final homemade function wt_percent_rank for weighted percentile ranks is about four times slower than the unweighted dplyr::percent_rank, but is still plenty fast enough (<5 seconds for 200,000 observations).

Benchmarking code:

That’s all, cheerio.

### Versioning of this blog post

An earlier version of this post had some speculation about the original referenced paper without me adequately checking what was actually happening. As it’s not relevant for my main point that paragraph has been removed.