Re-creating survey microdata from marginal totals

At a glance:

I play around with creating my own synthetic unit record file of survey microdata to match published marginal totals. I conclude that making synthetic data for development or education purposes can be useful, but the effort to try to exactly match marginal totals from this sort of survey is unlikely to be useful for either bona fide researchers or ill-intentioned data snoopers.

03 Nov 2019

I recently did some pro bono work for Gun Control NZ reviewing the analysis by a market research firm of the survey that led to this media release: “Most New Zealanders back stronger gun laws”. The analysis all checked out ok. The task at that time was to make sure that any claims about different perceptions of different groups in New Zealand society were backed by adequately severe (for the context) tests, and in the end I just performed a bunch of pragmatic Chi-square tests with the aggregate results provided by the original market research company.

However, my first thought when I saw that I had access only to the marginal aggregate totals, not the original microdata, was that it might be easier to analyse if I could re-create the original microdata. Amongst other things, this would have meant I could cycle through various bits of analysis with less code. Creating an adequate set of microdata was a harder job than I had hoped, so I didn’t use that method for the original task. But since then I have been pecking away at it to see what it would take, partly for potential future use in this situation, partly as a general self-learning exercise in further understanding issues relating to statistical disclosure control (confidentialisation by cell suppression, random rounding, etc), which is more of a relevant technique than ever today.

See this story for an example claim that an extended version of the sort of methods I describe here, in combination with repeated queries of marginal totals that have been pertured for statistical disclosure control, could be used to recover original sensitive census data. Me, I’m sceptical of the practicality of the alleged “exploit” in that story, which is more theoretical than actual, but it’s true that there is a potential vulnerability here that is worth dabbling my toes in to understand a little. It’s certainly true that repeated queries of a tool that delivers marginal totals (ie crosstabs) can get around the more basic forms of statistical disclosure control perturbation, which is why national stats offices are investing heavily in methods of control such as fixed random rounding to mitigate that particular vulnerability.

But to be clear on this up front - I don’t think a data snooper re-creating sensitive data is a realistic fear with regard to the sort of survey, with limited number of variables, such as that used for a demo in today’s post. And my analysis below bears this out.

Example - Survey of New Zealand views on gun control

When tidied up, the available data from the survey I am working with looks like this, a typical set of high level results from a market research poll:

This chart shows one of two substantive questions in the survey, which is all I will be focusing on today. The visually prominent differences in response, such as by ethnicity, region, gender and rural or not location are statistically significant. The published data show cross tabulated counts by one question and one variable at a time (which I will be calling “marginal totals” in this post) but no interactions.

The code for making these nice diverging stacked bar charts for Likert-like data is shown a bit later down this post.

My task today is, it it possible to recreate a set of microdata that adds up to the aggregates in the available data? This is a special case of creating synthetic data, a common statistical disclosure control technique used to make microdata available for analysis that has the same properties as original microdata but without revealing sensitive details. Sometimes the data is analysed directly, sometimes it is used as a training set of data with which statisticians can develop their models safely before sending them off to a secure environment to be run on the real, gold standard data.

Of course, this isn’t a new problem. A nice overview of some model-based methods is provided by Alan Lee in this discussion of creating SURFs (Simulated Unit Record Files) with New Zealand data. The synthpop R package was developed as part of the Synthetic Data Estimation for UK Longitudinal Studies project and seems to be the most advanced software in this area. However, the motivation of these efforts is to create a SURF that has plausibly been generated from the same underlying joint distribution of the population as the actual sample; I have given myself the slightly extended task of generating a SURF that has exactly the same joint distribution as the actual sample (not the population), at least to the degree of detail revealed by the published marginal totals.

Is this possible? There must be at least one solution, which is the actual sample (unless the aggregate totals have been deliberately fuzzed eg by random rounding or other perturbation for disclosure control). Can I find that solution? Is there more than one solution; if not, could this method be something a snooper could do to recover sensitive information? If there are multiple solutions, does finding one allow us to get extra insight from the data, for example by fitting a model that looks for effects from one variable while controlling for others, something that isn’t possible with just the univariate aggregates? These are interesting questions.

It turns out that the answers are:

• It is possible to find a solution that very closely matches the published marginal totals, but very difficult to get an exact match
• The close solutions aren’t unique
• The existence of multiple solutions means that a snooper is unlikely to recover sensitive information by this method
• Unless you combine considerably more variables than I have done in this instance, the process of re-synthesising microdata from the marginal totals does not yield any additional insights.

Data management

Import

First step as always is data management. This first chunk of code below downloads the data, tidies it and produces the first chart.

Post continues after R code

Approach

We will be performing this resurrection of the microdata in three steps:

1. create a table of all possible combinations of each variable (income, age, ethnicity, possible answers to the question on gun control)
2. create a set of weights for each combination that add up to the observed marginal distributions (eg combinations of age with particular answers to the question), from which we can select samples that represent the same population we have inferred the original sample was drawn from
3. draw a sample from that population,
4. evaluate our sample’s marginal totals against those of the original sample, and “improve” our simulated sample by dropping excess individuals and replacing them with new ‘respondents’ that give the sample propoertise that more closely resemble the original sample

The usual process of generating a SURF performs just steps 1 to 3, which turn out to be fairly straightforward with our relatively small number of variables. It is task four that looks to be the difficult one; a good thing too, or it would be too easy for snoopers to re-create microdata from aggregates.

BTW if people are wondering about my use of the term “snooper”, this isn’t me being whimsical; this is the term generally used in the statistical disclosure control literature for the adversary that we build our confidentialisation methods to guard against.

Dealing with some variables’ idiosyncracies

When we look at counts of responses by each of the reported variables, straight away we have two interesting problems.

• The “ethnicity” variable, while reported in aggregate the same as variables such as region and income, is different because in the standard New Zealand classifications it is possible for a respondent to report two ethnicities. So we have more than 1,000 total responses by ethnicity despite the reported sample size being 1,000
• Other variables have less than 1,000 responses, almost certainly due to non-response; and the “unknown” categories for those variables are not provide.
variable sum(Freq)
Age 1000.0000
All 1000.0000
Dependent children 1000.0000
Employment 1000.0000
Ethnicity 1098.8538
Gender 995.7287
Household income 857.4557
Living Situation 928.8421
Region 1000.0100
Rural 956.4756

… which was generated with this:

It looks like age, dependent children, employment and region were all fully responded to (perhaps mandatory) whereas other variables had varying degrees of partial response, with income being the most non-answered question (unsurprising because of its sensitivity).

For income (and similar variables) we can recover the marginal totals of those for whom income is unknown by comparing the total responses for each level of the substantive question for “All” to those for whom we do have income information.

For ethnicity we need five yes/no variables for each of the possible ethnicities. When we generate our full “all combinations” population, we will eliminate those with more than two ethnicities, which helps keep size down to reasonable levels.

A bit of mucking around lets us deduce the values of those unknowns, and turn ethnicity into multiple variables.

Step 1 - creating all possible combinations

The first few rows of the object freq1b above look like this:

Region Upper North Is. Strongly support 99
Region Auckland Strongly support 155
Region Wellington/ Wairarapa Strongly support 69
Region Lower North Is. Strongly support 52
Region Canterbury Strongly support 72
Region Other South Is. Strongly support 53
Rural Yes Strongly support 50
Rural No Strongly support 436
Gender Male Strongly support 216
Gender Female Strongly support 283

Here is some code that takes that freq1b object and turns it into a big data frame of all the possible combinations of demographic variables and answers to question 1, hence representing the full population of New Zealand in scope for the survey (although not yet in the actual proportions of that population).

Note that I have eliminated by hand a few combinations of variables that are implicitly non-existent in terms of our original sample, and also that the last steps in the chunk above are to cut down our combinations of variables to those that have only one or two ethnicities.

Step 2 - weights that resemble the marginal totals from the actual sample

So far so good. The next step is the one that gets most attention in the synthetic data literature; creating a model or set of weights that will allow our table of all possible combinations of variables to actually be representative of the in-scope population, as understood from the original sample. There are many ways to go about this, with and without explicit statistical models, and I have opted for the simplest which is to use iterative proportional fitting (or “raking”) to create a set of weights for my all-combinations table that adds up to every one of the observed combinations of aggregate marginal totals. I use Thomas Lumley’s survey package as a short cut here, noting the irony that I am creating weights for a population to match a sample, rather than (as is usually the case) for a sample to match a population. Sometimes this makes it hard to keep track of exactly what I mean by population in sample in the code below.

This only takes a few seconds to run on my laptop.

So we now have the full_data object, which has 2.5 million rows and 15 columns. The first 10 rows look like this:

Age Asian Dependent_children Employment Gender Household_income Living_Situation NZ_European_Other_European NZ_Maori Other_ethnicity Pasifika q1 Region Rural wt
18-29 Not Asian Yes 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 2e-07 30-44 Not Asian Yes 30 hours or more Male Under$50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 3e-07
45-59 Not Asian Yes 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 3e-07 60+ Not Asian Yes 30 hours or more Male Under$50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 3e-07
18-29 Not Asian No 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 6e-07 30-44 Not Asian No 30 hours or more Male Under$50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 8e-07
45-59 Not Asian No 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 9e-07 60+ Not Asian No 30 hours or more Male Under$50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 1e-06
18-29 Not Asian Yes Less than 30 hours Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 1e-07 30-44 Not Asian Yes Less than 30 hours Male Under$50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 1e-07

Within rounding errors, that final wt column adds up to the published survey sample totals. For example:

Step 3 - a sample from that population

So, a sample drawn at random from full_data with probability of selection proportionate to wt has a fighting chance of coming from the true population with the same joint distribution that the original sample came from, that is, the actual in-scope population of New Zealand. If we were just trying to create a SURF for analysts to use we would stop here, but we are interested not just in whether it is drawn from the same population, but how exactly it resembles the original sample. So in the code below I create the function evaluate_dissim(), which is going to get some heavy duty use later on. It compares the discrepencies in marginal totals between the new sample and the population totals it is meant to have, returning the sum of the absolute values.

Our starting sample is out from the marginal totals of the combined sample by a total of 1,546. For example, using gender again:

The new sample has similar results to the original - it’s plausibly from the same population, which is all we’ve aimed for so far - but it’s clearly not exactly the same. For example, my sample has 64 female respondents who neither support or oppose the gun control in question, whereas the original sample published a figure of 55. Close but not exact.

Step 4 - improving the simulated sample to make it resemble the original sample’s marginal totals

The fourth step is the tricky stuff. I tried several ways to do the next step and am far from convinced I’ve got it right; in fact even for the approach I’ve ended up with, I have far from optimised code. But it’s good enough for illustrative purposes.

My first idea was to find a row of data (representing a whole respondent) that was in “excess” with regard to at least one of its variables, remove that respondent from the sample and replace them with someone else chosen at random, with extra probability given to new respondents that would improve the marginal totals.

To do this, I defined a function called improve_rnd() which takes the latest sample and confronts it with a single marginal total, just like I do with gender in the table above.

• From that I identify types of respondents who are in excess (eg Female who Neither support or oppose) and types that are missing (Female who strongly support).
• I sort my sample with the rows that most represent excess values at the top (with some randomness given there will be lots of ties)
• I lop off the top row of that sorted sample (noting that this means removing a whole respondent ie also their values for income, age, region, ethnicity, etc)
• I create a table of under-represented rows from the population with regard to this particular variable, multiplying their original population weights by how badly they are under represented
• I sample one row from that under-represented set based on those new weights and add it to my new candidate sample
• I evaluate the candidate sample overall against all marginal totals to see if overall the change improves things or makes them worse (for example, I might have worsened the totals for income even though I improved them for gender)
• If the candidate sample is better, I return this; otherwise I repeat the last two steps, systematically going through my under-represented set of data until I find one that unambiguously improves the data

As you can probably guess, that second last point above proved to be important - before I did that, I ended up cycling around a modestly improved equilibirium where improving the marginal totals for any one variable was making it worse for the others.

This function then becomes the work horse for an iterative process where I work through all my variables (gender, income, age, etc) one at a time at random, looking to drop one of the rows in my sample that makes it worst for that variable and replace it with an improvement that doesn’t degrade the other variables’ totals. I repeat this until the total discrepency gets down to some acceptable level that is probably from rounding error (noting I never had the exact frequencies from the original data, only proportions and sub-population values of n).

Here’s the code that does that procedure. This is the expensive part of the computation; it took me many hours to converge to a combined discrepency of less than 400, so in the code below I stop it once it gets below 1,200 - still much higher than I was hoping to get.

Post continues after R code

So it turned out to be a pretty inefficient solution to knock out a whole person at a time and grab a new one in the hope they would be better. In retrospect, this didn’t make much sense as a method - I clearly came to it because I was thinking in terms of “whole people” as though my simulated sample really represented people and I had no choice but to accept and reject. I was thinking in these terms because I liked the idea of continually sampling from my hypothetical population, and felt this would somehow better mimic the known joint distribution. But I don’t think that makes much sense now that I’ve tried it.

An alternative, perhaps an obvious one, is to look at my current sample, find the particular combinations of some variable that is over-represented (eg nine extra females who neither support or oppose), pick nine of them at random and make them male. Clearly not possible in the real world, but that’s the whole point of synthetic data. I’d resisted this originally because I’d been thinking of all the other characteristics that are jointly distributed with being female that I was now changing, but on a bit of reflection this is actually little different from chucking those nine women out of the sample and keeping grabbing more people from the bag until I found some who were right.

So to implement this I built a new function, improve_fix which does just that. This was much quicker than my first, resampling-based method; but is still prone to being stuck. So in the implementation below I actually combine both methods: using the “swap people’s characteristics” method as the main way of “improving” my sample, but if after 20 efforts of doing this no improvements are happening, trying my original random resampling method to get stuck out of bad equilibrium.

Post continues after R code.

This method delivers good results in only a few minutes of processing time. Here’s a plot of the decreasing total discrepencies:

You can see the first, slowish descent in total discrepencies in the first 100 or so samples, which represents my first, replacement-based method with improve_rnd(). Then from a total discrepency of about 1,200 to just about 250 there is a rapid improvement from using the improve_fix() method. That method stalls at a discrepency level of about 250 but a return to a cycle of improve_rnd() gets it out of that equilibrium; this pattern recurs a few times until we minimise our total discrepencies at 22. Importantly, we can get to that sample with discrepency of 22 from multiple ways.

So here’s the comparison of just one variable and question 1 of the sample’s marginal totals and the target, to compare with my first synthetic sample. Note that we now have nearly an exact match - we have just one too few females who “strongly support”, and one too many of unknown gender who “somewhat support”. My method as implemented doesn’t have a clear way to improve the sample in this case. Intuitively, we would think we could find one of those two “unknown gender - somewhat support” people and change them to “female - strongly support” but I’ve written my program only to change one variable at a time (in this case Gender), to minimise complications with other sets of variables. So further improvement is still possible, if anyone really wants to pursue this approach

> #-----------------End results--------------
> latest_sample %>%
+   group_by(Gender, q1) %>%
+   summarise(latest_sample_total = n()) %>%
+   left_join(pops1[[5]]) %>%
+   rename(original_sample_total = Freq)
Joining, by = c("Gender", "q1")
# A tibble: 15 x 4
# Groups:   Gender [3]
Gender         q1                        latest_sample_total original_sample_total
<chr>          <chr>                                   <int>                 <dbl>
1 Female         Neither support or oppose                  55                    55
2 Female         Somewhat oppose                            38                    38
3 Female         Somewhat support                          107                   107
4 Female         Strongly oppose                            21                    21
5 Female         Strongly support                          282                   283
6 Female         Unsure                                      9                     9
7 Male           Neither support or oppose                  63                    63
8 Male           Somewhat oppose                            49                    49
9 Male           Somewhat support                           96                    96
10 Male           Strongly oppose                            51                    51
11 Male           Strongly support                          216                   216
12 Male           Unsure                                      8                     8
13 Unknown Gender Somewhat support                            2                     1
14 Unknown Gender Strongly oppose                             1                     1
15 Unknown Gender Strongly support                            2                     2


Reflections

A few conclusions:

• This was considerably harder to do than I’d realised; and would get harder again with more sets of marginal totals to match.
• We would need several orders of magnitude more sets of marginal totals before we have to worry about there being a unique solution that gave a snooper (eg someone who knows the one Female Pasifika survey respondent in region X) confidence they had obtained new, sensitive information ie their answers to the survey.
• Because I have not included any external information (eg on the ethnic characteristics of different regions), I will have only very weak interaction relations between any variables - big underestimates of the true interactions. Considerably more information of that sort would be needed for this exercise to add any value in terms of generating insights beyond what the original marginal totals did.

Overall, at the end of 1,000+ line blog post, I would say my full method set out here is unlikely to be pragmatically helpful for any realistic use case that comes to mind for me. I think the traditional generation of synthetic data (my steps 1 to 3) is potentially useful for purpose of developing models that might then be applied to the gold standard original microdata in a secure setting; but step 4 really doesn’t add much value either for a bona fide researcher or for a ill-intentioned snooper.

Acknowledgement and caveat

Gun Control NZ who commissioned the original survey shared the data and gave me permission to write a blog post on the data here but they have not seen the content in advance or been involved any other way; I am not affiliated with them and they are in no sense responsible for any of the content, findings, errors or omissions in the above.