# Importing the New Zealand Income Survey SURF

## At a glance:

I tidy up the publicly available simulated unit record file (SURF) of the New Zealand Income Survey 2011, import into a database, and explore income distributions, visualising the lower distribution of weekly incomes New Zealanders of Maori and Pacific Islander ethnicity. Along the way I create a function to identify modes in a multi-modal distribution.

15 Aug 2015

## The quest for income microdata

For a separate project, I've been looking for source data on income and wealth inequality. Not aggregate data like Gini coefficients or the percentage of income earned by the bottom 20% or top 1%, but the sources used to calculate those things. Because it's sensitve personal financial data either from surveys or tax data, it's not easy to get hold of, particularly if you want to do it in the comfort of your own home and don't have time / motivation to fill in application forms. So far, what I've got is a simulated unit record file (SURF) from the New Zealand Income Survey, published by Statistics New Zealand for educational and instructional purposes. It's a simplified, simulated version of the original survey and it lets me make graphics like this one:

This plot shows the density of income from all sources for four of the most prominent ethnicity types in New Zealand. New Zealand official statistics allow people to identify with more than one ethnicity, which means there is some double counting (more on that below). Three things leap out at me from this chart:

1. The density curves of the different ethnic groups are very similar visually
2. People of Maori and Pacific Peoples ethnicity have proportionately more $300-$400 per week earners than Europeans and Asians, leading to an overall noticeable lean to the lower numbers
3. Weekly income is bimodal, bunching up at $345 per week and$820 per week. Actually, this image is misleading in that respect; in reality it is trimodal, with a huge bunch of people with zero income (and some negative), who aren't shown on this plot because of the logarithmic scale. So you could say that, for New Zealanders getting any income at all, there is a bimodal distribution.

Where's that bimodal distribution coming from? The obvious candidate is part time workers, and this is seen when we plot income versus hours worked in the next image:

(That interesting diagonal line below which there are very few points is the minimum wage per hour)

Statistics New Zealand warn that while this simulated data is an accurate representation of the actual New Zealand population, it shouldn't be used for precise statistics, so for now I won't draw particularly strong conclusions on anything. A simulated unit record file is a great way of solving confidentiality purposes, but in the end it has been created by a statistical model. There's a risk that interesting inferences might be just picking up something implicit in the model that wasn't taken into account when it was first put together. That's not likely to be the case for the basic distribution of the income values, but we'll note the exploratory finding for now and move on.

A box plot is better for examining the full range of reported ethnicities but not so good for picking up the bi-modal subtlety. It should also be noted that both these graphs delete people who made losses (negative income) in a given week - the data show income from all sources:

Here's how I drew the graphs, and estimated the two modes. I think I'll want to re-use this data quite often, so it's worth while putting in a database that's accessible from different projects without having to move a bunch of data files around in Windows Explorer. The way Statistics New Zealand have released the file, with codings rather than values for dimensions like ethnicity and region, also makes a database a good way to make the data analysis ready.

After importing the data, the first significant job is to deal with that pesky ethnicity variable. In the released version of the data respondents with two ethnicities have both code numbers joined together eg 12 means both European (1) and Maori (2). To get around this I split the data into two fact tables, one with a single row per respondent with most of the data; and a second just for ethnicity with either one or two rows for each respondent. Here's how I do that with a combination of {dplyr} and {tidyr}:

Second step is to re-create the dimension tables that turn the codes (eg 1 and 2) into meaningful values (European and Maori). Statistics New Zealand provide these, but unfortunately in an Excel workbook that's easier for humans than computers to link up to the data. There's not too many of so it's easy enough to code them by hand, which the next set of code does:

The final step in the data cleaning is to save all of our tables to a database, create some indexes so they work nice and fast, and join them up in an analysis-ready view. In the below I use an ODBC (open database connectivity) connection to a MySQL server called "PlayPen". R plays nicely with databases; set it up and forget about it.

## Average weekly income in NZ 2011 by various slices and dices

Whew, that's out of the way. Next post that I use this data I can go straight to the database. We're now in a position to check our data matches the summary totals provided by Statistics New Zealand. Statistics New Zealand say this SURF can be treated as a simple random sample, which means each point can get an identical individual weight, which we can estimate from the summary tables in their data dictionary. Each person in the sample represents 117.4 in the population (in the below I have population figures in thousands, to match the Statistics New Zealand summaries.

Statistics New Zealand doesn't provide region and occupation summary statistics, and the qualification summaries they provide use a more detailed classification than is in the actual SURF. But for the other categories - sex, age group, and the trick ethnicity - my results match theirs, so I know I haven't munched the data.

sex Mean Sample Population
female 611 15217 1787.10
male 779 14254 1674.00
agegrp Mean Sample Population
15-19 198 2632 309.10
20-24 567 2739 321.70
25-29 715 2564 301.10
30-34 796 2349 275.90
35-39 899 2442 286.80
40-44 883 2625 308.30
45-49 871 2745 322.40
50-54 911 2522 296.20
55-59 844 2140 251.30
60-64 816 1994 234.20
65+ 421 4719 554.20
qualification Mean Sample Population
School 564 7064 829.60
None 565 6891 809.30
Other 725 1858 218.20
Bachelor or Higher 955 5223 613.40
occupation Mean Sample Population
No occupation 257 10617 1246.90
Labourers 705 2154 253.00
Residual Categories 726 24 2.80
Community and Personal Service Workers 745 1734 203.60
Sales Workers 800 1688 198.20
Clerical and Adminsitrative Workers 811 2126 249.70
Technicians and Trades Workers 886 2377 279.20
Machinery Operators and Drivers 917 1049 123.20
Professionals 1105 4540 533.20
Managers 1164 3162 371.30
region Mean Sample Population
Bay of Plenty 620 1701 199.80
Taranaki 634 728 85.50
Waikato 648 2619 307.60
Southland 648 637 74.80
Manawatu-Wanganui 656 1564 183.70
Northland 667 1095 128.60
Nelson/Tasman/Marlborough/West Coast 680 1253 147.20
Otago 686 1556 182.70
Gisborne / Hawke's Bay 693 1418 166.50
Canterbury 701 4373 513.60
Auckland 720 9063 1064.40
Wellington 729 3464 406.80
ethnicity Mean Sample Population
Residual Categories 555 85 10.00
Maori 590 3652 428.90
Pacific Peoples 606 1566 183.90
Middle Eastern/Latin American/African 658 343 40.30
Asian 678 3110 365.20
European 706 22011 2585.00
Other Ethnicity 756 611 71.80

## Graphics showing density of income

Finally, here's the code that drew the charts we started with, showing the distribution of the weekly income New Zealanders of different ethnicity.

Edited 18 August 2015 to add the scatter plot of the joint density of hours and income.