I tidy up the publicly available simulated unit record file (SURF) of the New Zealand Income Survey 2011, import into a database, and explore income distributions, visualising the lower distribution of weekly incomes New Zealanders of Maori and Pacific Islander ethnicity. Along the way I create a function to identify modes in a multi-modal distribution.
15 Aug 2015
The quest for income microdata
For a separate project, I've been looking for source data on income and wealth inequality. Not aggregate data like Gini coefficients or the percentage of income earned by the bottom 20% or top 1%, but the sources used to calculate those things. Because it's sensitve personal financial data either from surveys or tax data, it's not easy to get hold of, particularly if you want to do it in the comfort of your own home and don't have time / motivation to fill in application forms. So far, what I've got is a simulated unit record file (SURF) from the New Zealand Income Survey, published by Statistics New Zealand for educational and instructional purposes. It's a simplified, simulated version of the original survey and it lets me make graphics like this one:
This plot shows the density of income from all sources for four of the most prominent ethnicity types in New Zealand. New Zealand official statistics allow people to identify with more than one ethnicity, which means there is some double counting (more on that below). Three things leap out at me from this chart:
The density curves of the different ethnic groups are very similar visually
People of Maori and Pacific Peoples ethnicity have proportionately more $300-$400 per week earners than Europeans and Asians, leading to an overall noticeable lean to the lower numbers
Weekly income is bimodal, bunching up at $345 per week and $820 per week. Actually, this image is misleading in that respect; in reality it is trimodal, with a huge bunch of people with zero income (and some negative), who aren't shown on this plot because of the logarithmic scale. So you could say that, for New Zealanders getting any income at all, there is a bimodal distribution.
Where's that bimodal distribution coming from? The obvious candidate is part time workers, and this is seen when we plot income versus hours worked in the next image:
(That interesting diagonal line below which there are very few points is the minimum wage per hour)
Statistics New Zealand warn that while this simulated data is an accurate representation of the actual New Zealand population, it shouldn't be used for precise statistics, so for now I won't draw particularly strong conclusions on anything. A simulated unit record file is a great way of solving confidentiality purposes, but in the end it has been created by a statistical model. There's a risk that interesting inferences might be just picking up something implicit in the model that wasn't taken into account when it was first put together. That's not likely to be the case for the basic distribution of the income values, but we'll note the exploratory finding for now and move on.
A box plot is better for examining the full range of reported ethnicities but not so good for picking up the bi-modal subtlety. It should also be noted that both these graphs delete people who made losses (negative income) in a given week - the data show income from all sources:
Loading the NZIS SURF into a database
Here's how I drew the graphs, and estimated the two modes. I think I'll want to re-use this data quite often, so it's worth while putting in a database that's accessible from different projects without having to move a bunch of data files around in Windows Explorer. The way Statistics New Zealand have released the file, with codings rather than values for dimensions like ethnicity and region, also makes a database a good way to make the data analysis ready.
After importing the data, the first significant job is to deal with that pesky ethnicity variable. In the released version of the data respondents with two ethnicities have both code numbers joined together eg 12 means both European (1) and Maori (2). To get around this I split the data into two fact tables, one with a single row per respondent with most of the data; and a second just for ethnicity with either one or two rows for each respondent. Here's how I do that with a combination of {dplyr} and {tidyr}:
Second step is to re-create the dimension tables that turn the codes (eg 1 and 2) into meaningful values (European and Maori). Statistics New Zealand provide these, but unfortunately in an Excel workbook that's easier for humans than computers to link up to the data. There's not too many of so it's easy enough to code them by hand, which the next set of code does:
The final step in the data cleaning is to save all of our tables to a database, create some indexes so they work nice and fast, and join them up in an analysis-ready view. In the below I use an ODBC (open database connectivity) connection to a MySQL server called "PlayPen". R plays nicely with databases; set it up and forget about it.
Average weekly income in NZ 2011 by various slices and dices
Whew, that's out of the way. Next post that I use this data I can go straight to the database. We're now in a position to check our data matches the summary totals provided by Statistics New Zealand. Statistics New Zealand say this SURF can be treated as a simple random sample, which means each point can get an identical individual weight, which we can estimate from the summary tables in their data dictionary. Each person in the sample represents 117.4 in the population (in the below I have population figures in thousands, to match the Statistics New Zealand summaries.
Statistics New Zealand doesn't provide region and occupation summary statistics, and the qualification summaries they provide use a more detailed classification than is in the actual SURF. But for the other categories - sex, age group, and the trick ethnicity - my results match theirs, so I know I haven't munched the data.
sex
Mean
Sample
Population
female
611
15217
1787.10
male
779
14254
1674.00
agegrp
Mean
Sample
Population
15-19
198
2632
309.10
20-24
567
2739
321.70
25-29
715
2564
301.10
30-34
796
2349
275.90
35-39
899
2442
286.80
40-44
883
2625
308.30
45-49
871
2745
322.40
50-54
911
2522
296.20
55-59
844
2140
251.30
60-64
816
1994
234.20
65+
421
4719
554.20
qualification
Mean
Sample
Population
School
564
7064
829.60
None
565
6891
809.30
Other
725
1858
218.20
Vocational/Trade
734
8435
990.60
Bachelor or Higher
955
5223
613.40
occupation
Mean
Sample
Population
No occupation
257
10617
1246.90
Labourers
705
2154
253.00
Residual Categories
726
24
2.80
Community and Personal Service Workers
745
1734
203.60
Sales Workers
800
1688
198.20
Clerical and Adminsitrative Workers
811
2126
249.70
Technicians and Trades Workers
886
2377
279.20
Machinery Operators and Drivers
917
1049
123.20
Professionals
1105
4540
533.20
Managers
1164
3162
371.30
region
Mean
Sample
Population
Bay of Plenty
620
1701
199.80
Taranaki
634
728
85.50
Waikato
648
2619
307.60
Southland
648
637
74.80
Manawatu-Wanganui
656
1564
183.70
Northland
667
1095
128.60
Nelson/Tasman/Marlborough/West Coast
680
1253
147.20
Otago
686
1556
182.70
Gisborne / Hawke's Bay
693
1418
166.50
Canterbury
701
4373
513.60
Auckland
720
9063
1064.40
Wellington
729
3464
406.80
ethnicity
Mean
Sample
Population
Residual Categories
555
85
10.00
Maori
590
3652
428.90
Pacific Peoples
606
1566
183.90
Middle Eastern/Latin American/African
658
343
40.30
Asian
678
3110
365.20
European
706
22011
2585.00
Other Ethnicity
756
611
71.80
Graphics showing density of income
Finally, here's the code that drew the charts we started with, showing the distribution of the weekly income New Zealanders of different ethnicity.
Edited 18 August 2015 to add the scatter plot of the joint density of hours and income.