I wasn’t happy with my visualisation of individual incomes from the New Zealand income survey. Because it used a logarithmic scale to improve readability, in effect all zero and negative values are excluded from the data. Whenever I throw out data, my tail goes bushy… there has to be a better way. Those zero and negative values are an important part of the story, and it’s too easy to forget them. If you don’t include them in your standard graphic for exploring the distribution of the data, the next thing is you’re excluding them from your model in serious analysis.
It turns out this is a very common problem, and the routine solutions (at least according to this SAS blog, and this Q&A on another blog) seem to be either adding an arbitrary constant before taking logarithms, excluding those data points and treating them as missing values, or discretizing the data altogether. If my tail was bushy before, now I’m really nervous - those sound like really dangerous things to do (and yes, I freely admit having done all three myself plenty of times).
A bit of reflection shows that what’s going on here is a mixture of distributions - one of the most dangerous things for statistics based on traditional assumptions:
- A large bunch of the population, which we sometimes mistake for the whole population, who earn positive incomes that are roughly log-normally distributed, with mean and variance depending on structural variables like education, location, etc.
- A small subset, probably of enterpreneurs, who are making losses. The losses have their own distribution of interest.
- A medium sized subset who are not in work or owning businesses and have zero income
It’s folly to think that these three groups in the data-generating process will result in a single smooth distribution, as is implied by adding a constant to the income and taking logarithms. It’s worse to chuck out the inconvenient second and third groups altogether.
My solution is to use a modified power transform - one that transforms the absolute value of the original data, then restores the sign (negative or positive) to the result. I started by thinking of square roots, and I needed to handle the fact that the square root of a negative number is imaginary. By applying the square root to the absolute value and then restoring the sign you get a transformation like this:
Produced by (in R):
I’ve given myself flexibility to use powers other than 0.5 (square root). When I apply this approach to the New Zealand Income Survey 2011 simulated unit record file provided by Statistics New Zealand, I get a much more satisfactory representation of the full range of incomes there. Not only are the two modes at $825 and $325 per week visible that I discussed in my last post, but we can compare them to the spike at $0 per week, and observe a little bump at the mode negative income level of -$200.
I haven’t heard of this approach being used before and would welcome comment on it. As taking logarithms of income data is extremely common - pretty much the standard approach in fact - I would suggest that analysts are routinely losing sight of some important complexity in their data.
Here’s the final code producing that plot. It depends on the data being present in a database, as I described in the previous post.