Transformations for compositional data

Motivation

In engaging with this Twitter thread four months ago, I discovered that there was a whole set of statistical methods that I knew nothing about - transforming data that is in the form of a simplex. Common examples of this sort of data would include soil composition (which the Twitter thread was about), chemical composition, time use composition - basically anything where by its very structure, each observation is constrained to add up to a constant number (most often 1). I now think this was a material gap in my skillset, a well-rounded applied statistician needs to know about this.

The original question was in essence “can I take these observations of the proportions of samples that are silt, clay and sand as points in three-dimensional space and just calculate distances between them?”. My first answer was “yes so long as the units are the same.” Then when it was pointed out to me that each observation was constrained to add up one I thought “hmm, perhaps use just two dimensions as the third is redundant, and maybe it doesn’t matter which two”.

Turns out this is wrong - not disastrously wrong like “sort each column in your data independently before you do a regression to get higher correlations” is disastrously wrong; but at least not the best practice in dealing with data of this sort. To my credit, I did mention that I knew nothing about soil science.

People who do know about it, particularly Morgan Williams and Dylan Beaudette, luckily chipped in and mentioned that this is a known problem and there are a bunch of standard ways to deal with it.

I don’t have time to explore all the things mentioned in that Twitter thread so I’m going to focus on one of the more fundamental - the idea of working with “isometric log ratios” and calculating the distance between them, rather than the original dimensions.

Simulated data and a naive method

First let’s look at the first thing that made me uneasily feel my Twitter-brain hadn’t thought this through, my hope that you could measure distances between points using any 2 of 3 dimensions. I simulated some data from a zero-inflated 3 dimensional multivariate log normal distribution then constrained each observation to add up to 1.

Although the first raw cut of the data had positive correlations between the variables (which in my mental model was related to the ‘size’ of each sample that observations were being taken on), once you turn them into composition proportions they naturally are strongly negatively correlated. Of course, if one element is taking up 90% of the composition, the other two elements are going to be small. So a pairs plot of the data shows an interesting triangle shape.

There’s also a skewed univariate distribution for each dimension considered by itself, which is more a product of my simulation process (which gave the underlying multivariate normal data a common mean) than an essential part of the data structure:

This was done with this code:

library(MASS)
library(tidyverse)
library(GGally)

set.seed(42)

sig <- matrix(c(1, 0.2, 0.4, 0.2, 1, -0.3, 0.4, -0.3, 1),
              byrow = TRUE, nrow = 3)
n <- 100
m1 <- exp(mvrnorm(n, mu = c(0,0,0), Sigma = sig))

# need to make some of m1 real zeroes:
m1[runif(n * 3) < 0.05] <- 0

colnames(m1) <- c("x", "y", "z")

# pairs plot (not shown in blog)
ggpairs(as_tibble(m1))

# transform to compositional proportions:
m2 <- m1 / rowSums(m1)

# check that all the rows add up to 1.0000:
stopifnot(all(unique(round(rowSums(m2), 4)) == 1))

# pairs plot shown in blog
ggpairs(as_tibble(m2)) +
    labs(title = "Simulated simplex data (x + y + z = 1)")

Let’s look at all the 4,950 pairwise distances between those 100 observations, using x and y, x and z, y and z and all three of x, y and z.

We can easily see that my naive hope that you could just pick any arbitrary two of the three dimensions and get the same result is wrong. In fact, some highschool maths would have told us this - substituting in z = (1 - x - y) for (y) in a calculation of the distance between two points (sqrt((x1 - x2)^2 + (y1 - y2 ^2))) is going to get you different results.

The different two-dimensional distances are correlated with each other of course, and even more so with the three-dimensional distances, but the correlations are noticeably below 1.

Code for this step will be shown a bit later because, for efficiencies sake, I did some of the calculations in the next section at the same time and I want to talk about them first.

Isometric logarithm (or Box-Cox) centered ratios

Overview

So let’s look at the proper way to do it. First thing I thought on seeing the name of this technique was ratios of what I wondered? And how do they get to be isometric?

Luckily I found this comprehensive answer by the always-impressive whuber on Cross Validated, the Stack Exchange statistics forum. It’s brilliant, clear, well-explained and with reproducible code; I nearly didn’t write this blog due to lack of any obvious value-add from me. So do yourself a favour if you’re interested in this - and in how to craft a useful answer on Cross Validated or Stack Overflow - and give it a read. There’s a couple of tiny bugs in his code so that can be my value-add, but the real value of writing this blog post is forcing me to think through the process myself.

So recall that we are dealing with a k x n matrix where each of the n rows is an observation, and the k columns represent observations of some proportions that are constrained to add up to 1. So you could work out the values of any one of the columns by subtracting the other columns’ values from one.

ILR turns out (as I understand it at the moment - bear in mind I’m self-learning here so may have got some terminology wrong) to be a three step process:

transform the ratios (i.e. the proportions that make up individual numbers of the simplex) to make them less skewed. The ‘L’ in ILR stands for a logarithm transform at this point, but doesn’t seem to have any theoretical necessity so it makes sense to instead use a more general Box-Cox transformation (of which a logarithm is a special case)
center the results by subtracting the geometric mean
rotate them in such a way that the data becomes two dimensional

Of that last step, whuber explains:

…the hyperplane is rotated (or reflected) to coincide with the plane with vanishing kth coordinate and one uses the first k−1 coordinates. (Because rotations and reflections preserve distance they are isometries, whence the name of this procedure.)

OK… so the end result with my 3 dimensional original data will be 2 dimensions that are a transformed but still full-information version of the original.

Center and rotate with no transformation

An interesting thing about this is that if we skip the transformation of the original data (or, equivalently, transform with a Box-Cox transformation with parameter p = 1, which means the transformation is just subtracting one from it), then this final transformed version should be just a simple mean-subtraction and rotation of the original. Which would mean that calculating distances from the two transformed dimensions should get the same results as (or a linear combination of) the original three-dimensional data!

Let’s check that out in the first instance. Like the previous plot, each point in the image below represents one of the pairwise distances between the original 100 observations:

In the straight line of points in the facet in bottom row, second from right, we see a straight line of points. This is the perfect correlation of 1.0 between the distances calculated from our rotated data (d_ilr) and those on the original (d_xyz).

To make that perfectly clear - if you skip the ‘transform’ step of the ILR process, you end up with distances between points (from data rotated to need just two dimensions) that are perfectly correlated with the distances between points in the original three-dimensional space.

Center and rotate after logarithm transform

OK, so let’s put the transform back in - after all it is pretty fundamental to the concept of ILR. Starting with a logarithm (as per the ‘L’ in ILR), which is equivalent to Box-Cox with parameter lambda = 0, here is what we see:

As expected, there is no longer a simple linear correlation between the pairwise distances of points after ILR and any of the untransformed distances - whether three dimensional or the three possible sets of two dimensional distances.

But there’s something interesting here which is that the distribution of the differences after transformation and rotation has a long, thin right tail - it’s more skewed than is the distribution of differences on the original scale. It’s not nice for exploratory data analysis, where we typically look to transform data to be roughly symmetrical.

What might be going on here? Well, remember we’re looking at a plot of the pairwise distances between points after log-transform, centering and rotation. Let’s simplify things a little but just looking at the log-transformed original data:

We can see from here what an experienced data analyst would think of as the data having been transformed too much. The original simplex data has had its right-skew fixed, but overcompensated for - so we now have left-skewed data. This is the sort of situation where a Box-Cox transformation can be handy, giving us a broader range of transformations than just the logarithm.

Center and rotate after a more generalized transform

To give an idea why, here is the original data but this time with a square root transform:

Now the data is nice! In fact, for the simulated data in whuber’s example on Cross-Validated, he uses a Box-Cox transformation with a parameter of 0.5 - which is very similar to taking the square root - and as he points out it works “beautifully” with his Dirichlet distribution data.

By the way, this is how those plots of transformed data were produced:

m2 |>
  as_tibble() |>
  mutate(across(x:z, ~log(.x))) |>
  ggpairs() +
  labs(title = "Simulated simplex data after logarithm transform")

m2 |>
  as_tibble() |>
  mutate(across(x:z, ~sqrt(.x))) |>
  ggpairs() +
  labs(title = "Simulated simplex data after square root transform")

So this leads to my final version of this generalised ILR procedure, this time using a Box-Cox transformation with lambda = 0.5.

What next

So this transformation, centering and rotation is a beginning, not an ending. Whether the purpose is exploratory data analysis or more formal modelling, we would use the transformed data for that purpose.

Following the original line of thought on Twitter, I have been focusing on pair-wise distances between observations, which cna be used in any number of ways from classification to multi-dimensional scaling, but that would take me beyond the scope of an already too-long blog.

Efficient transform-center-rotate and calculation of distances

Here’s the code that does the transformations and calculates pair-wise distances. As I abstracted the core tasks out into functions it made sense to have all this code at once at the end of the blog rather than interspersed with all the individual plots above.

The ilr() function below is very lightly adapted from whuber’s original on Cross-Validated.

#' ILR (Isometric log-ratio) transformation.
#' see https://stats.stackexchange.com/questions/259208/how-to-perform-isometric-log-ratio-transformation#:~:text=The%20ILR%20(Isometric%20Log%2DRatio,time%20spent%20in%20various%20activities.
#' @param `x`  an `n` by `k` matrix of positive observations with k >= 2.
#' @param p Box-Cox parameter
ilr <- function(x, p = 0) {
  if (any(x < 0, na.rm = TRUE)) {
    stop("x must be only positive values.")
  }
  if (abs(p) < 1e-07) {
    y <- log(x)
  } else {
    y <- (x ^ p - 1) / p
  }       
  y <- y - rowMeans(y, na.rm = TRUE)            # Recentered values
  k <- dim(y)[2]
  H <- contr.helmert(k)                       # Dimensions k by k-1
  H <- t(H) / sqrt((2:k) * (2:k - 1))         # Dimensions k-1 by k
  z <- y %*% t(H)                             # Rotated/reflected values
  if(!is.null(colnames(x))){                   # (Helps with interpreting output)
    colnames(z) <- paste0(colnames(x)[-k], "_ilr")
  }
  return(z)                          
}


#' convenience function for euclidean distance between 2 or 3 dimensional points
euc_dist <- function(d1, d2 = 0, d3 = 0){
  d <- sqrt(d1 ^ 2 + d2 ^ 2 + d3 ^ 2)
}

#' Calculate pairwise differences of points
#' Calculation based on matrix m2 in the global environment. 
#' Not a portable function, just for this particular analysis.
distances <- function(p = 0){
  d <- m2  |>
    # two dimensional version
    cbind(ilr(m2, p = p)) |>
    as_tibble() |>
    mutate(id = 1:n())
  
  d |>
    rename(x1 = x, 
           y1 = y,
           z1 = z,
           x_ilr1 = x_ilr,
           y_ilr1 = y_ilr,
           id1 = id) |>
    left_join(d, by = join_by(id1 > id)) |>
    filter(!is.na(id)) |>
    mutate(d1 = x - x1,
           d2 = y - y1,
           d3 = z - z1) |>
    mutate(
      d_xy = euc_dist(d1, d2),
      d_xz = euc_dist(d1, d3),
      d_yz = euc_dist(d2, d3),
      d_xyz = euc_dist(y-y1,x-x1, z-z1),
      d_ilr = euc_dist(x_ilr1 - x_ilr, y_ilr1 - y_ilr)
    )
}

# Plot distances using original variables of simplex
  distances() |>
    select(d_xy:d_xyz) |>
    ggpairs() +
    labs(title = "Different methods of comparing pairwise differences of compositional data",
         subtitle = "Comparing choices of two of the original dimensions with use of all three")

# Plot distnaces with center and rotate without transform
  distances(p = 1) |>
    select(d_xy:d_ilr) |>
    ggpairs() +
    labs(title = "Different methods of comparing pairwise differences of compositional data",
         subtitle = "d_ilr is isometric logarithm ratio transformation without the logarithm")

# Plot distances after log transform
  distances(p = 0) |>
    select(d_xy:d_ilr) |>
    ggpairs() +
    labs(title = "Different methods of comparing pairwise differences of compositional data",
         subtitle = "d_ilr is isometric logarithm ratio transformation") 

# Plot distances after Box-Cox (0.5)
  distances(p = 0.5) |>
    select(d_xy:d_ilr) |>
    ggpairs() +
    labs(title = "Different methods of comparing pairwise differences of compositional data",
         subtitle = "d_ilr is isometric logarithm ratio transformation with a Box-Cox (0.5) transformation instead of logarithm")