I have been thinking in my spare time a bit about synthetic control methods as an approach to evaluation, and am working on a blog post. But a side issue that popped up was sufficiently interesting to treat separately.
Technically, it is a question of fitting a regression where the coefficients are constrained to be non-negative and add up to one; in other words, the coefficients are a simplex. The reason this comes up is in that in synthetic control methods, such a regression is used to determine the weights to use in constructing a weighted average of multiple units (countries, firms, people) that is as comparable as possible to the unit that received the intervention being evaluated.
The argument for the weights adding up to one is to avoid extrapolating from the range of the real data. I don’t find this a convincing argument, but I’ll leave that for a later post if I get round to it. For now, I’m interested in the technical question of how to get such a set of non-zero weights adding up to one; or abstracting it away from the application, how to fit a regression where the coefficients are constrained to be a simplex.
Naturally, there are multiple ways you might do this. There was a good question and answer on this on Cross-Validated way back in 2012. Two of the four methods I’m going to go through below come pretty much straight from answers or comments there. The other two are my own invention (I’m not claiming to be the first, just that I haven’t seen others doing it this way).
First, let’s get started by simulating some data. I’m going to make a matrix called X
with five columns, all from different distributions. Then I create a vector of true coefficients that will relate the y
I’m about to make to that X
. To make the exercise interesting and life-like, the true coefficients are going to involve a negative number, and aren’t going to add to exactly 1. One of them is going to be zero.
Now, I’m going to fit various regressions (with no intercept) to this data and extract the coefficients and compare them to the true data generating process. First let’s start with ordinary least squares with no constraints
This approach is very good at getting close to the true data generating process, as we’d expect. But of course, like the true process, it has a negative coefficient for x3, and it sums to more than 1:
> round(coefs1, 2)
Xx1 Xx2 Xx3 Xx4 Xx5
0.20 0.61 -0.11 -0.02 0.45
> sum(coefs1)
[1] 1.122484
This is the plot we get, comparing the model’s coefficients to those from the real data generating process:
OK, all very good but it makes no effort (other than just hoping it works out) to constrain the coefficients to add up to one. So here’s my first real attempt at that, which is based on a suggestion that whuber made in the comments on the Cross-Validated post. He points out with some basic algebra that if you subtract one of the X columns from the others and from y, and then fit a regression with OLS to that transformed data, you get the correct coefficients on all the columns of X that you left in, and can calculate the remaining one by just subtracting them all from 1.
This time the coefficients add up exactly to one as required, but unfortunately there is still a negative in there
> round(coefs2, 2)
X2x1 X2x2 X2x3 X2x4
0.24 0.40 -0.04 0.00 0.40
> sum(coefs2)
[1] 1
This isn’t news - it was identified as a problem in the discussion on Cross-Validated, but with the simulated data used there it didn’t manifest. In fact, this is one reason why my simulated data is a bit dirtier, with original coefficients that don’t add to one and which have a genuine negative value among them.
So much for model 2.
Model 3 is one I invented myself and makes use of the fact that glmnet
lets you not only regularise your coefficients (squash them towards zero, to help handle multicollinearity, overfitting, and related issues) it lets you specify upper and lower bounds for coefficients. So we could get our first fully compliant set of regression coefficients (non-negative, adding to one) by using the data transformation approach in model2 but fitting the model with glmnet
instead. Like this:
The results are all non-negative, and add correctly to one:
> round(coefs3, 2)
[1] 0.11 0.24 0.00 0.00 0.64
> sum(coefs3)
[1] 1
They do differ markedly from the true coefficients:
We’ll come back to that.
Next is what we might call the “proper” solution, which is to treat the problem as the classic case for quadratic programming that it is. Quadratic programming was developed exactly to deal with this situation - optimise a multivariate quadratic function (like, least squares in a regression) subject to linear constraints (like, coefficients must be non-negative and add up to one). Hans Borchers’ R package (“Practical Numerical Math Functions”) has an easy to understand interface to do this without having to invert your own matrices, so the only thing you need to do is to express the constraints in the right way. Here’s how we do this with our problem:
Just like the previous method, our result is all non-negative and sums to zero:
> round(coefs4, 2)
[1] 0.17 0.44 0.00 0.00 0.38
> sum(coefs4)
[1] 1
This time the coefficients look a bit closer to the original ones:
Finally, I the other method I invented was to go all-in Bayesian and simply specify in Stan that the coefficients have to be a simplex and that they have a Dirichlet distribution. This actually seems the most intuitive solution to me - what better way is there of making something a simplex than by just deeming it to have a Dirichlet distribution?
So I wrote this program in Stan:
… and called it from R with this:
This is now the third method for which our result is correctly all non-negative and sums to zero:
> round(coefs5, 2)
[1] 0.17 0.45 0.00 0.00 0.38
> sum(coefs5)
[1] 1
… and this is what it looks like, basically very similar to the quadratic programming solution:
OK, which method is best? We need some sort of benchmark to test them against, better than the original coefficients from the data generating process in complying with the fundamental restrictions. Given the ultimate purpose is to construct a weighted average, we want something that has similar proportions to the original coefficients. So, just wanting to be pragmatic about this, I created a “normalised” set of coefficients by removing the negative one and scaling the remainder to add up to one.
This lets us define a success criteria - the winning method is the one that meets the constraints (ruling out models 1 and 2) and most closely matches those proportions. Here’s a nice pairs plot comparing everything at a glance:
What we see is that models 4 and 5 are clear, effectively equal, winners. Not only are they nearly identical, they also match the normalised true coefficients (correlations of 0.98 or above). Sadly, the glmnet method doesn’t perform very well at recovering the original proportions. Perhaps it would have done better if I’d let the columns of X be correlated with eachother rather than independent (dealing with multi-collinearity being elastic net regression’s home turf). Perhaps also it would get closer to the quadratic programming solution if I forced the amount of regularisation down to nearly zero. But with two good solutions already I’m not inclined right now to experiment further on that.
The quadratic programming approach was easier to write and debug and quicker to run than the explicit modelling of the Dirichlet distribution in Stan, so pragmatically it becomes the winner; with the sole caveat being that question mark in the last paragraph about what happens if the columns of X are multi-collinear.
Oh, let’s finish with the code to draw that pairs plot. For one reason or another, I used good old base R pairs()
to do this, rather the GGally
package I more normally use for scatterplot matrices. I like the easy control pairs()
gives you over the upper and lower triangles of the scatterplot matrix, by just defining your own function on the fly. So for the old-timers, here’s that base R code: