The story so far
Last week I looked at a few related methods of investigating causality with observational data, where
- the treatment is expected to be received by observational units (people or firms)…
- in a way that is structurally related to variables of importance (ie not a random controlled trial)…
- which also impact on the response of interest.
The real life data I used was originally analysed by Lalonde, and investigated the impact of a job training program on income. People who received the job training program were disproportionately in need (ie low income at the time of the program), so we had to take this into account in the analysis. The three methods I used were
- propensity score matching - which creates a pseudo control group to compare to a treatment group, based on choosing non-treatment groups with a similar “propensity” to receive the treatment as those that actually did; hence discarding most of the data, but creating a nice, interpretable group.
- inverse propensity treatment weighting - in which the predicted propensity to receive the treatment is used to create weights for the whole data set, with more weight being given to points that look like they should have received the treatment but didn’t, and vice versa. Those weights can be used in a simple comparison of weighted means, or for a weighted regression.
- regression - in which the propensity to receive the treatment isn’t explicitly modelled, but the various confounding variables (ethnicity, income at beginning of program, age, etc) are “controlled” for in the usual regression way.
The conclusion was:
Compared to the older style propensity matching to create a pseudo control sample, it may be better to weight the full data by inverse propensity score because it doesn’t discard data. Performing a regression (rather than simple cross tabs) after the weighting or matching is a good idea to handle inevitable imperfections. The whole family of methods doesn’t necessarily deliver big gains over more straightforward single stage regressions. And if you have omitted material variables you’re in trouble whatever you do.
The omitted variables problem
Today I’m presenting some simulated data to explore that conclusion, with a particular focus on the omitted variables bias. What proportion of the “real” explanatory variables need to be missing from our observed set (and hence from the propensity model and/or the regression) for the whole thing to go belly-up?
To explore this I created a bunch of datasets with the following characterstics:
- response variable
yis independently and identically distributed in a normal distribution with the expected value a linear combination of 100 continuous explanatory variables and one binary
- the true treatment effect - the coefficient for
yis the response in a regression - is exactly 1.
- the true coefficients for the 100 other explanatory variables are standard normal variables (ie centered on zero with variance of one)
- the actual values of the explanatory variables are a multivariate normal distribution with non-trivial covariance, and individual means of zero and variances of one
- the propensity to get the treatment is related to the 100 explanatory variables, and in total 10% of cases get the treatment.
Here’s the setup of the session in R and the function that I used to create those datasets:
If observed in full, datasets generated this way should be ideal for analysis based either on propensity scores modelling or a simpler single stage regression; I haven’t introduced any complications such as outliers in any of the variables, non-linear relationships, or non-normal distribution of the response variable.
- I generated 30 data sets for each of the following sample sizes: 500, 1,000, 2,000, 5,000, 10,000, 20,000 and 100,000.
- For each of the resulting 210 datasets I created 10 datasets representing various stages of complete observation - with only 10, 20, 30, …, 100 of the 100 actual explanatory variables. The idea was to mimic the real-life situation where we don’t have access to all the variables that impact on the propensity to get the treatment, or on the eventual variable of interest.
- For each of the 2,100 datasets of various stages of observation I estimated the treatment effect (which has the true value of 1 in each case) with four different methods:
- comparison of means after propensity score matching
- regression on explanatory variables after propensity score matching
- weighted regression using the inverse propensity of treatment weights, using the full dataset
- single stage regression without modelling the propensity of treatment explicitly
The whole exercise took about eight hours to run on my laptop, for most of which time it was fully utilising all available resources.
Let’s first look at just the situation where we had all 100 explanatory variables in the model:
We see that propensity score matching (in blue) and then comparing the means always has a much bigger range of estimates than the three methods that use regression. We also see (unsurprisingly) that increasing sample size helps a lot for all methods.
Now compare this to when we only observe (and hence include in the analysis) 80 of the 100 true explanatory variables of the same datasets:
Ouch. The range and variance of estimates is much larger. Increasing sample size seems to help only marginally. And the advantage of the three regression methods over simple means comparison after propensity score matching is much less than before. Omitted variables bias - even when only 20% of them are missing - is a killer and it impacts on both bias and consistency. Note that the impact of bias doesn’t show up well here because different samples have different data generating processes and the biases are in different directions according to which population the sample is generated from.
We can see the impact of omitting more variables from the next graphic, which shows the results of fitting the model with different numbers of explanatory variables included:
When all 100 variables are included (to the far right of each facet), all the models are pretty good, although simple means comparison after propensity score matching noticeably is a bit further away from the true value of 1 than the other methods. But when only 90% of explanatory variables are observed and included - and much more for when fewer variables are in the model - the estimates of the treatment effect get systematically out, in a direction which varies for individual datasets (each represented by one colour line). Missing 20% of the variables (ie including only 80) is enough for the effect to be bad; and increasing the sample size from 10,000 to 100,000 doesn’t make much difference.
The final graphic shows how all these estimates are inconsistent (ie don’t converge to correct answer with increasing sample size) when less than the full 100 variables are included:
Again, we also see that the regression methods (with or without propensity score matching or weighting) are far better than simple means comparison after propensity score matching when all variables are observed and included in the model, but this isn’t anywhere as obviously the case when there is an omitted variables problem.
- all four methods are basically ok when all variables are included, but simply comparing means (without a regression) after propensity matching is not efficient;
- propensity matching without regression is much less efficient than propensity matching with regression, and both are not as good as either of the methods that fit regression models to the full data (with or without weights)
- missing 10 variables out of 100 is enough for all methods to go awry
- missing variables is worse than small sample size. At its most extreme, a sample size of 500 but with all 100 variables observed is pretty much as good as a sample size of 100,000 with only 90 variables observed - so long as a regression of some sort is used.
Here’s the R code that does the actual simulations, analysis and prepares those charts; it takes up where the chunk of code earlier in this post finishes.