Exploring propensity score matching and weighting

This post jots down some playing around with the pros, cons and limits of propensity score matching or weighting for causal social science research.

Intro to propensity score matching

One is often faced with an analytical question about causality and effect sizes when the only data around is from a quasi-experiment, not the random controlled trial one would hope for. This is common in many fields, but some of the most important occurrences are in public policy. For example, government programs to help individuals or firms are typically not allocated at random, but go to those with higher need, or higher potential to make something out of the assistance. This makes isolating the effect in the data of the treatment difficult, to say the least.

A frequently-used family of analytical methods to deal with this are grouped under propensity score matching (although not all these methods literally “match”). Such methods model the probability of each unit (eg individual or firm) receiving the treatment; and then using these predicted probabilities or “propensities” to somehow balance the sample to make up for the confounding of the treatment with the other variablers of interest.

The simplest to explain member of this family involves creating a pseudo control group from the non-treatment individuals that resembles the treatment group in that they have similar propensities to get the treatment, but differs in that they just didn’t get it. Then analysis can proceed “as though” the data had been generated by an experiment. This approach is easy to explain to non-statisticians and has the great benefit of creating a tangible, concrete “control” group.

But everything depends on the model of the probabilities of getting the treatment. If it’s a good model, you’re fine. If not - and in particular, if the chance of getting the treatment is related to unobserved variables which also impact on your response variable of interest - the propensity score approach helps very little if at all. A good text on all this (and much more) is Morgan and Winship’s Counterfactuals and Causal Inference: Methods and Principles for Social Research.

Job training example

Basics

A pretty thorough implementation of various propensity score methods in R comes in Daniel E. Ho, Kosuke Imai, Gary King, Elizabeth A. Stuart (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software, Vol. 42, No. 8, pp. 1-28. There’s a good overview in the MatchIt vignette. I’ll get started with data from one of their examples, which shows a typical application of this technique:

“Our example data set is a subset of the job training program analyzed in Lalonde (1986) and Dehejia and Wahba (1999). MatchIt includes a subsample of the original data consisting of the National Supported Work Demonstration (NSW) treated group and the comparison sample from the Population Survey of Income Dynamics (PSID).1 The variables in this data set include participation in the job training program (treat, which is equal to 1 if participated in the program, and 0 otherwise), age (age), years of education (educ), race (black which is equal to 1 if black, and 0 otherwise; hispan which is equal to 1 if hispanic, and 0 otherwise), marital status (married, which is equal to 1 if married, 0 otherwise), high school degree (nodegree, which is equal to 1 if no degree, 0 otherwise), 1974 real earnings (re74), 1975 real earnings (re75), and the main outcome variable, 1978 real earnings (re78).”

First, loading up the functionality I need for the rest of this post

library(tidyverse)
library(forcats)
library(scales)
library(MASS)              # for rlm
library(boot)              # for inv.logit
library(MatchIt)
library(boot)              # for bootstrapping
library(doParallel)        # for parallel processing
library(mgcv)              # for gam

#==============example from the MatchIt vignette=================
data(lalonde)

# naive comparison - people who got the job training program
# have lower incomes in 1978 - because the training was given
# to people with income problems.  So comparison is unfair:
lalonde %>%
   group_by(treat) %>%
   summarise(Income1978 = mean(re78),
             n = n())

  treat Income1978     n
1     0   6984.170   429
2     1   6349.144   185

So we see that the 429 people who didn’t get the job training “treatment” had an average income about $635 more than the 185 beneficiaries. Program failure!

Obviously that’s unfair on the program, so we use matchit and match.data to create an artificial control group that resembles the treatment group in terms of age, education, ethnicity, marital stats, and income in 1974 and 1975:

# Choose one of the large variety of propensity score matching methods to model propensity
match_model <- matchit(treat ~ age + educ + black + hispan + nodegree + married + re74 + re75, 
                       data = lalonde, method = "nearest")
match_data <- match.data(match_model)
   
# Simple comparison is now much fairer
match_data %>%
   group_by(treat) %>%
   summarise(Income1978 = mean(re78),
             n = n())

  treat Income1978     n
1     0   5440.941   185
2     1   6349.144   185

That gives us an average treatment effect of $908. Notice that the new control group is the same size as the treatment group; the rest of the observations have been discarded.

You don’t need to limit yourself to simple comparisons, although in principle they should work. A regression with the matched control and treatment data, even using the same explanatory variables as were used in the matching model, helps address the inevitable lack of complete balance between the two groups. That gives us a result of $960 (output not shown). I use the robust M-estimator rather than ordinary least squares to deal with breaches of the classical regression assumptions (particularly the non-Normal response, with variance increasing as its mean increases).

# regression model estimate with matched data
coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married +  re74 + re75, 
        data = match_data))["treat"]

Regression is simpler and gives similar results

However, a similar estimate could have come from a simpler, single-step regression with the original data, skipping the propensity score modelling altogether (there are arguments pro and con). The regression below estimates a treatment effect of $1,183. I call this “similar” because the uncertainty around all these estimates is huge, which I’ll demonstrate further down the post.

# regression model estimate with original data
coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married +  re74 + re75, 
        data = lalonde))["treat"]

Weighting might be better than matching

Creating a control group by matching has the distressing side-effect of throwing away large amounts of the data, because the control group is shrunk down to the same size as the treatment group. A possibly better use of the propensity scores is to keep all observations in play but weight them according to the propensity score - one of the methods described by Peter Austin in this article on “An introduction to propensity score methods for reducing the effects of confounding in observational studies”.

Under “inverse probability of treatment weighting”, proposed by Imbens in 2000, observations that receive the treatment are given weight of $\frac{1}{p}$ and those that did not receive the treatment are given weight of $\frac{1}{1-p}$, where p is the probability of getting the treatment. That is, each observation is given weight of the inverse of the probability of the treatment they actually got. Intuitively, treatment cases that resemble the controls are interesting and given more weight, and control cases that look like they should have got the treatment are interesting and get more weight. Analysis can then proceed via weighted average values, or a regression with explanatory variables (which may or may not be the same variables as those used in the model of propensity for treatment).

I implemented this with Lalonde’s data without using the MatchIt package, partly to convince myself I understood how it worked, and partly because I wanted more flexibility in modelling propensity than is supported by that package. I’d looked at the generalized linear model with binomial family response that was used for its propensity score matching, and noticed that age was showing up as unhelpful in determining treatment. My expectation is that age is probably related in a non-linear fashion to the probability of getting job training treatment, so I used a generalized additive model to allow for this…

mod <- gam(treat ~ s(age) + educ + black + hispan + nodegree + married + re74 + re75, 
           data = lalonde, family = "binomial")

plot(mod, shade = TRUE, 
     main = "Non-linear function of age in treatment propensity")

…which turns out to be the case:

Converting predicted probabilities into weights is straightforward. In the code below I have a quick look at the resulting density of weights.

lalonde <- lalonde %>%
   mutate(propensity_gam = predict(mod, type = "response"),
          weight_gam = 1 / propensity_gam * treat + 
                       1 / (1 - propensity_gam) * (1 - treat))

ggplot(lalonde, aes(x = weight_gam, fill = as.factor(treat))) +
   geom_density(alpha = 0.5, colour = "grey50") +
   geom_rug() +
   scale_x_log10(breaks = c(1, 5, 10, 20, 40)) +
   ggtitle("Distribution of inverse probability weights")

One of the criticisms of this inverse probability of treatment weighting approach is that individual observations can get very high weights and become unduly influential. For example, in an incomplete article by Posner and Ash (there may be a complete version somewhere else):

“While this method can be shown to have nice mathematical properties, it does not work well in practice. Consider a lone treated observation that happens to have a very low probability of being treated…. The value of the inverse of the propensity score will be extremely high, asymptotically infinity. The effect size obtained will be dominated by this single value, and any fluctuations in it will produce wildly varied results, which is an undesirable property.”

Ouch. Posner and Ash go on to suggest alternative ways of weighting less vulnerable to these problems. As a simple starter, in the below I try the obvious first step of truncating the weights at 10. Here are the average incomes of the treatment and non-treatment groups using the full set of inverse probability weights, and another set truncated at 10.

lalonde %>%
   group_by(treat) %>%
   summarise(Income_allweights = round(weighted.mean(re78, weight_gam)),
             Income_truncweights = round(weighted.mean(re78, pmin(10, weight_gam))),
             n = n())

  treat Income_allweights Income_truncweights     n  
1     0              6446                6435   429
2     1              6814                7167   185

Note that we now have all 429 of the non-treatment cases, a definite advantage over the matching methods.

We’re not limited to simple weighted averages of course; we can use these weights in the analysis of our choice including our robust regression model. Ho et al (2007) suggest that (in Morgan and Winship’s paraphrase):

“the general procedure one should carry out in any multivariate analysis that aspires to generate causal inferences is to first balance one’s data as much as possible with a matching routine and then estimate a regression model on the matched data. From this perspective, matching is a preprocessor, which can be used to prepare the data for subsequent analysis with something such as a regression model.”

The “doubly robust” approach to regression modelling to identify causal effects uses weights from propensity score matching as well as a regression.

“There has been considerable debate as to which approach to confounder control is to be preferred, as the first [ie single step regression] is biased if the outcome regression model is misspecified while the second approach [ie propensity score matching] is biased if the treatment regression, ie propensity, model is misspecified. This controversy could be resolved if an estimator were available that was guaranteed to be consistent … whenever at least one of the two models was correct. … We refer to such combined methods as doubly-robust or doubly-protected as they can protect against misspecification of either the outcome or treatment model, although not against simultaneous misspecification of both.”

Robins and Rotnitzky 2001, quoted in Morgan and Winship Counterfactuals and Causal Analysis

So here is my robust M estimator regression, using inverse propensity of treatment as weights, where propensity of treatment was modelled earlier as a generalized additive model on all explanatory variables and non-linearly with age:

coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married +  re74 + re75, 
         data = lalonde, weights = as.vector(weight_gam)))["treat"]

This gives a treatment estimate of $910 and I think it’s my best point estimate so far.

These weighting methods seem better than simply matching, if only because they retain all the data, but crucially they are still vulnerable to model mis-specification and unobserved explanatory variables. There’s a good critical discussion in this article by Freedman and Berk:

“Regressions can be weighted by propensity scores in order to reduce bias. However, weighting is likely to increase random error in the estimates, and to bias the estimated standard errors downward, even when selection mechanisms are well understood. Moreover, in some cases, weighting will increase the bias in estimated causal parameters. If investigators have a good causal model, it seems better just to fit the model without weights. If the causal model is improperly specified, there can be significant problems in retrieving the situation by weighting, although weighting may help under some circumstances.”

And with respect to the “doubly robust” approach:

“you need to get at least one of the two models (and preferably both) nearly right in order for weighting to help much. If both models are wrong, weighting could easily be a dead end. There are papers suggesting that under some circumstances, estimating a shaky causal model and a shaky selection model should be doubly robust. Our results indicate that under other circumstances, the technique is doubly frail.”

When I see multi-stage approaches like propensity score matching or weighting - just like structural equation models, and two or three stage least squares - that aim to deal with causality by adding complexity, I always get very nervous; the more so when I read criticisms like those above. I’ve done some simulations exploring this issue but a write-up will have to wait for a separate post.

Bootstrap or cross-validation must envelope the propensity modelling

The literature has a range of (conflicting) views on estimating uncertainty of statistics estimated after propensity score matching or weighting. Morgan and Winship report in a footnote that the bootstrap does not to work particularly well for matching in particular, because the resampling process leaves fewer distinct cases to match to during the propensity modelling stage. However, Austin and Small reported in 2014 tests of bootstrap methods that seem quite effective. Definitely, one of the appeals of weighting (rather than matching) is that it should make the overall process more suitable for sitting inside a bootstrap. However, I don’t have time for a really thorough review of this literature, so let’s just note that estimating uncertainty after propensity score matching is a moving area of uncertainty itself.

I use the method described by Austin and Small as the “complex bootstrap”, which involves resampling from the original data and performing the propensity modelling and matching for each resample. The propensity modelling is a big source of our uncertainty in the final estimates of interest. So that source of uncertainty needs to be repeated each time we mimic the sampling process with our bootstrap (the same applies to other pre-processing steps, like imputation). Austin and Small report that this results in a small overestimate of sampling variability (standard error higher by 7% than it should be; compared to 4% for a simpler bootstrap approach) but my principle, following the approach in Harrell’s Regression Modeling Strategies and elsewhere, is to always include pre-processing inside the bootstrap. I suspect that with less ideal data than Austin and Small use in their simulations (on my reading, they assume all relevant variables are available, amongst other ideals) this will pay off. Anyway, the results reported below look plausible.

Here’s code that bootstraps four of the methods of estimating treatment effect above:

simple comparison of matched data;
robust regression with matched data;
robust regression with data weighted by inverse propensity of treatment;
robust regression with the original data.

#=================bootstrap for confidence intervals=================
# Set up a cluster for parallel computing
cluster <- makeCluster(7) # only any good if you have at least 7 processors :)
registerDoParallel(cluster)

clusterEvalQ(cluster, {
   library(MatchIt)
   data(lalonde)
})

#' Function to estimate treatment effect three different methods
#' @return a vector of four estimates of treatment effect.  See comments
#' in function for description of which is which
my_function <- function(x, i){
   resampled_data <- x[i, ]
   match_data <- match.data(
      matchit(treat ~ age + educ + black + hispan + nodegree + married + re74 + re75, 
              data = resampled_data, method = "nearest")
   )
   
   # simple mean of matched groups
   est1 <- with(match_data,
                mean(re78[treat == 1]) -
                   mean(re78[treat == 0]))
   
   
   # regression model estimate with matched groups
   est2 <- coef(rlm(re78 ~ treat + age + educ + black + hispan + nodegree + married +  re74 + re75, 
           data = match_data))["treat"]
   
   # regression model with IPTW
   mod <- gam(treat ~ s(age) + educ + black + hispan + nodegree + married + re74 + re75, 
              data = resampled_data, family = "binomial")
   
   resampled_data <- resampled_data %>%
      mutate(propensity_gam = predict(mod, type = "response"),
             weight_gam = 1 / propensity_gam * treat + 1 / (1 - propensity_gam) * (1 - treat))
   
   est3 <- coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married +  re74 + re75, 
            data = resampled_data, weights = as.vector(weight_gam)))["treat"]
   
   # regression model estimate with original data
   est4 <- coef(rlm(re78 ~ treat + age + educ + black + hispan + nodegree + married +  re74 + re75, 
                   data = resampled_data))["treat"]
   return(c(est1, est2, est3, est4))
}

# test gives same results as when done earlier in the post:
my_function(lalonde, 1:nrow(lalonde))

booted <- boot(lalonde, statistic = my_function, R = 5000, 
     parallel = "snow", cl = cluster)

booted_df <- as.data.frame(booted$t)
names(booted_df) <- c("Simple difference with matched data", 
                      "Regression with matched data", 
                      "Regression with weighted data",
                      "Regression with original data")

booted_tidy <- booted_df %>%
   gather("Method", "Estimate") %>%
   mutate(Method = fct_reorder(Method, Estimate)) 

booted_summary <- booted_tidy %>%
   group_by(Method) %>%
   summarise(lower = quantile(Estimate, 0.025),
             upper = quantile(Estimate, 0.975)) %>%
   gather(type, Estimate, -Method)

ggplot(booted_tidy, aes(x = Estimate, fill = Method)) +
  geom_density(alpha = 0.4, colour = "grey50") +
  geom_segment(data = booted_summary, aes(xend = Estimate), y = 0, yend = 2e-04) +
  geom_text(data = booted_summary, aes(label = round(Estimate)), y = 2.5e-04, size = 3) +
  facet_wrap(~Method) +
  theme(legend.position = "none") +
  scale_x_continuous(label = dollar) +
  ggtitle("Treatment effects from four different modelling methods",
		  "Distribution and 95% confidence intervals of estimated impact on income in 1978 of a job training program") +
  labs(x = "Estimated treatment effect",
	   caption = "Lalonde's 1986 data on a job training program",
	   fill = "")

> # biases revealed by the bootstrap:
> booted

Bootstrap Statistics :
     original     bias    std. error
t1*  908.2022 -268.85736    766.5107
t2*  959.5030  -75.55153    655.4119
t3*  909.9001   12.09643    838.2337
t4* 1182.5056   33.35265    682.6993

Those are big confidence intervals - reflecting the small sample size and the difficulty of picking up an impact of an intervention amongst all the complexities of income determinants. I’m satisfied that those are ok depictions of the uncertainty.

All three of the two stage methods that model propensity first give extremely similar results; and the simpler one-stage regression on the original unweighted data gives a result within a third of a standard error. In practice which would I do? I’d probably do it both ways and if there are wildly different results I’d worry, treating that as a diagnostic warning.

Basically with this sort of data, the big gain comes from any modelling strategy that accepts treatment as endogenous and hence you need to control for other variables that are related to both it and the response. Once you’ve adopted a method to do that, the benefits of which particular method are, to me, not particularly material.