Just before Christmas I blogged about the positive correlation between depression incidence in US counties and their vote for Trump in the 2024 presidential election. In addition to my casual interest in the topic, I used it as a case study in multilevel modelling while adjusting for spatial correlation. I explicitly said that I didn’t think it likely that the depression-vote relationship was a causal link; I suspected that most likely, some underlying variable that caused depression was also related to voting behaviour.
I’m coming back to the issue because on reflection, I have three bits of unfinished business:
- I had a nagging thought that with 50+ counties per state (and hence some degrees of freedom to spare), I perhaps should have allowed random slopes for depression incidence in each state, rather than just random intercepts
- I thought my spatial statistics workaround, of just adding a rubbery mat to space to suck up the spatial autocorrelation between observations, perhaps was a bit slap-dash and I should be as a matter of course modelling the spatial autocorrelation explicitly
- An alert reader, Jonathan Spring, pointed out that in the USA there are very marked racial differences between depression diagnoses and suggested that perhaps in my model the depression incidence was standing in as a proxy for “whiteness”.
Of these, the first two felt like bits of probably-immaterial-to-the-question methodological details that I should polish up, whereas number 3 seemed likely to be the explanation of the whole phenomenon. In other words, race is likely a confounder of the depression-voting relationship, as per this directed acyclic graph:
Which means that if we want to actually understand the causal relationship of depression on voting we would need to control for race in the regression. Now, there are other things we’d need to do too; in particular to identify and control for the various “other factors”. I’m not up for that right now - this is someone else’s job - but I’m interested enough to go part-way into things to at least check out the degree to which race makes the depression relationship go away.
This is a long post and parts of it are likely to be of interest only to people concerned with the minutiae of multilevel modelling with spatial autocorrelation. If you just want to see what including race in the model does to the depression effect on voting for Trump, you can skip down to the final section.
All the code for this post assumes that the code from the previous post has already been run. If you want a version of the whole script that just runs, it’s here in the source repository of the whole blog.
But here’s the code for just that DAG diagram:
OK, on to fixing up the loose ends with my modelling approach.
Move from gam
to gamm
I decided that the simplest material improvement to my approach to the spatial autocorrelation would be to explicitly model the spatial co-variance of the residuals. One way of doing this that is the minimal change from my approach so far would be to move the model-fitting from mgcv::gam()
to mgcv::gamm()
. gamm()
fits models by iterating between calls to nlme::lme()
and a generalized additive model until convergence. Basically this gives us the ability to use the correlation structures for residuals and mixed random and fixed effects of lme()
while still using the splines and response distributions of gam()
. This is what I need as I want to continue modelling my response with a quasi-binomial family, and I am probably going to want to keep my “rubber sheet” nuisance effect over the US space modelled with a two-dimensional spline (s(x, y)
), even while using the correlation features of lme()
.
So the first thing I do is create a model6b
, as similar as possible to the model6
that was the best and final model from the last blog post, but just changes the estimation method. So here it is, a straight transfer from gam()
to gamm()
… which gives us these two different results:
> summary(model6)
Family: quasibinomial
Link function: logit
Formula:
per_gop ~ cpe + s(x, y) + s(state_name, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.7972 20.7488 -0.135 0.893
cpe 15.5908 0.6778 23.001 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x,y) 28.39 28.94 7.995 <2e-16 ***
s(state_name) 49.00 49.00 7.949 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.386 Deviance explained = 44.8%
GCV = 3138.5 Scale est. = 3170.7 n = 3107
> summary(model6b$gam)
Family: quasibinomial
Link function: logit
Formula:
per_gop ~ cpe + s(x, y) + s(state_name, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.8878 0.1481 -19.50 <2e-16 ***
cpe 15.0166 0.6271 23.95 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x,y) 25.41 25.41 6.438 <2e-16 ***
s(state_name) 39.91 49.00 7.154 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.382
Scale est. = 3005.3 n = 3107
There’s some differences coming from the different estimation methods. The gamm()
model uses less effective degrees of freedom for both the s(x,y)
rubber mat and the s(state_name, bs="re")
state level random intercept. The fixed effect coefficient for cpe
(‘crude prevalence estimate’ of county-level depression) is a little different - 15.0 versus 15.6. But we can see we’re fitting the same model and getting substantively the same results.
Next small change I make is to move the state level random intercept from the gam
specification into lme
. Again, this is an identical model, just changing how the fitting is done:
This is a big performance speed up (which we’re going to need as the models start getting more complex) for materially the same results.
Random slopes
Now I’m ready to take advantage of the move to the gamm
syntax and I let the slopes, not just the intercepts, vary for each state:
…which gives us these results. Note that the Akaike Information Criterion has decreased by 200 from adding the random slopes, suggesting the model is overall worth the increased complexity. The cpe (depression) coefficient has decreased in size but is still quite large and definitely significant.
> AIC(model7$lme, model6c$lme)
df AIC
model7$lme 9 8605.692
model6c$lme 7 8806.902
> summary(model7$lme)
Linear mixed-effects model fit by maximum likelihood
Data: data
AIC BIC logLik
8605.692 8660.064 -4293.846
Random effects:
Formula: ~Xr - 1 | g
Structure: pdIdnot
Xr1 Xr2 Xr3 Xr4 Xr5 Xr6 Xr7
StdDev: 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986
Xr8 Xr9 Xr10 Xr11 Xr12 Xr13 Xr14
StdDev: 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986
Xr15 Xr16 Xr17 Xr18 Xr19 Xr20 Xr21
StdDev: 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986
Xr22 Xr23 Xr24 Xr25 Xr26 Xr27
StdDev: 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986
Formula: ~1 + cpe | state_name %in% g
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 3.294624 (Intr)
cpe 15.833056 -0.985
Residual 50.963709
Variance function:
Structure: fixed weights
Formula: ~invwt
Fixed effects: list(fixed)
Value Std.Error DF t-value p-value
X(Intercept) -2.099504 0.5204783 3054 -4.033796 0.0001
Xcpe 10.482779 2.5041023 3054 4.186243 0.0000
Xs(x,y)Fx1 0.221830 0.1339667 3054 1.655859 0.0979
Xs(x,y)Fx2 0.134028 0.1735957 3054 0.772070 0.4401
Correlation:
X(Int) Xcpe X(,)F1
Xcpe -0.985
Xs(x,y)Fx1 -0.083 0.084
Xs(x,y)Fx2 -0.121 0.126 0.568
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-9.65917765 0.05092882 0.32541129 0.62277480 6.27537901
Number of Observations: 3107
Number of Groups:
g state_name %in% g
1 50
We can use this model to make a new version of the final chart from the previous blog post:
It doesn’t look much different. So my hunch on this aspect was right; there’s enough data to justify letting the depression-vote slope vary in each state (ie we get a better model) but it doesn’t change the substantive conclusion.
Better spatial autocorrelation
Now I’m ready to improve the way I’m handling spatial autocorrelation. As I mentioned last time, my approach to this is to model the centroid of each county with an s(x, y)
two-dimensional spline, a sort of rubber mat overlaid over the USA to soak up the things counties have in common with their neighbouring counties. I still think this is much better than nothing, but admit there is still a problem that even after doing this the residuals of counties that are close together will still be correlated. Which means that each observation is not worth as much as if it were truly independent, which means that my inferences are over-confident in their precision.
First I wanted to check out if this hunch was correct. The standard way to look for spatial autocorrelation is via a variogram. I couldn’t get the Variogram()
function in nlme
to return sensible results from the models that had been fit with gamm()
so I had to make one more explicitly with the variogram()
function in gstat
:
… which gets me this:
A variogram works by calculating the distances between each pair of observations, binning these and then assessing how related the pairs of observations (in this case, residuals after the model fitting) are in each bin. It doesn’t do this by a correlation coefficient but by another measure that I haven’t got my head around but is described as ‘half the variance of the differences between all possible points spaced a constant distance apart.’. The important thing is that a score of zero means perfect correlation, and the higher the numbers are the more independent the pairs of observations contained in the bin are. So to interpret the chart above we can say that the counties that are about 20 degrees (of latitude/longitude, ignoring curvature of the earth) apart or closer have some degree of correlation with eachother; once they get to that far apart this measure more or less stablises.
I don’t know why spatial statistics doesn’t just use a good-ole correlation coefficient for this job, but presume there are interesting historical reasons. To check that I wasn’t mangling things, I decided to “roll my own” spatial correlation measure, with:
That gives me this chart, which is more like the sort of thing we use in time series analysis.
I’m pleased it tells a similar story to the variogram (which is more of a black box to me) - the correlation between pairs of counties’ residuals is above zero for counties that are within 6 degrees/units of eachother, and stablises at a small negative number from about 20 units apart onwards.
My conclusion from this is that yes, there is still some spatial autocorrelation that should be taken into account. Particularly for those counties very close together (around 2 degree/units apart or less).
Now, there is a tricky problem with the shape of spatial autocorrelation. Even if you’re prepared to assume (as I am in this case) that it is symmetrical north-south and east-west (not always the case when it depends on eg wind), there are differing shapes of decay in the level of correlation, as a function of distance. Experts can apparently/allegedly judge which method is best by looking at the shape of the variogram, but I feel safest by trying all five correlation structures available to me and choosing the one with the lowest AIC.
That gets me these results:
> sapply(model8, \(m){round(AIC(m$lme), -1)})
[1] 8230 8600 8590 8330 8580
> AIC(model7$lme)
[1] 8605.692
We see that all the models that adjust for spatial correlation are better than model7 (which doesn’t), but the models using corExp
and corRatio
are much better than the other three. The corExp
model (model8[[1]]
) is our new best model so far.
If we look at the t statistics for the coefficients of these five models, we see that for the first time our best model has a “non-significant” slope for cpe (ie depression), 1.695. Now, I’m not going to remove a variable from a complex mixed effects model like this because of a non-significant t statistic, but it is definitely note-worthy that the depression effect has become less important once we let it vary by state and do the best correction for spatial autocorrelation:
> sapply(model8, \(m){summary(m$gam)$p.t})
[,1] [,2] [,3] [,4] [,5]
(Intercept) -2.031622 -4.036009 -3.987122 -3.494438 -3.952872
cpe 1.695218 4.206850 4.193251 3.746362 4.167077
Also noteworthy though is that the other spatial correlation shapes still leave depression with significant t statistic.
Here’s the full report from the lme
part of the best fit so far:
> summary(model8[[1]]$lme)
Linear mixed-effects model fit by maximum likelihood
Data: data
AIC BIC logLik
8231.487 8291.901 -4105.744
Random effects:
Formula: ~Xr - 1 | g
Structure: pdIdnot
Xr1 Xr2 Xr3 Xr4 Xr5 Xr6 Xr7 Xr8
StdDev: 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243
Xr9 Xr10 Xr11 Xr12 Xr13 Xr14 Xr15 Xr16
StdDev: 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243
Xr17 Xr18 Xr19 Xr20 Xr21 Xr22 Xr23 Xr24
StdDev: 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243
Xr25 Xr26 Xr27
StdDev: 6.68243 6.68243 6.68243
Formula: ~1 + cpe | state_name %in% g
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 2.702224 (Intr)
cpe 12.092468 -0.99
Residual 66.289425
Correlation Structure: Exponential spatial correlation
Formula: ~x + y | g/state_name
Parameter estimate(s):
range
0.7289188
Variance function:
Structure: fixed weights
Formula: ~invwt
Fixed effects: list(fixed)
Value Std.Error DF t-value p-value
X(Intercept) -0.889436 0.4390918 3054 -2.0256268 0.0429
Xcpe 3.370112 1.9919865 3054 1.6918346 0.0908
Xs(x,y)Fx1 0.165268 0.1447102 3054 1.1420604 0.2535
Xs(x,y)Fx2 0.000906 0.1909150 3054 0.0047431 0.9962
Correlation:
X(Int) Xcpe X(,)F1
Xcpe -0.988
Xs(x,y)Fx1 -0.065 0.066
Xs(x,y)Fx2 -0.131 0.141 0.510
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-5.2880800 0.2592565 0.5963373 1.0331947 4.6052452
Number of Observations: 3107
Number of Groups:
g state_name %in% g
1 50
I believe that Formula: ~Xr - 1 | g
bit refers to the s(x,y)
term from the GAM when brought into the linearised LME fit. The range of 0.73 refers to the exponential spatial correlation.
Race
Finally I’m ready to look at the question that’s of most substantive interest - does incuding race in the model make the “depression effect” go away, suggesting that depression is acting as a proxy for whiteness. I sourced data on county characteristics of various sorts from the UN Census Bureau. I considered four candidate variables:
white_alone
which is the proportion of the county that describe themselves as white and not other racewhite_all
which is the proportion of the county that describe themselves as white, whether or not they also are of another race as wellhispanic
proportion of county that describe themselves as hispanichispanic_multi
proportion of county that describe themselves as hispanic and more races in addition
I deliberately left out the proportion of African-Americans in each county, assuming it would be very collinear with some combination of the others. If were seriously interested in how race worked in this election I would probably have included it anyway.
Looking at these four variables (without peeking at their relationship to vote) gives me this pairs plot:
which convinces me I should save a degree of freedom and drop the white_all
variable from my modelling as containing virtually no extra information from the white_alone
variable. Here’s the code to import that data and draw the pairs plot:
Next I wanted to look at the relationship of these race variables to the logit of vote for Trump, to check if a linearity assumption was going to be reasonable. That got me this chart, which seemed linear enough for me (for my purposes):
…produced with this code:
OK so now I fit a whole bunch of different models to be confident that individual decisions from me weren’t going to be leading to my final conclusions. I fit models with many of the combinations of using a rubber mat to deal with spatial autocorrelation, explicitly modelling the spatial autocorrelation, random or fixed slopes (ie varying by state) for cpe
(depression), random or fixed slopes (ie varying by state) for hte race variables. Then I calculated the AIC of all these models and the t statistics for cpe
(depression), which gets me this table, sorted with the best models (lowest AIC) at the top:
model | AIC | rubber_mat | random_cpe_slope | race | random_race_slope | SAC_fix | cpe_t_stat |
---|---|---|---|---|---|---|---|
15 | 5676.498 | 1 | 1 | 1 | 1 | 1 | -1.612598 |
13 | 5904.910 | 1 | 1 | 1 | 0 | 1 | -1.900880 |
10 | 6063.537 | 0 | 1 | 1 | 0 | 1 | -1.581859 |
14 | 6241.311 | 1 | 0 | 1 | 0 | 1 | -4.612077 |
11 | 6416.801 | 0 | 0 | 1 | 0 | 1 | -2.996010 |
12 | 7136.013 | 0 | 0 | 1 | 0 | 0 | 5.347793 |
8 | 8231.487 | 1 | 1 | 0 | 0 | 1 | 1.695218 |
9 | 8246.613 | 0 | 1 | 0 | 0 | 1 | 2.162823 |
Lots of interesting things here, but most important:
- The best models are definitely those that include race
- The best model of all is also the most complex - random slopes for both race and depression, rubber mat, plus addressing the spatial autocorrelation (SAC) explicitly with an exponential correlation structure
- The best models all have a negative sign for
cpe
, indicating that after controlling for race, it is the counties with less depression that were more likely to vote for Trump - a compelling change in narrative from looking at the data without controlling for race. But in the best model of all, depression isn’t very important. - One model that I describe in the code below as “definitely illegitimate because it makes no attempt at all to correct for spatial autocorrelation” is the one that gives the biggest positive effect (as in t statistics) for the depression impact on voting for Trump
I regard this as good evidence that in fact incidence of depression by county wasn’t important in driving the vote for Trump, but that race composition of counties probably was (or at least ‘more likely’ was). Of course, how that mechanism works, and whether ‘whiteness’ here is in fact standing in for something else, is beyond the scope of this blog post (already far too long) to explain.
Here’s the code that fit all those candidate models described above and generates the table:
Here’s the key results from the overall winning model:
> summary(model15$lme)
Linear mixed-effects model fit by maximum likelihood
Data: data
AIC BIC logLik
5676.498 5797.326 -2818.249
Random effects:
Formula: ~Xr - 1 | g
Structure: pdIdnot
Xr1 Xr2 Xr3 Xr4 Xr5 Xr6 Xr7
StdDev: 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787
Xr8 Xr9 Xr10 Xr11 Xr12 Xr13 Xr14
StdDev: 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787
Xr15 Xr16 Xr17 Xr18 Xr19 Xr20 Xr21
StdDev: 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787
Xr22 Xr23 Xr24 Xr25 Xr26 Xr27
StdDev: 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787
Formula: ~1 + cpe + white_alone + hispanic | state_name %in% g
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 1.503864 (Intr) cpe wht_ln
cpe 6.544726 -0.761
white_alone 1.055536 -0.437 -0.188
hispanic 1.719388 -0.134 -0.152 0.180
Residual 43.651012
Correlation Structure: Exponential spatial correlation
Formula: ~x + y | g/state_name
Parameter estimate(s):
range
0.8315562
Variance function:
Structure: fixed weights
Formula: ~invwt
Fixed effects: list(fixed)
Value Std.Error DF t-value p-value
X(Intercept) -2.395289 0.271082 3051 -8.836033 0.0000
Xcpe -1.862160 1.155263 3051 -1.611892 0.1071
Xwhite_alone 3.678207 0.191161 3051 19.241375 0.0000
Xhispanic -0.544371 0.327396 3051 -1.662729 0.0965
Xhispanic_multi 1.751862 4.307237 3051 0.406725 0.6842
Xs(x,y)Fx1 0.202288 0.093536 3051 2.162664 0.0306
Xs(x,y)Fx2 0.429865 0.131173 3051 3.277087 0.0011
Correlation:
X(Int) Xcpe Xwht_l Xhspnc Xhspn_ X(,)F1
Xcpe -0.761
Xwhite_alone -0.461 -0.164
Xhispanic -0.178 -0.053 0.161
Xhispanic_multi -0.002 -0.114 0.101 -0.278
Xs(x,y)Fx1 -0.051 0.030 0.028 -0.032 0.157
Xs(x,y)Fx2 -0.121 0.101 0.083 0.000 -0.081 0.350
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-6.1786290 0.2312745 0.4830700 0.8228585 5.5847101
Number of Observations: 3107
Number of Groups:
g state_name %in% g
1 50
Looking at that table of fixed effects, we see the white_alone
variable is definitely significant, as is the rubber mat s(x,y)
spatial correlation absorber. So we could tentatively conclude that this suggests that whiteness strongly contributed to vote for Trump; that depression incidence didn’t contribute anywhere near as much; and there was a strong spatial correlation not explained by either whiteness or depression.
Looking for a way to summarise all this I came up with this chart of the partial relationship of whiteness and of depression incidence to vote for Trump; after controlling for the other things in the final model:
That was produced with this code, which involved fitting a new model to produce the residuals after controlling for race but not controlling for depression (once combination that hadn’t yet been done in the frenzy of model-fitting above):
That’s it for today. I hope that’s of interest for someone, either as a messy but realistic modelling strategy case study, or for those interested in the specific issue of depression and the 2024 election.