A skeet floated across my Bluesky feed that looked at the cross-sectional relationship between incidence of depression in 2020 and voting for Trump in the 2024 Presidential election. The data in the skeet and immediate blog post was at state level, but the hypothesis of interest in an article that sparked all this was an individual one (are depressed people voting for Trump). I don’t have the individual level microdata that might help explore the actual hypothesis, but I was surprised to see that the state-level regression had a significant evidence of an effect, and wondered if this would continue at the county level, which still has relatively accessible data.
This also led me down an interesting but familiar rabbit hole of multilevel modelling in the presence of a spatial correlation nuisance.
Ordinary Least Squares
Well, it turns out depression at the county level does correlate with voting for Trump, as we can see with this first, simple chart which shows the expected vote based on a model fit with ordinary least squares (OLS):
I’ll be fitting some more fancy models and getting better charts further down, but the basic message is the same as in this one - counties with higher incidence of depression had a higher proportion of their vote going to Trump than was the case with counties with lower levels of depression.
Before I say anything else or show any code, let’s get straight that this is very possibly not a causal link. In fact there are at least three things that are plausibly happening here:
- People who are more depressed were more likely to vote for Trump (or less likely to turn up to vote against him, which given voluntary voting, has a similar result although for importantly different reasons)
- People (who may themselves be not depressed) who are in areas with lots of depressed people around them were more likely to vote for Trump (eg because voters think “Trump will be able to do something about all the depressed people around here”)
- Some underlying factor (eg economic, social or cultural conditions) that leads to some areas having higher rates of depression also leads to higher votes for Trump, through some other mechanism
My expectation is that #3 is the more likely explanation, but I personally don’t actually have evidence to choose between them. Nor are these hypotheses mutually exclusive; two or all of them might be true at once.
OK here’s the code that gets that data and produces the first chart and a simple model with a statistically significant effect:
Excerpt from the results:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.40002 0.01861 21.49 <2e-16 ***
cpe 1.27547 0.08712 14.64 <2e-16 ***
Generalized linear model with a quasibinomial response
Now, I wanted to improve this for all sorts of reasons, although I suspected it was actually good enough for pragmatic purposes - case proven really, that counties with more depressed people voted more for Trump. Proven enough to justify the further work with additional data needed to explore why. But I had some statistical loose ends to tidy up. First of which is that vote is a proportion, and it feels icky to use OLS (which can send values above 1 and below 0) to model a proportion when we have generalized linear models (GLMs) designed for the job and easily available.
I didn’t want to use a straight logistic regression because the county-level data is far more dispersed than would be expected if it really were individuals making their own voting decisions. But a GLM with a quasi-binomial response keeps the link function used in logistic regression and the relationship of the mean and variance, while allowing the variance to be larger (or smaller) by some consistent ratio. Here’s what I get with that GLM:
…created with this code. Note that we now have started to weight counties based on their overall number of voters, which makes particular sense for a binomial or similar family response. I suspect this is one of the key issues driving the line vertically down, compared to the OLS version. The other key difference of course is that now it is slightly curved, as the ‘linear’ in a GLM is on the link scale, not the scale the response is originally observed on and used for this chart.
Here’s an excerpt from that summary of model2:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.82178 0.06823 -26.70 <2e-16 ***
cpe 9.25748 0.34153 27.11 <2e-16 ***
We see cpe
(crude prevalence estimate, ie not the age-standardised one) has very definitely “significant” evidence that it isn’t zero, with a point estimate of 9.3 and a standard error of only 0.3.
Introducing a state effect
One thing that all the world knows is how spatially-based are US politics. Everything is thought of in terms of states, in particular, and smaller areas sometimes. It follows naturally from the ways that US politics is discussed that we should use a multi-level ie mixed-effects model with state as a random intercept, when looking at something like the overall relationship between depression and voting. In other words, we have to let the vote for Trump in any state vary for all the state-specific things that aren’t in our model, and see if after doing that we still get an overall (constant nation-wide) relationship between depression and voting in the counties within each state.
I often reach to the lme4
library by Bates, Bolker and Walker as my starting point for mixed effects models and this is the results in this case:
Note we have a bunch of parallel (on link scale) lines, one per state, with their height varying by the state random effect. I love the effect of this chart, but unfortunately lme4::glmer
which is used in this case doesn’t let us use a quasibinomial family response; we have to use a binomial response which forces the variance to equal p(1-p)/n
, not just be proportional to it. The net result is that the confidence intervals are much narrower than is justified.
An alternative way to fit a similar model is the the gam()
function from the amazing mgcv
library by Simon Wood. There’s a great discussion of this on Gavin Simpson’s blog. By specifying a spline around a categorical factor like s(state_factor, bs = 're')
(‘re’ stands for random effect) we can use gam()
to add random intercepts while using the full range of families available to gam()
including quasibinomial. That gives us this alternative version of the last model; this time with much fatter (and realistic) confidence intervals!:
The confidence intervals are now very fat. But there’s still no doubt about the significance of the evidence of the relationship of the crude prevalence of depression on voting behaviour, even after allowing a random state-level intercept.
Here’s the code for both the glmer
and gam
versions of this random state effect model:
Excerpt from the gam results (the better of the two, with fatter confidence intervals but still very significantly different from zero):
per_gop ~ cpe + s(state_name, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.2466 0.2802 -11.59 <2e-16 ***
cpe 16.3333 0.6224 26.24 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(state_name) 48.44 49 20.54 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.345 Deviance explained = 40.3%
GCV = 3333.9 Scale est. = 3387.6 n = 3107
Allowing for spatial autocorrelation
OK we’re not quite there yet. The final thing at the back of my mind is that so far, we have treated counties as though they are independent, equally valuable observations. This is pretty common in these types of cross-sectional regression, but it’s not fair; it exaggerates the value of each county data point. The values of depression and Trump vote in two adjacent counties are, frankly, not equally valuable independent observations. Once you have one, you have a good chance of predicting the values in the county next door. By failing to account for this, we are treating our data as more valuable than it is, which translates to over-confidence in our inferences and confidence intervals that are too narrow.
How to fix this? Well, there are numerous ways, but my favourite pragmatic correction is to add a fixed effect that is a sort of rubber sheet laid across the geography of interest. This can be done with gam()
by including in the formula something like + s(x, y)
, where x and y are the centrepoints of the regions from where we have sourced observations.
Note - the other ways of correcting for this all seem to me to be way more complicated than this. Perhaps they are better! One thing I’m sure, it’s much better to do it my way than to ignore spatial autocorrelation altogether, which seems to be the near-universal approach. How often do you see an adjustment for spatial autocorrelation in a regression of US counties or states?
Now, even my pragmatic method is going to be a bother, because we need to get spatial data that isn’t available in our sources so far. The next chunk of code downloads shapefiles of all the US counties, calculates the centroids of each, and puts it into shape to join to our existing data.
This gives us this plot to check that we haven’t (for example) mangled states, or latitude and longitude:
Very nice. Now we can fit the model, with this code
That gives us this ‘rubber mat’ that I’ve included. The bit in the bottom left is Hawaii.
Basically this is going to absorb the variance that can be explained by mere proximity of one county to another. Which means that what is left to be explained by either the state random effect, or the relationship between depression and voting, is a fairer estimate.
Now, when we try to visualise this, the lines ar emuch squigglier than our previous charts. So I’ve chosen to put this into a faceted plot with a panel for each state. Here’s the end result:
… created with this code:
Now, I think this is the best model I’ve fit to this data. The prevalence of depression (cpe
) is still very much a significant (definitely non-zero) effect, and the large magnitude of this is comparable to the simpler models used so far. But we’ve got confidence that this isn’t just an artefact of the spatial closeness of counties, or of other non-included state level effects. So the extra work is worth it.
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.7972 20.7488 -0.135 0.893
cpe 15.5908 0.6778 23.001 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x,y) 28.39 28.94 7.995 <2e-16 ***
s(state_name) 49.00 49.00 7.949 <2e-16 ***
That’s all folks. I guess the lesson here (not that these blogs are lessons) is really just to allow for spatial effects: both categorical ones that fit into a mixed effects / multilevel paradigm (eg random intercept at state level); and those that are more subtle spatial autocorrelation. Really, this should be done much more.
And mgcv::gam()
is a great tool to address both of these at once.