Bernie Sanders defeated Hillary Clinton in the Democratic primaries in Michigan in what has been termed greatest upsets in electoral history by fivethirtyeight.com. This is exciting and all that, but I’m not a political hack: my calling is to analyze, especially when the predictions have been badly off. Indeed, what the Democratic primary results in Michigan upsets is not just political status quo, but also the facile attitude among many that, as data is plentiful, you can just wash data through some complicated algorithm and it will magically generate accurate predictions of the future. It turns out that you need to theorize what is going on and adapt what you do with data accordingly, not just blindly trust data to tell you what’s going on.
There were two things going on in Michigan that polls missed. I claim credit to having anticipated one. I will confess that I did not see the second one coming–I wonder if I should have seen it if I took a more careful look at the data, but that’s Wednesday morning pollstering.
While polls were indicating that Hillary Clinton was running ahead of Bernie Sanders by as much as 20+ percents, this was limited to a sample of likely Democratic primary voters. This is part of “theory building,” which, in any poll, or indeed, any data analysis, is where things are likely to go awry especially when the underlying universe is changing rapidly. You need a structure where to park your data, and if the ground is shifting, what data tells you can mislead–and this needs to be anticipated with care. It turns out that there was ample reason to expect that the conventional expectation of likely Democratic primary voters would be wildly off. First, in other states, Sanders brought into Democratic primary process many voters who were not Democrats, and in many cases, were participating in primaries for the first time. Further, we also knew that they were breaking in overwhelming rates to Sanders. While the extent to which these non-Democrats preferred Sanders in Michigan was not clear, polls that asked head-to-head matchups–significantly, of all respondents rather than only those whom pollsters deemed to be likely Democratic primary participants indicated Sanders outperforming by Clinton around 5%. In a de-facto 50-50 state like Michigan, this implies that Sanders had around 10% extra votes to spare that were not “likely Democratic primary voters.” If, as most polls indicated, Sanders was trailing by around 40-60 margin among Democratic primary participants, the actual gap was probably more like 50-60, which, when normalized, would translate to roughly 45-55. This was roughly my expected result, possibly narrower if Sanders was able to narrow the gap a bit via vigorous campaigning.
What I did not really expect, at least not explicitly, was the turnout pattern that emerged: younger voters (under 44) making up nearly half of the electorate, with the 19-26 cohort making up more than 1/5, according to the exit polls (for comparison, the same age group made up only 35% of the Republican sample). Of course, younger voters broke heavily for Sanders, overwhelmingly in case of the 19-26 cohort. I don’t know if this should have been a surprise–a similar pattern was seen in New Hampshire already, for example. Still, there were much talk of how unreliable the younger voters are, including some snooty remarks about how so many college students in Michigan would be on spring break, distracted from politics. Given the overwhelming nature of the youth vote, even a few percents’ gain for Sanders among the youngest age group would have made the difference between victory and defeat, and this was the cherry on top of the cake.
This underscores the centrality of theorizing in use of data. All data is generated through a biased process, but the nature of bias shifts from situation to situation–except we don’t always know what the bias of the day is. We need to watch for possible changes in biases constantly, and there are almost always plenty of places to look for clues if you are aware–but this demands an interest in knowing the substance of the phenomenon rather than clever algorithms–especially since the data will be too sparse to let algorithms decide all by itself.