The story of Literary Digest and its failed prediction of 1936 elections is the standard fare for all sorts of references on the pitfalls of sampling bias. Missing from practically all these warnings is any kind of guidance on how one might avoid such things, or indeed, these problems are really at all avoidable–I’d wager, in fact, that they are not.
Without revealing too much for confidentiality reasons, I’ll just say that I had run numbers on some polling data for the current election, using essentially same methodology, but with one major difference: once, the numbers were run using the data on the 2012 turnout patterns as post-stratification weights and the other without post-stratifying. The results are almost exactly the opposite, at least in electoral college terms: post-stratifying by the 2012 numbers predicts a narrow Trump win that masquerades as an electoral college landslide; without post-stratification, the numbers imply a significant Clinton victory that looks like an even bigger electoral college landslide.
As far as “data analysis,” both are absolutely “correct,” within confines of the data as we have it. The real questions concern not the alleged “conclusions” of these exercises, about “who will win,” which, at this stage, is nothing more than educated guess reinforced by some data about the recent past and the belief that the future will look like yesterday, depending on what we mean by “yesterday.” Specifically, if the election day looks like 2012, in terms of who shows up, but if the people who show up behave like those who are similar to them that participated in the polls, Trump stands an excellent chance. If the people who show up on the election day look like those who participated in the polls, in terms of demographics, but still behave like those who participated in the polls, accounting for demographic variations, Clinton stands an excellent chance.
We can be fairly certain that both sets of numbers are, in all likelihood, wrong, in terms of describing the “reality,” since there is nearly no question that they are tainted by variants of “sampling biases,” in that the actual electorate that will materialize in November will not resemble either, on state by state basis. Both sets of assumptions, in absence of a better means of predicting the future, is at least “defensible” in a fashion, however. Neither is really any less reasonable than practically any other guesstimate that is presently possible, short of spending significant resources trying to model a reasonably detailed prediction of who will show up–which will still suffer from much uncertainty until the day of reckoning comes.
This points to a significant problem in data analysis business, especially when they put up the pretense of “prediction” or “forecast.” We are not. We are merely describing the past, which, after all, is where the data originates from, and, to the degree that we think they pertain to the future, it is predicated on the assumption that the future fits into some pattern that we are guessing today, in lieu of definite knowledge. I’m a little bit weary of the pretense that data can help “predict,” to say the least. The real ability to learn from the data, I think, comes not from using the data to predict things by forcing the future into the procrustean bed of the past, but understanding the circumstances of the past that led to the data that we see, and being able to assess how these data generating processes will change in the future, with what probabilities. This is the historian in me doing the talking: we deal with nuances and changing circumstances that we have little reason to repeat themselves precisely as they did before. The 18th of Brumaire may happen again, but what was tragedy yesterday, to paraphrase Marx, could easily be remade as a musical comedy. If it does, maybe we should know something about differences between tragedies and musical comedies, not just the data of yesterday, or whenever the 18th of Brumaire was.