I’ve written a lot about means-centric vs. variance-centric statistical thinking. So what do I mean by it? In the former, we focus on the ability to “predict” something, at its mean or whatever. In this sense, variability in the data is a nuisance, even an enemy, something to be minimized. In the latter, however, we want to know what variables cause the biggest variability in the data. The variance is not only an essential component, it is the main subject of our thinking. If we can predictably forecast outcomes, it is not only boring, it is also something that can be manipulated and “pumped,” until it is broken (as per my discussion of how signals in a correlated equilibrium can be abused until it breaks, or indeed, as is the logic behind Stiglitz-Grossman Paradox, which is really just a variation on the same argument).
In the end, really, math that accompanies both approaches turn out to be the same: in the means centric approach, you identify the predictor variables that help minimize the “errors”; in the latter, you identify the variables that happen to be correlated with the variances–which turn out to be the ones that minimize the “errors.” This convergent evolution, unfortunately, obscures the fundamental philosophical difference between the two approaches.
An illustration might be found using some data from the 2016 primaries. Consider the following plot.
The graph illustrates the support for Trump in primaries as a function of the Democratic voteshare from the 2012 election, with two different types of counties: whether the county population is above or below 75% white, which is roughly the national average–the red dots indicate the counties with below 75% white population. The biggest variability in the Trump support can be found in the areas where Romney did well in 2012 (i.e. small Democratic voteshares): the Republican primary voters in Republican dominated areas with large minority populations did not like Trump, while those from the counties with largely white populations did not have much problem with him. Yes, it is true that Trump did well in many counties with both large Republican majorities and significant minority populations, but the counties where he performed poorly conditional on large Republican majorities are mostly characterized by large minority populations. As a predictor, this is terrible: because the conjunction of large minority population and large Republican majority from 2012 does NOT predict weak support for Trump necessarily–there are too many exceptions for that. But, the reality is that conjunction of all these variables moving in the same direction does not happen–to pretend that they do so feeds into the conjunction paradox identified by Tversky and Kahneman, in which people think the conjunction of characteristics believed to be correlated with each other, rightly or wrongly, is also the most likely–e.g. “Linda is a bank teller and is active in the feminist movement” rather than “Linda is a bank teller.” People already prone to believe conjunctions happen with too great a frequency already (which partly accounts for the beauty contest game–how people trying to follow a “narrative” systematically downplay the variance)!
From the variance-centric perspective, the large gap that opens up for Trump’s support in Republican friendly areas with large minority populations is not merely interesting–it IS the point. It is the variability that we are interested in. Incidentally, this is why Trump’s support numbers are jumping wildly–his support in many red states (i.e. the South–where Republican electoral dominance and large minority populations coincide) is highly uncertain, leading to what Nate Cohn calls “Trump’s Red State problem,” which, to be fair, should already have been apparent from the primary data already–and the polls that showed Trump’s serious national level unpopularity consistently indicated that he is characterized by particularly low popularity among the Republicans.
The key reason that this cannot be readily translated into a prediction is that we know more than the data itself, or rather, we have a broader context that includes data from elsewhere, in which to place the present data. As Gelman et al observe, that respondents say that they voted for a particular party in the last election (in a poll) is a significant piece of information known to be highly correlated with their present latent choice, even if we may not entirely trust their response to be accurate or honest. To insist that this be ignored is foolish–even if it cannot be taken at its face value, especially if it is correlated with a particular variability seen in the data. To the degree that the reality is inherently complex and uncertain, coming up with a fully accounted for prediction model that can predict everything is, quite literally, in the realm of impossibility. Much better to adopt a two step approach to learning: identify the sources of variability, then investigate for the correlates of the variability, with the awareness that variability itself is a random variable–i.e. the variance itself may be correlated with particular variables themselves. (NB: homogeneity is an absurd assumption and not really a necessary one, except to make OLS BLUE, sine variance is always variable…)