Andrew Gelman has a thoughtful explainer that purports to be about “Bayesian statistics,” but really should be a set of points that anyone trying to use data scientifically should be cognizant of.

This, in particular, is critical:

Is there any warnings? As a famous cartoon character once said, With great power comes great responsibility. Bayesian inference is powerful in the sense that it allows the sophisticated combination of information from multiple sources via partial pooling (that is, local inferences are constructed in part from local information and in part from models fit to non-local data), but the flip side is that when assumptions are very wrong, conclusions can be far off too. That’s why Bayesian methods need to be continually evaluated with calibration checks, comparisons of observed data to simulated replications under the model, and other exercises that give the model an opportunity to fail. Statistical model building, but maybe especially in its Bayesian form, is an ongoing process of feedback and quality control.

A statistical procedure is a sort of machine that can run for awhile on its own, but eventually needs maintenance and adaptation to new conditions. That’s what we’ve seen in the recent replication crisis in psychology and other social sciences: methods of null hypothesis significance testing and p-values, which had been developed for analysis of certain designed experiments in the 1930s, were no longer working a modern settings of noisy data and uncontrolled studies. Savvy observers had realized this for awhile—psychologist Paul Meehl was writing acerbically about statistically-driven pseudoscience as early as the 1960s—but it took awhile for researchers in many professions to catch on. I’m hoping that Bayesian modelers will be sooner to recognize their dead ends, and in my own research I’ve put a lot of effort into developing methods for checking model fit and evaluating predictions.

Statistics, I think, rests on three things. First, we use statistics precisely because we don’t know what the real world looks like. I don’t remember too much else of what Gary Lorden taught us when I was undergrad (I learned most of my statistical techniques in physics, in statistical mechanics, which is probably why I still tend to do strange things with data), but this is one thing that I remember and has formed the basis of all my thinking. Second, statistics is a process of inference from a combination of some theoretical assumptions and limited data. Sometimes, theory obscures facts. Other times, theory helps make connections not seen in the facts. An atheoretical model, like Google Translate, sometimes works better than a theoretical model, but that, in turn, assumes a theory–that the existing theories are too narrow for their own good and should be ignored. Perhaps reasonable sometimes, but not at other times. As the famous computer science story has it:

- “What are you doing?”,
asked Minsky.- “I am training a randomly wired neural net to play Tic-tac-toe“,
Sussman replied.- “Why is the net wired randomly?”,
asked Minsky.- “I do not want it to have any preconceptions of how to play”,
Sussman said.

Minsky then shut his eyes.- “Why do you close your eyes?”
Sussman asked his teacher.- “So that the room will be empty.”
At that moment, Sussman was enlightened.

The third leg of statistics rests on both of the previous legs. We don’t know the truth. Our theory and data are probably wrong, or, at least, incomplete. As the new data rolls in, we need to keep rethinking what this tells us about how we should think about the world.

The tricky part, though, is that both the theory and the data can be wrong or incomplete. One could trust that the theory is right and the data is wrong, and this can be dealt with by adjusting the data collection procedures and reweighing the data already on hand. Or, one could accept that the data is true and modify the theory to better fit the data, or, more likely, do both. The problem with this, of course, is that there is no “right” answer. Both the theory and the data are potentially wrong. You may cross-validate the theory with the data (as is the preference of the data science types) but it is really just a robustness check, to winnow away inferences that depend on a handful of outliers *in the existing data*. It is just a mechanical step that addresses neither the theory or the data being wrong or incomplete. The consequence of this, of course, is that this allows for multiple “statistical truths” to coexist: different versions that are compatible with different parts of theory or data, and they are all “true” in the statistical sense because we do not know the full scope of the “truth” to contrast them against. (This echoes the observation by Kuhn about scientific revolutions, and, more narrowly, the sunspots in macroeconomics. We have different theories of the truth, but we don’t know what is true because of the lack of data. In a sense, this is analogous to the identification problem, but, perhaps a bit more fundamental. In the theoretical identification problem, we at least have full data–it just supports multiple theories. In the scenario I laid out, we KNOW that the data we have is probably wrong or incomplete, even if we don’t know how exactly. This is far more common in social problems.)

Gelman had written about potential problems where “statistical hypotheses” and “scientific hypotheses” not coinciding neatly, many times, in fact. Paul Meehl’s original paper is worth a close read as well. The problem, too, is that, as a commenter to this post points out:

I recall a student asking him, why don’t people acknowledge any of this stuff you’re talking about? He said in his delightful Minnesota dialect (heavily affected at times), “Because if they did it would mean they’d all be selling shoes!”

The problem is not so much that this cannot be done, but this runs into a lot of problems if the scholars try to follow this path. For the junior scholars who need publications to stay afloat, this is not a worthwhile endeavor (I speak from personal experience, as I am now in the proverbial business of “selling shoes.” I found that trying to publish in top journals, trying to address these more carefully, is easier than trying to publish in more pedestrian journals. But most people, myself included, don’t have enough materials for top journals and they need to pad their CV’s with weak publications.) and apparently a problem in the business, too, if the points of note need to be condensed to 500 word memos that need to make the claims emphatically. This brings back the multiarmed economist problem: the value added of academic expertise is that it offers nuanced, conditional, and detailed guidance. But most of the nuances, conditionalities, and details get in the way. There are very few Hyman Rickovers who see salvation in the details, and to be fair, even Rickover considered abstract and theoretical details to be distracting–although he, in turn, thought that academics were too eager to assume away the details of practical implementation–different lines of work, different incentives, I suppose.

Gelman’s answer is, in a sense, more of a perspective than the solution. We don’t know the truth. Our theories are data are both wrong, at least some of the time. We need the failures, and, indeed, we need to set up the models to fail so that we can see what breaks under controlled conditions so that we can fix them before when we need to bet big–and, even then, maybe we should hedge our bets and not say too much with too much confidence. This actually looks like good science in general to me, and is something that we should constantly keep in our mind as we work with data.