I was absolutely fascinated by this blog entry.
This is the first really honest “data scientist” I came across. What commonly passes as “data science” is the opposite of science, although this is not necessarily a bad thing, and in a way, potentially even a good thing–even if I am myself skeptical. The blog’s author is not only aware of this, but definitely embraces this point.
At the risk of sounding postmodernist, I’d like to point out that what we consider “science” is, in fact, a social construct, albeit one with clearly defined structure. It is built on the premise that both data and ability to process it are constrained, whether by processing power of human brains or the time needed to process them: we theorize and use our theories to fill in and connect the dots where data is lacking. This is really the core of Kuhn’s idea about scientific revolutions (and more important, the periods between them). We lack the ability to acquire sufficiently fine-grained data in adequate quantities, we make up for it by developing theoretical explanations on a set of logical premises. But available data is invariably compatible with multiple logical explanations and we lack the ability to evaluate their relative explanatory power. This makes “science” inherently conservative, in the sense that it is hostile to new explanations that do not necessarily explain the existing data better than the old ones do. Only when the technology for acquiring and evaluating new data becomes available does it become possible to evaluate these competing theories and allow for rapid scientific progress–thus “scientific revolutions.”
Much of recent controversy over string theory stems from the Kuhnian dilemma: mathematically, theoretically, string theory is wonderful. But can its implications be evaluated empirically, using real data? Not really, for the most part. Until then, according to people like Lee Smolin, it’s not really “science.” But this is not just the problem for the string theory: it is the ultimate challenge for all of high powered physics, and in some sense, all of science–especially social sciences where good data is scarce. In a sense, science itself becomes an interesting illustration of the social processes behind science, more than the “science” itself. I’d been writing scathingly of the recent debate that sprang up over whether Trump’s supporters are authoritarian or whatever: with limited data and poorly defined concepts, I have no idea what these add up to, although it is clear that there is something going on there and, for good or for ill, the “sociology” of the audience wanting “scientific” evidence of this or that about politics creates the demand for this sort of exercise, if only to fuel the political propaganda of one side or the other. (In addition to the usual problems of social sciences, I’ve always thought that the greatest threat to political science is that it is political–there are too many things that people have too much of an opinion on that fills in for missing data.)
Neumeister’s proposal that data can fill in for the theory turns the traditional notion of “science” upside the head. Technological change has ensured that, at least in certain quarters, data is immensely plentiful. Patterns can be discovered in data that no one has even thought to theorize about. It is silly, in a sense, to waste time trying to theorize: let’s look at all these data and see what they can tell us.
But can humans overcome the temptation to jump to conclusions that are not warranted? Kahneman and Tversky have found that humans may not be good at detecting patterns, but are very quick to jump to conclusions intuitively. Many of these conclusions are surprisingly right, but some are very wrong–I love joking that every successful con is at least 10,000 years old, because human brains always fall for the same con because it is how we are wired. While not exactly a “con,” the same tendency leads us to buy into dubious “theories” that sound very deep and profound and resonate with our intuitions but offer very little by means of real explanation. The promise of “data science” seems to be tempting us to drop theorizing altogether, in favor of inductively found “answers” backed by data. But if people are so willing to buy into data that confirm our prejudices (such as polls supposedly showing Sanders far behind in Michigan, to use a current example), “data” can easily become a tool for abuse. Peter Norvig’s response to the way he and “science” underscores this point. In the end, most of us humans, don’t know how to derive useful insights from raw data very well, or, at least, distinguish useful insights from wishful thinking–even if we are capable of processing so much data, with aid of technology. This was perhaps constrained by our inability to process so much data–we couldn’t make THAT big a mistake, but, paired with ability to process massive data, granted by technological progress, the persistence of the same lack of discipline is liable to lead us to some giant mistakes (and probably already did–see the recent financial crises, aided by systematic misuse of data).
The promise of Big Data is that we will see things that we couldn’t see before. The danger is that we will start seeing all sorts of things that aren’t there. Evolution of human brain has not prepared us well for this problem, as we have always lived in the realm of data scarcity: we already have a tendency to believe a lot of nonsense based on scanty evidence. Too much data is as unnatural as refined sugar. Without theory to keep us penned within limits, we are liable to see even more turnip ghosts that aren’t there. What data can do is to help us creatively stress test old theories, to find their limits and variances, to see how conditionally “wrong” they are. Big Data, in this sense, then has a real potential to revolutionize science, but only if paired with real “science” and suitably constructive skepticism , which much of so-called “data science” seems eager to abandon as a millstone holding them back–I don’t mean to characterize all data science as pseudoscience, but this danger seems to be gaining recognition among many.
Robert Darnton reminds us that scientific revolutions and pseudosciences often go hand in hand. “Science,” to general public, is not sold on the basis of the process through which it operates, which is thought to be complicated and counterintuitive, but the wonderful fruits that it can produce. So “science” is the internet, the moon landing, and the polio vaccine–not how we got there. If a Messmer comes around and he has a magic formula that can do such wonderful things that the public and calls it “science,” who among the public is to say that his critics are just bad scientists who are jealous because they cannot deliver such wonderful things that he can? Yet, if “real” science has to compete with Messmers of the world to sell their wares, who wants to be so patient and restrained in what they offer? Is there a difference between “science” that sells exaggerated and fanciful claims and Messmer?
Great and intriguing though the possibility of scientific discovery through appropriate data mining is, I remain apprehensive of the immense potential for abuse. “Science” is not just knowledge: it is also a social convention that keeps us from getting too smarty-alecky for our own good.