Some years ago, there was a big controversy on wikipedia over the Haymarket Square bombing. Basically, some guy was trying to edit the page to introduce notions different from the conventional wisdom, contrary to the rules of Wikipedia, and the guy was repeatedly slapped down. The story received much attention at the time (see for example, this article on The Chronicle of Higher Education.)
If you cheated and looked at the article, of course, you’d have noticed by now that the one troublemaker was in fact a professional historian and one of foremost experts on the the topic and his research showed that a lot of conventional wisdom widely believed among the public is wrong. The problem is that this additional piece of information is not known in the anonymous internet world: everything is datum of equal worth, without knowing what lies behind the information. The commonly available information and “the truth” are the same, then, without the means to discern the truthfulness of the information. Now, here is where something gets tricky: does not Wikipedia have rules on “verifiability”? Yes, but only based on secondary sources: primary sources are not allowed as source of information. In a way, there is a good rationale behind it: primary sources are invariably of dubious credibility and evaluating the information they convey is not easy for the lay audience. That is where outside expertise and “secondary sources” come in: the information is valuable because such and such “expert” said so.
But how do you evaluate the worth of the secondary sources? In many instances, they themselves perpetuate myths and wrongheaded conventional wisdom. Evaluating how “accurate” and “wrongheaded” they are itself requires nuanced expertise.
The original Encyclopedie edited by Diderot (and other early attempts at creating encyclopedias) understood the value of expertise that goes beyond quantification. They sought to bring in input from the foremost experts of the time. They did not rely on the presumed wisdom of the big anonymous crowd but on the weight of academic reputation of the few highly renowed experts. Champions of wikipedia and such might claim that they might have done otherwise had the technology been available, but it is highly doubtful: the intellectuals of the Enlightenment were, quite frankly, snobs. They did not believe that the crowd had much wisdom to offer. Appeal to expertise was a very deliberate choice unconstrained by technology.
There is something in this story finds an echo in the mindset of “data mining,” at least the naive approach thereof. The foremost concern of data miners is to identify patterns in the data, not necessarily understanding the data itself, theorizing about its distributions and properties, and spending much time concerned with how the data was generated in the first place. (Although many statisticians also had similar weaknesses in the past, one might add–and may still do. They sought to justify their beliefs, say, about human intelligence, using data and statistical techniques, rather than trying to understand the nature of human intelligence itself, where their data was emanating from.) The problem is that data that is available to be analyzed exists in the quantity that they exists in for a reason. This is, in part, an extension of the simple selection bias problem, but exacerbated by the nature of the data ecology today. Data is not simply plucked out of the “reality.” It is plucked from both the reality and its many echoes. The “bad” data that fails to yield useful insights are not only abundant, but they generate far more echoes than the good, useful ones. In other words, “bad” data outreproduces the good ones exponentially, if one were to conceptualize data availability in ecological terms. The insights that can be garnered from the “bigger data” are potentially misleading because they are drawn far more from the uninformative but more abundant data.
The challenge here is the same as the problem that wikipedia has not yet addressed–as far as I know–about such things as the Haymarket Square bombings. In absence of the subject knowledge (or domain expertise, in data science lingo) that permits appropriate weighing of different sets of data, properly evaluating the data is difficult. But what makes wikipedia and naive machine learning valuable is precisely because they lack the “domain knowledge” to begin with, for along with knowledge comes prejudice. People “know” what conclusions to draw without insufficient evidence either because they “know” or “think they know–but really don’t,” like Karl Pearson about human intelligence. In an ideal world, the subjects experts should be skeptically respectful of the prospects offered by naive pattern recognition, or the data miners should be equally skeptically respectful of the subject expertise. But being “skeptically respectful” may well be the most difficult state of attitude towards information to attain. The data miners, for example, have tons of data that say one thing. They need to know, in terms that they can understand, why the patterns they are seeing are misleading. Simply shoving the subject credentials does not–and should not, for the sake of advancing knowledge–impress them. But the data analysts need to be cognizant of the limits in the data, the data itself as well as its statistical properties (there will be another post in near future about why taking variances seriously might be a good idea–perhaps even more so than the point estimates.).
I just came across this blog post. In some sense, this reflects what I had just noted, but draws the opposite conclusion. The consensus of “knowledge” is inherently conservative and new ideas, even if they are right, always must fight for its place by displacing the existing champion. Many historical accounts of the Galileo affair noted this: while Galileo was ultimately vindicated to be closer to the truth than his opponents, he did not have either convincing evidence or theoretical explanation that could overwhelm the consensus among his scientific peers. He was put on trial (and given very light punishment) mostly because he was a foul-tempered crank who slandered the important figures of the day, not because he was challenging some immutable (anti-)scientific orthodoxy. In some sense, even if the scientific consensus was wrong, the inertia of incredulity Galileo encountered was not atypical of radically new scientific insights.
The problem is that Galileo, and for that matter, experts in general, deal with other experts who are already in possession of a great deal of expertise. They do not need to be told of the first principles and how they lead up to the conclusions. Galileo’s contemporaries were familiar with the Copernican theory, for example, and were deeply appreciative of it. They were not wrong thinking that the weight of evidence at the time was limited, however. The necessary information needed to tip that balance was small. This is what makes the “wikipedia vs. the truth” problem more troubling to me: it punishes expertise of the few and perpetuates the ignorance of the masses, in the name of “democracy.” Tipping the balance of the non-expert beliefs will take much more. Tipping the balance of non-expert beliefs seemingly backed up by copious data will be even harder, unless Landon loses in a landslide to FDR. (NB: The “Landon Landslide” story is far more pertinent than one might expect: The Literary Digest drew its conclusion from exhaustive analysis of very large data, consisting of several million respondents. Its polling prowess was supposedly validated by successfully forecasting several previous elections. The sampling problems that bedeviled the Literary Digest polls produced publications in public opinion research decades after the fiasco. In the same election, George Gallup proved his mettle as a pollster not by his data analytical prowess but by his awareness, drawn from his political savvy, that the data ecology in 1936 was dramatically different from the previous elections. To the degree that one wishes to make accurate predictions in the sample that you do not have yet–i.e. NOT your “testing” data–it may be worthwhile to focus on how to forecast changes in the data ecology, not ingenious ways to cross validating by naively cutting up the data that you have…although cutting up the data on hand may actually prove to be the best approach to evaluate how the model holds up vis-a-vis potential evolutions of the data ecology.)