Data Availability and the Newspeak of the 21st Century.

George Orwell described the invention of the fictitious newspeak in the novel 1984 as an attempt by the state to curtail the breadth of thought by limiting the scope of what is thinkable.  While I’m skeptical that the limitations in linguistic structure necessarily limits the scope of thoughts themselves, I think it is definitely true that limiting the language does curtail the ability of the thought to spread, by making it difficult to succinctly and easily describe unorthodox thoughts that don’t fit neatly into a language that only the select few are privy to.

The idea of secret lingo among conspirators is hardly new to Orwell.  Secret societies (or societies that pretend to be such) have their secret codes, handshakes, and other allegedly covert means of communication.  The crew of the Nautilus in 20,000 Leagues under the Sea speak in a secret language that is impenetrable to the outsiders that prompts this reaction from Ned Land.

“Don’t you see, these people have a language all to themselves, a language they’ve invented just to cause despair in decent people who ask for a little dinner!”

The more complex, nuanced, and subtle a language is, the more impenetrable it is to the outsiders, and as such, can conceal “conspiracies” among its speakers, whether they are conspiracies of thoughts or action.  Not surprisingly, many states, as part of political consolidation, sought to standardize the language.  Much is made of the attempts by the Germans and the Russians, say, to suppress the Polish language or the English the Welsh. But they were contemporaneous with the standardization of the French and German languages that led to dominance of the capital dialects over their linguistic relatives.  Of course, they have a precedence in the standaridization of the written Chinese language by the First Emperor.

The penetrability to the outsiders and inability to express subtle, complex, and nuanced thoughts, of course, are also the characteristics that make a language easier for Google Translate and its relatives.  Here, the infiltrator is not the language police from Moscow or Berlin, or even the agents of INGSOC, but something that is literally not even human and does not care to think like a human does, but abides only by a massively complex but quite simple-minded at the same time pattern recognition algorithm.  Yet, it is also becoming the standard tool of the international business, politics, and other formal transactions.  Documents will have to be generated increasingly in a manner that makes subtleties and complexities difficult.  At least in the legal-diplomatic-business realms, the public sphere, if you will, words and concepts will have to be defined in a manner that purges such complexities as well.  Even if “true” thoughts, taking places inside people’s minds, may remain complex, nuanced, and subtle, they cannot be in the open.  This is reminiscent, if less complete, of the world of newspeak, as it achieves essentially the same aim.  “Subversive” thoughts, not for its contents per se, but for their complexity and “difficulties” are to be stamped out.  Without a language suitable for their expression, they will be reduced to the status of a boutique dead language among bitter enders, like Sorbian or Ligurian, incomprehensible beyond a handful of specialists.

Perhaps this is a bit of an extreme example:  for all its influence, Google Translate is not the ultimate arbiter of human language, for far too many linguistic transactions still take place without intermediaries.  But the analogue is that, once “data analysis” becomes the language of transactions–and Google Translate is an applied form of simple-minded (conceptually) even if massively complex (in terms of moving parts) “data analysis”–the underlying assumptions, especially when they are not properly understood, becomes part of commonly accepted “obvious common sense” taken for grounded without a second thought, even if they are ultimately wrong, misguided, or incomplete.  If you will, the users of the lingo are initiated into the cargo cult by partaking in its initiation rituals through the (implied) profession of faith in its assumptions.

One example that I became intimately familiar with is the exploding use of (simple) unidimensional models for everything in politics that accompanied the increasing popularity of DW-Nominate and its relatives that allegedly measure “ideology.”  The short answer is that, DW-Nominate has nothing direct to do with “ideology.”  They identify patterns in recorded votes that take place in a legislature.  All the talk about polarization in Congress, captured by DW-Nominate, is that the votes are predictable–most taking place along the party line, and those who “defect” between them, if there are any–rare nowadays–are readily identifiable.  Could this say something about ideology, inside people’s minds?  Probably, since some people vote their “ideology” some of the time.  But it is hardly the only thing:  many subtle politics go into shaping votes in Congress, and all instances of subtle politics are different.  So we don’t know what exactly they are simply by glancing quickly at the data, other than they exist and they account for, collectively, whatever percentage of the observed voting behavior.

But “lack of explanation” is not an explanation to many audiences.  When presented with the data that shows “not explained,” they can respond, correctly, “so you don’t know?”  (and I speak to this from actual experience).  We have to focus on what it is that we CAN explain, and that is the obvious pattern, the whole “polarization” angle:  where the Democrats vote like Democrats and Republicans vote like Republicans.  It used to be that the “moderates” who switched back and forth between them were relatively numerous and somewhat predictable (this is where a lot of nuance and details help with the explanation, but not with the “final answer.”), the main result of polarization is driven by most Democrats voting like Democrats most of the time and vice versa for the Republicans, with very few exceptions.  To the degree that we have accepted the assumption behind DW-Nominate, willingly or otherwise, by using these measures, we are naturally drawn to the explanation that legislators today are more “ideological” than they were before.

It is not necessarily a bad answer:  that the legislators are voting party mostly means that the old “subtle” politics, at least of the variety that ran counter to simple party-line voting, are increasingly going by the wayside.  We might not know what “ideology” is exactly, it is becoming the big deal.  But does this mean that the subtle politics are insignificant?  That I think is a dangerous oversight–after all, in 2016 election, most voters probably did vote party, but an electorally significant minority that happened to be concentrated geographically did not and, for all their sparseness, that made all the difference.  By drawing focus on the obvious and the more easily quantifiable, the focus on the “data analysis” blinds us to the subtleties, whether in politics, markets, or the language.  On average, it probably would not matter–in the short term.  But there will be times when it blows up in our faces.

Furthermore, the lack of attention to the not-so-easily quantifiable also shapes the strategies of the political (and other) actors.  I don’t think the heresthetic style of Trump (and Sanders) in 2016 were necessarily by chance.  We had seen this before, in the world of baseball, in form of the 2014 and 2015 Royals.  A number of commentators pointed out that, contra the apparent disdain for “moneyball” type strategy by the Royals, they were in fact taking a “moneyball,” strategy, by focusing on the characteristics that were being underinvested and underinvestigated.  I think that is a bit misleading as a characterization.  Stats like OBP are easy to calculate with great reliability.  Defense, base-running, and even pitching effectiveness are harder to quantify reliably.  Precisely because the latter are not easily quantifiable, quantitative analysis of baseball tend to be a bit more careful about them.  (I realize that saber people will have an issue with this–but I think I can say with confidence that anyone who thinks that defense stats should be taken multiple grains of salt are nuts).  The argument is right that the more obvious and more reliable stats will be more quickly monetized and the opportunity for arbitrage wiped out speedily.  Utilizing the less reliable, harder to quantify stats require both heavier reliance on old fashioned baseball know-how and not inconsiderable risk-taking, which I think fairly characterizes the Royals’ strategy last few years, which paid off in 2014 and 2015, but not so much in 2016.  (This is, of course, what lay behind Michael Milken’s junk bond strategy too, in a sense–not-so-easily quantifiable are inherently uncertain and thus risky, you can mitigate the risk somewhat by a bit of specific knowledge, and you can make it big by taking on bigger risk.)  Like the Royals, Trump had a bit of old fashioned political sense that has come to be dismissed by the new political quants (it is telling that an old fashioned Midwestern politician, Bob Dole, should have been the only former Republican presidential candidate to formally endorse Trump, for old fashioned political reasons.) and took on a big risk, even if of calculated variety.

In this sense, perhaps the real risk is not that we will be unable to think subversive thoughts at all, with the rise of the modern newspeak, but that we will be trapped increasingly in our linguistic bubble, our version of the fictional Nautilus, captained by a madman for reasons that make sense only to ourselves, without becoming aware of the world

PS.  I thought about what I wrote earlier about what DW-Nominate does not predict and the “subtle” politics and thought about how this might be captured.  Generally, presidential voteshare in a house district and the DW-Nominate score of the incumbent representing it are reasonably correlated.  One can capture the “presidential votes” that “should” take place by regressing the actual presidential voteshare and DW-Nominate scores (signs suitably changed to reflect left and right) and predicting the y-value, the creating a measure of surplus by examining the difference between this predicted value and the actual votes for incumbent House members running for reelection.  (this is a bit silly and crude–but hey, I’m doing this on the fly).  Comparing these “surpluses” to the standard errors on first dimensional DW-Nominate scores (as a measure of what is not captured by the scores), yields the following graphs.

For the entire period between 1946 and 2010 (the data I have immediately available cuts off at 2010):


It is worth noting that the errors yield more votes in more recent years, when party labels actually become more meaningful:

Contrast the period after 1995 (a semi-arbitary cutoff, admittedly).


and the period before.


Of course, being “weird” stands more when everyone else is alike.  When everyone is already different in their own ways, the congressman/woman is being just like “everyone” when he/she is “different.”  So the KC Royals strategy is already paying more dividends now than before.





2 thoughts on “Data Availability and the Newspeak of the 21st Century.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s