Cheap Data vs. Good Data–The Case of Assessing Japanese Military Aviation before World War II

This is a fantastic dissertation, too good for a mere MA thesis.  While it is fascinating enough just as a source of historical information on an interesting topic, it is also useful as an instructive illustration of the problems in successfully using and abusing data.

Intelligence assessment and statistical analysis are, fundamentally, the same problem. Both are extrapolations of the knowns (the data) to evaluate the unknown, with a certain set of assumptions that guide the process.  Both can never be entirely accurate, as the “knowns” do not match up neatly with the “unknowns,” but if we do enough homework and/or are sufficiently lucky, we can deduce enough about the relationship between them to make the pieces fit.  Or, in other words, we cannot rely on the data itself to just tell us what we want to know.  What we really want to know, the really valuable pieces of information, will be somehow unavailable–otherwise, we wouldn’t need to engage in the analysis in the first place.

The thesis points to an all too common in data analysis:  the good data pertained to something that we don’t really need to know, or worse, something potentially misleading, while the problems that we really do want to know did not generate enough high quality data.

In case of military aviation in Japan, the good data came from 1920s, when the Japanese, being aware of the backwardness of their aviation technology, actively solicited Western support in developing its capabilities, both military and industrial.  Since they were merely trying to catch up, their work was largely imitative.  Knowing the limitations of their existing technological base, they were happier to copy what was already working in the West, even if they were a bit old, rather than try to innovate on their own.  Since they had, literally, nothing to hide, they were open about the state of their technology, practices, and industries to the Westerners, who, in a way, already knew a lot of what the Japanese were working with anyways since most of them were copies of Western wares.  In other words, the data was plentiful and were of extremely high quality.  But they also conformed to the stereotype of the Japanese in the West as not especially technologically advanced or innovative.

By 1930s, things were changing: not only were  Japanese developing new aviation technologies of their own, the relationship with the West has cooled decisively.  They became increasingly secretive about what they were doing and, as such, good data about the state of Japanese military aviation became both scarce and unreliable.  But, in light of the increased likelihood of armed clash between Japan and the West, the state of the Japanese military aviation in 1930s (or 1940, even, given when the war eventually did break out) was the valuable information, not its state in 1920s.  The problem, of course, is that, due to the low quality of the data from 1930s, there was nothing conclusive that could be drawn from them.  While there were certainly highly informative tidbits here and there, especially viewed in hindsight, there were also a lot of utterly nonsensical junk.  Distinguishing between the two was impossible, since, by definition, we don’t know what the truth looked like.  Indeed, in order to be taken seriously at all, intelligence reports on Japanese aviation had to be prefaced with an appeal to existing stereotypes, that the Japanese were not very technologically savvy–which was, of course, more than mere prejudice, as it was very much true, borne out by the actual data from 1920s.  In other words, this misleading preface became, in John Steinbeck’s words, the pidgin and the queue–some ritual that had to be practiced to establish credibility, whether it was actually useful or not.

This is, of course, the problem that befell analyzing the data from the 2016 presidential election.  All the data suggested, as per the state of Japanese military aviation, that Trump had no chance.  But most of the good data, figuratively speaking, came from the wrong decade, or, involved the matchup that did not exist.  In all fairness, Trump was as mysterious as the Japanese military aviation of 1930s.  There were so many different signs pointing in different directions that evaluating what they added up to, without cheating via hindsight, would have been impossible.  While many recognized that the data was the wrong kind of data, the problem was that the good data pertaining to the question on hand simply did not exist.  The best that the analysts could do was to draw up the “prediction,” with the proviso that it is based on “wrong” data that should not be trusted–which, to their credit, some did.  This approach requires introspection, a recognition of the fundamental problem of statistics/intelligence analysis–that we don’t know the right answer and we are piecing together the known information of varying quality and a set of assumptions to generate these “predictions,” and sometimes, we don’t have the right pieces.  The emphasis on “prediction,” and getting “right answers,” unfortunately, interferes with the perspective.  If you hedge the bet and invest in a well-diversified portfolio, you may not lose much, but you will gain little.  Betting all on a single risky asset ensures that, should you win, you will win big.  Betting all on the single less risky asset, likewise, would ensure that you will probably gain more than hedging all around–and if everyone is on the same boat, surely, they can’t be all wrong?  (Yes, this is a variant of the beauty contest problem, a la Keynes, and its close cousin, Stiglitz-Grossman problem, with the price system.)

I am not sure, if the benefit of hindsight could be removed, an accurate assessment of Japanese military aviation capabilities in 1941 could have been possible.  The bigger problem is that, because of the systematic problems in data availability, the more rigorously data intensive the analysis (at least in terms of the mechanics), the farther from the truth its conclusions would have been.  A more honest analysis that did not care about “predicting” much would have pointed out that the “good” data is mostly useless and the useful data is mostly bad, so that a reliable conclusion cannot reached–i.e. we can’t “predict” nothing.  But there were plenty of others who were willing to make far more confident predictions without due introspection (another memory from 2016 election) and, before the election day, or the beginning of the shooting war, it is the thoughtless and not the thoughtful that seem insightful–the thoughtless can at least give you actionable intelligence.  What good does introspection due?

Indeed, in absence of good information, all that you can do is to extrapolate from what you already “know,” and that is your existing prejudice, fortified by good data from the proverbial 1920s.  This is a problem that all data folks should be cognizant of.  Always think:  what don’t we know and what does that mean about the confidence we should attach to the “prediction” we are making?

Mirage of Data and Analytics–Baseball Again.

Fangraphs has a fascinating piece that echoes some of my ideas from a post a little while ago.

Dave Cameron starts by pointing to the problems of securing “intellectual property” in baseball:  most people who do analytics are, essentially, mercenaries, who are hired on short term contracts and between different organizations frequently.  You cannot keep them from bringing ideas with them when they change jobs.  So ideas spread rapidly from organization to organization and the opportunity to arbitrage previously underappreciated ideas are reduced.  But he also alludes, without being explicit, to the fact that the ideas and concepts themselves are pretty simple, or at least are given to being interpreted very simply.  In other words, ideas are viewed as commodities that have constant values, rather than something that fits better with a particular philosophy or organization strategy.  To use Cameron’s example, a batter’s swing plane being an uppercut is a “better” approach. To use an example that I often find annoying, FIP is considered a “better” measure of a pitcher’s effectiveness than simple ERA.

Are these in fact “better” measures?  People often don’t seem to realize that FIP does not measure the same thing as ERA at the technical level:  FIP only incorporates the three “true” outcomes–HR, BB, and K’s.  It is probable that a pitcher who gives up many home runs and/or walks many batters is not very good.  But, conversely, there is something to be said if a pitcher who gives up many home runs and walks many batters don’t give up many runs.  Or, indeed, the same thing might be said for pitchers who give up many “unimportant” runs (i.e. give up runs only when it doesn’t count–and somehow, manages to persistently keep leads, even small leads).  It could be that FIP might, on average, capture the “value” of a pitcher better than ERA, which, in turn does a better job than simple wins and losses, but I don’t think the value of a player is a simple unidimensional value that always translates to a real number readily.  Conditional values of a pitcher varies depending an organization’s strategy and philosophy, and these are more difficult to change–but also offer the potential of finding more lasting value than the easier, commodifiable statistics.  The optimum strategy, in a high variance matching game, is to know your own characteristics (i.e. philosophy, approach, endownments in budget and talent pool, etc.) and optimize conditional on those characteristics–and sign on especially those that don’t fit other organizations’ characteristics neatly.  Universally good traits are easily identified and their value competed away fast, now that technology is readily available.

Much had been made of the Royals’ success in seemingly going against the grain, with regards “analytics.”  Now, several authors claimed that the Royals were in fact making good use of moneyball concepts, focusing on the traditional but still valuable ideas that have been neglected due to sabermetric fetish.  I think both are somewhat mistaken:  I suspect that the Royals began with a philosophy first and tried to incorporate statistics to fit the philosophy, not bounce around “analytics” chasing after the fool’s gold of commodified “good” stats whose value dissipates rapidly.  Copying the Royals’ approach, without having similar basic philosophy and organizational strengths and weaknesses, probably will not pan out.  Building the philosophy and style–and assemble personnel who appreciate them–is a long term process that requires, ironically, a deeper appreciation of what analytics do and don’t offer–specifically, the subtle differences between the many seemingly similar stats and how they mesh with the particulars of the team in order to find better “matches.”

This is hardly a new idea in business management:   in 1980s, as per this TAL story, GM execs were puzzled that Toyota was so willing to reveal the particulars of its management strategy to its competitor in course of their joint venture.  It turns out that Toyota’s management strategy is effective given the organizational philosophy of the firm and turned out to be very difficult to implement in GM without upending its fundamental characteristics.  It does seem that Toyota did overestimate the import of “Asian culture” as a component of its corporate philosophy, as GM was reasonably successful, over the (very) long term, in implementing many of the lessons it learned from Toyota–but most of these successes came in overseas subsidiaries far from the heart of GM’s corporate culture that impeded their implementation.  Perhaps this provides a better explanation of the much ballyhooed feud between Mike Scioscia and Jerry DiPoto that eventually led to the latter’s departure.  I don’t think Scioscia and the Angels organization have been necessarily all that hostile to the idea of “analytics” per se–they seemed to have interesting, quirky, and often statistically tenuous ideas about bullpen use and batting with runners in scoring position dating back to their championship year in 2002 at least. So a peculiar organizational culture already existed that could absorb analytical approach of certain strains but potentially hostile to others, and I wonder if what showed up was this, rather than “traditional” vs. “analytical” as commonly portrayed.

Here, I speak from personal experience:  I looked enough like a formal modeler to be mistaken for one by non-formal modelers, but I usually started from sufficiently different and unorthodox assumptions that I did not mesh with a lot of formal modelers who either did not understand that their assumptions need not be universal or were hostile to different ideas in the first place.  I will concede that, on average, the usual assumptions are probably right most of the time–but when they are wrong, they are really wrong, and a great deal of value resides in identifying the unusual circumstances when usually crazy ideas might not be so crazy.  Of course, that is why people, not just baseball teams, should take statistics and probability theory more seriously when they delve into “analytics.” Nevermind if stat X is “better,” unconditionally.  Is stat X more valuable given such and such conditions than stat Y, and do these such and such conditions apply to us more than those other guys?

PS.  This is the repeat of the story on beer markets and microbreweries, in a sense.   Bud Light is a commodity beer that seeks to fit “everyone” universally.  Its fit to any one market is imperfect, but, given the technology on hand, it can be produced much more cheaply than most beers that fit a smallish market better.  Only beer snobs are so willing to trade off much higher price for better fit in tastes.  This is independent of the methodological problem of identifying the fits–the question is, once you identified the better fit, how many people are willing to pay the price for the better taste?  But technological change forces a reconsideration of this business model:  microbrewery revolution was preceded by technological change that made production of smaller batches of beer much cheaper.  Producing massive quantities of the same taste is still cheaper, but the gap is much narrower.  Less snobby beer drinkers will pay a smaller premium for better taste fit.  So the problem is much more two-dimensional (at least) than before:  you find the better taste fit, and conditional on the taste fit (and the associated elasticities), try to identify the profit maximizing price.  This requires a subtler, more sophisticated strategy and analytical approach and is liable to produce a much more complex market outcome.  As noted before, people who are more sensitive to price than taste will still gravitate towards Bud Light, even if there is a taste that they prefer more, as long as the price gap is large enough.

With baseball (and indeed, all other forms of “analytics,”), the problem is the same.  FIP or SIERA or any other advanced statistics are still in the realm of commodity stats, something that is supposed to offer a measure of “universal” value.  If you will, these are the means to produce a better Bud Light.  But soon enough, Bud Light is still Bud Light.  It is not easy to find something that suits everyone that much better.  So you trade off:  you give up the segments of the market that have a certain taste for another segment that you can cater to more easily.  Or, in baseball context, you grab the players who may not be so good, in the overall sense, but whose strengths and weaknesses, whether quantifiable or not, complement your organizational goals and characteristics better, with the caveat that, even if they are quantifiable, the measures will be more complex than simple commodity stats like ERA or FIP, in that their usefulness would be conditional.  Perhaps one could come up with some sort of “fitness” or “correspondence” stats (incidentally, online dating services use this sort of stats–and this has long history of its own:  the “stable marriage problem” is one of  my favorites and is foundationally linked to the logic of equilibrium in game theory (and my research interest for years had been on “measuring” the stability/fragility of equilibria (Which, in a sense, is a paradoxical notion–if it’s not stable, how can it be an equilibrium?  But the catch is that most things are in equilibrium only conditionally–this is the core of PBE notion:  an outcome is stable conditional on beliefs that are justified by the outcome, i.e. a tautology.  If people, for whatever reason, don’t buy into the belief system, it may fall apart, depending on how many unbelievers there are.).

Using and Abusing Statistics–Baseball Edition

I like statistics and I like baseball, but the way I approach baseball stats might be a bit different from most other people.

At one time, it used to be that pitchers were evaluated on the basis of wins and losses, then came along pitchers who were ludicrously good with lousy win loss records like Nolan Ryan and people started realizing that the wins and losses make for lousy stats and started looking at alternatives.  By and large, that was a good thing–but with a caveat that people have forgotten.

More recently, people started realizing that some pitchers have ludicrously good ERAs and are not that good, and others with lousy ERA’s who were better than their numbers.  Came along more advanced stats like SIERA and FIP.  By and large, this probably is a good thing–but again, with a caveat that people forget.

The caveat in both cases is that the objective in baseball is winning.  Even if you allow an average of just one run every nine innings, if you keep losing, you still lost.  So winning is perfectly valid way of measuring a baseball player’s performance.  It is, indeed, the only measure that is actually meaningful.  Everything else is secondary.

The problem is that there are 25 players on a major league roster so that contribution to a win by a single ballplayer is conditional.  Steve Carlton, on terrible Philly teams, was more valuable, relatively speaking, than he would have been on a good team, even if he lost 20 games (i.e. 1973).  So how valuable is a single player on another team, ceteris paribus?  This involves constructing counterfactuals and it is something statistics–the real statistics–is supposed to be good at, as it came out of experimental research tradition.  But this is something that requires a bit more complex thinking than what most users of data, baseball and otherwise, seem too interested in consuming, as it often cannot reduce the performance to a single set of numbers.

Personally, I think ERA is still the best single set of numbers, for example, for evaluating pitchers for the ease of interpretation that it allows.  A pitcher with ERA of 3 on a team that averages 4 runs a game is a winner, on average, while the same pitcher on a team that averages 2 runs a game, on average, will be a loser, assuming that everything except the average offense (e.g. fielding, bullpen quality, etc.) stays that same.  That’s a bad assumption, obviously, but it omits an even more egregious and troublesome assumption from measuring pitchers by their win-loss records:  that everything, including offense, is the same–except, that is, the pitcher.  

Note that, one can actually do a bit better even to just use ERA or win-loss record, to evaluate a pitcher, by incorporating better statistical methods that don’t reduce themselves to a single number.  Pitching performance and everything else are random variables:  the offense might score an average of 5 runs a game, but with variance of 2, say.  The pitcher may give up 2 runs a game, but with the variance of 2.  Another lineup may score an average of 4 runs a game, but with no variance whatever.  Another pitcher might give up 3 runs a game, but with the variance of 0.  The second pitcher always wins, in front of the second lineup.  The first pitcher might be better on average, but he might lose, even in front of the second lineup.  But if you have the first lineup and if the pitching and hitting performances are independent (might not be–personal catchers and all that), perhap you might want the first pitcher rather than second–or not, perhaps, depending on the distribution (which may not be normal).  Of course, this is a baseball application of the “tall Hungarian” problem.  A high variance distribution allows for gambling in a way that low variance distributions do not–whether you choose to gamble depends on the circumstances.  Sometimes, gambling is the only way–and occasionally, it pays off.

Further incorporation of additional variables–fielding, relief corps quality, ground ball/fly ball ratios, and all that, will further reduce the variance, but will it completely eliminate the uncertainty?  Sometimes, a Mark Lemke hits a grand slam after all and an Omar Vizquel boots a grounder, after all.  You don’t want to intentionally put Mark Lemke in a spot where he HAS to hit a home run–that would be silly.  But risks and gambles are what make baseball interesting, and betting on high variance/low mean is sometimes exactly what you must do to win–even if you will probably lose your gamble.

Now, what being able to add more variables and reduce “errors” means is that you will be able to make better, safer gambles, but that is hardly a sure thing.  An interesting observation that has been made about investments into risky assets is that, the more data-intensive the research and analyses have become, the smaller the arbitrage opportunities have become:  not shocking, since, if it is obvious, people will grab on to them and pay a premium for it.  The consequence of this is that people are taking on more risk, because it is easier to bet on your getting lucky than being good–because all the obvious answers have been addressed.  I don’t know if this tradeoff is as well understood as it should be:  (relative) success is increasingly a sign of luck than skill.  But, at least when it comes to sports, we want to see the lucky as much as we do the skill.  You don’t expect a nobody to hit the walkoff hit to win a playoff series, but that happens often enough.

The bottom line is twofold.  First, all useful statistics are conditional (or Bayesian in a sense).  Unconditionally good stuff get arbitraged away fast–especially since unconditionally good stuff are obvious, even if you don’t know high powered stats.  The good players, good tactics, good approaches are good only if they are good for the situations that you need them for, which is almost certain to vary from team to team.  The real value is not that player X has WAR of 2, but how to best use a -2 WAR player (for another team, given how he was used there) to get positive win out of him for your team.  This can be tackled statistically, but not by calculating a single number that putatively captures his entire value.  Second, the value of a player is spread out over the entire season.  A player’s performance at any one time is variable, a gamble, a lottery ticket.  You invest in probabilities, but sometimes, General Sedgwicks get shot at improbable distances.  Working with probabilities and statistics CAN improve your chances at the gamble, but this is two dimensinoal–do you want to win big, at a big risk, or do you want to win small, at a small risk?  This comes with the additional proviso, of course, that your understanding of the universe is limited.  The lack of the appreciation for the risk and uncertainty is usually how one lies with statistics, or surprise the Belgians with unexpectedly tall Hungarians.

What Makes Humans Smart…and Dumb

I never heard about the Great Emu War until recently.  What happened in that “conflict” seems fairly predictable, actually:  in response to demands for action by farmers in marginal lands in Western Australia, the government sent soldiers armed with machine guns to cull marauding herds of large flightless birds, only to discover that mowing down wildlife with machine gun does not work as well as mowing down humans, only to scrap the project amidst much embarrassment.  What made me wonder about this venture, though, is something a bit different:  why was it so much easier to cut down humans with machine guns than birds?

Emus are big birds–pretty much about human sized.  They are fast runners, but I don’t think they are so fast that it makes it impractical to shoot them down with machine guns.  As far as I can tell, what made it so difficult for the Australian army machine gunners to shoot at emus effectively was that the birds ran whenever the soldiers approached them and when the shooting began (usually at considerable distance if only because soldiers couldn’t approach them closely) they ran in all direction in panic making it difficult for even a hail of bullets to hit many of them.  Of course, these are exactly the kind of natural reaction that almost any critter would engage in, if they were shot at–that is, except, one:  humans, especially those who are trained and disciplined.  What made it so easy for machine gunners to shoot down great masses of men during World War 1 was that humans are trained to behave unnaturally:  they kept their formation even in face of bullets and they actually approached the machine gunners even as the bullets were flying towards them–still packed in formations.

This is, in a sense, what human sociality achieves.  Humans do strange and unnatural things that definitely run counter to the natural instinct of self-preservation.  This is how a human “society” can remain organized even in face of adversity–which might do them good sometimes:  a group of people engaged in effective teamwork is far more effectively than just the sum of individuals (the Greek phalanx was nearly invincible in close combat as long as they could maintain formation where each pikeman could support (and could count on support from) his neighbors.)  But the same discipline that allows an entire society to operate as a team can be used as a bait to wipe out an entire society–Mongols and other steppe nomads were quite good at luring an entire army into a trap–which a highly disciplined army was more apt to–and wipe them out as a group.  (In a sense, the Romans were trapped and annihilated so completely at Cannae and Carrhae, by Hannibal and Surenas respectively, precisely because of the disciplined nature of their legions.)  The same, I suppose, applies to World War I and machine guns:  it takes discipline and training–precisely what make the human animal social and usually powerful–for an army to keep formation under attack, and that makes them so easy to wipe out with industrial machinery.

Discipline turns humans into machines, in a sense–but machines can outmachine humans.  Perhaps, at least some of the time, humans need a bit of animal instinct, to break the pack and run away from machine guns, like real living creatures, not dumbly march towards it only to be cut down in droves like stupid machines that aren’t so durable like real machines?