Variance vs. Errors

Errors and variances, in statistics and its applications, represent two distinct, yet easily confounded concepts.

The variance is the fundamental property of a distribution:  there is an inherent variability in the outcomes generated by a given data-generation process, which is beyond our understanding and must be kept in a random box.

An error is simply where our prediction model, built on an existing aggregate body of data, got the data wrong.

In practice, the two are, at least quantitatively, hard to tell apart.  In a repeated toss of a fair coin, the mean number of head will be 1/2 and the standard deviation, the square root of the variance is (exactly) 1/2–i.e. zero variance on the variance itself.  That means that, in every outcome, the result will ALWAYS be 1/2+1/2 or 1/2-1/2, i.e. no matter what, you will have exactly 0 head or 1 head.  This  can be estimated from the sample, of course:  just take the deviations in the sample, square them, and average them, which will approach 1/2 as the sample size approaches infinity.  But, most users of statistics rarely look at the average sample deviations:  they look at standard errors, or the range of the values that the true mean might take on given the sample.  So, if the coin truly is fair, the standard errors will approach 0 as the sample size increases, while the sample indicates the mean of 1/2.

As a predictor of the next coin toss, this is a pretty useless information.  You will never get half a head in ANY coin toss.  It is the probability with which you might obtain a head…but the value of knowing this probability with much precision, unless you are tossing a coin hundreds of times, is pretty limited:  in a small sample draw, the difference in the probability being 0.5 and 0.6, say, is pretty small, indeed, nearly impossible to tell apart statistiscally.  What you want is whether you will get a zero or a one, or something else, and for this, you might like to know the variance coupled with the mean:  what is the spread of the data, and what is the spread of the spread, if you will.

Interestingly, this takes us away (somewhat) from the convenient realm that the conventional statistics and the normal distribution places us, at least in context of coin tosses:  while the mean of the sample might be distributed approximately normally as n approaches infinity, the distribution of the outcomes themselves is anything but normal:  even as n approaches infinity, it is literally two lines, one at 0 and the other at 1, even as the normal distribution might provide some insights at their relative heights (and their associated confidence intervals)–although it’s not really necessary.

In other applications, the normal distribution assumption may or may not be applicable–we know that the means for a large variety of distributions do tend to approximate a normal distribution as n approaches infinity, as per the law of large numbers, but do natural populations come in normal distributions?  There is no good reason to believe that they do.  But a normal distribution assumption still yields surprises that we might not think about:  the average Hungarian might definitely be shorter than the average Dutch, to make up a problem on the go, but with sufficiently high variance for the former and a sufficiently low variance for the latter, it may be fairly common to see a random Hungarian  being taller than a random Dutch.  This has, as I noted previously, a significant strategic implications in all manner of settings.  All the insistence that the Hungarian is shorter than the Dutch, on the basis of “data,” would be pointless because, given high variance, the probability of running into an unusual Hungarian will be very high.  If the high variance is combined with a large skew (i.e. non-normal distributions), the implications might even be stranger.

The relative lack of attention given to the variances is a serious problem with the mean-centric mindset that focuses on “prediction” that dominates, I think, the “data science” types.  Sometimes, the world is inherently weird so that the outcomes cannot be predicted reliably, and we are better off trying to measure the extent of this weirdness and how far off the “predictable” paths that the true outcomes might reside.  The ideas developed by data science types like cross-validation and random forests provide potentially useful additional means for measuring variance–although, the truth is, simple set of deviations from conventional OLS models might be the best starting point, if the sacred cows of normality assumption are thrown aside and we try not to get hung up on statistical significance of the predictors.  Yet, we shun large “errors,” which might simply be the useful clues to the true “variances” that might be far more useful because we just want to predict the means….

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s