MY AMERICAN SCIENTIST
SEARCH

HOME > PAST ISSUE > May-June 2007 > Article Detail

FEATURE ARTICLE

# The Most Dangerous Equation

Ignorance of how sample size affects statistical variation has created havoc for nearly a millennium

What constitutes a dangerous equation? There are two obvious interpretations: Some equations are dangerous if you know them, and others are dangerous if you do not. The first category may pose danger because the secrets within its bounds open doors behind which lies terrible peril. The obvious winner in this is Einstein's iconic equation e=mc 2, for it provides a measure of the enormous energy hidden within ordinary matter. Its destructive capability was recognized by Leo Szilard, who then instigated the sequence of events that culminated in the construction of atomic bombs.

Supporting ignorance is not, however, the direction I wish to pursue—indeed it is quite the antithesis of my message. Instead I am interested in equations that unleash their danger not when we know about them, but rather when we do not. Kept close at hand, these equations allow us to understand things clearly, but their absence leaves us dangerously ignorant.

There are many plausible candidates, and I have identified three prime examples: Kelley's equation, which indicates that the truth is estimated best when its observed value is regressed toward the mean of the group that it came from; the standard linear regression equation; and the equation that provides us with the standard deviation of the sampling distribution of the mean—what might be called de Moivre's equation:

where sxbar is the standard error of the mean, s is the standard deviation of the sample and n is the size of the sample. (Note the square root symbol, which will be a key to at least one of the misunderstandings of variation.) De Moivre's equation was derived by the French mathematician Abraham de Moivre, who described it in his 1730 exploration of the binomial distribution, Miscellanea Analytica.

Ignorance of Kelley's equation has proved to be very dangerous indeed, especially to economists who have interpreted regression toward the mean as having economic causes rather than merely reflecting the uncertainty of prediction. Horace Secrist's The Triumph of Mediocrity in Business is but one example listed in the bibliography. Other examples of failure to understand Kelley's equation exist in the sports world, where the expression "sophomore slump" merely describes the likelihood of an average season following an especially good one.

The familiar linear regression equation contains many pitfalls to trap the unwary. The correlation coefficient that emerges from regression tells us about the strength of the linear relation between the dependent and independent variables. But alas it encourages fallacious attributions of cause and effect. It even encourages fallacious interpretation by those who think they are being careful. ("I may not be able to believe the exact value of the coefficient, but surely I can use its sign to tell whether increasing the variable will increase or decrease the answer.") The linear regression equation is also badly non-robust, but its weaknesses are rarely diagnosed appropriately, so many models are misleading. When regression is applied to observational data (as it almost always is), it is difficult to know whether an appropriate set of predictors has been selected—and if we have an inappropriate set, our interpretations are questionable. It is dangerous, ironically, because it can be the most useful model for the widest variety of data when wielded with caution, wisdom and much interaction between the analyst and the computer program.

Yet, as dangerous as Kelley's equation and the common regression equations are, I find de Moivre's equation more perilous still. I arrived at this conclusion because of the extreme length of time over which ignorance of it has caused confusion, the variety of fields that have gone astray and the seriousness of the consequences that such ignorance has caused.

In the balance of this essay I will describe five very different situations in which ignorance of de Moivre's equation has led to billions of dollars of loss over centuries yielding untold hardship. These are but a small sampling; there are many more.