COMPUTING SCIENCE

# The Bootstrap

Statisticians can reuse their data to quantify the uncertainty of complex models

# Origin Myths

If you’ve taken an elementary statistics course, you were probably drilled in the special cases. From one end of the possible set of solutions, we can limit the kinds of estimator we use to those with a simple mathematical form—say, mean averages and other linear functions of the data. From the other, we can assume that the probability distributions featured in the stochastic model take one of a few forms for which exact calculation *is* possible, either analytically or via tables of special functions. Most such distributions have origin myths: The Gaussian bell curve arises from averaging many independent variables of equal size (say, the many genes that contribute to height in humans); the Poisson distribution comes from counting how many of a large number of independent and individually improbable events have occurred (say, radium nuclei decaying in a given second), and so on. Squeezed from both ends, the sampling distribution of estimators and other functions of the data becomes exactly calculable in terms of the aforementioned special functions.

That these origin myths invoke various limits is no accident. The great results of probability theory—the laws of large numbers, the ergodic theorem, the central limit theorem and so on—describe limits in which *all* stochastic processes in broad classes of models display the same asymptotic behavior. The central limit theorem (CLT), for instance, says that if we average more and more independent random quantities with a common distribution, and if that common distribution is not too pathological, then the distribution of their means approaches a Gaussian. (The non-Gaussian parts of the distribution wash away under averaging, but the average of two Gaussians is another Gaussian.) Typically, as in the CLT, the limits involve taking more and more data from the source, so statisticians use the theorems to find the asymptotic, large-sample distributions of their estimates. We have been especially devoted to rewriting our estimates as averages of independent quantities, so that we can use the CLT to get Gaussian asymptotics. Refinements to such results would consider, say, the rate at which the error of the asymptotic Gaussian approximation shrinks as the sample sizes grow.

To illustrate the classical approach and the modern alternatives, I’ll introduce some data: The daily closing prices of the Standard and Poor’s 500 stock index from October 1, 1999, to October 20, 2009. (I use these data because they happen to be publicly available and familiar to many readers, not to impart any kind of financial advice.) Professional investors care more about changes in prices than their level, specifically the *log returns*, the log of the price today divided by the price yesterday. For this time period of 2,529 trading days, there are 2,528 such values *(see Figure 1). *The “efficient market hypothesis” from financial theory says the returns can’t be predicted from any public information, including their own past values. In fact, many financial models assume such series are sequences of independent, identically distributed (IID) Gaussian random variables. Fitting such a model yields the distribution function in the lower left graph of Figure 1.

An investor might want to know, for instance, how bad the returns could be. The lowest conceivable log return is negative infinity (with all the stocks in the index losing all value), but most investors worry less about an apocalyptic end of American capitalism than about large-but-still-typical losses—say, how bad are the smallest 1 percent of daily returns? Call this number *q*_{0.01}; if we know it, we know that we will do better about 99 percent of the time, and we can see whether we can handle occasional losses of that magnitude. (There are about 250 trading days in a year, so we should expect two or three days at least that bad in a year.) From the fitted distribution, we can calculate that *q*_{0.01}=–0.0326, or, undoing the logarithm, a 3.21 percent loss. How uncertain is this point estimate? The Gaussian assumption lets us calculate the asymptotic sampling distribution of *q*_{0.01}, which turns out to be another Gaussian *(see the lower right graph in Figure 1),* implying a standard error of ±0.00104. The 95 percent confidence interval is (–0.0347, –0.0306): Either the real *q*_{0.01} is in that range, or our data set is one big fluke (at 1-in-20 odds), or the IID-Gaussian model is wrong.

EMAIL TO A FRIEND :

**Of Possible Interest**

**Spotlight**: Flint Water Crisis Yields Hard Lessons in Science and Ethics

**Engineering**: Traffic Signals, Dilemma Zones, and Red-Light Cameras

**Perspective**: Taking the Long View on Sexism in Science