Logo IMG
HOME > PAST ISSUE > Article Detail


The Bootstrap

Statisticians can reuse their data to quantify the uncertainty of complex models

Cosma Shalizi


The bootstrap approximates the sampling distribution, with three sources of approximation error. First there’s simulation error, using finitely many replications to stand for the full sampling distribution. Clever simulation design can shrink this, but brute force—just using enough replications—can also make it arbitrarily small. Second, there’s statistical error: The sampling distribution of the bootstrap reestimates under our fitted model is not exactly the same as the sampling distribution of estimates under the true data-generating process. The sampling distribution changes with the parameters, and our initial fit is not completely accurate. But it often turns out that distribution of estimates around the truth is more nearly invariant than the distribution of estimates themselves, so subtracting the initial estimate from the bootstrapped values helps reduce the statistical error; there are many subtler tricks to the same end. The final source of error in bootstrapping is specification error: The data source doesn’t exactly follow our model at all. Simulating the model then never quite matches the actual sampling distribution.

Here Efron had a second brilliant idea, which is to address specification error by replacing simulation from the model with resampling from the data. After all, our initial collection of data gives us a lot of information about the relative probabilities of different values, and in certain senses this “empirical distribution” is actually the least prejudiced estimate possible of the underlying distribution—anything else imposes biases or preconceptions, which are possibly accurate but also potentially misleading. We could estimate q0.01 directly from the empirical distribution, without the mediation of the Gaussian model. Efron’s “nonparametric bootstrap” treats the original data set as a complete population and draws a new, simulated sample from it, picking each observation with equal probability (allowing repeated values) and then re-running the estimation (as shown in Figure 2).

2010-05CompsciShaliziF3.jpgClick to Enlarge ImageThis new method matters here because the Gaussian model is inaccurate; the true distribution is more sharply peaked around zero and has substantially more large-magnitude returns, in both directions, than the Gaussian (see the top graph in Figure 3). For the empirical distribution, q0.01=–0.0392. This may seem close to our previous point estimate of –0.0326, but it’s well beyond the confidence interval, and under the Gaussian model we should see values that negative only 0.25 percent of the time, not 1 percent of the time. Doing 100,000 non-parametric replicates—that is, resampling from the data and reestimating q0.01 that many times—gives a very non-Gaussian sampling distribution (as shown in the right graph of Figure 3), yielding a standard error of 0.00364 and a 95 percent confidence interval of (–0.0477, –0.0346).

Although this is more accurate than the Gaussian model, it’s still a really simple problem. Conceivably, some other nice distribution fits the returns better than the Gaussian, and it might even have analytical sampling formulas. The real strength of the bootstrap is that it lets us handle complicated models, and complicated questions, in exactly the same way as this simple case.

2010-05CompsciShaliziF4.jpgClick to Enlarge ImageTo continue with the financial example, a question of perennial interest is predicting the stock market. Figure 4 is a scatter plot of the log returns on successive days, the return for today being on the horizontal axis and that of tomorrow on the vertical. It’s mostly just a big blob, because the market is hard to predict, but I have drawn two lines through it: a straight one in blue, and a curved one in black. These lines try to predict the average return tomorrow as functions of today’s return; they’re called regression lines or regression curves. The straight line is the linear function that minimizes the mean-squared prediction error, or the sum of the squares of the errors made in solving every single equation (called the least squares method). Its slope is negative (–0.0822), indicating that days with below-average returns tend to be followed by ones with above-average returns and vice versa, perhaps because people try to buy cheap after the market falls (pushing it up) and sell dear when it rises (pulling it down). Linear regressions with Gaussian fluctuations around the prediction function are probably the best-understood of all statistical models—their oldest forms go back two centuries now—but they’re more venerable than accurate.

The black curve is a nonlinear estimate of the regression function, coming from a constrained optimization procedure called spline smoothing: Find the function that minimizes the prediction error, while capping the value of the average squared second derivative. As the constraint tightens, the optimal curve, the spline, straightens out, approaching the linear regression; as the constraint loosens, the spline wiggles to try to pass through each data point. (A spline was originally a flexible length of wood craftsmen used to draw smooth curves, fixing it to the points the curve had to go through and letting it flex to minimize elastic energy; stiffer splines yielded flatter curves, corresponding mathematically to tighter constraints.)

To actually get the spline, I need to pick the level of the constraint. Too small, and I get an erratic curve that memorizes the sample but won’t generalize to new data; but too much smoothing erases real and useful patterns. I set the constraint through cross-validation: Remove one point from the data, fit multiple curves with multiple values of the constraint to the other points, and then see which curve best predicts the left-out point. Repeating this for each point in turn shows how much curvature the spline needs in order to generalize properly. In this case, we can see that we end up selecting a moderate amount of wiggliness; like the linear model, the spline predicts reversion in the returns but suggests that it’s asymmetric—days of large negative returns being followed, on average, by bigger positive returns than the other way around. This might be because people are more apt to buy low than to sell high, but we should check that this is a real phenomenon before reading much into it.

There are three things we should note about spline smoothing. First, it’s much more flexible than just fitting a straight line to the data; splines can approximate a huge range of functions to an arbitrary tolerance, so they can discover complicated nonlinear relationships, such as asymmetry, without guessing in advance what to look for. Second, there was no hope of using a smoothing spline on substantial data sets before fast computers, although now the estimation, including cross-validation, takes less than a second on a laptop. Third, the estimated spline depends on the data in two ways: Once we decide how much smoothing to do, it tries to match the data within the constraint; but we also use the data to decide how much smoothing to do. Any quantification of uncertainty here should reckon with both effects.

2010-05CompsciShaliziF5.jpgClick to Enlarge ImageThere are multiple ways to use bootstrapping to get uncertainty estimates for the spline, depending on what we’re willing to assume about the system. Here I will be cautious and fall back on the safest and most straightforward procedure: Resample the points of the scatter plot (possibly getting multiple copies of the same point), and rerun the spline smoother on this new data set. Each replication will give a different amount of smoothing and ultimately a different curve. Figure 5 shows the individual curves from 800 bootstrap replicates, indicating the sampling distribution, together with 95 percent confidence limits for the curve as a whole. The overall negative slope and the asymmetry between positive and negative returns are still there, but we can also see that our estimated curve is much better pinned down for small-magnitude returns, where there are lots of data, than for large-magnitude returns, where there’s little information and small perturbations can have more effect.

comments powered by Disqus


Subscribe to American Scientist