Statisticians can reuse their data to quantify the uncertainty of complex models
Statistics is the branch of applied mathematics that studies ways of drawing inferences from limited and imperfect data. We may want to know how a neuron in a rat’s brain responds when one of its whiskers gets tweaked, or how many rats live in Manhattan, or how high the water will get under the Brooklyn Bridge, or the typical course of daily temperatures in the city over the year. We have some data on all of these things, but we know that our data are incomplete, and experience tells us that repeating our experiments or observations, even taking great care to replicate the conditions, gives more or less different answers every time. It is foolish to treat any inference from only the data in hand as certain.
If all data sources were totally capricious, there’d be nothing to do beyond piously qualifying every conclusion with “but we could be wrong about this.” A mathematical science of statistics is possible because, although repeating an experiment gives different results, some types of results are more common than others; their relative frequencies are reasonably stable. We can thus model the data-generating mechanism through probability distributions and stochastic processes—random series with some indeterminacy about how the events might evolve over time, although some paths may be more likely than others. When and why we can use stochastic models are very deep questions, but ones for another time. But if we can use them in a problem, quantities such as these are represented as “parameters” of the stochastic models. In other words, they are functions of the underlying probability distribution. Parameters can be single numbers, such as the total rat population; vectors; or even whole curves, such as the expected time-course of temperature over the year. Statistical inference comes down to estimating those parameters, or testing hypotheses about them.
These estimates and other inferences are functions of the data values, which means that they inherit variability from the underlying stochastic process. If we “reran the tape” (as Stephen Jay Gould used to say) of an event that happened, we would get different data with a certain characteristic distribution, and applying a fixed procedure would yield different inferences, again with a certain distribution. Statisticians want to use this distribution to quantify the uncertainty of the inferences. For instance, by how much would our estimate of a parameter vary, typically, from one replication of the experiment to another—say, to be precise, what is the root-mean-square (the square root of the mean average of the squares) deviation of the estimate from its average value, or the standard error? Or we could ask, “What are all the parameter values that could have produced this data with at least some specified probability?” In other words, what are all the parameter values under which our data are not low-probability outliers? This gives us the confidence region for the parameter—rather than a point estimate, a promise that either the true parameter point lies in that region, or something very unlikely under any circumstances happened—or that our stochastic model is wrong.
To get standard errors or confidence intervals, we need to know the distribution of our estimates around the true parameters. These sampling distributions follow from the distribution of the data, because our estimates are functions of the data. Mathematically the problem is well defined, but actually computing anything is another story. Estimates are typically complicated functions of the data, and mathematically convenient distributions all may be poor approximations of the data source. Saying anything in closed form about the distribution of estimates can be simply hopeless. The two classical responses of statisticians have been to focus on tractable special cases, and to appeal to asymptotic analysis, a method that approximates the limits of functions.