Turning Scientific Perplexity into Ordinary Statistical Uncertainty
PRINCIPLES OF APPLIED STATISTICS. D. R. Cox and Christl A. Donnelly. x + 202 pp. Cambridge University Press, 2011. $39.99.
D. R. Cox published his first major book, Planning of Experiments, in 1958; he has been making major contributions to the theory and practice of statistics for as long as most current statisticians have been alive. He is now in a reflective phase of his career, and this book, coauthored with the distinguished biostatistician Christl A. Donnelly, is a valuable distillation of his experience of applied work. It stands as a summary of an entire tradition of using statistics to address scientific problems.
Statistics is a branch of applied mathematics that studies how to draw reliable inferences from partial or noisy data. The field as we know it arose from several strands of scholarship. The word “statistics,” coined in the 1770s, originally referred to the study of the human populations of states and the resources those populations offered: how many men, in what physical condition, with what life expectancies, what wealth and so on. Practitioners soon learned that there was always variation within populations, that there were stable patterns to this variation and that there were relations between these variables. (For instance, richer men tended to be taller and live longer.) Another component strand was formed when scientists began to systematically analyze or “reduce” scientific data from multiple observers or observations (especially astronomical data). It became obvious from this research that there was always variation from one observation to the next, even in controlled experiments, but again, there were patterns to the variation. In both cases, probability theory provided very useful models of the variation. Statistics was born from the weaving together of these three strands: population variability, experimental noise and probability models. The field’s mathematical problems are about how, within a probability model, one might soundly infer something about a given process from the data the model generates, and at the same time quantify how uncertain that inference is.
Applied statistics, in the sense that Cox and Donnelly profess, is about turning vexed scientific (or engineering) questions into statistical problems, and then turning those problems’ solutions into answers to the original questions. The sometimes conflicting aims are to make sure that the statistical problem is well posed enough that it can be solved, and that its solution still helps resolve the original, substantive dilemma—which is, after all, the point.
Rather than spoiling any of Cox and Donnelly’s examples, I will sketch one that recently came up in my department. The scientific question was “Are large lesions of type X in organ Z a sign of disease D?” This is far too inexact to be a statistical problem. Large compared to what: lesions of another type in the same people? Lesions of type X in people without the disease? Is the enlargement a direct marker of the disease? If so, what is the mechanism? If not, might enlargement be a side effect of something that causes the condition? What makes a lesion large—its volume, mass, diameter? How does one deal with multiple lesions in a single person? Once we agree on metrics for the size of lesions, how well can we measure them? Is the disease really an all-or-nothing affair, or might it come in stages or grades that might be related to the lesions? How do we know who has the disease, anyway? How should size be compared: differences, ratios, the number of lesions larger than some threshold? How precise is our comparison, and how much uncertainty does it inherit from our originally imprecise measurements and limited set of patients? To frame this question as an applied statistics problem, these points must all be settled, at least in a preliminary way.
For translating between scientific questions and statistical problems, we don’t (yet!) have much useful mathematics or many algorithms, but we do have good heuristics and traditions. (For my example, these include comparing individuals with the disease to those without, making sure you know how variable lesions are within both groups, and avoiding diagnosing the disease by looking at lesions.) The point of Principles of Applied Statistics is to convey those traditions. It is thus quite appropriate that the first half of the book contains very few equations. Rather, the first five chapters go into the elements necessary to make the translation: understanding what, exactly, the scientific question is getting at (chapter 1); understanding where the data came from, and weighing the pros and cons of different data-gathering tactics (chapters 2 and 3); finding clarity about exactly what was measured and how (chapter 4); and maintaining quality control over data collection and data storage, or, failing that, at least understanding the errors (chapter 5).
Only in the second half of the book (chapters 6–10) do statistical models and methods, the subject of most textbooks, make their appearance, and with them a few select equations. The book is structured this way partly because Cox has already written many books about these matters (most recently Principles of Statistical Inference, 2006), but the choice also shows a sound sense of proportion. For most people with decent mathematical training, learning the standard models and methods is not that hard. (These days, computing can substitute for math; I know a well-regarded statistician who can barely do integrals but is good at simulation.) What is harder is learning to match models and methods to the actual problem. Our most common method for relating one variable to another is called regression. It gives us models where one variable is a function of others but is also perturbed by unpredictable fluctuations, which must be accounted for. If we are mostly interested in the relation between two variables, what others should we include as controls, and which ones should we exclude? When and how should we incorporate constraints on the regression function suggested by prior scientific opinions? How can we check those constraints? When there are multiple ways of representing the same function, how can we pick one that will be easy to interpret in terms of the original scientific question? These issues of formulation and specification are the real subject of the book’s second half.
Nothing that Cox and Donnelly have to say about these matters is revolutionary—it’s not supposed to be. All good applied statisticians know that they need to check that their models make sense not just as formal mathematical objects but as substantive representations of the problem being studied. They know that statistically significant results may be scientifically trivial, and vice versa; that picking a meaningful parameterization is often more helpful than using the most efficient procedure; and so forth. The point of the book is to share these fruits of experience in this tradition with newcomers, so that they can make original mistakes.
At some points I would depart from Cox and Donnelly—I wish they had made more systematic use of graphical models when discussing causality and dependence among variables, and my experience with simple nonparametric smoothing has evidently been better than theirs—but these are small things. The big thing is their vision of what good applied statistics looks like: formulating important scientific questions as problems about carefully designed statistical models, and then solving them by applying thoughtfully selected methods to high-quality data, all while being clear about the uncertainty inherent in statistical inferences. There are other traditions of data analysis and even of applied statistics—what’s now called “data mining” is one—with different virtues. But the applied statistics tradition that Cox and Donnelly describe has been uniquely successful in helping scientists understand the world. If you do not have a few years to spend apprenticed to a master, I can think of few better ways of being initiated into that tradition than reading Principles of Applied Statistics.
Cosma Shalizi is an assistant professor in the statistics department at Carnegie Mellon University and an external professor at the Santa Fe Institute. He is writing a book on the statistical analysis of complex systems models. His blog, Three-Toed Sloth, can be found at http://bactra.org/weblog/.