The Statistical Crisis in Science
Data-dependent analysis—a “garden of forking paths”— explains why many statistically significant comparisons don't hold up.
There is a growing realization that reported “statistically significant” claims in scientific publications are routinely mistaken. Researchers typically express the confidence in their data in terms of
p-value: the probability that a perceived result is actually the result of random variation. The value of
(for “probability”) is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a
p-value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear.
The idea is that when
is less than some prespecified value such as 0.05, the null hypothesis is rejected by the data, allowing researchers to claim strong evidence in favor of the alternative. The concept of
p-values was originally developed by statistician Ronald Fisher in the 1920s in the context of his research on crop variance in Hertfordshire, England. Fisher offered the idea of
p-values as a means of protecting researchers from declaring truth based on patterns in noise. In an ironic twist,
p-values are now often used to lend credence to noisy claims based on small samples.
p-values are based on what would have happened under other possible data sets. As a hypothetical example, suppose a researcher is interested in how Democrats and Republicans perform differently in a short mathematics test when it is expressed in two different contexts, involving either healthcare or the military. The question may be framed nonspecifically as an investigation of possible associations between party affiliation and mathematical reasoning across contexts. The null hypothesis is that the political context is irrelevant to the task, and the alternative hypothesis is that context matters and the difference in performance between the two parties would be different in the military and healthcare contexts.
At this point a huge number of possible comparisons could be performed, all consistent with the researcher’s theory. For example, the null hypothesis could be rejected (with statistical significance) among men and not among women—explicable under the theory that men are more ideological than women. The pattern could be found among women but not among men—explicable under the theory that women are more sensitive to context than men. Or the pattern could be statistically significant for neither group, but the difference could be significant (still fitting the theory, as described above). Or the effect might only appear among men who are being questioned by female interviewers.
We might see a difference between the sexes in the healthcare context but not the military context; this would make sense given that health care is currently a highly politically salient issue and the military is less so. And how are independents and nonpartisans handled? They could be excluded entirely, depending on how many were in the sample. And so on: A single overarching research hypothesis—in this case, the idea that issue context interacts with political partisanship to affect mathematical problem-solving skills—corresponds to many possible choices of a decision variable.
issue is well known in statistics and has been called “p
-hacking” in an influential 2011 paper by the psychology researchers Joseph Simmons, Leif Nelson, and Uri Simonsohn. Our main point in the present article is that it is possible to have multiple potential comparisons (that is, a data analysis whose details are highly contingent on data, invalidating published
p-values) without the researcher performing any conscious procedure of fishing through the data or explicitly examining multiple comparisons.