Nearly 150 years ago, Charles Darwin speculated that movements such as smiling or frowning are not just the consequences of emotions, but can also cause them. Until the advent of controlled psychological experimentation, it was difficult to find evidence for or against this facial feedback hypothesis. Even when controlled experiments on the facial feedback hypothesis were done in the 1970s, it was impossible to be sure that the hypothesis was properly tested—how could we know that the physical movement was the cause of an emotion rather than the fact that you were explicitly told to “smile”?
In 1988, F. Strack, L. L. Martin, and S. Stepper ingeniously solved this problem. They had subjects do a task while holding a pen in their mouth, either with their teeth (which would cause a smile) or just with their lips (which would prevent a smile). Subjects had no idea that the experiment had anything to do with emotions. Even so, subjects in the Teeth (smile) condition rated cartoons as significantly funnier than those in the Lips (no smile) condition. A striking confirmation of a century-old theory!
The facial feedback hypothesis quickly became a classic result in experimental psychology, taught in Introduction to Psychology courses worldwide.
However, just last year, an extensive effort to replicate the effect through 17 separate preregistered replications of the study, by Eric-Jan Wagenmakers and colleagues, failed to find evidence for the facial feedback hypothesis. More recently, Ulrich Schimmack and Yue Chen conducted an analysis of the scientific literature on the facial feedback hypothesis to determine the replicability of the effect, concluding that “there was never convincing evidence for the effect.”
This classic and fundamental effect has now dissolved into the ether.
This vanishing act is far from the only one out there—there is now broad consensus that many fields face a problem with reproducibility. Estimates are that up to two-thirds of peer-reviewed studies in psychology , and between 20 percent and 50 percent of studies in medicine cannot be reproduced. Indeed, the majority of scientists surveyed in Nature last year had failed to reproduce at least one previously published result (which in many cases was their own result originally).
How can this be?
There are four main reasons why an experimental study may not be reproducible: fraud, poor methodology, control sensitivity, and random variation. Outright fraud is not replicable for good reason, and a culture of replication in science will severely limit the ability of researchers to perpetrate it.
Poor methodology, alas, is rife. There are a plethora of analytic and statistical pitfalls that are very easy to fall into, and even the best and most careful researchers do, from time to time. In many cases, proper method can require drastic changes from current practices and is often much more expensive and time-consuming, so improving overall standards meets with a great deal of inertia in the scientific community.
Control sensitivity describes when minute details of experimental setup can strongly affect experimental results. Fields where this shortcoming is an issue may need better methods for describing experimental controls and techniques. As well, the relevance of research results under real-world conditions may need to be carefully evaluated.
Finally, random variation is always possible. Some small fraction of experimental samples will be consistent with any hypothesis, simply by random variation. The goal of statistical analysis is to determine the likelihood that results are not due to random chance, so that we can minimize it.
Improving overall scientific and statistical methodology requires institutional change and large-scale retraining of researchers, so it is slow and expensive at best. Authors of a new paper in Nature Human Behavior, however, propose a much easier way to improve reproducibility while we work on the harder methodological issues: merely changing the p-value threshold required to call a result “statistically significant.”
A p-value, roughly speaking, is the likelihood that experimental results came about by chance. A low p-value means that results are unlikely just due to chance, and therefore some real process, presumably the hypothesis being tested, probably caused them.
The authors argue that the standard threshold to consider a result “statistically significant” should be lowered ten-fold, from 0.05, as is now standard in many fields, to 0.005. Such a lower threshold, the reasoning goes, will dramatically reduce the number of false-positive reports that get reported, and thus the scientific literature will, on the whole, be more reliable, with many more published results being reproducible. To be clear, many authors (from many disciplines) do not claim that such a redefinition of “statistically significant” will solve the problem, or that improving overall methodological standards is not needed. Indeed, many of the authors are at the forefront of efforts to improve statistical and experimental methodology in their fields.
This proposal is plausible and reasonable on its face, but I would argue it is fundamentally misguided and likely to be harmful to science. While tightening the standards for statistical significance might slightly reduce the number of irreproducible results, the proposal will reinforce pernicious ideas that prevent the scientific community from adopting better methodologies.
Indeed, the very notion that we can simply classify experimental results as “statistically significant” or not was never intended by its inadvertent inventor, pioneering statistician R. A. Fisher. As now used, it is a concept that causes much confusion and error and is therefore long overdue to be scrapped. It creates the mistaken view that science is the accumulation of trustworthy “facts” from single studies, and distorts researchers’ incentives to the great detriment of science.
Reducing the p-value threshold as proposed will have the effect of doubling down on the supposed importance of statistical significance and will only reinforce the problematic idea that a study is either “in” or “out”. (Let’s call this idea decision criterion science.) Far from ameliorating the reproducibility crisis while we implement more fundamental improvements, this proposal will, by strengthening the idea of decision criterion science, exacerbate the crisis and make it harder to implement those improvements.
You see, the value of an experimental result as evidence, either for or against a hypothesis, is very rarely an all-or-nothing affair. Nothing magical happens when going from a p-value of 0.051 to one of 0.049. Instead of a single binary “in or out” criterion of significance, we need to look at multiple quantitative measures of evidentiary value (such as p-value, Bayes factors, odds ratios, and absolute/relative effect sizes) directly, together with the reasons why these measures make sense for the problem being studied. (I am one of 88 coauthors of a response to the Nature Human Behavior paper that makes this argument in some technical detail. This response is now in preprint.) Such practices will make it easier (though still nontrivial) to combine the evidence from multiple studies (creating metareviews) to get a clearer view of the overall picture.
And such evidence combination is the real point.
The fundamental problem with decision criterion science is that it produces a scientific culture within which each peer-reviewed article seems to stand alone. Each individual study is considered to either prove (really, “support”) a hypothesis, and thus become a contribution to the scientific literature, or it does not and is not. Except in rare cases or in fields with greatly elaborated quantitative theories where very precise prediction is possible, no single study stands alone. All scientific knowledge rests on the integration of many bits of evidence, and a multifaceted understanding of each and every bit’s evidentiary value is needed to create this integrated view.
Of course, nearly all scientists would say that they certainly view science in this way, as a grand edifice built out of a mass of interrelated studies, each contributing partial evidence to the accumulated scientific knowledge base. However, it is not, in fact, how we usually do science. It is natural, when a researcher devotes months or years of effort to a study, not to mention seeking funding for the study, to view it as an accomplishment that stands (must stand) alone. Moreover, the notion of statistical significance as a gatekeeper to reliable results, combined with the “in or out” nature of scientific publishing decisions, reinforces and hardens the decision criterion science worldview. This view is so strongly and implicitly embedded within the assumptions of scientific practice that it is invisible to most researchers, even though it informs and incentivizes the most widespread and damaging methodological flaws in science today. Anything that strengthens decision criterion science will undercut efforts to eliminate these flaws.
Consider just two of the main methodological problems that contribute to the current crisis in scientific credibility: HARKing and the file-drawer effect. (These are two of many potential pitfalls .)
Hypothesizing after results are known (HARKing) means adjusting one’s hypothesis to better fit one’s results, even though the results don’t support the initial hypothesis (or if the initial hypothesis was vague). This pitfall is especially easy if there are many variables; one can just hypothesize that some other variable interacts with the main hypothesis in a way that will give a statistically significant result, and then reanalyze the data. Whenever a previously gathered data set is reanalyzed for new hypotheses, HARKing is likely. (Doing this can be useful for discovering novel hypotheses, but new experiments are needed to verify them.)
Indeed, hidden biases can creep in almost anywhere—in the formulation of hypotheses (both null and alternate), how data are sampled in collection, how data points are excluded from analysis, and on and on. There are so many degrees of freedom that with enough time, effort, and ingenuity, a publishable result can almost always be squeezed out of the data.
Decision criterion science will almost inevitably lead to HARKing, if the initial results are not significant. After all, given decision criterion science, it’s only a scientific result if it meets the decision criterion, and if we didn’t manage to meet it initially, then we need to adjust our hypothesis to find one that does. (This oversight can be, of course, unconscious—no reputable scientist will intentionally draw the target after shooting the arrow—but with complex enough data, it’s easy to do if you’re not careful.)
The grand edifice view, on the other hand, will recognize the evidentiary value of the “failed” experiment—it may contribute weak evidence for the original hypothesis, or evidence (weak or strong) against it. But it isn’t worthless simply because it failed to meet some arbitrary decision criterion.
Another problematic effect linkable to decision criterion science is the file-drawer effect, the fact that the scientific record is biased because only “significant” results are published. Suppose that 20 different research groups do experiments to test some hypothesis H, and just one of the groups gets a result with p < 0.05. Because only significant results are worth publishing, only that one result is published in a peer-reviewed journal, and the other 19 end up in the proverbial file drawer. If we look just at the published research, we would say that H is supported. However, if we look at all the results, including those in the file drawer, we would reach the opposite conclusion.
There are, in fact, known methods for addressing such methodological problems, such as preregistering research studies, reporting multiple measures of evidentiary value, data sharing, online open peer-review, and more. These solutions, however, are time-consuming and expensive, create burdens on researchers and funding agencies, slow apparent scientific progress, and interfere with high-speed publication of peer-reviewed articles—the main currency of scientific career advancement. For these reasons, it requires herculean efforts to improve scientific practice in these ways—hence the suggestion to take a small step forward by just changing what p-values we consider to be “statistically significant.”
However, the difficulty of making these changes is based, however implicitly and invisibly, primarily on the assumption that the unit of scientific advance is the individual research study. If the study gets a “significant” result (however defined) it is a success, and if not, it is worthless. This hidden idea lies behind the incentives of the individual researcher to perform as much “publishable” (read “statistically significant”) research as possible, and behind the incentives of the funding agencies to fund “transformative” (read “novel, not replications of previous studies”) and “reliable” (read “statistically significant”) research.
What we need, rather, is a more expansive and communal view of the scientific enterprise, in which we examine the totality of evidence for (and against) a hypothesis from potentially many different experiments. Starting from this viewpoint, we will design our studies from the get-go to give results that contribute evidence toward resolving a question, rather than results that we expect will resolve the question. Each measure of evidentiary value (such as the p-value) will then be considered not as a binary filter (“in or out”, “significant or not”), but as a quantitative measure to be combined with other such measures from other studies, to give an overall measure of the evidence for or against the hypothesis. In other words, the unit of scientific progress is more like the comprehensive metareview than the individual research study, and each study should be designed to fit into such a metareview. (Certainly, there are significant technical challenges to be solved, and the incentive structure of modern science needs to change. But that’s the point.)
Some 600 years ago, the idea of using controlled experimentation led to a revolution in how we develop reliable bodies of knowledge, and hence to modern science. Just over a century ago, the development of rigorous statistical analysis led to a revolution in how we evaluate the import of experimental results in complex situations. The lack of reproducibility indicates that we are now at the cusp of another great leap in scientific epistemology.
Pushing scientific investigation further requires us to do science that explicitly recognizes its communal and interconnected nature. Stopgap measures, such as the “p < 0.005” proposal, that harden the pernicious assumptions of decision criterion science will only harm science in the long term, even if they make small improvements in the short term. We must keep our collective eyes on the ball: Remove the term statistically significant, and replace it with multiple relevant and justified measures of evidentiary value, while working to build the standards, techniques, incentives, and institutions needed to support the grand edifice of future science.