MACROSCOPE

# Do We Really Need the *S*-word?

The use of “significance” in reporting statistical results is fraught with problems—but they could be solved with a simple change in practice

It pervades journal articles, media reports and discussions of nearly all quantitative, data-based research. It is so entwined with statistical inference that we subconsciously think it, say it and write it without stepping back to reflect on its intended meaning. It attaches an air of importance (or lack thereof) to results often not worthy of the label. *Significance*, the new *s-*word, is overused and underdefined in the realm of connecting statistical results to the underlying science.

The current Wikipedia entry for “statistical significance” clearly distinguishes between the word’s statistical and common meanings, stating, “When used in statistics, *significant* does not mean *important* or *meaningful*, as it does in everyday speech.” However, I believe we are unable (or perhaps too untrained) to set aside the lay meaning of the word when reading it in the context of statistical results. For scientists, statisticians, journalists and others who write or speak about statistical results, I advocate a simple solution: Replace the *s-*word with words describing what you actually mean by it.

# What Does It Really Mean?

What exactly does the word *significant* mean in statistical contexts, if it does not mean “important” or “meaningful”? When someone labels a result as statistically significant, does it merely mean that the *p*-value (a value calculated to quantify evidence against a hypothesis) is less than 0.05, or that the 95-percent confidence interval does not include 0? If so, perhaps it is time to ask whether we really need to use a word that carries substantial meaning in our day-to-day language to describe something so simple.

Did the people who introduced the word’s use in statistics intend for it to be interpreted according to its current everyday meaning? The answer is not simple. In his 2001 book *The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century,* David Salsburg contends the word carried much less weight in the late 19th century, when it meant only that the result showed, or signified, something. Then, in the 20th century, *significance* began to gather the connotation it carries today, of not only signifying something but signifying something of importance. The coinciding of this change in meaning with a steady increase in its use by more scientists with less statistical training has had a big impact on the interpretation of scientific results. My sentiments echo Salsburg’s: “Unfortunately,” he writes, “those who use statistical analysis often treat a significant test statistic as implying something much closer to the modern meaning of the word.”

# New Twist, Old Debate

The dangers of implementing arbitrary *p*-value cutoffs and the importance of distinguishing between statistical significance and practical significance are now well recognized by scientists. Andrew Gelman and Hal Stern address the common error of making comparisons based on statistical significance in a 2006 paper for *The American Statistician, *titled “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant.” I contend that in addition, our tendency to repeatedly use the same wording to explain statistical results masks potential improvements in inference corresponding to recognition of these issues. My goal here is not to revisit ideas that have been described eloquently and repeatedly over the past decades (by James O. Berger and Donald A. Berry in a 1988 article for *American Scientist, *and by Jacob Cohen in a 1994 article for *American Psychologist, *among others) but instead to propose a simple change in wording. I suggest we replace the word *significant* with a concise and defensible description of what we actually mean to evoke by using it.

For example, if we mean that the two-sided *p*-value is less than 0.05, then let us say just that. Let the reader judge whether that result is to be deemed significant by our modern scale and by their knowledge of the science. If, instead, we mean the result is in fact practically important, let’s say as much and clearly communicate our justification for doing so. If we believe our research is important, let us convince our critics through our choice of other words. We can use the *p*-value to back up our arguments, without appealing to the *s*-word. If we do not feel forced to label results associated with larger *p*-values “nonsignificant” or “insignificant,” it may even help curb the long-standing publication bias. By replacing the *s-*word* *with defensible statements, we can easily and quickly clean up the often sloppy dissemination of scientific results.

# The Significance of Significant

Perhaps you are thinking this is a trivial suggestion, a mere matter of semantics. Can such a small change manifest real improvements? There are not many easy ways to improve scientific inference, but I believe this is one of them. The significance of the word *significant* should not be easily dismissed. The word carries strong connotations from its everyday usage that are difficult, if not impossible, to let go of simply because we find ourselves interpreting statistical results. In practice, it is all too easy to slide from the constant vigilance required to maintain these real semantic differences amid the multitude of assumptions and procedural details involved in statistical analysis.

The *s-*word is, of course, not limited to discourse among scientists. It finds its way into media reports on research, which are read by the public, most of whom have far too little statistical background to understand the different meaning of the word in the context of statistical results. The rest of a sentence may contain incomprehensible statistical jargon, but the *s-*word is recognizable and easily digestible—and the meaning attached to it is, of course, its everyday sense: important and meaningful. Thus, “important and meaningful” is the message sent to an audience without the background to understand what led to the printing of that weighty word.

The journal *Science* suggests that potential authors “use *significant* only when discussing statistical significance,” acknowledging the subtleties attached to its use and meaning in the context of scientific research. I suggest we take a further step and omit its use in statistical contexts as well. If most of us are not capable of separating the statistical meaning from the everyday meaning, and if, as I argue, the word really is not needed to explain statistical results, why maintain our dependence on it? Let’s free ourselves to justify our statements more adequately and describe our results more wisely.

For readers with the background necessary to successfully critique results, we should provide the information they need to make their own informed opinions, based on sound reasoning and justification. Rather than giving in to the false dichotomy evoked by “significant” or “not significant”—a dichotomy most often based on arbitrary and hidden criteria—we should focus on *why* we believe results are (or are not) meaningful and important. Replacing the *s-*word in our writing and speech allows us the space to do just that.

# Case Studies in Unsignificance

Curious about the impact a ban on the *s-*word might have, three years ago I began banning the word from my two-semester Methods of Data Analysis course, which is taken primarily by nonstatistics graduate students. My motivation was to force students to justify and defend the statements they used to summarize results of a statistical analysis. In previous semesters I had noticed students using the *s-*word as a mask, an easily inserted word to replace the justification of assumptions and difficult decisions, such as arbitrary cutoffs. My students were following the example dominant in published research—perpetuating the false dichotomy of calling statistical results either significant or not and, in doing so, failing to acknowledge the vast and important area between the two extremes. The ban on the *s-*word seems to have left my students with fewer ways to skirt the difficult task of effective justification, forcing them to confront the more subtle issues inherent in statistical inference.

An unexpected realization I had was just how ingrained the word already was in the brains of even first-year graduate students. At first I merely suggested—over and over again—that students avoid using the word. When suggestion proved not to be enough, I evinced more motivation by taking off precious points at the sight of the word. To my surprise, it still appears, and students later say they didn’t even realize they had used it! Even though using this *s*-word doesn’t carry the possible consequence of having one’s mouth washed out with soap, I continue to witness the clasp of hands over the mouth as the first syllable tries to sneak out—as if the speakers had caught themselves nearly swearing in front of a child or parent.

# What Would Fisher Think?

The statistician and biologist Ronald Fisher is usually given credit (or blame) for the attachment of the word *significant *to statistical results, thanks to his coining of the term “test of significance” and to the incredibly liberal use of the word in his 1925 text *Statistical Methods for Research Workers*. However, evidence suggests he borrowed the idea from previous uses of the word in discussions about judging how likely it was that an event was due to chance as opposed to a “real” effect.

The University of Southampton economist John Aldrich points to two late-19th-century uses of the word in a pre-Fisher statistical context. In 1885, the political economist Francis Ysidro Edgeworth wrote: “In order to determine whether the observed difference between the mean stature of 2,315 criminals and the mean stature of 8,585 British adult males belonging to the general population is significant.…” In 1888, the logician John Venn (of Venn diagram fame) stated, “As before, common sense would feel little doubt that such a difference was significant, but it could give no numerical estimate of the significance.” In a 1982 article for *American Psychologist, *Michael Cowles and Caroline Davis also point to a famous 1908 paper by William Gosset (who wrote under the pseudonym Student). Student used the *s-*word in introducing the *t*-distribution: “Three times the probable error in the normal curve, for most purposes, would be considered significant.”

These instances notwithstanding, it appears that Fisher’s generous use of the word in *Statistical Methods for Research Workers *was initially responsible for the need of a second definition in the dictionary today. In addition, the common use of the 0.05 *p*-value as nearly the sole criterion for dichotomizing and labeling results as significant or not appears to have some of its first roots in that text. Fisher wrote: “The value for which P = 0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant.”

In these early uses of the word, authors appear to intend the meaning of “showing something,” and not necessarily “showing something important.” Even in these cases, however, the ability to judge the strength of intended meaning is admittedly difficult, providing even more reason for promoting careful and sparing use of the word today. It seems naive to believe that we, as humans, can ignore the word’s everyday meaning when we enter the realm of statistics, particularly when it is ambiguously defined even in statistical contexts.

In his 1956 book, *Statistical Methods and Scientific Inference*, Fisher is notably more careful in his discussions of cutoffs and significance. In discussing the practice of rejecting a hypothesis, he states:

No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas. It should not be forgotten that the cases chosen for applying a test are manifestly a highly selected set, and that the conditions of selection cannot be specified even for a single worker; nor that in the argument used it would clearly be illegitimate for one to choose the actual level of significance indicated by a particular trial as though it were his lifelong habit to use just this level.

This language is clearly different from the strict classification Fisher suggested in 1925. Although I lay some blame on Fisher for early uses of the *s-*word, I believe he would take issue with its rampant use today, particularly if, as Salsberg suggests, its meaning carries more weight now than it did in 1956. We can strive to be one of the scientific workers Fisher describes, and omitting the *s-*word from our work is a starting point with serious potential.

Fisher goes on to say, “In choosing the ground upon which a general hypothesis should be rejected, personal judgment may and should be properly exercised. The experimenter will rightly consider all points on which, in the light of current knowledge, the hypothesis may be imperfectly accurate.” The perfect place to insert such descriptions of reasoning and justifications of personal judgment is just where the word *significant *used to reside.

# Join the S-word Movement

In my experience, scientists making their first attempts at abandoning the *s-*word discover how wedded they are to it. The real challenge, however, lies in replacing the *s-*word with substance, not with an equally ambiguous synonym. If the scenario is a simple one—the *p*-value was 0.048, the confidence interval did not include 0 or the variable in question ended up in your top model—use the space to explicitly define your criteria. If, in fact, you believe the results are practically meaningful and important, convince your readers with sound justification using both statistical and general scientific reasoning.

Statistical inference is an art, uncomfortably dependent on practitioners and their backgrounds. It should not be construed as a way to objectivize inference or a straightforward means to classify results as significant or not. Omission of the *s-*word may seem like a rather insignificant request among the bigger issues facing statistical inference and science in general. However, given the simplicity and accessibility of this change, it is worth the potential improvements it offers in the dissemination of our scientific results. I hope you will join me and my students in working to curtail use of the *s-*word and its negative impacts on science.

# Bibliography

- Aldrich, J. 2011. Contribution to
*Earliest Known Uses of Some of the Words of Mathematics*. http://Jeff560.tripod.com/s.html. - Berger, J. O., and D. A. Berry. 1988. Statistical analysis and the illusion of objectivity.
*American Scientist*76:159–165. - Cohen, J. 1994. The earth is round (
*p*< .05).*American Psychologist*49:997–1003. - Cowles, M., and C. Davis. 1982. On the origins of the .05 level of statistical significance.
*American Psychologist*37:553–558. - Edgeworth, F. Y. 1885.
*Jubilee Volume, Royal Statistical Society*181–217. - Fisher, R. A. 1973.
*Statistical Methods and Scientific Inference,*3rd ed. New York: Hafner Press. - Fisher, R. A. 1944.
*Statistical Methods for Research Workers,*9th ed. London: Oliver and Boyd. - Gelman, A., and H. Stern. 2006. The difference between “significant” and “not significant” is not itself statistically significant.
*American Statistician*60:328–331. - Gill, J. 1999. The insignificance of null hypothesis significance testing.
*Political Science Quarterly*52:647–674. - Goodman, S. N. 2001. Of P-values and Bayes: A modest proposal.
*Epidemiology*12:295–297. - Poole, C. 2001. Low P-values or narrow confidence intervals: Which are more durable?
*Epidemiology*12:291–294. - Salsburg, D. 2001.
*The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century.*New York: W.H. Freeman and Co. - Siegfried, T. 2010. Odds are, it’s wrong.
*Science News*177:26. - Thompson, B. 1996. AERA editorial policies regarding statistical significance testing: Three suggested reforms.
*Educational Researcher*25:26–30. - Weinberg, C. R. 2001. It’s time to rehabilitate the P-value.
*Epidemiology*12:288–290. *Wikipedia*contributors. Statistical significance.*Wikipedia, The Free Encyclopedia*, http://en.wikipedia.org/wiki/Statistical_significance. Accessed August 17, 2012.

EMAIL TO A FRIEND :