Do We Really Need the S-word?
The use of “significance” in reporting statistical results is fraught with problems—but they could be solved with a simple change in practice
What Would Fisher Think?
The statistician and biologist Ronald Fisher is usually given credit (or blame) for the attachment of the word significant to statistical results, thanks to his coining of the term “test of significance” and to the incredibly liberal use of the word in his 1925 text Statistical Methods for Research Workers. However, evidence suggests he borrowed the idea from previous uses of the word in discussions about judging how likely it was that an event was due to chance as opposed to a “real” effect.
The University of Southampton economist John Aldrich points to two late-19th-century uses of the word in a pre-Fisher statistical context. In 1885, the political economist Francis Ysidro Edgeworth wrote: “In order to determine whether the observed difference between the mean stature of 2,315 criminals and the mean stature of 8,585 British adult males belonging to the general population is significant.…” In 1888, the logician John Venn (of Venn diagram fame) stated, “As before, common sense would feel little doubt that such a difference was significant, but it could give no numerical estimate of the significance.” In a 1982 article for American Psychologist, Michael Cowles and Caroline Davis also point to a famous 1908 paper by William Gosset (who wrote under the pseudonym Student). Student used the s-word in introducing the t-distribution: “Three times the probable error in the normal curve, for most purposes, would be considered significant.”
These instances notwithstanding, it appears that Fisher’s generous use of the word in Statistical Methods for Research Workers was initially responsible for the need of a second definition in the dictionary today. In addition, the common use of the 0.05 p-value as nearly the sole criterion for dichotomizing and labeling results as significant or not appears to have some of its first roots in that text. Fisher wrote: “The value for which P = 0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant.”
In these early uses of the word, authors appear to intend the meaning of “showing something,” and not necessarily “showing something important.” Even in these cases, however, the ability to judge the strength of intended meaning is admittedly difficult, providing even more reason for promoting careful and sparing use of the word today. It seems naive to believe that we, as humans, can ignore the word’s everyday meaning when we enter the realm of statistics, particularly when it is ambiguously defined even in statistical contexts.
In his 1956 book, Statistical Methods and Scientific Inference, Fisher is notably more careful in his discussions of cutoffs and significance. In discussing the practice of rejecting a hypothesis, he states:
No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas. It should not be forgotten that the cases chosen for applying a test are manifestly a highly selected set, and that the conditions of selection cannot be specified even for a single worker; nor that in the argument used it would clearly be illegitimate for one to choose the actual level of significance indicated by a particular trial as though it were his lifelong habit to use just this level.
This language is clearly different from the strict classification Fisher suggested in 1925. Although I lay some blame on Fisher for early uses of the s-word, I believe he would take issue with its rampant use today, particularly if, as Salsberg suggests, its meaning carries more weight now than it did in 1956. We can strive to be one of the scientific workers Fisher describes, and omitting the s-word from our work is a starting point with serious potential.
Fisher goes on to say, “In choosing the ground upon which a general hypothesis should be rejected, personal judgment may and should be properly exercised. The experimenter will rightly consider all points on which, in the light of current knowledge, the hypothesis may be imperfectly accurate.” The perfect place to insert such descriptions of reasoning and justifications of personal judgment is just where the word significant used to reside.