To Throw Away Data: Plagiarism as a Statistical Crime

Whether data are numerical or narrative, removing them from their context represents an act of plagiarism

Andrew Gelman, Thomas Basbøll

A Statistical Crime

Returning to the statistical language of probability and likelihood, to falsify the provenance of a story is to imply an incorrect likelihood function and thus to lose inferential validity. (Statistically speaking, systematically excluding data without revealing the exclusion is a misspecification of the model.) As one of us (Basbøll) eventually showed, any telling of the story is a selection from several possible versions of it. By not sourcing it properly, Weick hides the opportunism of his sampling and sets Engel up to propose a convenient (for top management) “truth” about corporate strategy. This is not to say that, had Weick cited Holub appropriately, he would not have ultimately used it to draw lessons about leadership, even ones that executives would find useful. But if he had done so, he would have had to justify his argument, rather than merely retell the story in his own way to suit his purposes.

Scholars in fields ranging from psychology to history to computer science have recognized that stories are part of how people understand the world. As statisticians, we can consider reasoning from stories as a form of approximate inference. From this perspective, statistical principles should provide some approximate guidance about the potential biases and precision of such inferences. One key principle is not to throw away information and, if discarding data is for some reason necessary, to describe as clearly as possible the mechanism by which the relevant information was excluded. Plagiarism violates both these rules and, as such, is a violation of statistical ethics, beyond any other considerations of moral behavior.


