A Safety Net for Scientific Data
Online data archives bolster confidence in science and provide a springboard for future scientists. Whose responsibility is it to curate aging data sets?
For the data sets that form the lifeblood of research science, it is the best of times and the worst of times. Across all fields, data sets are growing rapidly because of increased computing speed and storage capacity. Yet the accessibility of that information is often deteriorating. Traditionally, the data used in the authors’ analysis were reviewed and published with their journal articles. These days the raw information typically stays in the authors’ care. As a result, many data sets are now inaccessible, lost, or effectively lost because the authors don’t know where they are or how to retrieve them.
The problem is not lack of space to store the data. New organizations like Dryad, which opened in 2010 to provide data archiving services for researchers in ecology and evolution, offer long-term scientific data archiving, sometimes for free and sometimes for an up-front fee paid by a journal, institution, or researcher. Dryad will permanently store 10 gigabytes of data for less than $100. Journals, too, will usually store data online in appendixes or supplemental information sections accompanying the article. Researchers have been slow to adopt these new strategies, however.
What is needed are ways to encourage or require scientists to archive their data, ideally in locations that are freely available to the public, proposes Tim Vines, managing editor of the journal Molecular Ecology. “There is a growing crisis of confidence in science, both within science and more broadly. Science publication has become increasingly focused on novelty and excitement of results, rather than overwhelming solidity of results,” he says. In 2011, many leading evolution journals, including Molecular Ecology, created and adopted the Joint Data Archiving Policy, which calls for authors to archive their data when their paper is published in a participating journal. With almost 1,000 data sets archived in Dryad so far, Molecular Ecology is one of the journals leading the charge for improved data availability.
Vines knows firsthand the importance of long-term data availability. A data set on morphology of fire-bellied toads (Bombina species) recorded by a naturalist in the early 1930s was instrumental to his dissertation work; it allowed him to show that hybridization between two species had been constant in a particular location for almost 60 years. “The old data allowed us to apply a very sophisticated analysis. But for many other papers about these toads, even much more recently in the 1970s and 1980s, the data were not available,” Vines says. “In particular, the individual data were not available. We couldn’t take the older data they had and compare it to the data we had, because all we had were the summary statistics. That really hampered our ability to make any inferences about stability of these hybrid zones.”
After adopting the Joint Data Archiving Policy at Molecular Ecology, Vines and other journal publishers sought ways to make sure it was enforced. “It became clear to me that other journals with equal or slightly different archiving policies were achieving much, much lower rates of data archiving,” he says. For example, the Public Library of Science (PLoS) journals, despite a comprehensive archiving policy and a vocal stance on open access, were not achieving high rates of archiving—for example, about 12 percent of papers at PLoS One had archived data online.
In response, Vines and several colleagues launched a study of the effectiveness of different data archiving policies by assessing the proportion of published papers that archived their data. They checked 229 papers from 12 different journals that had no policy on data archiving, only suggested it, or required it. To make sure the difficulty of archiving was comparable throughout, the researchers limited their study to a set of papers that all used the same type of data, a common genetic analysis.
The results of the study, published in the FASEB Journal in January 2013, showed that a mandatory data archiving policy was much more effective than either recommending archiving or having no policy at all. More than 50 percent of papers had publicly archived data when mandated, although the variation in success was large in this group, because journals varied in enforcement. In contrast, approximately 30 percent had publicly archived data in journals with a weak policy and 20 percent in those with no policy at all. Of the four journals in the study that mandated data archiving, the two that stood out because of their especially high rates of archiving also required authors to describe the location of any publicly available data related to the paper when they described their methods.
“The differences are pretty dramatic,” Vines says. “The journals that have a policy saying you must do this and make authors have a statement in the manuscript about what they have done achieve more than 90 percent archiving rates. Journals that tell you only that you should do it achieve about the same rates as having no policy whatsoever.”
While Vines was working on the FASEB Journal paper, he was also struck by how information tended to vanish over time. “It’s pretty easy to get data from an author one or two years after publication, but everybody knows that 10 years after publication, or 20 years after publication, it’s almost impossible to get the data back,” he says. “I was rooting around for references in the literature, trying to find a paper that quantified the rate at which data disappeared, and there wasn’t one. I realized that this was going to be a really important piece of the puzzle. Everyone assumes that data held by authors disappear, but how fast?”
Data disappear for multiple reasons. Sometimes the contact information for the authors is out of date and current contact information cannot be found; this problem is especially pronounced for papers published prior to the rise of email. Sometimes the data-storing device was lost, damaged, or stolen. Sometimes the author has the data but does not remember where they are stored. Sometimes the data exist only on an obsolete storage device, such as a floppy disk. And sometimes the data are accessible yet unusable because the metadata—information on what the column headings mean, what the individual identifiers are, and which files contain what kind of data—have been lost.
Vines spent much of 2013 working with some of his coauthors from the FASEB Journal paper, as well as new collaborators, to quantify the rate at which scientific data become unavailable. The results are in press, and Vines is making plans for his next project: testing whether the studies are reproducible. Even when an author has provided the source data, data inconsistencies and lack of metadata can complicate whether the statistical analysis originally described can be replicated.
Vines concludes that better scientific data archiving practices have to begin with the journals that publish the results. “Journals are the only group that has real power to make data archiving happen, particularly for published papers,” he says. “There’s a crystal-clear, simple point at which the journal says, ‘All of the data used for this study must be made available, or else this paper is not going to be typeset.’”
Journals should require authors to describe the location of their data when writing about their study methods, Vines recommends, because this approach substantially increases compliance and reduces the amount of time that a journal’s staff needs to spend working out what data should be archived. Peer reviewers could then assess the authors’ data archiving plans along with their assessment of the rest of the study. Vines proposes that journals include a statement in their author guidelines, which Molecular Ecology will be adopting soon, emphasizing that, all else being equal, papers that do an exemplary job of archiving their data and analysis code will be given priority for publication.
The scientific community has been slow to embrace the idea of public data archiving, as attested by a spirited debate among ecologists on Twitter over the summer. Some researchers have a sense of ownership of their data and feel that it is their right to choose with whom they share data. Nevertheless, the practical reality is that most scientists don’t have the time to curate all the data they collect over their careers.
“Journals feel somewhat obliged to listen to the community,” Vines acknowledges, “but I think this is a case where the greater good of the community and the reputation of science requires that journals step up and enforce data archiving at publication.” —Katie L. Burke