Science Needs More Moneyball
Baseball's data-mining methods are starting a similar revolution in research
Lost to the Past
As useful as the idiot’s guide approach has been across fields, gleaning meaning from old data serves up severe challenges. Difficulties can arise because at the time events happened, the data recorders did not anticipate that the information would be analyzed in ways not yet imagined. In cases stretching across baseball, biology and language, important items were not reported or, in some cases, observed at all. There is a twin problem to using past data, which is a communitarian challenge—appreciating that data are often used in ways unimagined at the time of collection, how can we make the data we record today more usable and valuable in the future?
As baseball and the sciences have taken an interest in mining old data for new insights, it has turned out that the old data sets are often sufficiently complete for us to discover new “laws” of baseball or science. Yet in far too many cases, fresh scrutiny of old data reveals painful omissions proving that science has missed an opportunity.
In retrospect, I am amazed at how little interest baseball and biology have shown for the future use of data. In baseball, the traditional play-by-play record of games was all that was reliably available until 1988, when the pitch-by-pitch record became the standard. The new record turned out to be important in many ways—for example, in managing a pitcher’s productivity, health and longevity.
Until recently, biology was equally shortsighted in its data collection; this has created a problem for biologists who would like to analyze other scientists’ published data. For example, Cathy Lozupone and Rob Knight at the University of Colorado figured out from analyses of others’ data that the most difficult evolutionary transition in the history of bacteria has been from saline to non- saline environments and vice versa. However, because the original researchers did not record the actual salinity levels, Lozupone and Knight could not pinpoint the precise concentration of salinity that has been most difficult to cross.
Previous standards of data collection in biology were typically limited to what might be interesting for the experiment at hand or perhaps for some future experiment in the same lab. Today, biologists are increasingly expected to anticipate likely uses by others of the data we gather and are taking pains to do so, but this forethought is not easy.
I recently met with Hilmar Lapp, a database expert at the National Evolutionary Synthesis Center (NESCent), and discussed how researchers could avoid omitting important elements of data. He said that it is too much to expect, in the case of biology, for one researcher to think to include all the observations worthy of recording for posterity; he suggests what is needed is a “crowd intelligence.” Accordingly, NESCent and other organizations have sponsored working groups to pool ideas and propose standards and directions of biological data collection in novel areas of inquiry—that is, to foster crowd intelligence. For example, the Genome Sequencing Consortium recently established standards for recording environmental data when genes and genomes are sampled; earlier action might have avoided the debacle of the missing salinity data Lozupone and Knight encountered.
In some cases, we do not have data on old events, not because of a lack of imagination but because the appropriate technology was not available at the time. In the case of baseball, the new, high-tech Advanced Value Metrics (AVM) system automatically describes each hit ball by its trajectory, velocity and point of hitting the ground. The AVM description of a hit allows analysis of how frequently a fielder can catch a ball that usually ends up being a double. But no one could analyze the skill of fielders at this level prior to the advent of this technology.
Until recently in biology, a lack of microbiological technology limited plant ecologists’ understanding of the factors allowing a particular plant species to grow. Plant ecologists discovered only recently that the success of many plant species in nature is determined by helpful and harmful microbes that live in the soil. Therefore, decades of studies trying to understand the successes and failures of plants came up short because they failed to collect data on soil microbes.
In linguistics, the lack of technology for audio recording has hindered an analysis of spoken English usage over time. You might think that dialog written in novels and stories would be a good substitute for actual sound recordings; these pages are frequently as good a record as we will get. However, it is discouraging that a corpus-based analysis of word usage in speech versus fiction by lexicographer and author Orin Hargraves has shown that certain clichéd phrases, which appear to mimic spoken language, are actually used far more frequently in literature than in real life. For example, hardly anyone really says “he bolted upright” or “she drew her breath,” but these forms are found with surprisingly high frequency in literature. Consequently, an unbiased, corpus-based account of spoken English usage begins with abundant voice recording in the 20th century.
Analyses of huge data sets allow us to move beyond our previous understanding, which was based on much less data than we have available to us today. There is so much possibility for a data-driven explosion of understanding of games, creatures and words by explorers today and in the future. We owe these future explorers the best and most complete record of life today that we can offer.
The Moneyball film opens with wisdom from Mickey Mantle: “It’s unbelievable how much you don’t know about the game you’ve been playing all your life.” Surely the same is true for many in the natural and social sciences, pondering the areas they have been studying all their careers.
- Arumugam, M., et al. 2011. Enterotypes of the human gut microbiome. Nature 473:174–180.
- Becraft, E., F. M. Cohan, M. Kühl, S. Jensen and D. M. Ward. 2011. Fine-scale distribution patterns of Synechococcus ecological diversity in the microbial mat of Mushroom Spring, Yellowstone National Park. Applied and Environmental Microbiology 77:7689–7697.
- Butterfield, J. 2008. Damp Squid: The English Language Laid Bare. Oxford: Oxford University Press.
- Cohan, F. M., and S. M. Kopac. 2011. Microbial genomics: E. coli relatives out of doors and out of body. Current Biology 21:R587–R589.
- James, B. 2011. Solid Fool’s Gold: Detours on the Way to Conventional Wisdom. Chicago: ACTA Sports.
- Kopac, S., and F. M. Cohan. 2011. A theory-based pragmatism for discovering and classifying newly divergent bacterial species. In Genetics and Evolution of Infectious Diseases, ed. M. Tibayrenc. London: Elsevier.
- Lewis, M. 2003. Moneyball: The Art of Winning an Unfair Game. New York: W. W. Norton.
- Lozupone, C. A., and R. Knight. 2007. Global patterns in bacterial diversity. Proceedings of the National Academy of Sciences of the U.S.A. 104:11436–11440.
- Wiedenbeck, J., and F. M. Cohan. 2011. Origins of bacterial diversity through horizontal gene transfer and adaptation to new ecological niches. FEMS Microbiology Reviews 35:957–976.
- Zimmer, B. 2011. The jargon of the novel, computed. New York Times, July 29.