Science Needs More Moneyball
Baseball's data-mining methods are starting a similar revolution in research
As I see it, the baseball revolution produced an “idiot’s guide” to creating a team roster—a handbook based on things one can learn not through decades of experience and intuition but by applying general quantitative methods. It’s the same kind of approach we should employ more in the sciences. Mountains of data and a capacity for analyzing them have also become available to science in the past few years. Data are now poised to trump the intuition of experts and the “facts” that scientists have championed over the years.
For instance, consider my own field, biology. Every biologist “knows” what a species is—a group of organisms that can successfully produce viable and fertile offspring. Biologists have long believed that species defined this way represent the fundamental units of ecology and evolution.
In the case of evolutionary microbiology (my specialty), it is particularly important to be able to recognize all the fundamental units of ecology among closely related bacteria. We especially need to distinguish those that are dangerous from those that are not and those that are helpful from those that are not. Indeed, we would like to identify all the bacterial populations that play distinct ecological roles in their communities.
As in baseball, the discovery of bacterial diversity has experienced a transition from relying on the subjective judgment of experts to objective and universal statistical methods. Originally, discovery and demarcation of bacterial species required a lot of expertise with a particular group of organisms, involving difficult measures of metabolic and chemical differences. To make the taxonomy more accessible, decades ago the field complemented this arduous approach with a kind of idiot’s guide, where anyone could use widely available molecular techniques to identify species—for example, a certain level of overall DNA sequence similarity.
One popular universal criterion (among others) is to identify species as groups of organisms that are at least 99 percent similar in a particular universal gene. The problem is that—like the case of baseball where batting average, RBIs and home runs were used to supplement expert knowledge—nobody in microbiology tested whether the new molecular techniques actually came closer to solving the problem of recognizing the most closely related species.
Unfortunately, microbiology’s current DNA-based idiot’s guide, as well as the expert-driven metabolic criteria that preceded it, has yielded species with unhelpfully broad dimensions. For example, Escherichia coli contains strains that live in our guts peaceably, as well as various pathogens that attack the gut lining and others that attack the urinary tract. Moreover, established fecal-contamination detection kits that are designed to identify E. coli in the environment are now known to register a positive result with E. coli relatives that normally spend their lives in freshwater ponds, with little capacity for harming humans. And E. coli is not alone—there is a Yugoslavia of diversity within the typical recognized species: Much like the veneer of a unified country that hid a great diversity of ethnicities and religions, E. coli (and most recognized species) contains an enormous level of ecological and genomic diversity obscured under the banner of a single species name.
We can fix this confusion the same way that baseball improved its data analysis: by letting the game—or in our case, nature—decide which stats best predict what we most want to know. In microbiology the trick is to let the bacteria tell us what DNA sequence approach most accurately identifies the bacteria that are significantly different in their habitats and ways of making a living. Two teams, including Martin Polz’s group at the Massachusetts Institute of Technology and my group at Wesleyan and Montana State Universities, have developed computer algorithms for identifying groups of bacteria specialized to different habitat types within an officially recognized species. These algorithms reject the expert-based criteria for how much diversity should be placed within a species. Instead, they analyze the dynamics of bacterial evolution to let the organisms themselves tell us the DNA sequence criterion that best demarcates ecologically distinct populations for a particular group of bacteria.
Another opportunity for discovery in biology through data mining stems from the new Human Microbiome Project. Here, DNA sequences are collected from various bacteria-laden human habitats, such as the gut, mouth, skin and genitals, with samples taken from individuals of different age, sex, health, weight and diet.
For example, Dusko Ehrlich of the French National Institute for Agricultural Research and his colleagues recently analyzed the bacterial genes purified from the feces of 39 humans from six European countries, amounting to about 100 million bases of bacterial DNA per person. They attempted to identify bacterial biochemical functions associated with age and body mass. Their intuition suggested various guesses for the identity of these genes, which were largely supported, but data-driven methods identified genes that gave much stronger relationships. One important data-driven discovery indicated a negative relation between obesity and the microbes’ capacity for harvesting energy.
Ongoing massive sequencing projects in human, marine and soil environments allow us to characterize the diversification of bacteria: to discover the most newly divergent bacterial species, to characterize them as specialized to different habitats and to identify the biochemical functions most important in each habitat. However, the approach depends critically on how well we describe the habitats we sample.