Science Needs More Moneyball
Baseball's data-mining methods are starting a similar revolution in research
Beyond the field of microbiology, data- mining revolutions are extending across the natural and social sciences (although meteorology and economics, with decades-long access to mountains of data, are still the granddaddies of this approach). In the social sciences, it is particularly interesting to see how data mining has recently helped linguists analyze how words are actually used in writing and speech—for example, as seen in the challenge of producing a dictionary. Traditionally, analysis of language use has involved assessment of written texts, usually from a canon of books accepted by experts as exemplars of “proper” usage, a step that required an army of volunteers who sent in quotations to the dictionary editors. Then the appointed set of language experts made subjective decisions about new usage—what is acceptable, what is vulgar and what is vile. A data revolution in linguistics is freeing us from needing the army of volunteers, as well as from the opinions of the learned experts. Language analysis is heading toward a data-driven idiot’s guide that can decide on acceptable usage based on what is actually accepted in writing and in speech.
Various corpora of written and spoken language have emerged online, and these allow extensive analysis of how and where words are used. Entire uploaded texts can be searched and analyzed. The largest is the Oxford Corpus, launched in 2006 and covering texts from the entire Anglosphere. The U.S.-centered Corpus of Contemporary American English (COCA) features a user-friendly website (http://corpus.byu.edu/coca/). These corpora, when searched, give a 10-word neighborhood around each use of the word, which yields much information. For instance, a searcher can see whether the word is used in the singular or plural form, as well as words that are frequently co- located with it and so on. In Damp Squid, Jeremy Butterfield describes how these corpora can yield a picture of English (or potentially any language) as it is actually used, as validated by the entire community of writers and speakers.
One way that corpus-based analysis bucks expert opinion is in deciding when an evolutionary change in usage has become acceptable simply by the criterion of being frequently accepted. For example, the word “criteria,” on entering the English language from Greek, maintained its original meaning as the plural of “criterion.” Cringe though we may, our own experiences plus analysis of the Oxford Corpus show that use of “criteria” as the singular is catching up on its use as the plural. The corpus also allows us to note changes in old expressions that still hold meaning for us, but only if we change the words a little. Shakespeare’s “in one fell swoop” is still a popular phrase four centuries later, but only through changing the obsolete adjective “fell” to one that sounds similar and holds a similar meaning, which is “foul.” Despite the resistance of experts, the language is de facto evolving, and the corpus allows us to validate these changes.