Logo IMG
HOME > PAST ISSUE > May-June 2011 > Article Detail


Bit Lit

With digitized text from five million books, one is never at a loss for words

Brian Hayes


This past January, Michel, Aiden and a dozen co-authors from Harvard, Google and elsewhere published a research article in Science introducing the new linguistic corpus and presenting some of their early findings. They also announced the Google Books Ngram Viewer, an online tool that allows anyone to query the database. Finally, they made the entire n-gram data set available for download under a Creative Commons license.

Some of the results reported in the Science article show how n-gram data can be used to document changes in the structure of language. One study examines the shifting balance between regular and irregular verbs in English—those that form the past tense with -ed and those that follow older or odder rules. Between 1800 and 2000 six verbs migrated from irregular to regular (burn, chide, smell, spell, spill and thrive) but two others went the opposite way (light and wake). In the case of sneaked vs. snuck, it’s too soon to tell.

Michel and Aiden describe their work as culturomics, a word formed on the model of genomics (but not yet to be found in the n-gram data set). In the same way that large-scale collections of DNA sequences can reveal patterns in biology, high-volume linguistic data can aid the analysis of human culture. For example, Michel and Aiden examined changes in the trajectory of fame over the past two centuries by counting occurrences of celebrity names. According to the n-gram analysis, modern celebrities come to public attention at an earlier age, and their fame grows faster, but they fade faster, too. “In the future, everyone will be famous for 7.5 minutes,” they remark (attributing the quote to “Whatshisname”).

Another study looked at linguistic evidence of censorship and political repression. In English, the frequency of the name Marc Chagall grows steadily throughout the 20th century, but in German texts it disappears almost entirely during the Nazi years, when the artist’s work was deemed “degenerate.” Similar cases of suppression were found in China, Russia and the United States. (The American victims were the Hollywood 10—writers and directors blacklisted from 1947 until 1960 because of supposed Communist sympathies.)

Having found that known cases of censorship or suppression could be detected in the n-gram data, Michel and Aiden then asked whether new instances could be identified by searching among the millions of time series for those with a telltale pattern. In the case of the Nazi era, the team devised a “suppression index” that compares n-gram frequencies before, during and after the Hitler years. Starting with a list of 56,500 names of people, they found that almost 10 percent showed evidence of suppression in the German-language data, but not in English.

comments powered by Disqus


Subscribe to American Scientist