Bit Lit

With digitized text from five million books, one is never at a loss for words

Brian Hayes

The Oracle of N-grams

Reading about these experiments gave me the itch to try running some of my own. And it turns out that many interesting questions can be investigated with little effort and no cost using the Google Ngram Viewer ( The protocol for this service is simple: Type in a comma-separated series of n-grams, and get back a graph showing their frequency as a function of time. The frequencies are normalized to adjust for linguistic inflation—the expansion of the language as more books are published each year. The normalized frequency is the number of occurrences of an n-gram in a given year divided by the sum of all n-gram occurrences recorded in that year.

2011-05HayesFA.jpgClick to Enlarge ImageShown at right is the Ngram Viewer’s output in response to a simple query—a list of six nouns. Interpreted with care, a chart like this one might tell us something about the shifting fortunes of scientific disciplines—but the careful interpretation is crucial. This is a popularity contest among words, not among the concepts they denote. From the graph it would appear that Biology did not exist before about 1840—and that’s close to the truth if we’re speaking of the word itself. But the science of living things goes back further.

The curves have some curious features that I can’t explain, such as synchronized humps in about 1815 and 1875. Was there a real (but short-lived) upsurge in publishing books on the sciences in those years? Or are we seeing some artifact of librarianship or the selection process? The geology curve appears to have a persistent oscillation with a period of roughly 20 years. What, if anything, is that about?

The same query words without the initial capital letters yield somewhat different results. So do the corresponding agent nouns—astronomer, biologist, and so forth.

The Ngram Viewer can become an absorbing (and time-consuming) entertainment. You might even turn it into a party game: One player draws the graph, the others try to guess the query. But less-frivolous applications are also within reach. Here’s one possibility: With well-crafted queries, it might be possible to gauge the penetration of various foreign languages into English publications (or vice versa). From a very small sample, I get the impression that the frequency of German words in English text sagged during the World Wars, whereas Russian peaked in the Cold War.

