With digitized text from five million books, one is never at a loss for words
Terror of Terabytes
The Viewer is an excellent oracle for the n-gram data, but it answers just one kind of question: How has the frequency of a specific n-gram varied over time? Many other questions cannot be expressed in this form. You might want to know which n-grams are the most common, or how word frequency varies as a function of word length, or which words entered the printed record first. To answer these questions and others like them, you’ll have to work a little harder. For starters, you’ll have to download the data, which is not a trivial undertaking.
The complete set of English n-grams weighs in at 340 gigabytes of compressed files, which expand to fill 2.5 terabytes of disk space. I have not yet tried to swallow all that data; doing anything interesting with it would require more hardware. I have been working solely with the English 1-gram files, which amount to 10 gigabytes when decompressed. I’ve been able to manage them on a laptop, although I’ve needed a refresher course in “external” algorithms—those that manipulate data on disk rather than in main memory. (This was a common practice when memory topped out at 64 kilobytes, but that was a long time ago.)
The 1-gram data are scattered over 10 files, which I merged into one. Then I set about gathering some basic facts and figures. In the English 1-gram data set there are 7,380,256 unique words, which occur a total of 359,675,008,445 times. Thus the mean number of occurrences per word is 48,735—but that’s a somewhat misleading number, because the distribution is highly skewed. (The top 100 words account for half of all word occurrences.) A more meaningful statistic is the median, which is 166.
Which are the most common 1-grams? Setting aside a few common marks of punctuation, the highest-frequency words are: the, of, and, to, in, a, is, that, for, was. Another trivia question: What’s the longest word in the corpus? I think the longest that’s really a word and that wasn’t invented just to set records is phosphoribosylaminoimidazolecarboxamide.
Prowling around in the data with a text editor reveals a multitude of oddities. Choose an entry at random, and it’s likely to be a word you’ve never seen before. Indeed, there’s a good chance it’s not a word at all in the strict sense, but rather a number or a mixture of letters and digits, or something even more mysterious. For example, my eye fell on this curious “word”:
How could such a zany-looking string of letters turn up at least 40 times in published books? As it happens, we have a tool for answering such questions, namely Google Books. Since the Google OCR program produced this string, the Books search engine should be able to find it. And there it is: a row of letters in a word-search game—a game that has apparently been reprinted in dozens of puzzle books.
» Post Comment