With digitized text from five million books, one is never at a loss for words
The Library of Babble
Suppose we had a magically cleansed version of the Harvard-Google database, free of all OCR errors. Would the time-series graphs from the Ngram Viewer look much different? I doubt it. Thus for present purposes the noise introduced by the OCR process is a minor distraction we can safely ignore.
But there are purposes beyond the present ones. Google has announced the grandiose goal of digitizing all the world’s books. They may succeed. Some of those books may survive only in digital versions. And someone may even want to read them! If the scanning protocol now in use is the main channel by which we are to transmit 600 years of human culture to future generations, there’s reason to worry.
But for the moment I am not inclined to complain. The n-gram collection released by the Harvard-Google team is a marvelous gift. I would much rather have it now than wait for some unattainable level of perfection. And now that it’s been made public, it’s ours as well as theirs, and we can all help improve it.
- Abello, James, Panos M. Pardalos and Mauricio G. C. Resende (eds). 2002. Handbook of Massive Data Sets. Massive Computing, 4. Dordrecht: Kluwer Academic Publishers.
- Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, the Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak and Erez Lieberman Aiden. 2011. Quantitative analysis of culture using millions of digitized books. Science 331:176–182.
- Norvig, Peter. 2009. Natural language corpus data. In Beautiful Data, edited by Toby Segaran and Jeff Hammerbacher, pp. 219–242. Sebastopol, Calif.: O’Reilly.
- Smith, Ray. 2007. An overview of the Tesseract OCR engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition, ICDAR 2007, Vol. 2, pp. 629–633. New York: IEEE.
- Vitter, Jeffrey Scott. 2008. Algorithms and Data Structures for External Memory. Series on Foundations and Trends in Theoretical Computer Science. Hanover, Mass.: Now Publishers.