With digitized text from five million books, one is never at a loss for words
Michel and Aiden set out to study language and culture, but they have also created a resource for the study of optical character recognition.
Based on a random sample from the 1-gram files, I estimate that 15 percent of the entries are affected in some way by OCR errors or anomalies. This sounds horrendous, but it does not mean that the OCR program made mistakes on 15 percent of the words it read; the word-recognition error rate is probably well under 1 percent. The problem is that there’s only one way to read a word correctly, but there are countless ways to go wrong. Suppose the program reads 1,000 words and gets 990 of them right. If it makes a different mistake on each of the remaining 10 words, then the final list has 11 entries, 10 of which are erroneous.
Because of this effect, efforts to tidy up OCR errors would not only improve the accuracy of the data set but would also reduce its bulk. Entries for Rccovery, Reeovery, Reoovery, Rerovery and Revovery could all be merged into Recovery. But making such repairs is a daunting task, especially if you want to preserve other variations and errors, introduced not by the OCR process but by authors and printers.
Consider: bomemaker is probably an OCR error; invertibrate is probably a human error; cerimoniale is probably not English. What about haccalaureate? Is that an OCR error or is it a degree granted by a programming school? A human reader can make judgments in such cases, but hand-grooming multigigabyte files is not an attractive prospect. We need a mechanized solution.
A few special cases look doable. The OCR program has encoded some instances of “fi” and “fl” as ligatures, combining the two letters into a single character, while other instances remain as pairs of letters. For most uses of the data set, it would probably be better to treat all these cases consistently; this seems easy to accomplish.
More challenging but perhaps still within reach is the problem of the “long s” that was part of English orthography through the 18th century as in:
OCR programs (like many human readers) tend to interpret this character as the letter “f,” leading to an abundance of comical fricative spellings such as quickfilver and abfceffes. I suspect that an algorithm could successfully correct a large fraction of these misreadings without turning too many flaws into slaws.
The idea is to make a change only when the “s” form of a word is substantially more common than the “f” form and when the “f” version has a strong peak of popularity before 1800. But I have not yet tried implementing this algorithm, so I don’t know how many new errors it will introduce.
With other OCR quirks, the chronological clue is lacking, and so we must resort to blunter tools such as a matrix estimating the probability that any one character will be mistaken for another. No doubt much can be accomplished in this way. On the other hand, if these mistakes were easy to fix, they would have been fixed already.