Crawling toward a Wiser Web
Computing with data sets as large as the World Wide Web was once the exclusive prerogative of large corporations; the Common Crawl gives the rest of us a chance.
To put all human knowledge at everyone’s fingertips—that was the grandiose vision of Paul Otlet, a Belgian librarian and entrepreneur. Starting in the 1890s, he copied snippets of text onto index cards, which he classified, cross-referenced, and filed in hundreds of wooden drawers. The collection eventually grew to 12 million cards, tended by a staff of “bibliologists.” Otlet aimed to compile “an inventory of all that has been written at all times, in all languages, and on all subjects.” The archive in Brussels was open to the public, and queries were also answered by mail or telegram. In other words, Otlet was running a search engine 100 years before Google came along.
In 2015, knowledge at your fingertips is no longer an idle dream. Although not everything worth knowing is on the Internet, a few taps on a keyboard or a touchscreen will summon answers to an amazing variety of questions. The World Wide Web offers access to at least 1010 documents, and perhaps as many as 1012; with the help of a search engine, any of those texts can be retrieved in seconds. Needless to say, it’s not done with three-by-five cards. Instant access to online literature depends on a vast infrastructure of computers, disk drives, and fiber-optic cables. The main work of Otlet’s bibliologists—the gathering, sorting, indexing, and ranking of documents—is now done by algorithms that require no human intervention.
For all the remarkable progress made over the past century, it’s important to keep in mind that we have not reached an end point in the evolution of information technology. There’s more to come. In particular, today’s search engines provide speedy access to any document, but that’s not the same as access to every document. Someday we may have handy tools capable of digesting the entire corpus of public online data in one gulp, delivering insights that can only emerge from such a global view. An ongoing project called the Common Crawl offers a glimpse of how that might work.