Indexing the Information Age
By Monica Westin
Search engines can locate specific pages among the trillions of objects on the internet thanks to tagging norms established in 1995—the Dublin Core metadata.
Search engines can locate specific pages among the trillions of objects on the internet thanks to tagging norms established in 1995—the Dublin Core metadata.
One weekend in March 1995, a group of librarians and web technologists found themselves in Dublin, Ohio, arguing over what single label should be used to designate the person responsible for the intellectual content of any file that could be found on the World Wide Web. Many were in favor of using something generic and all-inclusive, such as “responsible agent,” but others argued for the label of “author” as the most fundamental and intuitive way to describe the individual creating a document or work. The group then had to decide what to do about the roles of nonauthors who also contributed to any given work, such as editors and illustrators, without unnecessarily expanding the list. New labels were proposed, and the conversation started over.
The group was participating in a workshop hosted by OCLC (at the time the Online Computer Library Center, now officially OCLC, Inc.) and the National Center for Supercomputing Applications (NCSA) in an attempt to create a concise but comprehensive set of tags that could be added to every document, from text files to images and maps, that had been uploaded to the web. Arguments about these hypothetical tags, often based on wildly different assumptions about the future of the internet, raged over the next few days and continued long into the nights. By Saturday afternoon, the workshop organizers were in despair of ever being able to reach any kind of consensus. Yet, by the end of the long weekend, the eclectic crowd had created a radical system for describing and discovering online content that still directly powers web searches today, and which paved the way for how all content is labeled and discovered on the open web.
Two recent developments added importance, and pressure, to the 1995 conference: NCSA’s recent launch of the first widely available web browser, Mosaic; and the subsequent rapid pace of content being uploaded online. Mosaic, which became available to the public in 1993, had a graphical point-and-click interface that anyone could use. It removed the need for users to write their own interfaces to explore the web, so suddenly anyone could go online. Public use of the web surged dramatically. Mosaic also allowed users to upload their own text documents, images, and videos, leading to a spike in content scattered across the web. Meanwhile, a large and growing set of scholarly materials had been moving online throughout the early 1990s, and university librarians were among the first people flagging how difficult these files were to find.
By 1995, there were about a half-million unique, content-rich pages, text files, and other “document-like objects,” as the workshop participants called them, on the web, but there was no good way to search for them without already knowing where and what they were. Early web-search tools, such as Archie and Gopher, could query only the titles of the files or their locations, meaning that you had to know either the exact name of the file you were looking for or exactly where it was located (its full uniform resource locator, or URL). So, for example, if you wanted to find a copy of an essay that you thought someone had posted online, you couldn’t just search for the author’s name or some keywords. This barrier made many, if not most, online documents essentially inaccessible to most people. As Stu Weibel of the OCLC’s research group wrote in his report for the 1995 workshop: “The whereabouts and status of this [online] material is often passed on by word-of-mouth among members of a given community.”
pf/Alamy Stock Photo
To make these files discoverable for users, some kind of tagging system with top-level information, such as author and subject, was needed: in other words, metadata. In this context, metadata can be thought of as a short set of labels associated with a document that allow you to both find the document and know what it is without opening it.
Librarians have been creating bibliographic metadata for thousands of years. For example, at the Library of Alexandria in ancient Egypt, each papyrus scroll had a small tag attached to it with information about its title, author, and subject, so that readers didn’t need to unroll it to know what it was. Librarians could also use these tags to return the scrolls to the correct pots or shelves.
Now the web needed the same thing.
The workshop to solve this problem began as a hallway conversation in October 1994 at the second International World Wide Web Conference in Chicago. Weibel was standing around drinking coffee in the hallway with five or six people, including Terry Noreault, his boss; and Eric Miller, his colleague on the OCLC research team. As Weibel remembers: “We were talking about how nice it would be if there were easier ways to find the 500,000 individually addressable objects [documents] on the web. . . . I looked at my boss, and he just nodded and agreed to organize a workshop for it.”
Early web-search tools could query only the titles of the files or their locations, meaning that you had to know either the exact name of the file you were looking for or exactly where it was located.
The workshop was quickly co-organized by Weibel and Miller, who wanted to be able to take the results to the next web conference in Darmstadt, Germany, the following spring. In order to develop a system that worked, they knew they needed input from three different groups of people: encoding and markup experts in specialized disciplines, who could help ensure that metadata was effectively associated with the online files; computer scientists; and librarians—or, as multiple people who attended the first workshop told me with deep affection, “the freaks, the geeks, and the ones with sensible shoes.”
Some 52 people showed up to the workshop in Dublin, Ohio. The variety of attendees, and of their perspectives on how documents on the web should be organized, was striking. As Priscilla Caplan, a librarian who attended the conference, wrote: “There were the IETF [Internet Engineering Task Force] guys, astonishingly young and looking as if they were missing a fraternity party to be there. There were TEI [Text Encoding Initiative] people . . . geospatial metadata people . . . publishers and software developers and researchers.” All had very different goals, but “nearly everyone agreed that there was a tremendous need for some standard.”
In 1995, most librarians were using MARC (Machine-Readable Cataloging) to create metadata for their library catalogs. MARC records are complex, extremely long, and require deep expertise to create. These kinds of elaborate descriptions could never work at scale for the entire web. Automated approaches weren’t on the table back then, and it soon became clear to all attendees, even those who had showed up thinking that they might be tweaking an existing system, that the metadata standard for the web would have to be something entirely new: simple enough for anyone to label their own documents as they posted them online, but still meaningful and specific enough for other people and machines to find and index them. A brand-new, simple, and succinct metadata system would mean that, for the approximately half-million existing items online, and the billions more that everyone knew were coming, there would need to be one agreed-upon way of adding the metadata tags, with the same kinds of information in the tags themselves.
Creating these labels involved determining not just what information was needed to locate the files online at that moment, but also what might be needed later as web content continued to snowball. There was no formalized voting or veto process to come up with the system; each piece of metadata was created through consensus, compromise, and, occasionally, impassioned disputes. Much of the arguing concerned the nature of a future no one could predict.
For example, many attendees didn’t anticipate that automated search engines were coming, though some of the more technical people saw them on the horizon and were pushing requirements for improved geolocated discovery. As Miller says: “I remember introducing the [geolocation] coverage element and getting a lot of blowback. I made the point that coverage is going to be local as well as global, like: Find a restaurant near me. We were trying to push the envelope so that we would be ready when other technologies advanced and other services became available.” Other attendees saw geospatial data as unnecessary details included to assuage a person or community, and they weren’t sure it made sense to include them given the need to keep the system lean.
In the beginning, the disagreements seemed insurmountable, and Miller felt disheartened. “The first night we thought: This is gonna fail miserably,” she said. “At first, nobody saw eye to eye or trusted each other enough yet to let each other in and try to figure out the art of the possible.” But as concessions and then agreements were made, people began to feel energized by the creation of a new system, even if imperfect; one piece at a time, their system could bring the content of the web within reach for everyone. As Caplan remembers: “By the second day, there was a lot of drinking and all-night working groups. We were running on adrenaline and energy. By the last day, we realized we were making history.”
The result of all the arguments was Dublin Core (DC) metadata, the first metadata standard for describing content on the web. The final short group of DC tags, or metadata elements, was drawn from a longer list that had been developed, iterated, analyzed, argued over, and eventually cut down to a list of 13. In his workshop report, Weibel provided an example of the elements, using the University of Virginia Library’s record of Maya Angelou’s poem “On the Pulse of Morning,” transcribed by the library from Angelou’s performance at U.S. President Bill Clinton’s inauguration:
Subject: Poetry
Title: On the Pulse of Morning
Author: Maya Angelou
Publisher: University of Virginia Library Electronic Text Center
Other Agent: Transcribed by the University of Virginia Electronic Text Center
Date: 1993
Object: Poem
Form: 1 ASCII file
Identifier: AngPuls1
Source: Newspaper stories and oral performance of text at the presidential inauguration of Bill Clinton
Language: English
This 11-element example does not include Coverage (the location or duration of the object) or Relation (relationship to other objects). In 1998, Weibel participated in a Network Working Group that added two more elements: Description (textual space for abstracts or content descriptions) and Rights (information about intellectual property). This fifteen-element Dublin Core vocabulary became the basis for future tagging.
Fiercely truncated compared to a library catalog record and simple enough that anyone could use it, the Dublin Core vocabulary was revolutionary in its creation of a very new middle ground.
Fiercely truncated compared to a library catalog record (MARC records have 999 fields) and simple enough that anyone could use it, DC was revolutionary in its creation of a very new middle ground: a record that is “more informative than an index entry but is less complete than a formal cataloging record,” as Weibel wrote in his 1995 report. DC tags could be created manually and easily by anyone, not just librarians, allowing for more documents to be described in a standardized way so that automated tools could index them comprehensively. The ease and simplicity of DC tags, while still being specific enough to be meaningful, were key to their success. As Miller explains, DC “makes the simple things simple, and the complex things possible.”
Today, DC looks very familiar and even obvious, in part because it has so deeply influenced the way that metadata is embedded into web pages. Metadata tags, or metatags, are now a fundamental infrastructure for the open web, where they usually take the form of HTML (Hypertext Markup Language), the most used system for displaying content in a web browser. HTML metatags label pages for crawling and indexing by search engines such as Google and other web-scale search services. For example, the metatag ”dc.Author” content=”Maya Angelou” /> is flagged and parsed by web indexers to mean that the author of the content on the web page is Maya Angelou. The information embedded in metatags is used for matching queries to search engine result pages; much of search engine optimization work is just adding comprehensive, detailed metatags.
The original DC metatags are still used globally. They also directly influenced many others, from the Web 2.0 metatags for social media posts to generic and specific tags for various types of content-rich HTML pages on the web. For example, the Poetry Foundation’s electronic version of “On the Pulse of Morning” contains multiple sets of metadata embedded in the HTML source code: from standard DC metatags, such as <’dcterms.Title’>; to tags for X (formerly known as Twitter), such as the <’twitter:image’> tag used to add an image when the poem is shared on the platform; and the Open Graph tag <’og:see_also’> that Facebook uses to point users to related content.
DC has directly influenced and shaped the past 30 years of finding documents on the web. Miller, who moved from OCLC to the World Wide Web Consortium (W3C), the group that oversees standards, protocols, and languages for the entire web, told me, “You can connect any web standard back to certain DC characteristics, lessons learned, or principles. . . . A lot of the stuff that came out of the global standards was directly to solve the industry standards that DC defined.”
Library of Congress / Stacey Lutkoski
The summer after the 1995 Dublin Core workshop, librarian Lorcan Dempsey, then located at the University of Bath and leading the U.K. Office for Library and Information Networking, offered to assist with spreading the standard. Dempsey helped to run future workshops, initiating with Weibel the annual conference series known today as the Dublin Core Metadata Initiative, which still meets to make tweaks and pose arguments about what changes will be needed for the future. But the core set of metatags has remained remarkably stable. As Weibel told me: “After two and a half days, we had reached general consensus about what the elements should be, as well as their characteristics. People are still arguing at Dublin Core conferences, but we did a pretty good job of figuring out the basic skeleton.”
Uptake was quick for the brand-new standard. In 1997, a team in Germany, led by Roland Schwänzl at Osnabrück University, added DC tags for their content pages, representing the first real-world use of the metatags. “It marked the first time that I realized the impact of this navel-gazing that we were doing,” Weibel said. “Other people without an intellectual stake went and built systems based on it.” Once the standards were encoded, the “navel-gazing” had become legacy code, and DC was real.
Everyone I interviewed who attended the first few workshops still seems surprised that they managed to reach consensus at all after just a few days, and they all assert that this agreement was as remarkable a product as the list of metadata elements itself. The structure of productive disagreement, and sharing different visions of where things were going, was part of the success. “It almost didn’t matter what we ended up with,” Weibel said. “It wasn’t rocket science—there was no magic about it. . . . We simply had shared problems that needed to be addressed, and it worked.” If there was magic about DC, it was the social process, not technology, set against the backdrop of profound optimism at the possibility of a new world.
If there was magic about Dublin Core, it was the social process, not technology, set against the backdrop of profound optimism at the possibility of a new world.
What made the workshop difficult was also what made the standards successful: the intellectual diversity of those involved. The consensus-driven success of DC, led by a diverse, noncommercial group from wildly different fields, would frankly be almost impossible today; even the concept of bringing together a heterogeneous web community to solve web-sized problems with neutral, open standards now seems quaint. DC also encouraged broad use of metadata to improve description and organization of digital resources, leading to a more connected and discoverable web ecosystem. That ecosystem is disappearing rapidly in the platformed, hyper-corporate space of the current web.
The future of the web is firmly moving in the opposite direction of DC values around transparent approaches for information discovery. Large commercial tech platforms work hard to keep users in their walled gardens by utilizing in-app black-box algorithms rather than linking out to external locations. Generative AI tools are incentivized to replace exploration of the open web, and almost none of the current text-based generative AI tools cite their sources.
These changes mark what is often described as the end of the open web, a new paradigm in which corporations centralize services and invisibly control what you see when you try to find something, using business models that are ever more incentivized to move away from interoperable web standards. The historic period of the open web, which arguably began with Mosaic 30 years ago, is receding and perhaps has already ended. The story of DC bookmarks the historic moment of when this era was new, and when the web inspired widespread creativity and optimism.
This article is adapted from a version previously published on Aeon, aeon.co.
Click "American Scientist" to access home page
American Scientist Comments and Discussion
To discuss our articles or comment on them, please share them and tag American Scientist on social media platforms. Here are links to our profiles on Twitter, Facebook, and LinkedIn.
If we re-share your post, we will moderate comments/discussion following our comments policy.