Logo IMG


Bit Rot

Brian Hayes

Standardized and Generalized

Do any existing file formats offer versatility and the plausible hope of longevity?

Postscript and TEX have already been mentioned as highly readable formats with open standards. On the other hand they are not well-suited to abstract markup. Postscript is essentially a write-only language: Almost anything can be turned into Postscript, but going the other way is difficult. TEX, the creation of Donald Knuth of Stanford University, is the preferred formatting language in the physical sciences and mathematics. With a vast literature already committed to the format, TEX has excellent prospects for long-term survival. But it too is fundamentally a visual formatting language; abstract markup is possible, but it takes discipline.

The leading candidate for a file format for the ages is SGML, the Standard Generalized Markup Language, developed in the early 1980s by Charles F. Goldfarb of IBM and now an international standard. SGML is actually a language for defining markup languages. In the SGML formalism, all elements of a document are labeled by descriptive tags embedded in the text. A title, for example, might be surrounded by the tags [title] and [/title]. Visual presentation is deferred until these tags are processed in a later stage of formatting. New tags, and whole new classes of tags, can be defined as needed.

SGML has fervent adherents. It has been received with particular enthusiasm by organizations that produce quantities of highly structured documents, such as technical manuals. It has also been adopted by several journal publishers, including the American Institute of Physics and the IEEE Computer Society. And yet SGML has not won the hearts of content providers everywhere. Maybe it's just that the name is so forbiddingly standardized and general, but many consider SGML unwieldy and cumbersome. The levels of indirection annoy people. First you define a tag for a title, then you enter the tag and the title into the document, then you define how the tagged title is to be formatted in the finished output. That's a lot of rigmarole when a word processor allows you to just click on the title and apply the formatting directly.

A decade ago James H. Coombs, Allen H. Renear and Steven J. DeRose of Brown University wrote an eloquent plea for SGML and other abstract markup languages. They argued that we all mark up our texts anyway—even conventional punctuation is a markup language—so we might as well choose the style of markup that captures the most important and long-lived information. They also argued that inserting abstract tags requires less cognitive effort than doing typographic formatting; it's easier to remember "This is a [title]" than "Titles are set in 28-point Bodoni Bold." I find the arguments persuasive, and yet I note that the users of Microsoft Word still outnumber the users of SGML .

For a time it seemed that SGML might finally catch on through the reflected glitter of HTML, which began as a kind of SGML-lite—smaller, simpler and lacking the facility to define new tags. The hope was that people would learn the advantages of abstract markup, yearn for something more powerful, and move up to the mother tongue. What's happened instead is that people have gone to extreme lengths to turn HTML into a visual formatting language. They embed text in tables so that they can control margins and columns; they sprinkle pages with hundreds of invisible images to control the placement of text; in preference to abstract tags such as [h1] they use a [font size] tag.

Abstract markup is going to get one more chance. A new language called XML, for Extensible Markup Language, is simpler than SGML but still retains the ability to define new tags (which is what makes it extensible). In principle, anyone can devise a private set of XML tags, but the main interest is in dialects defined for entire communities or disciplines. For example, there are groups creating XML variants for mathematics, chemistry, biological sequence data, astronomy and meteorology. Farther afield, other XML dialects describe real-estate listings, financial data, classified ads, legal documents and genealogies.

The mathematical dialect, MathML, illustrates the potential of XML. At one level, MathML addresses the messy problem of how to display mathematical expressions on the Web. In HTML this is often done by embedding scads of miniature images in the page—a solution that works, more or less, but offends the finer sensibilities. MathML puts mathematical notation directly into the markup language. But there's more to it than that. MathML can capture meaning as well as appearance. An expression such as x2 can be written in such a way that it will be recognized not as "x superscript 2" but as "the square of x." The dream is copying an equation from a published paper directly into a program such as Mathematica or Maple, where is can be solved or graphed or manipulated algebraically. Whether the dream comes true depends on whether authors accept the discipline of abstract tagging or turn XML into yet another visual formatting language.

comments powered by Disqus


Of Possible Interest

Perspective: The Tensions of Scientific Storytelling

Letters to the Editors: The Truth about Models

Spotlight: Briefings

Subscribe to American Scientist