Logo IMG


Bit Rot

Brian Hayes

Self-Documenting Documents

A useful exercise in thinking about data preservation and conversion is to imagine yourself a paleographer in the distant future, long after the collapse of civilization (brought on, no doubt, not by a wayward asteroid but by the year 2000 bug). Your job is to recover the wisdom of the ancients from the disks and tapes they left behind. This situation may seem contrived, but it really isn't that different from rediscovering a forgotten carton of eight-inch floppy disks full of Wordstar and Visicalc files.

What characteristics of a file format would help you recover the contents when the program that created the file is defunct? One obvious help is documentation. It's always easier to find your way if you have a map, and if the streets have signs. Archivists call it metadata: all the information about the information, starting with the handwritten label stuck on a floppy disk. The ideal is a self-documenting file—one that explains its own structure. If you want to be a fundamentalist about self-documentation, it becomes a game like communicating with extraterrestrials. Every disk has to include instructions for building a machine to read it, and instructions for reading the instructions, and so on. But in practice it's possible to supply a lot of metadata without getting caught in a bottomless regress. For example, an image file might consist of 307,200 eight-bit bytes; interpreting this block of data is easier with the clue that the bytes represent the colors of pixels arranged in a rectangular array of 480 rows and 640 columns.

If the file can't fully document itself, then at least it can be documented elsewhere. The Postscript page-description language would not be easy to fathom without help, but it is thoroughly described in a series of fat books. If those manuals survive the millennium, future generations should be well equipped to read Postscript. The TEX typesetting system and its nephew LATEX are also meticulously documented. But with a few notable and laudable exceptions, the file formats of commercial software are closed and proprietary. If you want to figure them out, you're on your own.

Finally, the job of recovery and reconstruction is a great deal easier for files that employ abstract markup. The nature of abstract markup is to tell you what is in the file, rather than how to present it. That's the ultimate in metadata, and just what you need to maximize your chances of correctly understanding the information.

Most people don't choose their computer software by evaluating the qualities of file formats. They are swayed instead by lists of features, and by the sensuous experience of clicking on tool palettes or dragging-and-dropping. This situation is unlikely to change, and so the file formats of popular commercial programs are the ones that future antiquarians will have to deal with. In this respect an intriguing development is Microsoft's recent decision to make HTML a "companion" file format for all the programs of the Microsoft Office suite, including Word and the Excel spreadsheet. The ability to save files in HTML format is nothing unusual; what's important about the Microsoft initiative is that HTML files can also be read by the applications. A Microsoft press release promises "seamless round-tripping" from HTML to other formats. In principle, then, HTML could become the primary medium for much digital information. Regrettably, the HTML generated by the Office programs is heavily laden with visual markup.

comments powered by Disqus


Subscribe to American Scientist