Logo IMG
HOME > PAST ISSUE > Article Detail


Bit Rot

Brian Hayes

The Nitty Gritty

Strictly speaking, a computer file doesn't have a format; it has many formats, built in layers one atop the other. At the bottom of the hierarchy is the pattern of magnetized stripes on a disk or tape, or the microscopic pits in the reflective surface of a CD-ROM. This physical layer is the domain of hardware; you can't even perceive the recorded information, much less make sense of it, without the right machinery.

Figure 1. Hierarchy of digital file formatsClick to Enlarge Image

At the next level, patterns of bits are interpreted as numbers, characters, images and the like. The meaning of the patterns is always not obvious. Numbers can be stored in a baffling variety of formats. An integer might be represented by 8, 16, 32 or 64 bits. The bits could be read from left to right or from right to left. Negative numbers could be encoded according to either of two conventions, called one's complement and two's complement. Other variations include binary-coded decimal and floating-point numbers.

For text the situation is not much better. "Plain ASCII text" is often considered the lowest common denominator among computer file formats—a rudimentary language that any system ought to understand—but in practice it doesn't always work that way. ASCII stands for American Standard Code for Information Interchange. The "American" part of the name is a tip-off to one problem: ASCII represents only the characters commonly appearing in American English. If a text includes anything else—such as letters with accents or mathematical symbols—it lies beyond the bounds of pure ASCII.

Each ASCII character is represented by a seven-bit binary number, which has room for values in the range from 0 to 127. Most computers store information in bytes of eight bits each, allowing for another 128 characters. Unfortunately, every designer seems to have chosen a different set of extra characters. Not that there aren't standards for the use of the eighth bit. That's just the problem: There are more than a dozen of them. Grown men and women have given up decades of their lives to sit on committees arguing over the proper place of the dollar sign in computer character sets.

Many of ASCII's limitations are addressed in a new standard for character representation called Unicode. By giving each character two bytes instead of one, Unicode can specify more than 65,000 characters, enough for all the world's major alphabetic languages as well as the thousands of symbols in Chinese, Japanese and Korean. Unicode seems to be catching on. It is built into Microsoft Windows NT and the Java programming language, and Apple has announced its plan to support the standard. In the long run, this is good news; Unicode will solve some ticklish problems. On the other hand, it will mean another round of conversions for those 12,000 files I drag around behind me. Indeed, almost every computer file in existence today may eventually need to be converted.

comments powered by Disqus


Subscribe to American Scientist