
This Article From Issue
January-February 2012
Volume 100, Number 1
Page 84
DOI: 10.1511/2012.94.84
HANDBOOK OF FLOATING-POINT ARITHMETIC. Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Damien Stehlé and Serge Torres. xxiv + 572 pp. Birkhäuser, 2010. $129.
The digital computer was originally conceived as a machine for doing arithmetic, and so you might think it could add, subtract, multiply and divide without error. After all, these are skills we teach to young children. But arithmetic is a subtler art than it seems. For some problems, a computer cannot possibly give an exact numerical answer, because irrational quantities such as the square root of 2 cannot be represented in a finite number of digits. The most you can reasonably ask of the computer is that it always calculate the best approximation to the true answer, within the constraints of finite precision. But even this goal is sometimes exceedingly difficult to achieve.
In the empyrean world of mathematics—as distinct from the material world of computation—arithmetic has rules you can count on. For example, the relation a+1>a holds true for every finite value of a on the real number line. But if you try a few experiments with computer software, you can readily find values of a for which a+1=a . A spreadsheet program on my computer gives this preposterous result when a is set equal to 9,007,199,254,740,992. Another familiar rule that computer arithmetic flagrantly violates is the associative law of addition, which states that (a+b)+c =a+(b+c). In the spreadsheet, when I assign the values a=1017, b =–1017 and c=1, I find that (a+ b)+c returns a result of 0.0, whereas a+(b+c) yields the value 1.0.
These anomalies are not the result of programming errors or hardware malfunctions, and they are certainly not confined to this one spreadsheet program; they reflect inescapable limitations of computer arithmetic. The real numbers of mathematics form an infinite continuum, with no limits on their size or on how closely they crowd together; between any two real numbers, there’s room for infinitely many more. But a machine with a finite set of discrete parts cannot represent all of those numbers. In the spreadsheet program, every number must fit into 64 binary digits of storage, and so there can be no more than 264 (or roughly 1019) distinct numbers.
The question is: Which 264 numbers should be included in this privileged set? The design trade-off is between precision and range. Packing all the numbers into a narrow interval allows fine discrimination between nearby values; scattering the numbers more sparsely covers a wider territory. The choice made by the spreadsheet program—and by much other software—is a “floating point” scheme similar to scientific notation. It affords high precision for small numbers (close to zero) but much coarser resolution in the outer reaches of the number line.
In the 64-bit version of floating-point arithmetic used by the spreadsheet program, a number takes the form m×2E. Here m is the significand, with 53 bits of precision, and E is the exponent, which is allocated 11 bits. These facts about the internal details of the number format provide some clues about the oddities described above. In particular, the mysterious number 9,007,199,254,740,992 is equal to 253, which is where the significand runs out of digits. In this numeric neighborhood, rounding a number to the nearest representable floating-point value yields the same result for both a and a+1.
Floating-point arithmetic is not a novel idea; mainframe computers had it in the 1950s. But two events in the 1980s shaped modern development of the technology. First, a coprocessor chip made by Intel provided hardware support for floating-point arithmetic in early personal computers. Second, the publication of a detailed standard for floating-point operations, designated IEEE-754, brought a measure of consistency and predictability. The standard was widely adopted both by hardware manufacturers and by the designers of programming languages, so that the IEEE-754 version of floating-point arithmetic soon became part of the computational infrastructure, available on almost all computing platforms. (These days, even cell phones have it.)
It’s often the fate of infrastructure that everybody uses it but nobody thinks about it. Particularly welcome, then, is this thorough new Handbook of Floating-Point Arithmetic, which delves deeply into the murky underworld of numerical computing. Most of the book’s nine authors are affiliated with the École Normale Supérieure de Lyon in France. As a group they have a distinguished portfolio of research publications on numerical analysis, computer arithmetic and related topics.
The Handbook’s introductory chapters discuss the nitty-gritty details of how a number is encoded in a sequence of bits, how the basic operations of arithmetic are to be carried out according to the dictates of the IEEE standard, the importance of correct rounding, and what can go wrong. The what-can-go-wrong part gets a disproportionate share of attention, if only because the “corner cases” are where the interesting problems are found. How should the system respond when it is asked to evaluate 1/0? How about 0/0? Questions like these are answered by introducing symbols for exceptional cases: +∞, –∞, NaN (which stands for “not a number”) and the concept of “signed zero,” with the dual symbols +0 and –0. All these novelties bring further conundrums, such as the curious fact that the statement NaN=NaN is false: The symbol is not equal to itself. On the other hand, +0=–0 is true.
A further series of chapters, in a section headed “Cleverly Using Floating Point Arithmetic,” is aimed at the ordinary practitioner—perhaps a scientist-programmer trying to find the best compromise between speed and accuracy while avoiding the worst pitfalls. Presented here are some recommended algorithms for addition and multiplication with a guarantee of exactness (as well as methods for quantifying error in other cases) and advice on common tasks such as evaluating polynomials and dealing with complex numbers. But this is not a text on numerical analysis. You will need to turn elsewhere to calculate error bounds on an overall computation (as opposed to a single step) or to evaluate the stability of a numerical process.
Three more chapters address a rather different audience: They speak to the engineer who would implement floating-point arithmetic, either in hardware or in software. Admittedly, this is a task that very few people will ever have occasion to undertake, just as few drivers design their own automobiles or few musicians build their own pianos. Still, there’s the perennial fascination of learning how things work. Take the algorithm for floating-point multiplication. The basic idea is to add the exponents and multiply the significands. Efficient hardware can perform these two tasks concurrently, but then the sum of the exponents may need to be adjusted because of a carry from the product of the significands. Furthermore, multiplying the significands can double the number of digits, and so the product must be rounded to fit the allotted space.
The final sections of the Handbook reach beyond the four operations of elementary-school arithmetic to the calculation of logarithms, exponentials and trigonometric functions. In this borderland between mathematics and computer science the challenges get deeper, and correct answers are not always to be had, even with the most ingenious bit-twiddling. The issue, again, is correct rounding. With the operations of ordinary arithmetic (and also in extracting square roots), calculating a few extra bits of precision will always suffice to determine whether a value that lies between two representable numbers should be rounded up or rounded down. In computing approximations to the transcendental functions there is no such guarantee. In the case of natural logarithms, there is at least one instance where correctly rounding the answer to 53 bits of precision requires calculating 117 bits. It has not been proved that the number of bits needed is always finite.
This state of affairs suggests a rather gloomy outlook for floating-point arithmetic. If we insist on correctness, some programs might be very slow, or they could even run forever without returning an answer. If instead we demand efficiency, we might have to accept answers that are not only inexact but also in some sense wrong—further from the true answer than is strictly necessary at a given level of precision. These facts have long been clear to the insiders who design floating-point systems, but perhaps they are not so well known to those who rely on the outputs. There is no reason for despair, just because the 53rd bit may be incorrect. Marvels of science and engineering were accomplished even in the dark ages of the slide rule, an instrument that came nowhere near 53 bits of precision. But the situation remains mildly irksome. And whereas users of slide rules were acutely aware of how coarse their approximations were, users of modern computers may have delusions of omniscience.
In my own casual programming, I often go out of my way to avoid floating-point arithmetic, favoring methods that give exact results for rational numbers, even though I pay a steep price in efficiency. And I sometimes get curmudgeonly about the overwhelming “mind share” that floating-point methods command in the world of scientific computing. I have the sense that floating-point arithmetic is often chosen without even considering alternatives. Reading the Handbook of Floating-Point Arithmetic has deepened my awareness of the snags and snares that await the unwary floating-point programmer. At the same time, however, the Handbook has inspired in me a renewed admiration for modern floating-point methods, despite their quirks. There is a huge body of knowledge standing behind the decisions made in formulating the IEEE-754 standard, and in the more recent work described here by the Lyon group. A floating-point calculation may not get every bit right, but it’s not leaving many bits on the table unnecessarily.
Brian Hayes is Senior Writer for American Scientist. He is the author most recently of Group Theory in the Bedroom, and Other Mathematical Diversions (Hill and Wang, 2008).
American Scientist Comments and Discussion
To discuss our articles or comment on them, please share them and tag American Scientist on social media platforms. Here are links to our profiles on Twitter, Facebook, and LinkedIn.
If we re-share your post, we will moderate comments/discussion following our comments policy.