Logo IMG
HOME > PAST ISSUE > Article Detail



Brian Hayes

The Virtual Machine

The idea of a virtual machine is hardly new to Java. It goes back to the very origins of computer science and is one of the many ingenious inventions of Alan M. Turing. The idea is that any sufficiently powerful computer can emulate, or mimic, any other computer. Such emulation is not just a theoretical toy. Practical emulators allow a Macintosh or a Unix box to dream it is a PC. Likewise, an emulator allows just about any computer to act as a Java virtual machine.

The virtual-machine strategy has a simple combinatorial advantage. Writing N programs for M platforms calls for an amount of labor proportional to N x M. With a virtual machine the work needed is N + M. The N operations are needed to write one version of each program; the M operations consist of building the virtual machine for each platform. In the 1970s this approach to software portability was tried in the P-code system, developed at the University of California at San Diego. P-code was intended to be a universal intermediate language. Compilers for many high-level languages could generate P-code, which would be run by interpreters on various computers.

In the case of Java, the intermediate language consists of byte codes, which make up the instruction set of the virtual machine. Because there are just 256 eight-bit bytes, the machine's repertory of actions is limited to no more than 256 instructions. The architecture of the virtual machine is centered on a "pushdown stack," where values are stored while operations are pending. Consider the sequence of three instructions iload0, iload1 and iadd, which Java happens to encode in the bytes whose decimal values are 26, 27 and 96. The two iload instructions push two local variables onto the top of the stack. Then iadd pops the two numbers off the stack, adds them and pushes the sum on the stack in their place. The i prefixed to each instruction indicates that the operands must be integers; there are equivalent instructions for other data types, such as floating-point numbers.

When a Java program is compiled, the output, called a class file, is not just a stream of byte codes. The file format includes several additional fields, structures and markers. For example, every valid class file must begin with a magic number, 3405691582. (The number seems less arbitrary when it is written in hexadecimal notation, where the 16 digits run from 0 to 9 and A to F. Converted to base 16, the magic number is CAFEBABE.)

The byte-code verifier ensures that a class file has the right format, and it also runs many checks on the byte codes themselves. In analyzing the three-byte program fragment given above, the verifier would make sure that both of the operands are integers, and it would prove that the stack cannot overflow or underflow. These checks enhance the reliability of Java programs, since type mismatches and stack failures are errors that would likely cause the program to crash. The same checks are also the main line of defense against malicious software. (Java's armor against hostile programs has been found to have a few chinks, but so far most of them have been flaws of implementation, not design.)

Interestingly, one thing the byte-code verifier cannot verify is that a class file was actually generated by a Java compiler, rather than coming from some other source. Since the format of the class file has been spelled out in complete detail, a compiler for another language can emit byte codes that will be executed by the Java virtual machine just as if they were authentic Java. Note that these cuckoo-egg byte codes are subject to the same defenses against malicious programs, since the ersatz class file has to pass through the verifier. In effect, the Java language and the Java virtual machine are completely decoupled. Programs written in any language can be compiled into byte codes and run on the Java virtual machine; they thus gain the benefits of platform independence. Conversely, Java programs could be compiled for platforms other than the virtual machine.

Hijacking the Java virtual machine in this way is not just a hypothetical possibility. Per Bothner of Cygnus Solutions has written a compiler called Kawa that translates Scheme—my own pet language—into Java byte codes. Furthermore, the Kawa compiler is itself written in Java, so that it will run on any platform that has a Java virtual machine. Other languages, including Ada, are being grafted into Java in the same way.

The one nagging doubt about this ruse for fooling the virtual machine has to do with efficiency. The architecture of the virtual machine was designed to be a good match for typical Java programs; it is probably less than optimal for very different languages such as Scheme. But efficiency is a troublesome issue even for the "100% Pure Java" that Sun advocates. Compiling a program into byte codes, rather than into the "native code" of a specific processor, interposes a layer of interpretation that inevitably slows execution. This penalty may be acceptable for the occasional Java "applet" downloaded from a Web site and run once or twice; it will be intolerable if the major applications that people work with every day are rewritten in Java. (Corel Corporation has announced plans to publish Corel Office for Java, a suite of Java programs including a word processor and a spreadsheet.)

Sun's answer to the efficiency problem is the JavaChip—a microprocessor whose native instruction set consists of Java byte codes. Thus the virtual machine becomes real, and the overhead of interpretation is eliminated. But with this vision Java has come full circle. It is no longer a bridge between platforms but a new platform competing with all the others.

Meanwhile, Java has not quite reached the promised land of platform-independence even among the existing platforms. The small Java program of Figure 3 is one of the first examples given in The Java Tutorial, by Mary Campione and Kathy Walrath. The source code is the same for all platforms, but the tutorial's instructions for running the program are different for Unix, Windows and Macintosh computers. What's worse, the program also produces different results for each platform! (The source of these differences is that the program counts characters typed at the keyboard, and line-ends are encoded differently by the three operating systems.)

» Post Comment



Subscribe to American Scientist