Top banner


Ode to the Code

Brian Hayes

The genetic code was cracked 40 years ago, and yet we still don't fully understand it. We know enough to read individual messages, translating from the language of nucleotide bases in DNA or RNA into the language of amino acids in a protein molecule. The RNA language is written in an alphabet of four letters (A, C, G, U), grouped into words three letters long, called triplets or codons. Each of the 64 codons specifies one of 20 amino acids or else serves as a punctuation mark signaling the end of a message. That's all there is to the code. But a nagging question has never been put to rest: Why this particular code, rather than some other? Given 64 codons and 20 amino ­acids plus a punctuation mark, there are 1083 possible genetic codes. What's so special about the one code that—with a few minor variations—rules all life on Planet Earth?

The canonical nonanswer to this question came from Francis Crick, who argued that the code need not be special at all; it could be nothing more than a "frozen accident." The assignment of codons to amino acids might have been subject to reshuffling and refinement in the earliest era of evolution, but further change became impossible because the code was embedded so deeply in the core machinery of life. A mutation that altered the codon table would also alter the structure of every protein molecule, and thus would almost surely be lethal. In other words, the genetic code is the qwerty keyboard of biology—not necessarily the best solution, but too deeply ingrained to be replaced or improved.

There has always been resistance to the frozen-accident theory. Who wants to believe that the key to life is so arbitrary and ad hoc? And there is evidence that the accident is not quite frozen. Certain protozoa, bacteria and intracellular organelles employ genetic codes slightly different from the standard one, hinting that changes to codon assignments are not impossible after all. And if the code is subject to change, then it must also be subject to natural selection, which in turn suggests the possibility of ongoing improvement. Perhaps ours is not the very best of all possible codes, but after four billion years of evolution it ought to be a pretty darn good one.

Click to Enlarge Image

The urge to find something singular and superlative about the code was already evident even before it was deciphered. For several years before experiments began to reveal the true structure of the genetic code, theorists were at liberty to dream up codes of their own. Some of the proposals were so ingenious that the real code seemed a bit disappointing. An earlier column in this series (January-February 1998) described that era of imaginary genetic engineering. But the creative thinking did not end with the publication of the codon table; indeed speculation seems to have been inhibited very little by the constraints of mere fact. This sequel is meant to bring the story up to date, covering both the biological mainstream and a few ideas from wilder shores.

Egged on by Error

Early guesses about the nature of the code often started from an assumption that it would maximize information density. One conjecture had each nucleotide base spelling out three messages at once. The concern with efficiency turned out to be misplaced; information density is not a very high priority for most organisms. The concept that has replaced efficiency as the great desideratum in genetic coding is error-­tolerance, or robustness. In one way or another, the code is thought to minimize the incidence and the consequences of errors in the transmission of genetic information, so that meaning can be recovered even from garbled messages.

Among the many ways that genetic signals could go awry, two kinds of errors have been singled out for attention: mistranslations and mutations. Errors in translation disrupt the reading of the genetic message—the flow of information from DNA to RNA and then to protein—but they leave the DNA itself intact. Translation errors were probably of great importance early in the history of life, when the machinery of protein synthesis was imprecise. Mistranslations are less frequent now, and less harmful. Each error disables only a single protein molecule. Mutations are another matter: They alter the DNA, the permanent genetic archive. Whereas a translation error is like an inkblot marring one copy of a book, a mutation is a flaw in the printing plate, reproduced in every copy. The simplest "point" mutations substitute one nucleotide for another at a single site on the DNA (with a corresponding change on the opposite strand).

The idea that fault tolerance might shape the genetic code arose as soon as biologists got their first glimpse of the codon table. The mapping from codons to amino acids is highly degenerate: In many cases multiple codons specify the same amino acid. But the synonymous codons are not just scattered haphazardly across the table; they clump together. Because of these clusters, a misreading or mutation has a better-than-average chance of producing a new codon that still translates into the same amino acid.

Closer examination of the table—with some knowledge of amino acid chemistry—revealed another possible strategy for coping with errors. When a change to a single nucleotide does not yield the same amino acid, it nonetheless has a good chance of producing one with similar properties. For example, all the codons with a middle nucleotide of U correspond to amino acids that are hydrophobic, or water-repellent, a trait governing how the chain of amino acids in a protein molecule folds up in the aqueous environment of the cell. Thus at least two-thirds of the time a point mutation in one of these codons will either leave the identity of the amino acid unchanged or will substitute another hydrophobic amino acid.

Reshuffling the Deck of Codons

As early as 1969, Cynthia Alff-Steinberger of the University of Geneva began trying to quantify the code's resilience to error by means of computer simulation. The basic idea was to randomly generate a series of codes that reshuffle the codon table but retain certain statistical properties, such as the number of codons associated with each amino acid. Then the error-resistance of the codes was evaluated by generating point mutations that caused amino acid substitutions. A code scored well if the erroneous amino acids were similar to the original ones. With the computing facilities available in the 1960s, Alff-Steinberger was able to test only 200 variant codes. She concluded that the natural code tolerates substitutions better than a typical random code.

Click to Enlarge Image

A decade later J. Tze-Fei Wong of the University of Toronto approached the same question from another angle—and reached a different conclusion. Instead of generating many random codes, he tried a hand-crafted solution, identifying the best substitution for each amino acid. Wong found that the substitutions generated by the natural code are less than half as close, on average, as the best ones possible. This result was taken as evidence that the code has not evolved to maximize error tolerance. But Wong did not attempt to find a complete, self-consistent code would generate all the optimal substitutions.

Returning to studies of random codes, David Haig and Laurence D. Hurst of the University of Oxford generated 10,000 of them in 1991, keeping the same blocks of synonymous codons found in the natural code but permuting the amino acids assigned to them. The result depended strongly on what criterion was chosen to judge the similarity of amino acids. Using a measure called polar requirement, which indicates whether an amino acid is hydrophobic or hydrophilic, the natural code was a stellar performer, better than all but two of the 10,000 random permutations. But in other respects the biological code was only mediocre; 56 percent of the random codes did a better job of matching the electric charge of substituted amino acids.

Focusing on the encouraging result with polar requirement, Hurst and Stephen J. Freeland (now at the University of Maryland, Baltimore County) later repeated the experiment with a sample size of 1 million random codes. Using the same evaluation rule as in the smaller simulation, they found that 114 of the million codes gave better substitutions than the natural code when evaluated with respect to polar requirement. Then they refined the model. In the earlier work, all mutations and all mistranslations were considered equally likely, but nature is known to have certain biases—some errors are more frequent than others. When the algorithm was adjusted to account for the biases, the natural code emerged superior to every random permutation with a single exception. They published their results under the title "The genetic code is one in a million."

But still there was the question of whether polar requirement is the right criterion for estimating the similarity of amino acids. Choosing the one factor that gives the best result and ignoring all others is not an experimental protocol that will convince skeptics. This issue was addressed in a further series of experiments by Freeland and Hurst in collaboration with Robin D. Knight and Laura F. Landweber of Princeton University. Rather than try to deduce nature's criteria for comparing amino acids, they inferred it from data on actual mutations. If two amino acids are often found occupying the same position in variant copies of the same protein, then it seems safe to conclude that the amino acids are physiologically compatible. Conversely, amino acids that are never found to occupy the same position would not be likely substitutions in a successful genetic code. There is a circularity to this formulation: The structure of the genetic code helps determine which substitutions are seen most often, and then the frequencies of substitutions serve to rank candidate genetic codes. Freeland and his colleagues argue that they can break the cycle by choosing an appropriate subset of the mutation data, including only proteins at substantial evolutionary distance, which should be separated by many mutations.

Using this bootstrap criterion, Freeland and his colleagues compared the biological code with another set of a million random variations. The natural code emerged as the uncontested champion. They wrote of the biological code: " appears at or very close to a global optimum for error minimization: the best of all possible codes."


The idea that the genetic code is evolving under pressure to ameliorate errors—or indeed that it is evolving at all—has not won universal assent. Some cogent objections were set forth as early as 1967 by Carl R. Woese of the University of Illinois at Urbana-Champaign. Among other points, he noted that if a trait is actively evolving, you would expect to see some variation. In particular he called attention to the various "extremophiles" that live at high temperature, high salt concentration, and so on. These organisms tend to have unusual proteins and unusual nucleic acids, but they all have the standard genetic code.

The few variant codes known in protozoa and organelles are thought to be offshoots of the standard code, but there is no evidence that the changes to the codon table offer any adaptive advantage. In fact, Freeland, Knight, Landweber and Hurst found that the variants are inferior or at best equal to the standard code. It seems hard to account for these facts without retreating at least part of the way back to the frozen-accident theory, conceding that the code was subject to change only in a former age of miracles, which we'll never see again in the modern world.

Another challenge to the error-reduction hypothesis is the difficulty of showing causation in an evolutionary context. Even if the pattern of codon assignments is consistent with such a mechanism, the same pattern might have arisen in some other way.

Computer experiments like Alff-Steinberger's and Freeland's reveal nothing about pathways of evolution. A program churning out a million random genetic codes is not what you expect to see in nature. To simulate the step-by-step process of mutation and selection is much more demanding; after all, the biosphere has been working at it for a few billion years. Nevertheless, models of this kind are being attempted. Guy Sella and David H. Ardell of Stanford University are running a simulation that includes both a nucleic acid genotype and a protein phenotype, linked by a mutable genetic code. They point out that change can be introduced into the genetic code without utterly disrupting cell metabolism if there are multiple codons for a given amino acid, and some of them fall into disuse; these rarely used codons are then free to take on new roles. The mechanism is analogous to the gene duplication that often precedes evolutionary divergence of proteins: One copy of the gene carries on the original function, allowing the other to explore new territory. Thus degeneracy or redundancy is not just an accidental feature of the code but is necessary to allow scope for evolution.

Code On, Codon

Solomon W. Golomb of the University of Southern California, who was a central figure in the first round of speculations about the genetic code, has summed up the spirit of that era: The approach taken in those days was to ask, "How would Nature have done it, if she were as clever as I?" Now that we know how nature has done it, you might think that the period of freewheeling conjecture would be over, but I am pleased to report that there is no lack of adventurous ideas about patterns and structures in the genetic code. Here are just a few of the ideas in circulation.

One of the themes of the earlier period was the need to find some compelling relation between the numbers 64 and 20. And this quest had spectacular successes: In at least two schemes, the 64 codons could specify exactly 20 amino acids, neither more nor less. The mathematics was so beautiful, it was hard to believe nature would pass up an opportunity to make use of it. Pierre Béland and T. F. H. Allen of the St. Lawrence National Institute of Ecotoxicology in Montreal argue that nature did not miss the opportunity. They propose a primordial genetic code in which information was read from both strands of the DNA at once, and all messages were palindromic, so that they could be read in either direction. Under these conditions, meaning can be assigned to only 20 of the 64 triplets.

A double-stranded translation system may sound outlandish, and yet there are hints that the "antisense" strand of DNA may be more than just a placeholder. Jaromir Konecny, Michael Schöniger and G. Ludwig Hofacker of the Technical University of Munich point out that a rough symmetry of the genetic code creates a kind of antigene opposite every normal gene. Wherever the sense strand calls for a hydrophilic amino acid, the antisense strand (read in the opposite direction) is likely to code for a hydrophobic one. It's even possible that some of these antisense pseudogenes are transcribed in vivo. William F. Pendergraft III and six colleagues at the University of North Carolina at Chapel Hill have recently detected immunological reactions to one such antisense protein.

More generally, there is growing recognition that the genetic code may encompass more information than just the simple mapping from codons to amino acids. Synonymous codons may not always be completely equivalent. It's certainly true that codon frequencies are not random or uniform. Among the several codons that specify a given amino acid, some may be common and some rare, and these usage biases can vary both within and between genomes. The biases probably help to regulate the rate of protein synthesis: If the transfer RNA that matches a codon is rare, then transcription of genes including that codon will be slowed. For some proteins there is evidence that such pace-setting codons help ensure correct folding of the amino acid chain.

Another fertile area is the search for symmetries and patterns in the genetic code. The standard table of codon assignments derives from the obvious representation of the triplet code as a 4×4×4 cube. Several authors, observing that 64 is equal not only to 43 but also to 26, suggest organizing the codon table as a six-dimensional (2x2x2x2x2x2) hypercube. A mutation is a movement from one vertex to an adjacent vertex in this structure. The geometry is intriguing, and there are interesting connections with Gray codes and even with the I Ching, but I'm not so sure that biologists will find the concept useful.

Click to Enlarge Image

Not every interesting idea takes the form of a paper in the Journal of Theoretical Biology. Another quite different geometrical interpretation of the genetic code has been presented to the world in the form of a design for a toy. Mark White, a physician and inventor in Bloomington, Indiana, discovered that the genetic code can be represented succinctly on a dodecahedron (a solid whose surface consists of 12 pentagons) or its dual the icosahedron (made up of 20 triangles). Each face of the dodecahedron is labeled with one of the four nucleotides, each of which appears three times. Any grouping of three adjacent faces, read in the right order, generates the appropriate amino acid. White has made prototypes of toys that incorporate this design. He observes that the icosahedral model is closely related to the very first proposal for a triplet genetic code, the "diamond code" devised in 1955 by George Gamow. This neatly closes the circle and takes us back to the beginning of the story.

© Brian Hayes



Bottom Banner