Logo IMG


The Invention of the Genetic Code

Brian Hayes

Reality Intrudes

The comma-free codes were not quite the last word in the wildcat era of genetic code-building. In 1959 Robert Sinsheimer suggested a scheme where the genetic alphabet had only two letters; A and C were interpreted as the same symbol, and so were G and U. This device was a way of coping with the recent discovery of wide variations in the ratio of (A+U) to (G+C) in various organisms. Of course reducing the code to binary notation meant that triplets could not code for 20 amino acids; the codons would have to be at least quintuplets (providing 32 combinations).

As far as I know, no one ever proposed a three-letter, ternary code. Such a code might distinguish A from U but lump together C and G, producing 27 codons. This plan has a faint echo in the real genetic code, where the third base in a codon is sometimes interpreted merely as A or G versus U or C.

I'm also surprised that no one gave serious thought to schemes where the codons can vary in length. In engineering, the idea of choosing shorter sequences to represent more frequent symbols was already a well-established trick for compressing a message. David Huffman had created a theory of such codes in 1951, and of course the Morse code went back a century further. Biologists were clearly aware of the principle, and they were mindful of coding efficiency, but they did not explore the possibility.

Perhaps if the era of speculation had continued a few years more, these wrong ideas would also have been given their turn. But in 1961 the whole coding craze was brought up short by unexpected news from the lab bench. Marshall W. Nirenberg and J. Heinrich Matthaei of the National Institutes of Health announced that artificial RNAs could stimulate protein synthesis in a cell-free system. What's more, the first RNA they tried was poly-U, a long chain of repeating uracil units. In comma-free codes, UUU has to be a nonsense codon, but Nirenberg and Matthaei's result implied that it codes for the amino acid phenylalanine. A few more codons were identified over the next year or two. Then Philip Leder and Nirenberg found an even better experimental protocol, and by 1965 the genetic code was mostly solved.

Figure 5. Codon assignmentsClick to Enlarge Image

The code resembled none of the theoretical notions. As the table assigning codons to amino acids was filled in, it became apparent that the magic number 20 held no magic after all. All the clever mathematical contrivances for getting 20 amino acids out of 64 codons turned out to be figments of the human urge to find pattern, not reflections of any natural order. The "extra" codons are merely redundant: Some amino acids have one or two codons, some have four, some have six. (Three codons serve as stop signs.) At first glance the mapping between codons and amino acids appeared arbitrary, even haphazard.

Nature also ignored all the mathematical ingenuity applied to solving the frame-shift problem. The living cell does it by a kind of dead-reckoning. Ribosomes march along the messenger RNA in strides of three bases, translating as they go. Except for signals that mark where the ribosome is supposed to start, there is nothing in the code itself to enforce the correct reading frame.

When I mentioned to a biologist friend that I find some of the hypothetical genetic codes of the 1950s more appealing than the real thing, she protested that the actual code is one of the most elegant creations of biochemistry, and she pointed out some of its subtle refinements. The codon table is not entirely arbitrary. Its redundancies confer a kind of error tolerance, in that many mutations convert between synonymous codons. When a mutation does alter an amino acid, the substitute is likely to have properties similar to those of the original. Computer simulations by David Haig and Laurence D. Hurst show that the present code is nearly optimal in this respect.

These observations suggest that I should be grateful my genes were not designed by George Gamow or Francis Crick. With Gamow's overlapping codes, any mutation could alter three adjacent amino acids at once, probably disabling the protein. Comma-free codes are even more brittle in this respect, since a mutated codon is likely to become nonsense and terminate translation.

But criticisms of this kind are not entirely fair. They pluck the invented code out of its theoretical context and plug it into a biochemical system that has been evolving for three billion years or more in concert with a very different code. It's like replacing a man's arms with the wings of a bird and expecting him to fly. The reciprocal transplant would be no more successful. That is, if we should ever visit a planet where life has evolved for a few billion years with a comma-free genetic code, we would doubtless find that our own code was maladaptive.

Imagine that in 1957 a clairvoyant biologist offered as a hypothesis the exact genetic code and mechanism of protein synthesis understood today. How would the proposal have been received? My guess is that Nature would have rejected the paper. "This notion of the ribosome ratcheting along the messenger RNA three bases at a time—it sounds like a computer reading a data tape. Biological systems don't work that way. In biochemistry we have templates, where all the reactants come together simultaneously, not assembly lines where machines are built step by step."

comments powered by Disqus


Of Possible Interest

Feature Article: Curious Chemistry Guides Hydrangea Colors

Computing Science: Clarity in Climate Modeling

Feature Article: Candy Crush's Puzzling Mathematics

Subscribe to American Scientist