Logo IMG


The Invention of the Genetic Code

Brian Hayes

Comma-Free Codes

By the later 1950s, there was growing support for the idea of messenger RNA—a single-strand molecule acting as an intermediary between DNA and the protein-synthesizing machinery. At the same time Crick was formulating the "adaptor hypothesis," the idea that amino acids do not interact directly with messenger RNA but are carried by small molecules that recognize specific codons. (Today, of course, the adaptor molecules have been identified as transfer RNAs.) The codons were by then thought to be nonoverlapping triplets of bases.

The process of gene expression was imagined as going something like this. First the appropriate segment of DNA was transcribed into messenger RNA; like replication, this was done by blind copying, without regard to the meaning of the sequence. Then the messenger RNA stretched out in the cytoplasm of the cell with its long row of codons exposed like a sow's nipples. Each adaptor molecule, already charged with the correct amino acid, poked around until it latched onto the right codon. When all the codons were occupied, the amino acids were linked together, and the completed protein was peeled off the template.

The scenario must have seemed highly plausible. Even looking back from the 1990s, it seems like the kind of chemistry that living organisms do. The nonsequential pattern-matching needed to line up adaptors on the messenger RNA is vaguely like an enzyme-substrate reaction or like the binding of antibody to antigen. And yet there was a serious problem with the vision of piglets suckling on RNA: A piglet might very well wind up between nipples.

Suppose somewhere in a messenger RNA is the partial sequence ... UGUCGUAAG.... (Note that in RNA uracil replaces the thymine of DNA, and so the code is written with U rather than T.) The intended reading is ... UGU, CGU, AAG..., but the RNA molecule has no spaces or commas to indicate codon boundaries. The sequence could equally well be read as ... UG, UCG, UAA, G ... or ... U, GUC, GUA, AG.... Each of these alternatives would have a different meaning. Furthermore, in the suckling-pig model of protein synthesis, adaptor molecules that attached to the messenger RNA in different reading frames might interfere with one another and prevent any protein at all from being produced.

Figure 3. Overlapping code packsClick to Enlarge Image

The frame-shift problem doesn't arise with an overlapping code, because all three reading frames are simultaneously valid. With sequential codons, however, the translation machinery has to be guided to the right frame. In 1957 Crick devised a solution that seemed at once so clever and so obvious that it just had to be right. He suggested that adaptor molecules might exist for only a subset of the 64 codons, with the result that only that subset would be meaningful; the rest of the triplets would be "nonsense codons." Then the trick is to construct a code in such a way that when any two meaningful codons are put next to each other, the frame-shifted overlap codons are always nonsense. For example, if CGU and AAG are sense codons, then GUA and UAA must be nonsense, because they appear inside the concatenated sequence CGUAAG. Similarly, AGC and GCG are ruled out by the sequence AAGCGU. If all the out-of-frame triplets are nonsense, then the message has only one reading. A code with this property is said to be comma-free, since messages remain unambiguous even when words are run togetherwithoutcommasorspaces.

Do such codes exist? In English you might try to find a subset of all three-letter words that can be jammed together without creating any additional instances of the words in the subset. To make the problem more manageable, consider this list of 10 three-letter words: ass, ate, eat, sat, sea, see, set, tat, tea, tee. Is there a subset that forms a comma-free language? Trial and error shows that the words ate, eat and tea cannot all appear together, because teatea, for example, contains both eat and ate. Similarly, sea combines with tat, tea or tee to produce eat. One set of words that has no conflicts is ass, sat, see, set, tat, tea and tee.

Figure 4. To build a comma-free codeClick to Enlarge Image

How many words can a comma-free code include? For the case of RNA, Crick and his Cambridge colleagues John Griffith (another physicist) and Leslie Orgel carried out a straightforward analysis. They pointed out first that the codons AAA, CCC, GGG and UUU cannot appear in any comma-free code, since they cannot combine with themselves without generating reading-frame ambiguity. The remaining 60 codons can be sorted into groups of three, where the codons within each group are related by a cyclic permutation. For example, the codons AGU, GUA and UAG form one such group. A comma-free code can have no more than one codon from each of these permutation classes. How many classes are there? Dividing 60 objects into groups of three produces exactly 20 groups. Bingo!

The analysis just given sets the maximum possible size of a comma-free genetic code, but it does not guarantee that a maximal code actually exists. Nevertheless, Crick, Griffith and Orgel went on to construct several examples. And they offered a vision of how the code might work: "This scheme ... allows the intermediates to accumulate at the correct positions on the template without ever blocking the process by settling, except momentarily, in the wrong place. It is this feature which gives it an advantage over schemes in which the intermediates are compelled to combine with the template one after the other in the correct order."

Crick and his colleagues were quick to point out that they had no experimental evidence for the comma-free code. As a nonoverlapping code, it put no constraints on amino acid sequences, so there was no point in looking for confirmation there. The code did strongly constrain the base sequences of DNA and RNA, but those sequences were unknown. "The arguments and assumptions which we have had to employ to deduce this code are too precarious for us to feel much confidence in it on purely theoretical grounds," they wrote. "We put it forward because it gives the magic number-20-in a neat manner and from reasonable physical postulates." The magic number was enough to persuade both biologists and the wider public. Carl Woese later wrote: "The comma-free codes received immediate and almost universal acceptance.... They became the focus of the coding field, simply because of their intellectual elegance and the appeal of their numerology.... For a period of five years most of the thinking in this area either derived from the comma-free codes or was judged on the basis of compatibility with them."

The intellectual elegance also attracted the attentions of coding-theory professionals, most notably Solomon W. Golomb, now at the University of Southern California. Golomb and his colleagues (including the physicist-biologist Max Delbrück) wrote several papers on comma-free codes, taking the biological problem as their point of departure but going on to explore more abstract and generalized ideas. They quickly deduced a formula for the maximum size of a comma-free code: For an alphabet of n letters grouped into k-letter words, the formula takes a particularly simple form when k is a prime: (nk - n)/k. For n = 4 and k = 3 (the case of interest to biologists) they showed that there are 408 maximal comma-free codes and gave a procedure for constructing them. And they devised some more elaborate related codes. For example, a transposable comma-free code is designed so that both strands of the DNA have the comma-free property. Using triplets, the largest transposable code has only 10 codons, but a quadruplet code yields 20. Golomb also invented a genetic code based on sextuplets; it is not only comma-free and transposable but also can correct any two simultaneous errors in translation, and detect a third error. Life would be a lot more reliable if Solomon Golomb were in charge.

comments powered by Disqus


Subscribe to American Scientist