COMPUTING SCIENCE

# Prototeins

# High-Scoring Molecules

What makes for a good folding? In proteins the usual measure is the Gibbs free energy, a thermodynamic quantity that depends on both energy and entropy. If you could tug on the ends of a protein chain and straighten it out, the result would be a state of high energy and low entropy. The energy is high because amino acids that "want" to be close together are held at a distance; the entropy is low because the straight chain is a highly ordered configuration. When you let go, the chain springs back into a shape with lower energy and higher entropy, changes that translate into a lower value of the Gibbs free energy. The "native" state of a protein—the folding it adopts under natural conditions—is usually assumed to be the state with the lowest possible free energy.

Prototeins can get along with a simpler folding criterion. Standard practice is to rank foldings simply by counting *H*-*H* contacts. It's more like keeping score than measuring energy. If the *H*'s are viewed as analogues of hydrophobic amino acids, the scoring system reflects the tendency of hydrophobic groups to seek shelter from water. But the prototein model is so abstract that it doesn't really matter what kind of force is at play between the *H*'s. Just say that *H*'s are sticky, and it takes energy to pull them apart.

One strategy for finding good folds, then, is to look for configurations that maximize the number of *H-H* contacts. A program to carry out the search runs through all the foldings of all the sequences of a given length, keeping only those foldings with the maximum number of contacts.

How many contacts are possible in a folded prototein? A little doodling on graph paper shows that the highest possible ratio of contacts to *H*'s is 7:6. Sequences that attain this limit are exceedingly rare. (I leave it as a puzzle for the reader to find the shortest such sequence, which I believe has 26 beads.) But proteins are not required to solve such mathematical puzzles. To find the stablest configurations of a given sequence, all you need do is find the foldings that have more *H-H* contacts than any other foldings of the same sequence, whether or not the number of contacts is the theoretical maximum. There is a shortcut for identifying these stable foldings. It begins with the sequence made up entirely of *H*'s, which is rather like double-sided sticky tape that collapses on itself in a crumpled ball. If any sequence at all has a folding with a given number of *H-H* contacts, then that configuration must also be among the stablest foldings of the all-*H* sequence. In the all-*H* folding, however, some of the *H*'s may not form contacts, and so they can be changed to *P*'s without altering the score of the folding. By making all such substitutions, you recover the sequence with the minimum number of *H*'s that can give rise to a given folding.

Sequences with rigid, heavily cross-linked folds are fairly rare. Among chains with 21 beads the maximum number of *H-H* contacts is 12, and a chain must have at least 14 *H*'s to reach this limit. There are only 80 sequences of 14 *H*'s and 7 *P*'s that produce 12 contacts, out of the universe of more than two million 21-bead sequences.

Figure 1 shows some of the 80 maximally cross-linked 21-bead prototeins, along with a few other foldings chosen at random. The two populations of molecules are very different. The randomly chosen configurations tend to be loose and floppy, and their average number of *H-H* contacts works out to less than 1. The highest-scoring folds, in contrast, are all very compact, with the chain either wound around itself in a spiral shape or folded into zigzags.

A lifelike feature of the compact foldings is a tendency for the *H*'s to congregate in the interior of the molecule, leaving the *P*'s exposed on the surface. The model has no explicit rule favoring the formation of such a hydrophobic core; it happens automatically when you select foldings with numerous *H-H* contacts. In this connection, Dill points out that for short prototein chains a two-dimensional lattice model may be more realistic than a three-dimensional one. The reason is that the perimeter-to-area ratio of a short chain in two dimensions approximates the surface-to-volume ratio of a longer chain in three dimensions.

Not all features of the high-scoring prototein foldings inspire confidence in the model's realism. For example, a disproportionate number of the best sequences have *H*'s at both ends, and these molecules tend to fold up with their ends tucked into the hydrophobic core. The reason is easy to see: An *H* at the end of a chain can participate in three contacts, whereas interior *H*'s can have no more than two. But the sticky-end effect is an artifact of the model; there is no comparable phenomenon in real proteins.

Another peculiarity can be traced to the choice of a square lattice. Two *H*'s on a square lattice can form a contact only if they are separated within the prototein sequence by an even number of intervening beads. As a result, every prototein can be divided into odd and even subsequences that do not interact. No such parity effect is seen in proteins. This failure of realism is unfortunate; on the other hand, the segregation of odd and even sublattices allows some very handy optimizations in a simulation program.

EMAIL TO A FRIEND :

**Of Possible Interest**

**Computing Science**: Clarity in Climate Modeling

**Sightings**: Cell by Cell, Life Appears

**Feature Article**: Candy Crush's Puzzling Mathematics