COMPUTING SCIENCE

# Prototeins

Anyone who has ever struggled to fold a roadmap should have an extra measure of respect for protein molecules, which fold up all on their own and practically put themselves away in the glove box. Protein folding is so remarkably efficient that it has been called a paradox. Thirty years ago Cyrus Levinthal pointed out that a typical protein molecule has so many possible configurations that it would need eons to explore all of them and find the best shape; yet proteins fold in seconds.

Looking at the tangled loops and coils of a folded protein, you might imagine that the arrangement is haphazard—like a randomly crumpled map rather than a properly folded one—but in fact every twist and turn is precisely specified. Chemically, a protein is a linear polymer, a sequence of the smaller molecules called amino acids, which are joined end to end like pop-beads. The sequence of amino acids is the only information about the protein encoded in the genes, but the protein can do its job only if the one-dimensional chain of amino acids folds into the correct three-dimensional structure. Apparently the sequence alone is enough to guide the folding. If two protein molecules have the same sequence, they fold up into the same shape.

One way to gain a better appreciation of the protein molecule's knack for folding is to simulate it with a computer program. The most detailed simulations track the motion of every atom and try to reproduce all the chemistry and physics going on in the system. The ultimate goal is to predict the native structure of the protein based on nothing more than the sequence of amino acids. Unfortunately, that goal is a distant one. The models require hours of computer time just to simulate a few picoseconds of molecular dynamics.

I have been exploring a protein model at the other end of the complexity scale—a minimalist model, where every aspect of the simulation is reduced to its simplest possible form. A model so abstract cannot reveal anything about the structure of particular protein molecules—it cannot show how insulin or myoglobin folds—but it may offer clues to some general principles of protein folding. For example, one might hope to learn what kinds of amino acid sequences lead to a stable and compact molecule.

The great advantage of a really simple model is that you can solve it exactly, at least for short chains of amino acids. You can examine every possible folding of every possible sequence, picking out the ones of interest. You can know with certainty which configurations have the most favorable properties.

Another advantage of a minimalist model is that you don't have to be an expert in protein chemistry or molecular dynamics to play with it. A curious amateur can write a rudimentary program in a few days or weeks, and run it on commonly available machinery. Indeed, the simplified protein structures are so well suited to the needs of the amateur that I am tempted to call them amteins—they're not quite ready to turn pro yet. However, I have been persuaded to choose a name slightly less facetious, and so I shall call them prototeins.

# Foursquare Folding

The specific model I've been toying with was devised 10 years ago by Ken A. Dill of the University of California at San Francisco, who has continued to explore it since then with the help of several colleagues. Almost all of my experiments merely replicate their earlier work.

Dill's molecules would not be recognized as proteins by a biochemist (or by a ribosome, for that matter). They are radically simplified in three ways.

First, whereas real proteins are constructed from 20 kinds of amino acids (which differ in size, shape, electric charge, affinity for water and other properties), the building blocks of prototeins come in just two flavors. Dill designates them *H* and *P*, for *hydrophobic* and *polar*; the *H* units repel water while the *P* units attract it.

Second, the various forces acting between amino acids in proteins (electrostatic attractions and repulsions, hydrogen bonds, solvent interactions) are reduced in prototeins to a single rule: *H*'s like to stick together. The *P* units in prototeins are inert, neither attracting nor repelling.

Third, prototeins do their folding on a lattice, as if the molecules were laid out on graph paper. Think of the *H's* and *P's* as colored dots placed at the grid points of the lattice; the chemical bonds in the backbone of the prototein are lines drawn on the grid to connect the dots. Confining the molecules to a lattice is a major computational convenience. It keeps the number of configurations finite. If the chain could bend and twist in continuous space, there would be no clear way of counting the arrangements, and you could never be sure you had tried them all. Dill and others have explored several lattice geometries in both two and three dimensions. My own experiments all inhabit the two-dimensional square lattice, which is the simplest.

Dots and lines on graph paper: That's really all there is to a prototein. Or else the model could be described in terms of colored beads, laid down on a board with a gridlike pattern of dimples to hold the beads in place. To build a sequence of amino acids, you string together *H*-beads and *P*-beads in whatever order you choose. To fold the molecule, you arrange the string of beads on the lattice board. The string is not allowed to stretch or break, and so successive beads in the sequence have to occupy nearest-neighbor sites on the lattice. No two beads can be piled up at the same site, and so the chain cannot cross itself. If two *H*-beads that are not adjacent within the linear sequence wind up on adjacent sites after the chain is folded, their attraction creates a cross-link, or contact, that helps to stabilize the molecule. Foldings that give rise to many such contacts are favored over those with few contacts.

Simple and abstract the model surely is—so much so that you can't help wondering if the process of abstraction hasn't sucked all the life out of it. The squared-off, flattened molecules certainly don't look very biological. But the proof of the prototein is in the folding.

# Self-Avoidance

A program for studying prototeins has two main tasks to accomplish. The first chore is to generate all possible sequences of *H'*s and *P*'s. This part is easy; it's just binary counting. Any prototein sequence of length *r* can be mapped onto an *r*-bit binary number, simply by replacing each 1 in the binary representation of the number with an *H,* and each 0 with a *P*. The complete set of *r*-bit sequences is enumerated by counting from 0 to 2^{r}–1. For example, in the case of *r* = 5 there are 32 sequences, starting with *PPPPP*, *PPPPH* and *PPPHP*, and continuing through *HHHHH*.

The program's second task is to generate all possible foldings of each sequence. This is a little more challenging. A folding is modeled by a self-avoiding walk: a path through the lattice that visits no site more than once. The shortest self-avoiding walks are easy to analyze. On the square lattice there are exactly four self-avoiding walks one step long, namely the walks that move one site north, east, west or south of the origin. Each of these walks can be extended in three different ways to form two-step walks. The walk that begins with an eastward step can continue with a second step to the east, north or south; it cannot go west, because it would be retracing its own steps, and that is forbidden. Thus there are 4 x 3 = 12 walks of two steps each. The same kind of reasoning shows there are 36 three-step walks. But beyond this point the counting begins to get messy. Consider the three-step walk that goes first east, then north, then west. On the fourth step this walk cannot turn east, since that would constitute illegal backtracking. It also cannot go south, since it would thereby return to the origin—a site it has already visited. Hence there are only two available directions for this particular walk, whereas some other walks still have three options. It gets even worse: A walk can box itself in so that there are *no* legal moves, and the walk has to be abandoned.

When I began experimenting with algorithms for self-avoiding walks, I found them so diverting that I thought I might never get back to the larger project of folding proteins. I could easily fill up an entire column with self-avoiding walks—and so that is what I have decided to do. I will make them the subject of a future column, and here give only a brief summary of how they fit into the world of prototeins.

To survey all possible foldings of a prototein of *r* beads, you must generate all self-avoiding walks of *r*–1 steps. There is no shortcut for producing the complete set of walks; you have to enumerate them all. And each time you add a step, you have to check to make sure the destination site is not already occupied. There are tricks for speeding up the process, but none of them fundamentally change the nature of the algorithm.

As the walks get longer, the effort of counting them grows exponentially; adding one step multiplies the number of walks by about 2.6. Through a prodigious feat of computing, A. R. Conway and A. J. Guttmann have counted all the self-avoiding walks of up to 51 steps (there are more than 10^{22}), but for the amateur in self-avoidance the practical limit is probably between 20 and 30 steps. If your computer has enough memory, you can store a list of walks rather than regenerate them for each prototein sequence; this saves a great deal of time.

Symmetries can reduce the number of walks you need to generate or store. For the purposes of molecular modeling, taking two steps east and one step north is no different from going two steps north and one step west; the paths are the same but for a 90-degree rotation. When all such symmetries are taken into account, the number of unique walks is cut to approximately 1/16th the total number. But there are still plenty of walks. At a length of 15 steps, 401,629 unique walks remain after all symmetries are eliminated.

Given a procedure for generating binary *H-P* sequences and another procedure for generating self-avoiding walks, it is a simple matter to combine them. The idea is to produce all possible combinations of sequences and walks, folding up each sequence into the geometry defined by each walk. From this collection of folded molecules you can then gather statistical information—such as the average number of *H-H* contacts—or search for notably good foldings.

# High-Scoring Molecules

What makes for a good folding? In proteins the usual measure is the Gibbs free energy, a thermodynamic quantity that depends on both energy and entropy. If you could tug on the ends of a protein chain and straighten it out, the result would be a state of high energy and low entropy. The energy is high because amino acids that "want" to be close together are held at a distance; the entropy is low because the straight chain is a highly ordered configuration. When you let go, the chain springs back into a shape with lower energy and higher entropy, changes that translate into a lower value of the Gibbs free energy. The "native" state of a protein—the folding it adopts under natural conditions—is usually assumed to be the state with the lowest possible free energy.

Prototeins can get along with a simpler folding criterion. Standard practice is to rank foldings simply by counting *H*-*H* contacts. It's more like keeping score than measuring energy. If the *H*'s are viewed as analogues of hydrophobic amino acids, the scoring system reflects the tendency of hydrophobic groups to seek shelter from water. But the prototein model is so abstract that it doesn't really matter what kind of force is at play between the *H*'s. Just say that *H*'s are sticky, and it takes energy to pull them apart.

One strategy for finding good folds, then, is to look for configurations that maximize the number of *H-H* contacts. A program to carry out the search runs through all the foldings of all the sequences of a given length, keeping only those foldings with the maximum number of contacts.

How many contacts are possible in a folded prototein? A little doodling on graph paper shows that the highest possible ratio of contacts to *H*'s is 7:6. Sequences that attain this limit are exceedingly rare. (I leave it as a puzzle for the reader to find the shortest such sequence, which I believe has 26 beads.) But proteins are not required to solve such mathematical puzzles. To find the stablest configurations of a given sequence, all you need do is find the foldings that have more *H-H* contacts than any other foldings of the same sequence, whether or not the number of contacts is the theoretical maximum. There is a shortcut for identifying these stable foldings. It begins with the sequence made up entirely of *H*'s, which is rather like double-sided sticky tape that collapses on itself in a crumpled ball. If any sequence at all has a folding with a given number of *H-H* contacts, then that configuration must also be among the stablest foldings of the all-*H* sequence. In the all-*H* folding, however, some of the *H*'s may not form contacts, and so they can be changed to *P*'s without altering the score of the folding. By making all such substitutions, you recover the sequence with the minimum number of *H*'s that can give rise to a given folding.

Sequences with rigid, heavily cross-linked folds are fairly rare. Among chains with 21 beads the maximum number of *H-H* contacts is 12, and a chain must have at least 14 *H*'s to reach this limit. There are only 80 sequences of 14 *H*'s and 7 *P*'s that produce 12 contacts, out of the universe of more than two million 21-bead sequences.

Figure 1 shows some of the 80 maximally cross-linked 21-bead prototeins, along with a few other foldings chosen at random. The two populations of molecules are very different. The randomly chosen configurations tend to be loose and floppy, and their average number of *H-H* contacts works out to less than 1. The highest-scoring folds, in contrast, are all very compact, with the chain either wound around itself in a spiral shape or folded into zigzags.

A lifelike feature of the compact foldings is a tendency for the *H*'s to congregate in the interior of the molecule, leaving the *P*'s exposed on the surface. The model has no explicit rule favoring the formation of such a hydrophobic core; it happens automatically when you select foldings with numerous *H-H* contacts. In this connection, Dill points out that for short prototein chains a two-dimensional lattice model may be more realistic than a three-dimensional one. The reason is that the perimeter-to-area ratio of a short chain in two dimensions approximates the surface-to-volume ratio of a longer chain in three dimensions.

Not all features of the high-scoring prototein foldings inspire confidence in the model's realism. For example, a disproportionate number of the best sequences have *H*'s at both ends, and these molecules tend to fold up with their ends tucked into the hydrophobic core. The reason is easy to see: An *H* at the end of a chain can participate in three contacts, whereas interior *H*'s can have no more than two. But the sticky-end effect is an artifact of the model; there is no comparable phenomenon in real proteins.

Another peculiarity can be traced to the choice of a square lattice. Two *H*'s on a square lattice can form a contact only if they are separated within the prototein sequence by an even number of intervening beads. As a result, every prototein can be divided into odd and even subsequences that do not interact. No such parity effect is seen in proteins. This failure of realism is unfortunate; on the other hand, the segregation of odd and even sublattices allows some very handy optimizations in a simulation program.

# Escaping Degeneracy

Are folds that maximize the number of *H-H* contacts the best folds for a prototein? Not necessarily.

A high *H-H* score enhances a molecule's stability, which is certainly a useful property in a biopolymer, but there are other factors to consider as well. Stability implies that once a molecule is folded, it will probably stay folded. It's also important, however, that all molecules with the same sequence fold up to yield the same structure. The way to achieve such uniformity is to select sequences that have a unique best folding, even if that folding does not have the highest possible *H-H* score.

A molecule with many equally good foldings is said to have a degenerate ground state. The all-*P* sequence is an obvious example: Every folding has an *H-H* score of zero. The all-*H* sequence is also degenerate. Obviously, any sequences with unique preferred foldings must be found between these extremes, but the existence of such sequences cannot be taken for granted. You can search for them by sorting all the foldings of a sequence into bins according to their *H-H* score; if the highest-scoring bin has a single occupant, that sequence has a unique best folding.

On the square lattice, uniquely folding sequences do exist for all but one of the chain lengths I was able to test. (The exception is length 5.) The longest chains I examined have 14 beads. Among the 16,384 sequences of this length, 955 have a unique folding. Within this subset, 96 foldings have seven *H-H* contacts, which is the maximum observed in 14-bead prototeins. The sequences in this elite subset, combining uniqueness with high stability, might be considered among the most lifelike prototeins.

Low degeneracy and numerous contacts are not the only criteria for judging a prototein fold. Martin Karplus and Eugene Shakhnovich work with three-dimensional lattice models and employ a more realistic energy spectrum than the simple contact counting of the *H-P* scheme. Their findings highlight the importance of having a large energy gap between the best folding and the next-best one. They have also looked into the kinetics of folding, asking not just which configuration is stablest but also how long it takes a randomly wriggling molecule to find that conformation. Among 200 candidate sequences, 30 repeatedly discovered the state of minimum energy after no more than 50 million small random rearrangements.

# How Do Proteins Do It?

Although the prototein model is only a crude caricature of real protein folding, even this simplified simulation can be computationally taxing. For prototeins of length *r*, the number of sequences is 2^{r}, and the number of foldings is approximately 2.6^{r–1}; the effort needed to solve a model is proportional to the product of these numbers. That product grows steeply. The five-bead model can be solved by hand, and a commodity computer disposes of the 10-bead model in seconds. But a chain of 15 beads combines 32,768 sequences with 148,752 folds, for a total of almost 5 billion cases. At 20 beads, the product of sequences and folds is over 20 trillion, which is way beyond the limit of this amateur's patience (and lifespan).

Attacking these models with brute-force computations could turn out to be comically stupid. Maybe there's some clever algorithm waiting to be discovered that will make folding easy. Maybe, but not likely. In the past few months two groups have proved that models much like the one described here belong in the class of hard problems known as NP-complete. Bonnie Berger and Tom Leighton give a proof for the three-dimensional *H-P* model. The two-dimensional case is proved in a quite different way by Pierluigi Crescenzi, Deborah Goldman, Christos Papadimitriou, Antonio Piccolboni and Mihalis Yannakakis.

Showing that a problem is NP-complete doesn't actually prove it is hard; NP-completeness merely certifies that the problem is as hard as a bunch of others, and a method for efficiently solving any one of the problems could be adapted to all the rest. Some miraculous algorithm could sweep the whole class of problems away. But don't hold your breath.

Which brings us back to Levinthal's question: If protein folding is so hard, how do proteins do it? There are three kinds of answers.

One possibility is that protein molecules are capable of mathematical wizardry beyond the reach of conventional computers. This would stand the NP-completeness result on its head; instead of proving that protein folding is hard, it would show that everything else is easy. You could encode an instance of any NP-complete problem in a synthetic sequence of amino acids, then let the protein fold itself up; from the folded configuration you could read out the solution to the original problem.

Another answer is that proteins, contrary to their reputation, do *not* always fold efficiently and spontaneously. Some of them need help, in the form of "chaperone" molecules. Some may fold erroneously and be recycled by proteolytic enzymes. And it's possible that the native state of some proteins is not in fact the state of lowest free energy. A biological molecule doesn't have to be absolutely stable; it only has to last long enough to do its job. Perhaps the appropriate model of protein folding is not an exhaustive search for the best conformation but an approximation algorithm that is guaranteed to quickly find a good folding. William E. Hart and Sorin Istrail have published just such an algorithm for the *H-P* model.

The third option is that proteins do quickly find the best among all possible foldings, but only because they have evolved to exhibit precisely this property. In other words, the only amino acid sequences that survive under natural selection are those that happen to fold rapidly. Sequences that fold hierarchically could fit this description: If small sections of the chain condense independently into secondary structures such as helices, which then aggregate without further internal rearrangement, the combinatorial monster might be tamed. Although this mechanism cannot explain everything about protein folding—indeed, Dill argues that secondary structures are a consequence of compact folding rather than a cause—it certainly helps.

In any case, the idea that any arbitrary amino acid sequence would fold efficiently is surely overoptimistic—which nixes the fantasy of a protein-folding computer for NP-complete problems. But the subset of rapidly folding sequences remains poorly understood. Computational models, even simplistic ones, offer a means of probing it.

In this connection it is worth noting that nature itself has hardly begun to explore the full space of amino acid sequences. All the proteins in all the organisms that ever lived on the earth could not sample more than an utterly negligible fraction of the 20^{100} or so possible sequences. Thus a computation something like the ones carried out in the *H-P* model is running at this moment, all over the planet, in the big green computer.

© Brian Hayes

EMAIL TO A FRIEND :