COMPUTING SCIENCE

# Prototeins

# How Do Proteins Do It?

Although the prototein model is only a crude caricature of real protein folding, even this simplified simulation can be computationally taxing. For prototeins of length *r*, the number of sequences is 2^{r}, and the number of foldings is approximately 2.6^{r–1}; the effort needed to solve a model is proportional to the product of these numbers. That product grows steeply. The five-bead model can be solved by hand, and a commodity computer disposes of the 10-bead model in seconds. But a chain of 15 beads combines 32,768 sequences with 148,752 folds, for a total of almost 5 billion cases. At 20 beads, the product of sequences and folds is over 20 trillion, which is way beyond the limit of this amateur's patience (and lifespan).

Attacking these models with brute-force computations could turn out to be comically stupid. Maybe there's some clever algorithm waiting to be discovered that will make folding easy. Maybe, but not likely. In the past few months two groups have proved that models much like the one described here belong in the class of hard problems known as NP-complete. Bonnie Berger and Tom Leighton give a proof for the three-dimensional *H-P* model. The two-dimensional case is proved in a quite different way by Pierluigi Crescenzi, Deborah Goldman, Christos Papadimitriou, Antonio Piccolboni and Mihalis Yannakakis.

Showing that a problem is NP-complete doesn't actually prove it is hard; NP-completeness merely certifies that the problem is as hard as a bunch of others, and a method for efficiently solving any one of the problems could be adapted to all the rest. Some miraculous algorithm could sweep the whole class of problems away. But don't hold your breath.

Which brings us back to Levinthal's question: If protein folding is so hard, how do proteins do it? There are three kinds of answers.

One possibility is that protein molecules are capable of mathematical wizardry beyond the reach of conventional computers. This would stand the NP-completeness result on its head; instead of proving that protein folding is hard, it would show that everything else is easy. You could encode an instance of any NP-complete problem in a synthetic sequence of amino acids, then let the protein fold itself up; from the folded configuration you could read out the solution to the original problem.

Another answer is that proteins, contrary to their reputation, do *not* always fold efficiently and spontaneously. Some of them need help, in the form of "chaperone" molecules. Some may fold erroneously and be recycled by proteolytic enzymes. And it's possible that the native state of some proteins is not in fact the state of lowest free energy. A biological molecule doesn't have to be absolutely stable; it only has to last long enough to do its job. Perhaps the appropriate model of protein folding is not an exhaustive search for the best conformation but an approximation algorithm that is guaranteed to quickly find a good folding. William E. Hart and Sorin Istrail have published just such an algorithm for the *H-P* model.

The third option is that proteins do quickly find the best among all possible foldings, but only because they have evolved to exhibit precisely this property. In other words, the only amino acid sequences that survive under natural selection are those that happen to fold rapidly. Sequences that fold hierarchically could fit this description: If small sections of the chain condense independently into secondary structures such as helices, which then aggregate without further internal rearrangement, the combinatorial monster might be tamed. Although this mechanism cannot explain everything about protein folding—indeed, Dill argues that secondary structures are a consequence of compact folding rather than a cause—it certainly helps.

In any case, the idea that any arbitrary amino acid sequence would fold efficiently is surely overoptimistic—which nixes the fantasy of a protein-folding computer for NP-complete problems. But the subset of rapidly folding sequences remains poorly understood. Computational models, even simplistic ones, offer a means of probing it.

In this connection it is worth noting that nature itself has hardly begun to explore the full space of amino acid sequences. All the proteins in all the organisms that ever lived on the earth could not sample more than an utterly negligible fraction of the 20^{100} or so possible sequences. Thus a computation something like the ones carried out in the *H-P* model is running at this moment, all over the planet, in the big green computer.

© Brian Hayes

EMAIL TO A FRIEND :

**Of Possible Interest**

**Spotlight**: Briefings

**Computing Science**: Belles lettres Meets Big Data

**Technologue**: Quantum Randomness