COMPUTING SCIENCE

# Prototeins

# Self-Avoidance

A program for studying prototeins has two main tasks to accomplish. The first chore is to generate all possible sequences of *H'*s and *P*'s. This part is easy; it's just binary counting. Any prototein sequence of length *r* can be mapped onto an *r*-bit binary number, simply by replacing each 1 in the binary representation of the number with an *H,* and each 0 with a *P*. The complete set of *r*-bit sequences is enumerated by counting from 0 to 2^{r}–1. For example, in the case of *r* = 5 there are 32 sequences, starting with *PPPPP*, *PPPPH* and *PPPHP*, and continuing through *HHHHH*.

The program's second task is to generate all possible foldings of each sequence. This is a little more challenging. A folding is modeled by a self-avoiding walk: a path through the lattice that visits no site more than once. The shortest self-avoiding walks are easy to analyze. On the square lattice there are exactly four self-avoiding walks one step long, namely the walks that move one site north, east, west or south of the origin. Each of these walks can be extended in three different ways to form two-step walks. The walk that begins with an eastward step can continue with a second step to the east, north or south; it cannot go west, because it would be retracing its own steps, and that is forbidden. Thus there are 4 x 3 = 12 walks of two steps each. The same kind of reasoning shows there are 36 three-step walks. But beyond this point the counting begins to get messy. Consider the three-step walk that goes first east, then north, then west. On the fourth step this walk cannot turn east, since that would constitute illegal backtracking. It also cannot go south, since it would thereby return to the origin—a site it has already visited. Hence there are only two available directions for this particular walk, whereas some other walks still have three options. It gets even worse: A walk can box itself in so that there are *no* legal moves, and the walk has to be abandoned.

When I began experimenting with algorithms for self-avoiding walks, I found them so diverting that I thought I might never get back to the larger project of folding proteins. I could easily fill up an entire column with self-avoiding walks—and so that is what I have decided to do. I will make them the subject of a future column, and here give only a brief summary of how they fit into the world of prototeins.

To survey all possible foldings of a prototein of *r* beads, you must generate all self-avoiding walks of *r*–1 steps. There is no shortcut for producing the complete set of walks; you have to enumerate them all. And each time you add a step, you have to check to make sure the destination site is not already occupied. There are tricks for speeding up the process, but none of them fundamentally change the nature of the algorithm.

As the walks get longer, the effort of counting them grows exponentially; adding one step multiplies the number of walks by about 2.6. Through a prodigious feat of computing, A. R. Conway and A. J. Guttmann have counted all the self-avoiding walks of up to 51 steps (there are more than 10^{22}), but for the amateur in self-avoidance the practical limit is probably between 20 and 30 steps. If your computer has enough memory, you can store a list of walks rather than regenerate them for each prototein sequence; this saves a great deal of time.

Symmetries can reduce the number of walks you need to generate or store. For the purposes of molecular modeling, taking two steps east and one step north is no different from going two steps north and one step west; the paths are the same but for a 90-degree rotation. When all such symmetries are taken into account, the number of unique walks is cut to approximately 1/16th the total number. But there are still plenty of walks. At a length of 15 steps, 401,629 unique walks remain after all symmetries are eliminated.

Given a procedure for generating binary *H-P* sequences and another procedure for generating self-avoiding walks, it is a simple matter to combine them. The idea is to produce all possible combinations of sequences and walks, folding up each sequence into the geometry defined by each walk. From this collection of folded molecules you can then gather statistical information—such as the average number of *H-H* contacts—or search for notably good foldings.

EMAIL TO A FRIEND :

**Of Possible Interest**

**Computing Science**: Clarity in Climate Modeling

**Sightings**: Cell by Cell, Life Appears

**Feature Article**: Candy Crush's Puzzling Mathematics