COMPUTING SCIENCE

# Sorting Out the Genome

To put your genes in order, flip them like pancakes

# Gene Sorting

Geneticists give *Drosophila* genes charming names like *groucho*, *kojak* and *dreadlocks*, but for algorithmic purposes it's a lot easier to just number them. From this point of view a chromosome is nothing more than a permutation of the numbers from 1 through *n*.

The goal is a method for finding the reversal distance between any two permutations, such as 3 1 5 2 4 and 4 1 2 3 5. Because the labeling of the genes is arbitrary, however, we can always assign the numbers in such a way that one configuration is the *canonical* permutation 1 2 3...*n*, with the integers in ascending order. Then the task is simply to *sort* the other permutation, finding a sequence of reversals that restores it to the canonical order. Instead of converting 4 1 2 3 5 into 3 1 5 2 4, we can sort the sequence 5 2 4 1 3 into its canonical order 1 2 3 4 5; the operations required are identical.

Can every permutation be sorted by some series of reversal moves, without any need for other kinds of operations? A simple variation on the naive pancake-flipping algorithm supplies an affirmative answer. From any starting configuration, there must be some block of genes that can be rotated 180 degrees to bring gene 1 into its correct, leftmost position. Thereafter some other block can be flipped to put gene 2 in its proper place; moreover, this can be done without disturbing gene 1. Here is the full sequence of reversals that sorts our five-digit example (the underline designates the segment that is about to be reversed):

Using this left-to-right algorithm, the five genes are sorted with four reversals. It's not hard to see that any *n*-element array can be sorted by the same method in at most *n -* 1 reversals. The procedure is similar to the bottom-up pancake algorithm, but because we don't have to work only from one end of the chromosome, each gene can be put in its place with a single flip rather than two.

Unfortunately, the left-to-right algorithm cannot always be counted on to find the *shortest* sequence of reversals for sorting a chromosome. In this example, four steps is not the minimum; there is a three-step sequence that accomplishes the same task:

But there is no obvious logic behind this series of moves, so the question becomes: How do you find such shorter pathways, and how do you know when you've found the shortest route of all?

The importance of identifying the shortest reversal sequence is not that we need a high-performance algorithm for sorting genes on chromosomes. Sorting the genes is not really the point. The shortest pathway matters because it is the best available estimate of the true reversal distance between two genomes—the measure of their evolutionary kinship.

In general, we can't reconstruct the actual course of an evolutionary process with any certainty, but parsimony argues that short and direct pathways are more likely than long, zigzag ones. A change in gene order from 1 2 3 to 3 2 1 *might* have come about through a multistep transformation such as 123→213→312→321, but the single reversal 123→321 is usually a better guess. Thus the shortest pathway is always the starting point for inferring a phylogeny. (The shortest series of reversals is not necessarily unique; there may be several reversal pathways of the same minimal length.)

EMAIL TO A FRIEND :