COMPUTING SCIENCE

# Sorting Out the Genome

To put your genes in order, flip them like pancakes

# Breakpoints

In sorting out the process of sorting by reversals, a good place to begin is with the work of John Kececioglu, now of the University of Arizona, and David Sankoff of the University of Ottawa. They were not the first to write about the problem, but they were the first to make substantial progress on it, starting in the early 1990s. Their principal tool was the concept of a breakpoint.

A breakpoint appears wherever a permutation brings together two nonconsecutive numbers. For example, this eight-element permutation has two breakpoints, marked with carets:

A segment *not* interrupted by a breakpoint is called a strip; it consists of consecutive numbers in either ascending or descending order. The importance of breakpoints is that the canonical permutation has none. Thus any procedure that keeps reducing the number of breakpoints until the count reaches zero is a method for sorting a permutation. (A complication is that both the forward permutation 1 2 3...*n* and the backward one *n*...3 2 1 are without breakpoints. The remedy is to augment the sequence with "anchoring" elements 0 and *n*+1, which compel the sorting algorithm to choose the forward direction.)

Breakpoint analysis suggests a lower bound on the complexity of sorting by reversals. A reversal can create or destroy breakpoints only at the ends of the segment being reversed; it follows that a single reversal can eliminate at most two breakpoints. If the initial permutation has *m* breakpoints, no sorting procedure can possibly eliminate all of them in fewer than *m*/2 reversals.

Kececioglu and Sankoff devised a sorting algorithm that comes within a factor of two of this limit; in other words, the algorithm's maximum number of reversals will not exceed the number of breakpoints in the original permutation. The algorithm employs a "greedy" strategy: At each stage in the computation it maximizes the number of breakpoints removed. If there is an opportunity to eliminate two breakpoints, that move is given higher priority than all reversals that remove just one breakpoint, etc.

A focus on breakpoints makes intuitive sense in sorting permutations. After all, the breakpoints are the positions where the numbers are out of order, whereas the strips are already sorted. This line of reasoning leads to the conjecture that an optimal algorithm for sorting by reversals should operate only at breakpoints, never cutting the permutation within a strip. Unfortunately, the conjecture does not hold. Consider the permutation 3 4 1 2, which has a single breakpoint, between the 4 and the 1. Sorting with an algorithm that never cuts a strip takes three reversals: 3412→4312→4321→1234. But another sequence gets to the goal in just two reversals: 3412→1432→1234.

Kececioglu and Sankoff's greedy algorithm runs efficiently but gives only approximate results. They also described a program that makes the opposite tradeoff, finding the optimum sorting sequence but paying a penalty in running time and memory requirements. At the heart of this algorithm is a routine that may have to examine every possible sequence of reversals for sorting the given permutation. The number of such reversal sequences grows exponentially with *n*; indeed, the worst-case rate of growth exceeds *n* * ^{n}* . By means of a "branch and bound" strategy Kececioglu and Sankoff were able to greatly reduce the number of sequences considered, but the worst-case performance remains exponential, and their program was able to find exact solutions only up to about

*n*= 30.

In 1995 Kececioglu and Sankoff speculated that optimal sorting by reversals belongs to the class of computational problems designated "NP-hard." These are problems for which—barring miracles—only exponential algorithms will ever be found. Two years later Alberto Caprara of the University of Bologna proved that finding the shortest sequence of reversals for sorting an arbitrary permutation is indeed NP-hard.

EMAIL TO A FRIEND :