Events & News

Computer Science Colloquium

Category	Lecture
Date	Thursday, August 7, 2008
Time	11:30 am
Location	GS 906
Details	Light refreshments will be served at 11:15AM in the 9th floor atrium
Speaker	Dean Starrett
Title	Doctoral Candidate
Affiliation	Department of Computer Science

On Aligning Alignments Exactly

PhD Dissertation Defense
Committee Chair: John Kececioglu

An increasingly essential tool in biology is the alignment of multiple sequences. Bio-scientists use multiple sequence alignments for tasks such as predicting protein structure and function, reconstructing phylogenetic trees, and finding motifs.

Computationally hard, both in theory and in practice, constructing multiple alignments is typically done by heuristic methods. The majority of state-of-the-art multiple alignment programs employ a form and polish approach, where a multiple alignment is built up by progressively merging smaller alignments, starting with single sequences. Then in the polishing phase, the alignment is repeatedly split back into smaller alignments and re-merged.

This merging of alignments, the basic computational problem in these construction and local-search phases of the best multiple alignment heuristics, is called the Aligning Alignments Problem. Under the sum-of-pairs objective, it seems this problem may be a simple extension of 2-sequence alignment. However, it is proven here that with affine gap costs (which are recognized as necessary to get biologically-informative alignments of sequence pairs) the problem is NP-complete when gaps are counted exactly. Interestingly, this form of multiple alignment is polynomial-time solvable when we relax the exact count, showing that exact gap counts are inherently hard in multiple sequence alignment.

However, unlike general multiple alignment, we demonstrate that Aligning Alignments (with affine gap costs and exact counts) is tractable in practice with an algorithm and implementation. Our software AlignAlign is both time and space efficient on biological data. Computational experiments on natural biological data show instances from two standard datasets can be optimally aligned with surprising efficiency, and experiments on simulated biological data show this efficiency scales.