Computer Science Colloquium

Colloquium Speaker

Speaker:		Tao Jiang McMaster University Hamilton, Ontario,Canada

Topic:		Approximation Algorithms for Multiple Sequence Alignment

Date:		Tuesday, March 23, 1999

Time:		3:15 PM

Place:		Gould-Simpson, Room 701, Tandy Warnow's class

ABSTRACT

The primary structure of a deoxyribonucleic acid (DNA) molecule is a sequence consisting of four types of letters, A, C, G, and T, each standing for a nucleotide. The length of such a DNA sequence ranges from several thousand letters for a simple virus to three billion letters for a human. We all know that these long and mysterious sequences encode Life as well as genetic diseases, but decoding the sequences is perhaps one of the most challenging tasks in the world. The ultimate goal of molecular biology is to understand what segments of a DNA are responsible for a biological function such as the color of eyes or a genetic disease such as cancer, and how these segments are formed and work. These functionally meaningful segments of a DNA are usually called genes.
To find genes which are responsible for a biological function, a biologist often compares a set of DNA sequences that share the same function and tries to identify regions which are ``conserved'' in all of these sequences. On the other hand, a biologist may also infer the ``closeness'' of organisms by comparing their DNA sequences and computing the degree of similarity of the sequences. Such ``closeness'' information is useful in the inference of evolutionary trees. These are all example issues in an important field of computational molecular biology commonly referred to as sequence comparison.
Multiple sequence alignment is one of the fundamental problems in sequence comparison. It is known as one of the most challenging problems in computational molecular biology, because all of its meaningful formulations lead to NP-hard optimization problems. The problem also arises in some fields of computer science such as syntactical analysis of HTML tables and electronic chain mail analysis. In this lecture, I will first give a brief introduction to some popular models of multiple sequence alignment including multiple alignment with tree cost and sum-of-pairs cost, and then a survey of recent approximation algorithms for these problems. A few simple constructions and analyses will be sketched if time permits. The algorithms not only run in reasonable time, but also produce results with a guaranteed quality. Some of the algorithms have been implemented and are available for download via WWW.