Our current research is focused on two main projects: in computational biology, robust algorithms and software for DNA sequence assembly and multiple sequence alignment; in algorithm implementation, portable, efficient software for fundamental string and graph problems.
National Science Foundation CAREER Award, ``Applied Algorithms for Computational Molecular Biology,'' NSF Grant DBI-0196202 funded by the Computational Biology Activities and Theory of Computing programs, September 1997-August 2001, $241,111.
While significant strides have been made in recent years in the instrumentation and molecular biology behind DNA sequencing and mapping, advances in algorithms and software for two key problems are needed to maintain the present rate of progress. Automated DNA sequence assembly is critical for completion of the large-scale sequencing projects being undertaken by the Human Genome Project, and will soon be the bottleneck in current sequencing efforts. High-quality multiple sequence alignment is crucial both for the initial phase of DNA sequencing (to accurately determine the consensus of overlapping fragments and to disambiguate repeats when fragments from separate copies are overlaid), and for the later phase of determining sequence function.
This project aims to develop robust algorithms and software for high-quality DNA sequence assembly in the presence of sequencing error, repetitive elements, and chimeric clones, and multiple sequence alignment under the maximum-trace objective function. The research is unique in two aspects.
Implementations of the algorithms will form the kernel of two systems: Saber, for ``sequence assembly in the presence of error,'' and Primal, for ``practical rigorous multiple alignment.'' This will be the first freely available software for sequence assembly and multiple alignment that produces solutions of guaranteed quality.
Dali, short for ``a discrete algorithms library,'' is a unified implementation of data structures and algorithms for fundamental problems on strings and graphs. The emphasis is on code reusability, portability, efficiency, and elegance. In contrast to many algorithm collections, the implementation is uniform, with consistent, compatible interfaces; in contrast to many object-oriented libraries, the implementation is lightweight, with low operation overhead and small code size.
Data structures currently implemented in Dali include
Dali arose from experiences with implementing algorithms for discrete problems in computational biology that use basic tools from string algorithms and combinatorial optimization. Dali is written in a portable subset of C in a fully object-oriented style, and will be freely available to the research community.