Projects

Our current research is focused on two main projects: in computational biology, robust algorithms and software for DNA sequence assembly and multiple sequence alignment; in algorithm implementation, portable, efficient software for fundamental string and graph problems.

Applied algorithms for computational molecular biology
A discrete algorithms library

Applied algorithms for computational molecular biology

National Science Foundation CAREER Award, ``Applied Algorithms for Computational Molecular Biology,'' NSF Grant DBI-0196202 funded by the Computational Biology Activities and Theory of Computing programs, September 1997-August 2001, $241,111.

While significant strides have been made in recent years in the instrumentation and molecular biology behind DNA sequencing and mapping, advances in algorithms and software for two key problems are needed to maintain the present rate of progress. Automated DNA sequence assembly is critical for completion of the large-scale sequencing projects being undertaken by the Human Genome Project, and will soon be the bottleneck in current sequencing efforts. High-quality multiple sequence alignment is crucial both for the initial phase of DNA sequencing (to accurately determine the consensus of overlapping fragments and to disambiguate repeats when fragments from separate copies are overlaid), and for the later phase of determining sequence function.

This project aims to develop robust algorithms and software for high-quality DNA sequence assembly in the presence of sequencing error, repetitive elements, and chimeric clones, and multiple sequence alignment under the maximum-trace objective function. The research is unique in two aspects.

In contrast to most of the prior practical work on sequence assembly and multiple alignment, the software we propose uses techniques from combinatorial optimization to yield solutions of guaranteed quality: when our implementations output an assembly or an alignment, the solution comes either with a guarantee that it is optimal under the objective function, or with a concrete bound on how much its value differs from optimal.
In contrast to most of the prior theoretical work on DNA sequencing and alignment, our algorithms address the sources of error inherent in the data: in sequence assembly, sequencing error, repeats, and chimerism; in multiple alignment, weak and inconsistent pairwise relationships.

Implementations of the algorithms will form the kernel of two systems: Saber, for ``sequence assembly in the presence of error,'' and Primal, for ``practical rigorous multiple alignment.'' This will be the first freely available software for sequence assembly and multiple alignment that produces solutions of guaranteed quality.

Dali: A discrete algorithms library

Dali, short for ``a discrete algorithms library,'' is a unified implementation of data structures and algorithms for fundamental problems on strings and graphs. The emphasis is on code reusability, portability, efficiency, and elegance. In contrast to many algorithm collections, the implementation is uniform, with consistent, compatible interfaces; in contrast to many object-oriented libraries, the implementation is lightweight, with low operation overhead and small code size.

Data structures currently implemented in Dali include

lists,
arrays,
search trees,
hash tables,
mergeable heaps,
disjoint sets, and
directed graphs.

Algorithms that are implemented include

topological sort,
weakly connected components,
strongly connected components,
longest paths in acyclic graphs,
shortest paths with positive edge lengths,
minimum spanning trees,
maximum flow,
minimum source-sink cut,
minimum unrestricted cut,
maximum weight branchings,
maximum weight bipartite matchings,
maximum cardinality general matchings,
nearest common ancestors,
heaviest common subsequence, and
string alignment with linear gap costs.

Currently underway are implementations of

arbitrary integers,
suffix trees,
string matching, and
maximum weight general matchings.

Dali arose from experiences with implementing algorithms for discrete problems in computational biology that use basic tools from string algorithms and combinatorial optimization. Dali is written in a portable subset of C in a fully object-oriented style, and will be freely available to the research community.

Research