Colloquium Speaker

Speaker: S.Cenk Sahinalp
Case Western Reserve University
Cleveland, Ohio
Topic:Sequence Nearest Neighbors and Block Edit Operations
Date:Tuesday, December 5, 2000
Time:11:00 AM
Place:Gould-Simpson, Room 701


Refreshments will be served in the 7th-floor lobby of Gould-Simpson at 10:45 AM


ABSTRACT


Let D be a database of n sequences; we would like to preprocess D so that given any on-line query sequence Q, we can quickly find a sequence S in D for which d(S,Q) < d(S,T) for any other sequence T in D. Here d(S,Q) denotes the "block edit distance" between sequences S and Q, which is defined to be the minimum number of "block edit" operations needed to transform S to T - all edit operations will be reversible so that d(S,T) = d(T,S). The block edit operations correspond to the notion of similarity between sequences in intended application and include character edits (inserts, replacements, deletes etc) as well as block edits (moves, copies, deletes, reversals) and numerical transformations (scaling by an additive or a multiplicative constant). We present the first known efficient algorithm for "approximate" nearest neighbor search for sequences under any set of edit operations. Our preprocessing time and space is polynomial in size of D and query time is near-linear in size of Q; the approximation factor we achieve is O(log k log*^2 k), where k is the size of the longest sequence in D.

The talk will summarize two papers. The first one was co-authored by G. Cormode, M. Paterson, U. Vishkin and appeared in SODA'00. The second one was coauthored by S. Muthukrishnan and appeared in STOC'00.