Computer Science Colloquium

Colloquium Speaker

Speaker:	S.Cenk Sahinalp Case Western Reserve University Cleveland, Ohio

Topic:	Sequence Nearest Neighbors and Block Edit Operations

Date:	Tuesday, December 5, 2000

Time:	11:00 AM

Place:	Gould-Simpson, Room 701

Refreshments will be served in the 7th-floor lobby of Gould-Simpson at 10:45 AM

ABSTRACT

Let D be a database of n sequences; we would like to preprocess D so that given any on-line query sequence Q, we can quickly find a sequence S in D for which d(S,Q) < d(S,T) for any other sequence T in D. Here d(S,Q) denotes the "block edit distance" between sequences S and Q, which is defined to be the minimum number of "block edit" operations needed to transform S to T - all edit operations will be reversible so that d(S,T) = d(T,S). The block edit operations correspond to the notion of similarity between sequences in intended application and include character edits (inserts, replacements, deletes etc) as well as block edits (moves, copies, deletes, reversals) and numerical transformations (scaling by an additive or a multiplicative constant). We present the first known efficient algorithm for "approximate" nearest neighbor search for sequences under any set of edit operations. Our preprocessing time and space is polynomial in size of D and query time is near-linear in size of Q; the approximation factor we achieve is O(log k log*^2 k), where k is the size of the longest sequence in D.
The talk will summarize two papers. The first one was co-authored by G. Cormode, M. Paterson, U. Vishkin and appeared in SODA'00. The second one was coauthored by S. Muthukrishnan and appeared in STOC'00.