University of Arizona, Department of Computer Science

CSc 120: Sequence Similarity

Expected Behavior

Write a Python function seq_sim(seq1, seq2, k) that takes as arguments two strings seq1 and seq2 and an integer k, and returns a floating point value between 0 and 1 (inclusive) giving the similarity between seq1 and seq2. The similarity value should be computed as the Jaccard index applied to the sets of k-grams of seq1 and seq2 (where k is the third argument to the function). See Algorithm A1 on the phylogenetic problem spec for details of the algorithm.

You can use the code from the previous short problem as a helper function for this problem.

You can assume that both seq1 and seq2 have length at least k.

IMPORTANT: CloudCoder runs Python 2 while we have used Python 3 for our class. One of the differences between Python 2 and Python 3 is that in Python 2, the '/' operator implements integer division (i.e., 1/2 evaluates to 0) while in Python 3 the '/' operator implements floating-point division. You can get around this problem by writing float(num1)/float(num2) in CloudCoder to compute the floating point result of dividing num1 by num2.

Examples

  1. Call: seq_sim('aaaaaaaaaa', 'aaaa', 3)
    Return value: 1.0

  2. Call: seq_sim('aaaaab', 'aaaaaa', 3)
    Return value: 0.5

  3. Call: seq_sim('aabaacaad', 'abaaba', 3)
    Return value: 0.42857142857142855

  4. Call: seq_sim('aabaacaad', 'abaaba', 4)
    Return value: 0.2857142857142857