CSc 120: Sequence Similarity
Expected Behavior
Write a Python function
seq_sim(seq1, seq2, k) that takes as arguments two
strings seq1 and seq2 and an
integer k, and
returns a floating point value between 0 and 1
(inclusive) giving the similarity between
seq1 and seq2. The similarity
value should be computed as
the
Jaccard index applied to the sets of
k-grams of seq1 and seq2
(where k is the third argument to the function).
See Algorithm A1 on the
phylogenetic problem spec for details of the algorithm.
You can use the code from the previous short problem as a helper function
for this problem.
You can assume that both seq1 and seq2
have length at least k.
IMPORTANT: CloudCoder runs Python 2 while we have used Python 3 for
our class. One of the differences between Python 2 and Python 3 is that in
Python 2, the '/' operator implements integer division
(i.e., 1/2 evaluates to 0) while in Python 3
the '/' operator implements floating-point division. You
can get around this problem by
writing float(num1)/float(num2)
in CloudCoder to compute the floating point result of dividing
num1 by num2.
Examples
-
Call: seq_sim('aaaaaaaaaa', 'aaaa', 3)
Return value: 1.0
-
Call: seq_sim('aaaaab', 'aaaaaa', 3)
Return value: 0.5
-
Call: seq_sim('aabaacaad', 'abaaba', 3)
Return value: 0.42857142857142855
-
Call: seq_sim('aabaacaad', 'abaaba', 4)
Return value: 0.2857142857142857