University of Arizona, Department of Computer Science

CSc 120: Sequence-set Similarity

Expected Behavior

Write a Python function seq_set_sim(seq_set1, seq_set2, k) that takes as arguments two sets of strings seq_set1 and seq_set2 and an integer k, and returns a floating point value between 0 and 1 (inclusive) giving the similarity between the sets of strings seq_set1 and seq_set2. Compute the similarity value as follows:

Use the Jaccard index to compute the similarity between individual strings.
Compute the distance between the sets of strings seq_set1 and seq_set2 as the maximum similarity between any string in seq_set1 and any string in seq_set2.

You can use the code from the previous short problems as helper functions for this problem.

You can assume that seq_set1 and seq_set2 are both non-empty and that the strings in these sets all have length at least k.

Examples

Call: seq_set_sim(set(['aaaa','aabb']), set(['aaab']), 3)
Return value: 0.5
Call: seq_set_sim(set(['aaabba','aabbcc']), set(['aaab','abbc']), 4)
Return value: 0.3333333333333333
Call: seq_set_sim(set(['aaabba','abbc']), set(['aaab','aabbcc']), 2)
Return value: 0.6
Call: seq_set_sim(set(['ababab','acacac']),set(['bababa','cacaca']), 3)
Return value: 1.0
Call: seq_set_sim(set(['abbbbba','bcccccb']), set(['aaaaab','aaaaac']), 3)
Return value: 0.0