CSc 120: Phylogenetic Trees
This problem brings together many different programming constructs and
techniques we covered over the course of the semester including: string
manipulation, (Python) lists, dictionaries, tuples, classes,
list comprehensions, and trees. It is one of the most
technically challenging programs assigned in this class this semester
and it's also one of the most interesting.
Background
An
evolutionary tree (also called a
phylogenetic tree) is a tree that expresses
evolutionary relationships between a set of organisms.
This program involves writing code to construct phylogenetic trees starting from
the genome sequences of a set of organisms. (Of course, there is nothing
inherently genetic about the techniques we use and the code we write: for
example, since programs are sequences of characters, we could just as easily
apply this approach to sets of programs.)
Expected Behavior
Write a Python program, in a file phylo.py, that
behaves as specified below.
-
Read in the input parameters:
-
Read in the name of an input file using input('FASTA file: ').
-
Read in an integer value N using
input('n-gram size: ').
-
Read in the input file. The file format is
specified under Input format (below).
For each organism whose genome sequence appears in the input file:
-
Extract its ID (a string) from the first line of its information.
-
This first line will be followed by the organism's
genome sequence, which is a long sequence of consisting of
the letters A, T, C, and G.
This sequence will appear over a number of adjacent lines
and will be terminated by an empty line. Read in the
genome sequence and concatenate the different lines into a
single string.
-
For each organism, construct the set of its N-grams using the value
N specified by the user in Step 1.
[Programming requirements: item #3].)
Note: You should construct a set of N-grams, i.e., an
unordered collection without duplicates. If you don't, the similarity
values you compute will be different from ours, and the phylogenetic tree
will therefore also be different. You can learn more about Python sets
here.
-
For each pair of distinct organisms name1 and name2
(i.e., name1 ≠ name2), use the N-grams computed
in the previous step to compute the similarity between their
genome sequences (see Algorithm A1 below).
[Programming requirements: item #4].)
-
Use the similarity values computed in the previous step to compute the
phylogenetic tree (see Algorithm A2 and Algorithm A3
below.)
[Programming requirements: item #5].)
-
Print out the phylogenetic tree. Some examples are shown
here.
Input format
The input file uses a text-based format called
FASTA. An file can contain the
genome sequences for several different organisms. Each
such genome sequence has the following format:
-
A description line, which begins with the
character '>' and contains identifying information about
the genome sequence.
For the purposes of this program, each genome sequence has an "ID"
that is used to refer to it. This ID is obtained from the description
line as follows: it begins at the character after the initial
'>' and continues up to the first whitespace (blank or tab)
character in the line.
-
The description line is followed by one or more non-empty lines of
genome sequence data. For the purposes of this program, these lines
should should be concatenated into a single string.
-
The genome sequence ends when an empty line is encountered.
For example, the following is a small portion of the
Zika virus genome (source:
www.ncbi.nlm.nih.gov):
The ID for this genome
sequence, shown highlighted in yellow, is NC_012532.1.
>NC_012532.1 Zika virus, complete genome
AGTTGTTGATCTGTGTGAGTCAGACTGCGACAGTTCGAGTCTGAAGCGAGAGCTAACAACAGTATCAACA
GGTTTAATTTGGATTTGGAAACGAGAGTTTCTGGTCATGAAAAACCCCAAAGAAGAAATCCGGAGGATCC
...
AGACTCCATGAGTTTCCACCACGCTGGCCGCCAGGCACAGATCGCCGAACTTCGGCGGCCGGTGTGGGGA
AATCCATGGTTTCT
Output format
The output format for a tree is as follows:
def __str__(self):
if self.is_leaf():
return self.id()
else:
return "({}, {})".format(str(self.left()), str(self.right()))
Here the method is_leaf() returns True if and only
if the tree node it is invoked on is a leaf node; and id(),
left(),
and right() return, respectively, the ID of the node
and the left and right subtrees of the tree node.
Some examples are shown
here.
Comment: If you want, you can use the code described
here to
print out your tree in a more visually appealing manner. However,
this code is not part of the problem specification and is
provided purely for your entertainment.
If you use this code, make sure you turn
off the pretty-printing before you submit your code.
Algorithms
Algorithm A1. Similarity between two genome sequences
The similarity between two sequences is a number between 0 and 1:
a similarity of 1 indicates that the sequences are identical, while a
similarity of 0 indicates that the sequences have nothing in common.
The intuition behind our algorithm for computing the similarity between two
sequences is that if we chop each sequence up into small pieces, then
similar sequences will have a lot of pieces in common while dissimilar
sequences will not.
This intuition is formalized using
the
Jaccard index.
Given two sequences A and B, let the corresponding sets of
N-grams be ngrams(A) and ngrams(B)
respectively.
Then, the similarity between A and B is given by
similarity(A, B) =
|
| ngrams(A) ⋂ ngrams(B) |
|
| ngrams(A) ⋃ ngrams(B) |
|
where |S| denotes the size of the set S.
Note that different values of N for computing N-grams can produce
different distance values for the same sequences, and thereby give rise to
different phylogenetic trees.
For the purposes of this assignment, you should implement
your N-grams as Python sets. The reason is to simplify the
computation of unions and intersections:
-
to create an empty set: set()
-
the set corresponding to a list L
is given by:
set(L)
-
the union of sets S1 and S2 is
given by:
S1.union(S2)
-
the intersection of sets S1 and S2
is given by:
S1.intersection(S2)
Note: Treating the N-grams for genome sequences as sets means that
the presence or absence of duplicates in the sequences will not affect
their similarity measure. This may not always seem intuitive.
Algorithm A2. Distance between two phylogenetic trees
Algorithm A1, discussed above, computes the similarity between two
sequences. However, the tree construction process described below
proceeds by computing the distance between trees, where each tree
contains a set of sequences. So the sequence-similarity algorithm
described above has to be extended to deal with sets of sequences. This
can be done in a number of different ways, of which we chose one.
For the purposes of this assignment, the similarity between two trees
tree1 and tree2 is given by the highest similarity
between a leaf in tree1 and a leaf in tree2. In
other words:
-
Consider all possible pairs of leaves (leaf1, leaf2) where
leaf1 ∈ tree1 and leaf2 ∈ tree2.
-
Each leaf node considered in the previous step is an organism (because the
leaf nodes of the phylogenetic tree are individual organisms). Each organism
has a genome sequence. Compute the similarity value between each pair of leaf
nodes mentioned in the previous step as the similarity between the genome
sequences of the corresponding organisms.
-
From the set of similarity values computed in the previous step,
pick the largest value. This is the similarity between the trees
tree1 and tree2.
Algorithm A3. Constructing a phylogenetic tree
There are many different algorithms for constructing phylogenetic trees.
For this assignment, you should use the algorithm described below.
-
Initialization.
Construct a list of trees tree_list, each of which is a
leaf node containing a single organism.
-
Iteration.
Repeatedly perform the following steps until tree_list
contains just a single tree:
-
Find trees t1 and t2 in tree_list
(t1 ≠ t2)
that have the highest similarity among all pairs of trees
in tree_list.
-
Create a new tree t0 that has t1
and t2 as its two children.
Important: To facilitate grading, set the left and
right child of the new tree t0 as follows:
-
If str(t1) < str(t2)
choose t1 to be the left child and t2 to
be the right child of t0.
-
Otherwise, choose t2 to be the left child and
t1 to be the right child of t0.
-
Remove t1 and t2 from tree_list.
-
Add t0 to tree_list.
At this point, at the end of one iteration of the loop,
the length of tree_list has decreased by 1.
Programming Requirements
-
You need a way to map the ID of an organism to its genome sequence
as well as the N-grams obtained from that sequence. For this,
implement a class GenomeData representing genome data
for an organism. This should have at least the attributes listed below,
together with the appropriate methods for accessing them:
-
_name or _id: the ID of the
organism;
-
_sequence: the genome sequence for the organism; and
-
_ngrams: the set of N-grams for the organism's genome
sequence.
Put the code implementing this class in a file genome.py.
You will have to import this code into your program using something like
from genome import *
-
Represent the nodes in the phylogenetic tree using a class Tree.
This should have at least the attributes listed below, together with the
appropriate methods for accessing them:
-
_name or _id: the name of the
node. For leaf nodes, this should be the ID of the corresponding
organism. For non-leaf nodes, one simple way to indicate that
the node is not a leaf is to set this attribute to None.
-
_left and _right: references to the
left and right subtrees of the node.
Additionally, I found it convenient to have an attribute at each node
giving the set of IDs of all of the leaf nodes "under" that node:
this came in handy when computing the similarity between two trees.
But this is not a requirement.
Put the code implementing this class in a file tree.py.
You will have to import this code into your program using something like
from tree import *
-
Use a list comprehension to compute the list of N-grams for each
genome sequence (make sure that the last few elements of this
list are not shorter than the value of N specified).
(You can obtain a set of N-gramsd by applying set()
to this list.)
-
Use a dictionary to map pairs of organism IDs to the similarity
between those organisms.
-
To facilitate grading, set the children of each non-leaf node in
the phylogenetic tree as as follows. During the iterative construction
of the tree, when you combine two trees t1 and t2
as subtrees of a new tree node, if
str(t1) < str(t2) then
t1 should be the left child; otherwise t2 should
be the left child of the new node.
Errors
The following are errors. In each case, your program should
give the error message specified (if any) and behave as indicated.
-
Input file cannot be read.
Program behavior: Use a try statement to detect this. Give
an error message and quit.
Error message:
"ERROR: could not open file " + filename
-
The value of N (for the size of N-grams) cannot be read or cannot be
converted to an integer.
Program behavior: Use a try statement to detect this. Give
an error message and quit.
Error message:
"ERROR: Bad value for N"
Examples
Some examples are shown
here.
Testing
This is probably the most complex program assigned in this class this
semester, which makes it essential that you test your program thoroughly.
I strongly urge you to use the testing strategies we have discussed in
class. In particular, you should combine black-box testing, unit (white
box) testing, and judicious use of asserts. Test your
program as you develop it: because of the program's size and complexity,
it will be harder to track down early-stage bugs if you wait all the way
until the end of coding before you start testing.
For black box testing, you may want to start with small files created
specifically for testing. For example, use several small files like
tiny02.fasta: these files
are small enough that you can manually compute N-grams and similarity
values and check that the program is doing what is expected. Move on
to larger input files only after working with several such small input files.
Files to be submitted
This problem requires you to submit three files:
- phylo.py
- genome.py
- tree.py