Constructing Phylogenetic Trees Using Maximum Likelihood
Constructing Phylogenetic Trees Using Maximum Likelihood
Scholarship @ Claremont
Scripps Senior Theses Scripps Student Scholarship
4-9-2012
Recommended Citation
Cho, Anna, "Constructing Phylogenetic Trees Using Maximum Likelihood" (2012). Scripps Senior Theses. Paper 46.
http://scholarship.claremont.edu/scripps_theses/46
This Open Access Senior Thesis is brought to you for free and open access by the Scripps Student Scholarship at Scholarship @ Claremont. It has been
accepted for inclusion in Scripps Senior Theses by an authorized administrator of Scholarship @ Claremont. For more information, please contact
[email protected].
Constructing Phylogenetic Trees using
Maximum Likelihood
Anna Cho
March 9, 2012
Department of Mathematics
Abstract
Abstract iii
Acknowledgments vii
Bibliography 49
Acknowledgments
I’d like to thank my parents for their love and support. I wouldn’t have
gotten to the point I am at today without them. I’d also like to thank my
advisor, Professor Towse, for all of the advice and emotional support that
he offered me on my thesis. Thank you for for your energy and helping me
see the beauty of math. I’d also like to thank my reader, Professor Radun-
skaya, for all of her brilliant advice. I would have been lost without your
help.
Chapter 1
More and more DNA sequences are being analyzed today than ever be-
fore. With this increase in the accumulation of DNA sequences comes a
demand for the study of the ancestry of organisms and their phylogenetic
trees. Scientists are interested in how closely related one species is to an-
other. Studying phylogenetic trees and the evolutionary processes that they
model allow scientists to gain a better understanding of how organisms
have arrived at the state they are in today. Phylogenetic trees give us the
ability to see how species evolve and adapt throughout different time peri-
ods with different conditions and needs. Studying these evolutionary pro-
cesses is clearly important to the advancement of biology, but finding the
correct phylogenetic tree for a set of related species is very difficult consid-
ering that we are only given the data that we can observe today, namely the
DNA sequences of those species. We have overcome much of this difficulty
using statistical inference. Statistical models and Markov models allow us
to estimate how similar a phylogenetic tree is to the actual, unknown phy-
logenetic tree for a given set of DNA sequences. We use the maximum like-
lihood method to infer what the true phylogenetic tree of our set of data
looks like. Maximum likelihood uses an explicit evolutionary model. We
assume that the data we observe is identically distributed from this model.
Before defining maximum likelihood, we review some of the terminol-
ogy used in this statistical approach. We have been using the term phylo-
genetic tree to indicate a branching diagram describing a set of species and
their common ancestors. The terms phylogenetic tree and evolutionary tree
are often used interchangeably in the field of computational biology. In
2 The Stochastic Process of DNA Base Substitutions
this paper, we will be using the term phylogenetic tree exclusively. The set
of data, namely the DNA sequences of the species we are observing, will
be at the tips of the phylogenetic tree. The internal nodes of the tree repre-
sent the DNA sequences of the ancestors of the species we are examining.
The segments of the diagram that connect one DNA sequence to another
are called the branches of the tree. Finally, the root of the tree represents the
DNA sequence of the sole common ancestor of all of the species we observe
in our set of data. Each species at the tip of the tree can be traced back to
this common ancestor at the root of the tree. Figure 1.1 shows an example
of a phylogenetic tree.
We refer to the tree shape or the way in which the tips, nodes, and root
are connected by branches as topology.It is important to note that this use
of the word topology is different from the branch of mathematics known
as topology. To topologists, each phylogenetic tree would have the same,
trivial topology and would be indistinguishable. In evolutionary biology,
two topologies are considered different from one another if one topology
cannot be cannot be recreated from the second topology without disassem-
bling a connection between two nodes or between a node and a tip. Fig-
ure 1.2 shows both an example of equivalent topologies and an example of
topologies that are different from one another.
Likelihood 3
Figure 1.2: Tree A and Tree B have equivalent topologies. Tree C has the
same number of tips as Tree A and Tree B, but the branches connecting the
two middle tips cannot be changed to look like Tree A and Tree B without
detatching them. Thus, Tree C has a different topology from Tree A and
Tree B.
1.1 Likelihood
The process of finding a phylogenetic tree using maximum likelihood in-
volves finding the topology and branch lengths of the tree that will give us
the greatest probability of observing the DNA sequences in our data. After
each step, we take the likelihood of each tree that we examine. The tree
that gives us the largest likelihood is then chosen to be examined in the
next step. We will describe this process in more detail in Chapters 2 and 3.
This definition seems quite simple, but we also need to be careful not to use
the term, likelihood as we otherwise would in English. When we say, ”the
likelihood of a phylogenetic tree,” we are not referring to the probability of
seeing that particular tree. Rather, we are referring to the probability of see-
ing the DNA sequences that we have in front of us, given that phylogenetic
tree.
In order to see how likelihood works, we will first consider a very sim-
ple example.
that randomly chooses the probability with which the robot will raise its
left arm for the entire season of the game show. Out of ten episodes of
the gameshow, six contestants left with a prize while 4 contestants went
away with nothing. One ambitious frequent viewer of the show is inter-
ested in finding the likelihood of the robot’s decision to raise its left arm
for this season. Let p be the probability that the robot raises its left arm,
and let X be the proportion of times the robot raises its left arm. One hy-
pothesis we could consider is that the robot is fair. Then, the likelihood that
p = 0.5 is P (X = 3/5 | p = 0.5). Clearly, this probability is less than one.
Another hypothesis we could consider is that the robot has a 3/5 proba-
bility of raising its left arm and a 2/5 probability of raising its right arm.
The likelihood of this probability is P (X = 3/5 | p = 3/5) = 1. Then,
P (X = 3/5 | p = 0.5) < P (X = 3/5 | p = 3/5). The viewer’s best guess is
that the probability of the robot raising its left arm is 53 .
species are to each other. We model the probability of a DNA base substi-
tution as a continuous time Markov process (see [4]). We will describe this
in more detail in the following section.
P (A | B) = P (A).
P (A ∩ B) = P (A)P (B).
This allows us to compute the likelihood of a given set of DNA se-
quences one site at a time. Once we have computed the likelihoods of each
site of the sequences, the product of the likelihoods of each individual site
gives us the likelihood of the set of DNA sequences as a whole. Thus, the
bulk of the maximum likelihood method lays in finding the optimal phylo-
genetic tree for each single site of DNA.
6 The Stochastic Process of DNA Base Substitutions
Example 3. In the case of base substitutions, at any time, t, during the evo-
lutionary process of a site of DNA, the state of the site can take on the
random variable, 1, 2, 3, or 4. The process of substituting bases is in-
dexed by the time, T , which has a continuous range. Thus, the process
is a continuous-time process. The stochastic process, {X(t) : t ∈ T } can
take on the values 1, 2, 3, and 4 at each time, t, in our index, T .
are not dependent on past states. This means that a future state only de-
pends on the present state, X(s). It is not affected by any past state, X(u),
where 0 ≤ u < s. Since there are not any discrete indicators of when a base
will mutate and the index, T , takes on a continuous range, the process of
base substitution is a continuous-time Markov process, as described below.
Definition 6. A stochastic process, {X(t) : t ≥ 0}, is called a continuous-time
Markov processes when it posesses these properties:
1. Each event of the process is independent of previous events. (In other
words, the process has the Markov property).
2. When entering a state, the process will stay in that state for a random
amount of time before transitioning to another state.
We denote the probability that a continuous Markov process currently in
state i will be in state j after t time units as
Pij = P (X(t + s) = j | X(s) = i).
Matrices are used to represent Markov processes with rows representing all
possible current states and columns representing all possible future states.
Each term, Pij , of a matrix representing a Markov process represents the
probability of the process moving from state i to state j in a span of time, t.
Example 4. Suppose the robot in Example 1 works without any problems
for a period which is exponentially distributed with parameter, λ. Then, it
breaks down and a replacement robot will have to substitute for the bro-
ken robot on the gameshow for a period which is exponentially distributed
with parameter, µ. Since the time that each robot spends on the gameshow
is exponentially distributed with different parameters, the times that each
robot spends on the show are independent of each other. Let X(t) = 1 if the
robot is working at time t. Let X(t) = 2 if a replacement robot is being used
at time t. Then, {X(t) : t ≥ 0} is a continuous-time Markov process with
each event being independently distributed and the following transition
probabilities: P11 = P22 = 0, P12 = P21 = 1.
As we stated earlier, sometimes, we represent base substitution proba-
bilities in a matrix called a transition matrix. The rows of the matrix repre-
sent the initial state, and the columns represent the state after a base sub-
stitution has occurred. Our transition matrix for this problem would be:
P11 P12 0 1
Pij = = .
P21 P22 1 0
8 The Stochastic Process of DNA Base Substitutions
and in general,
for all s and t ≥ 0. We say that a random variable X with this property is
memoryless. The process’s determination of what state it is in at time t + s
is not affected by the state it was in at time t for all 0 ≤ t < s; it does not
remember its prior states. If we restate this property using the definition of
conditional properties, we have
P (X > s + t ∩ X > t)
P (X > s + t | X > t) = = P (X > s + t)
P (X > t)
which is equivalent to
If the Poisson process has N (t) events at time t, then we represent the process
by:
(λt)n
P (N (t + s) − N (s) = n) = e−λt , for n = 0, 1, . . . .
n!
We will assume that in a small interval of time of length dt, there is a
probability µdt that the current base at a site will transform, where µ is the
rate of base substitution per unit of time. It is clear that at time t = 0, there
will have been 0 base substitutions. Thus, the first property of our defini-
tion of a Poisson process is fulfilled. The probability of transitioning bases,
µdt, is the same for all intervals of time with length dt. This is the second
property of a Poisson process. Further, the number of base substitutions
during a particular time interval is independent of the history of changes
outside of this interval, so each time increment is independent from other.
time increments Finally, the probability of a change in base in a time in-
terval is very small and the number of base changes can be modeled by
a Poisson distribution, satisfying the third property of a Poisson process.
Thus, the process of changing from the current base to another base in a
time interval dt is a Poisson Process. If we let N (t) be the number of tran-
k
sitions from base i to base j at time t, then P (N (t) = k) = e−µt (µk)
k! . Then,
P (N (t) = 0) = e−µt and P (N (t) > 0) = 1 − e−µt . If we let X be the time
of the first event, then the probability of the first event occuring before a
time t > 0 is P (X ≤ t) = P (N (t) > 0) = 1 − e−µt . The complement of this
probability, the probability that no mutation occurs before a time t > 0, is
1 − (1 − e−µt ) = e−µt . Then,
Definition 8. Ergodicity is the property that the limiting probability limt→∞ Pij (t)
exists and is independent of the initial state, i. We call a Markov process
with this property ergodic.
The Pacific Science Center examples are all positive recurrent since the
line will not stay in state 0 or state 1 for an infinite amount of time. For
a nonexample, consider the robot gameshow example once more. Sup-
pose that after a robot malfunctions, the gameshow completely replaces the
robot forever. Then, if we let 1 denote the state that robot 1 is being used
on the gameshow, the expected amount of time between two consecutive
returns to state 1 is infinite. We never see robot 1 again since it is replaced.
Therefore, state 1 is not positive recurrent in this case.
Consider a continuous Markov process, {X(t) : t ≥ 0}, with state space,
S. Suppose that Pij > 0 for each i, j ∈ S. Restated, this means that for
all i, j ∈ S, the states, i and j, are accessible to each other. Thus, all of
the states in S communicate with each other, making the Markov process
irreducible. Also, suppose that, starting from state i, the process will return
to state i with probability 1, and the expected number of transitions before
a first return to i is finite. This means that the Markov process is recurrent.
An irreducible continuous Markov process in which each state is positive recurrent
is ergodic [5].
To understand the reasoning behind this argument we will consider
a continuous Markov process. For a continuous Markov process, we call
the limiting probabilities for transitions into state j, denoted πj , stationary
12 The Stochastic Process of DNA Base Substitutions
distributions. We want to show that πj = limt→∞ Pij (t). Suppose that the
Markov process is reducible, and that there are two irreducible partitions of
S, say s1 and s2 . Let j be an element of s2 . Also, suppose that limt→∞ Pij (t)
converges to a limiting probability for an i in s1 . But since none of the
elemetns of s1 are accessible from s2 and vice versa, Pij (t) = 0 for all t ≥ 0.
This means that depending on whether the initial state i is in s1 or s2 , the
limiting probabilities could differ. The limiting probabilities would not be
independent of state i, a necessary property for an ergodic Markov process.
Therefore, the Markov process will not be ergodic. The Markov process
must also be recurrent. Otherwise, if i is the initial state, Pii (t) = 0 for all
t ≥ 0 and limt→∞ Pii (t) = 0. Again, the limiting probability depends on
whether or not the initial state is i. Hence, the Markov process needs to be
irreducible, and each of its states must be positive recurrent.
The Likelihood of a
Phylogenetic Tree
Let us first consider a tree with tips, 1 and 2, root, 0, and branch lengths,
14 The Likelihood of a Phylogenetic Tree
v1 and v2 , as in Figure 2.1. If the state, S0 , of node 0 was known, the like-
lihood of the tree would simply be the product of the probabilities of base
substitution in each tree branch and the base frequency, πS0 , of state, S0 :
X 4
X
L= πS0 PS0 S1 (v1 )PS0 S2 (v2 ) = πS0 PS0 S1 (v1 )PS0 S2 (v2 ).
S0 1
The states S1 , S2 , v1 , and v2 , are given (from the tree), and πi , for i =
{1, 2, 3, 4}, is given as a fixed constant, independent of the tree. Notice that
when we are calculating the likelihood of a phylogenetic tree, we are only
able to determine the likelihood of the shape of the tree (i.e, the likelihood
of the locations of the branches and nodes of the tree). Thus, we cannot
determine the states of interior nodes simply by calculating the likelihood
of the tree.
4 X
X 4 X
4
L= πS0 PS0 S5 (v5 )PS5 1 (v1 )PS5 2 (v2 )PS0 S6 (v6 )PS6 3 (v3 )PS6 4 (v4 )
S0 =1 S5 =1 S6 =1
(2.1)
This calculation sums over all four possible states for each node with an
unknown state.
Unfortunately, this calculation is extremely long. The likelihood calcu-
lation for Figure 2.1 has 4 terms, while the calculation for Figure 2.2 has 64
terms. Phylogenetic trees with more species have even more terms. It is
helpful to move each summation as far to the right in the likelihood cal-
culation as possible, allowing us to find the likelihoods of each individual
segment of the tree. If we do this with (2.1), our calculation would look like
this:
X X
L= πS0 [PS0 S5 (v5 )(PS5 1 (v1 ))(PS5 2 (v2 ))]
S0 S5
X
× [PS0 S6 (v6 )(PS6 3 (v3 ))(PS6 4 (v4 ))] (2.2)
S6
substitution probability Pij (v). Within each set of parentheses, we have the
likelihood for each branch connected to a tip. Within each bracket, we have
the likelihood for each of the two branches that stem from the root, node
0. It should be clear that this relationship between the equation and the
topology of the tree is a result of our construction of the likelihood calcula-
tion for the tree, but this construction also allows us to compute likelihood
using the conditional likelihoods of each segment of the tree.
(k)
We will use the notation, Ls , for the likelihood of the tree at and above
(k)
node k on the tree, given that node k has state s. Then, each Ls corre-
sponds to segments of the tree beginning with the tips of the tree. For
example, in Figure 2.2,
4
(5) (1) (2)
X
LS = (PS5 1 (v1 )L1 )(PS5 2 (v2 )L2 ). (2.3)
s5 =1
Since the tips of our tree contain our set of data, we know the states of
(k)
the tips of the tree. Thus, if k is a tip of our phylogenetic tree, Ls will be
0 for all states except for its observed state. If s∗ is the observed state at tip
(k)
k, then Ls∗ = 1. Remember that this simply means that the probability of
tip k having its observed state s∗ is 1 given the hypothesis that tip k has
probability s∗ and 0 given any other hypothesis. For example, in Figure
(1) (2) (3) (4)
2.2, L1 = 1, L2 = 1, L3 = 1, and L4 = 1. Now that we have an
easy calculation for the tips of our tree we are able to begin our likelihood
calculation for the entire tree in Figure 2.2 at its tips.
Since our likelihood calculation has been reduced to conditional likeli-
hood calculations and we can easily find the likelihoods of the tips of the
tree, we begin the computation from the tips of the tree and work our way
down to the root. We can compute the conditional likelihoods of the nodes
in the tree with tips as their immediate descendents. In Figure 2.2, nodes 5
and 6 satisfy this property.
Example 9. For node 6 in Figure 2.2, the likelihood that node 6 has state S
is:
4 h X 4 4
X i
(6) (1) (4)
X
LS = PS6 S3 (v3 )LS1 PS6 S4 (v4 )LS4
S6 =1 S1 =1 S6 =1
(1) (2)
But we know that L1 = 1 and L2= 1, so the calculation reduces to:
(2)
X
LS = [(PSk 1 (v1 )(1))(PSk 2 (v2 )(1))].
S
Computing the Likelihood of a Phylogenetic Tree 17
Once we have found the conditional likelihoods for all nodes with tips
as immediate descendants, we can think of those nodes as our new ”tips”
by using their conditional likelihoods to compute the likelihoods of their
ancestor nodes. Thus, for any node x with immediate descendants y and z,
the conditional likelihood at node k is
(x)
X h X (y)
X
(z)
i
LS = PSx Sy (vy )LSy PSx Sz (vz )LSz . (2.4)
s Sy Sz
The base frequency for the root of our tree, πS0 , gives us the probability that
the root is in state S0 .
Example 10.
0.8 0.1 0.1 0.1
0.1 0.8 0.1 0.1
Pij =
0.1
0.1 0.8 0.1
0.1 0.1 0.1 0.8
Suppose the terms in the matrix above represent the base substitution prob-
abilities of a phylogenetic tree with two species that are both one branch
length unit away from a common ancestor. As we have seen before, i will
represent a prior state, and j will represent a later state for the DNA site un-
der consideration. Both i and j can take on values 1, 2, 3, and 4 corespond-
ing to DNA bases, A, C, T, and G, respectively. Suppose we are studying
orcs and trolls in our set of data. For the DNA site that we are examin-
ing, orcs have an A, and trolls have a T. Then, the likelihood of seeing our
18 The Likelihood of a Phylogenetic Tree
where 0 is the node in our tree representing the common ancestor and S
is the state of node 0. We take the base substitution probabilities from our
(0) (0) (0) (0)
matrix and see that L1 = 0.08, L2 = 0.01, L3 = 0.08, and L4 = 0.01.
Now, say that the common ancestor is the root of our tree. Then, we sim-
ply need to find the sum of the products of these conditional likelihoods
and the base frequencies for each hypothesized ancestor base. If the base
frequencies are 0.25 for each base, then
4
(0) (0) (0) (0)
X
L= πs L(0)
s = π1 L1 + π2 L2 + π3 L3 + π4 L4 = 0.045
S=1
4
X
L= πs0 ((Ps0 s5 (v5 )L(5) (6)
s5 )(Ps0 s6 (v6 )Ls6 )). (2.5)
s0 =1
We will prove (2.7) using the Law of Total Probability. Since {X(s) =
k | k ∈ S} is a set of mutually exclusive events, we can apply the Law of
Total Probability to get:
The Pulley Principle 21
Figure 2.3: Tree A, Tree B, and Tree C are all equivalent due to the pulley
principle. Tree A shows the tree rooted at node 0. Tree B demonstrates how
the tree root can be moved. Tree C shows the unrooted tree.
Chapter 3
Finding a Maximum
Likelihood Tree
complete the likelihood calculations for each possible topology. Finally, the
topology producing the greatest likelihood out of all of the other topologies
we examine is the tree that we choose as the best phylogenetic tree for our
set of data.
can we determine whether or not Figure 3.1 contains all possible labeled,
rooted bifurcating trees for a tree with three tips?
Figure 3.1: All possible labeled, rooted bifurcating trees for three species.
Figure 3.2: Tree A shows a 3-species tree. When we want to add species
4 to the tree, we must add it to a new node on an existing branch in the
3-species tree as shown in trees B-F . If we decide to add species 4 to an
existing node, as we have done in Tree G, then our tree will no longer be a
rooted bifurcating tree.
Example 11. In Figure 3.2, tree A has 5 branches, so species 4 can be added
in 5 ways. Notice that when we add species 4 to tree A, as shown in tree
B, there are 2 new branches and 1 new interior node. Then, when we add
How Many Possible n-species Trees are There? 27
(2n − 3)!
= 8, 200, 794, 532, 637, 891, 559, 375.
2n−1 (n− 1)!
This means that if a computer was able to evaluate the likelihood of a 20-
species tree in one hour, it would take us about 1.64 × 1023 hours, or about
1.87 × 1019 years. Remember that this is only for the calculation of the
likelihood of each tree, without testing various branch lengths.
Obviously, there are way too many total possible trees to examine all
of them when we have a tree with more than ten species. We must also
keep in mind that we have to examine all possible trees for each site of the
DNA sequences, making this process even less feasible. Thus, we must
use a different algorithm for examining as many possible tree topologies as
possible.
3 × 5 × 7 × · · · × (2(n − 1) − 3) = 1 × 3 × 5 × 7 × · · · × (2n − 5)
28 Finding a Maximum Likelihood Tree
Figure 3.3: Tree A shows an labeled unrooted, bifurcating tree. We can root
Tree A at species 1, resulting in Tree B, a rooted, bifurcating tree.
Figure 3.4: Nearest Neighbor Interchange. Here we see our original tree with
subtrees A, B, C, and D. For the particular interior branch that we are con-
cerned with, we erase that particular branch along with all of the branches
directly connected to it. Then, we reassemble the subbranches in the two
different ways shown above. This gives us a total of three possible ways to
construct our tree, including the tree before local rearrangements.
(4) (3)
XXX
L= πS0 (PS0 S3 (0)LS4 )(PS0 S3 (v3 )LS3 ) (3.1)
S0 S4 S3
(4) (3)
X X
= πS0 LS0 ( PS0 S3 (v3 )LS3 ) (3.2)
S0 S3
The equality in (3.2) comes from the fact that PS0 S4 (0) = 1. Now, if we
substitute (1.2) into (3.2), letting µ = 1, we get:
X X
(4) (3) (4) (3)
X
−v3 −v3
L=e πS LS LS + (1 − e ) πS4 LS4 πS3 LS3 (3.3)
S S4 S3
(4) (3)
Now, we must figure out how to find LS4 and LS3 without knowing
the branch lengths in the tree. Recall that in Section 2.1, for a tip, k, of our
(k) (k)
tree with base, i, Li = 1, while Lj6=i = 0. Since species 1, 2, and 3 are all
located at the tips of our tree, they have a known conditional likelihood,
(k)
LSk . In the 2-species tree that we added species 3 to in order to get Tree C,
we should have evaluated branch lengts, v1 and v2 . We can then use these
branch lengths and the known conditional likelihoods of the tips of our tree
(4)
to compute LS4 .
Recall that this calculation for likelihood only gives us the likelihood
for the phylogenetic tree representing one DNA site. For all K DNA sites
in the sequences of our data set, the likelihood is:
K
Y
(Ai q + Bi p) (3.4)
i=1
32 Finding a Maximum Likelihood Tree
and
X X
(4) (3)
Bi = πS4 LS4 πS3 LS3
S4 S3
for the i-th DNA site in the set of sequences. We are interested in finding
the value of v3 that will maximize the likelihood. We can find this value of
v3 by finding the value of p that will maximize (3.3). Then, we can solve for
v3 since v3 = − ln(1 − p).
Now, we will explore some properties that follow from these equations
that will help us understand Felsenstein’s iteration formula for finding seg-
ment lengths that will maximize the likelihood. If we take the logarithm of
(3.3), we get
K
X
ln(L) = ln(Ai q + Bi p).
i=1
If we take the derivative of this equation and set it equal to zero, we get
K
d ln(L) X Bi − Ai
= = 0. (3.5)
dp (Ai q + Bi p)
i=1
We can use (3.5) to take out the terms in the numerator of (3.6) containing
q. We end up with
K K K
X Bi − (Bi − A)q X Bi X Bi − Ai
= −q .
Ai q + Bi p Ai q + B i p (Ai q + Bi p)
i=1 i=1 i=1
Bp(k)
p(k+1) = .
Aq (k) + Bp(k)
Now, we will substitute an x for p(k) . Then, we have a function in terms of
x:
Bx Bx
f (x) = = .
A(1 − x) + Bx (B − A)x + A
Notice that this equation is in the form of a fractional linear transformation
such that it has only two fixed points. These two fixed points are x = p =
0 and x = p = 1. Then, the iteration will converge to either p = 1 or
p = 0. Hence, if we use this iteration algorithm for the single site, we will
get that the branch length we are estimating is either 0 or undefined since
p = 1 − e−v , for branch length, v. If p = 0 = 1 − e−v , then v = 0. If
p = 1 = 1 − e−v , then v = ∞. Therefore, in practice, we only use the EM
algorithm with DNA sequences with two or more sites.
We must do this iteration technique for each of the branches on each
topology. Once we iterate and optimize for vx , we fix this value of vx and
34 Finding a Maximum Likelihood Tree
1 A C T G T
2 A A C G G
We will assume that we can model base substitution probabilities using the
Jukes-Cantor model (see [7]). Under this model, base frequencies are all
equal. In other words, π1 = π2 = π3 = π4 = 1/4.
First, we will focus on site 1 of the DNA sequences. We will add each
base to the tree in alphabetical order (A, C, G, T, T). Our 2-species tree will
look like this:
Now, since there is only one branch in our tree, there is only one possi-
ble location for us to add species 3. Then, our 3-species tree will look like
38 Finding the Maximum Likelihood Tree - An Example
this:
(∗) (∗)
X
L = π1 L1 (e−v1 + (1 − e−v1 )π1 ) + πS0 LS0 (1 − e−v1 )π1 (4.1)
S0 6=1
(∗) (∗)
X
−v1 −v1
=e π 1 L1 + (1 − e ) πS0 LS0 π1 . (4.2)
S0
Our equation is now in the same form as (3.3), so we can label our terms
(∗)
as we did in (3.7) for site 1. In this example, term A1 = π1 L1 , term B1 =
P (∗) −v1 , and term p = 1 − q = 1 − e−v1 .
S0 πS0 LS0 π1 , term q = e
Next, we consider DNA site 2, so that we can find A2 and B2 . If we
order the bases in site 2 in the same order as we ordered the bases in site
1, we have A, A, G, C, and G. The phylogenetic tree for site 2 will look like
this:
Then, the likelihood for the phylogenetic tree for site 2 is:
39
(∗) (∗)
X
L = e−v1 π1 L1 + (1 − e−v1 ) πS0 LS0 π1 .
S0
(∗) P (∗)
In this equation, A2 = π1 L1 and B2 = S0 πS0 LS0 π1 .
Since the number of sites in the DNA sequences of our data is K = 2,
our iteration formula will simply be:
2
1 X Bi p(k)
p(k+1) = (4.3)
2
i=1
Ai q (k) + Bi p(k)
In order to use this iteration formula, we must first determine what Bi
and Ai are for i = 1 and for i = 2. For each Ai , we know that π1 = 1/4 under
(∗)
the Jukes-Cantor model. Then, we are left with finding L1 . Similarly, for
each Bi , we know that πS0 = π1 = 1/4. Thus, we are left with the task of
(∗)
finding LS0 . By (2.4), we have that
(∗)
L1 = (P13 (v3 ))(P12 (v2 ))
and
(∗)
X
LS0 = (PS0 4 (v3 ))(PS0 2 (v2 ))
S0
in A1 ,
(∗)
L1 = ((1 − e−1 )(1/4))(e−1 + (1 − e−1 )(1/4)) ≈ 0.083
in A2 , and
(∗)
LS0 = 2((1 − e−1 )(1/4))2 + 2((e−1 + (1 − e−1 )(1/4))((1 − e−1 )(1/4))) ≈ 0.216
in both B1 and B2 .
Now, we must take an initial estimate of p(1) . We will also estimate that
p = 1 − e−1 . Once we run the EM algorithm a few times, it converges to
(1)
(∗)
LS0 = 2[(1 − e−7.969 )(1/4)][(1 − e−1 )(1/4)]
+ [e−7.969 + (1 − e−7.969 )(1/4)][(1 − e−1 )(1/4)]
+ [(1 − e−7.969 )(1/4)][e−1 + (1 − e−1 )(1/4)] ≈ 0.250. (4.4)
41
For site 2,
(∗)
L1 = [e−7.969 + (1 − e−7.969 )(1/4)][(1 − e−1 )(1/4)] ≈ 0.395
and
(∗)
LS0 = 2[(1 − e−7.969 )(1/4)][(1 − e−1 )(1/4)] + [e−7.969
+ (1 − e−7.969 )(1/4)][(1 − e−1 )(1/4)]
+ [(1 − e−7.969 )(1/4)][e−1 + (1 − e−1 )(1/4)] ≈ 0.250. (4.5)
Now, we have all of the components to find v2 using our iteration for-
mula (4.3). Again, we will estimate p(1) = 1 − e−1 . Once we run the EM
algorithm a few times, it converges will converge to a number as it did for
v1 . We will then be able to use this number to find v2 .
We follow the same procedure to find an estimate for v3 . Using our new
estimates for v1 and v2 , the EM algorithm will converge to some value. We
will use this value to find an estimation for v3 . We use these approxima-
tions of v1 , v2 , and v3 in the EM algorithm again until each of the values
converges.
We continue to use the resulting branch length estimates from the EM
algorithm to produce better estimates of the branch lengths until the succes-
sive estimates begin to converge. Then, the branch lengths that the iteration
converges to are accepted as the best estimates for the actual branch lengths
of the phylogenetic tree in this step of the maximum likelihood process.
We can now proceed to add our fourth species to the tree. Since there
are three branches in the 3-species tree, there are three possible locations
for species 4. For site 1 of the DNA sequences, the 4-species trees that can
result from the 3-species tree are:
Notice that a new node and two new branches are created after the addition
of species 4 to the tree.
42 Finding the Maximum Likelihood Tree - An Example
Now, we proceed as we did with the three-species tree. For each new
tree produced, we evaluate the branch lengths using the EM algorithm.
After the iteration converges and gives an estimate of each branch length,
we evaluate the likelihoods of each topology. We accept the topology that
gives us the greatest likelihood and dismiss the other two topologies. After
we have found our optimal 4-species tree, we can add species 5 to each
possible location in the accepted 4-species tree. After evaluating the branch
lengths and likelihoods of each resulting 5-species tree topology, we have a
5-species tree.
Figure 4.1: The 5-species tree resulting in the greatest likelihood before local
rearrangements.
Before we can accept this tree as our maximum likelihood tree, we must
perform local rearrangements to help test some possible phylogenetic trees
that we may have missed due to the order in which we added each species
to the tree. Suppose the tree in Figure 4.1 is the tree we obtain. For each
interior branch of the tree, we can perform Nearest Neighbor Interchange
(NNI). For interior branch, v7 . the resulting topologies are:
Next, we evaluate the branch lengths and likelihoods of these two topolo-
gies, accepting the tree that results in the largest likelihood. If this tree’s
likelihood is greater than the likelihood of the tree in Figure 4.1, we accept
the new tree. In the tree that we accept, whether it be the tree in Figure
4.1 or the new tree, we continue to perform NNI on any remaining interior
branches. Each time we do NNI, we evaluate the lengths of the branches
and find the likelihood of the new trees. Then, we accept each tree that
increases the likelihood. Remember that this is a greedy algorithm, so we
continue this process of local rearrangements until we feel that a tree gives
43
Figure 4.2: Suppose this is our maximum likelihood tree for wizards, elves,
hobbits, dwarves, and humans. The branch lengths, v1 , v2 , . . . , v7 , are
known from the EM algorithm calculations. They help us see how closely
related one species is to another.
Chapter 5
We have examined how probability and statistics can describe the process
of estimating the most likely phylogenetic tree for a set of DNA sequences.
As with any mathematical modeling problem, there are several assump-
tions that we made in order to ease the computational burden of finding
the maximum likelihood tree. It is important to consider the assumptions
that we have made and investigate ways in which to make our approach
better.
In Chapter 4, we took quite a large leap in modeling base substitution
with the Jukes-Cantor model. There are several models of DNA base sub-
stitution, including Kimura, Felsenstein, and Tamura (see [3] for descrip-
tions of these models). We did not collect any actual information about the
DNA sequences in our data besides the bases that we could observe. Af-
ter running a few experiments, one is able to determine which model of
base substitution best fits the set of species one is examining. For exam-
ple, in some DNA sequences, the rate of base substitution of a transition
can be different from the rate of base substitution of a transversion. Tran-
sitions and transversions categorize base substitutions. Both states before
and after the substitution can be purines (A or G) or pyramidines (C or
T), in which case we call the mutation a transversion. If the base was a
purine and was substituted with a pyramidine, or vice versa, we call the
base substitution a transition (note: the name, transition, does not relate
to the transition probabilties we discussed earlier). There are also cases in
which we are unable to observe which specific base we have at the tips of
our trees (i.e., in our set of DNA sequences). Sometimes, we are only able
to observe whether or not we have a purine or a pyrimadine at the tips of
the tree. In order to determine which model of base substitution best de-
46 The Limitations of Our Model
[1] Dempster, A. P., Laird, N. M., and Rubin, D. (1977). Maximum like-
lihood from incomplete data via the em algorithm. Journal of the Royal
Statistical Society, 39:1–38.
[4] Gascuel, O., editor (2007). Mathematics of evolution and phylogeny. Ox-
ford University Press, Oxford.
[8] Sadava, D., Heller, C., Orians, G., Purves, B., and Hills, D. (2008). Life
the Science of Biology. The Courier Companies, Inc., Sunderland, MA,
eighth edition.