0% found this document useful (0 votes)
174 views

Phylogenetic Tree Construction - Methods

Building a phylogenetic tree requires four distinct steps: (Step 1) identify and acquire a set of homologous DNA or protein sequences, (Step 2) align those sequences, (Step 3) estimate a tree from the aligned sequences, and (Step 4) present that tree in such a way as to clearly convey the relevant information to others.Typically you would use your favorite web browser to identify and download the homologous sequences from a national database such as GenBank, then one of several alignment program

Uploaded by

vanigo1824
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
174 views

Phylogenetic Tree Construction - Methods

Building a phylogenetic tree requires four distinct steps: (Step 1) identify and acquire a set of homologous DNA or protein sequences, (Step 2) align those sequences, (Step 3) estimate a tree from the aligned sequences, and (Step 4) present that tree in such a way as to clearly convey the relevant information to others.Typically you would use your favorite web browser to identify and download the homologous sequences from a national database such as GenBank, then one of several alignment program

Uploaded by

vanigo1824
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

FUNDAMENTALS OF BIOINFORMATICS

Module 23: Phylogenetic Tree Construction – Methods

Welcome all to a new session on Fundamentals of Bioinformatics. In this


session, we will discuss the theory behind the different methods of
phylogenetic tree construction.

An evolutionary tree is a two dimensional graph showing evolutionary


relationships among organisms or in the case of sequences, in certain genes
in separate organisms. There are currently two main categories of tree-
building methods, each having advantages and limitations.

The first category is based on discrete characters, which are molecular


sequences from individual taxa. The basic assumption is that characters at
corresponding positions in a multiple sequence alignment are homologous
among the sequences involved. Therefore, the character states of the common
ancestor can be traced from this dataset. Another assumption is that each
character evolves independently and is therefore treated as an individual
evolutionary unit.

The second category of phylogenetic methods is based on distance, which is


the amount of dissimilarity between pairs of sequences, computed on the
basis of sequence alignment. The distance-based methods assume that all
sequences involved are homologous and that tree branches are additive,
meaning that the distance between two taxa equals the sum of all branch
lengths connecting them.

First we can look at Distance based methods

I. Distance Based Methods

The algorithms for the distance-based tree-building method can be subdivided


into either clustering based or optimality based.

The clustering-type algorithms compute a tree based on a distance matrix


starting from the most similar sequence pairs. These algorithms include an
unweighted pair group method using arithmetic average (UPGMA) and
neighbor joining (NJ).

The optimality-based algorithms compare many alternative tree topologies


and select one that has the best fit between estimated distances in the tree
and the actual evolutionary distances. This category includes the Fitch–
Margoliash and minimum evolution algorithms.

1. Clustering-Based Methods

In this category, we will discuss two important methods

Module 23|1
Unweighted Pair Group Method Using Arithmetic Average (UPGMA)

The simplest clustering method is UPGMA, which builds a tree by a sequential


clustering method. Given a distance matrix, it starts by grouping two taxa
with the smallest pairwise distance in the distance matrix. A node is placed
at the midpoint or half distance between them. It then creates a reduced
matrix by treating the new cluster as a single taxon. The distances between
this new composite taxon and all remaining taxa are calculated to create a
reduced matrix. The same grouping process is repeated and another newly
reduced matrix is created. The iteration continues until all taxa are placed on
the tree. The last taxon added is considered the out-group producing a rooted
tree. The basic assumption of the UPGMA method is that all taxa evolve at a
constant rate and that they are equally distant from the root, implying that a
molecular clock is in effect. However, real data rarely meet this assumption.
Thus, UPGMA often produces erroneous tree topologies. However, owing to its
fast speed of calculation, it has found extensive usage in clustering analysis
of DNA microarray data.

Neighbour Joining (NJ)

The UPGMA method uses unweighted distances and assumes that all taxa
have constant evolutionary rates. Since this molecular clock assumption is
often not met in biological sequences, to build a more accurate phylogenetic
trees, the neighbour joining (NJ) method can be used, which is somewhat
similar to UPGMA in that it builds a tree by using stepwise reduced distance
matrices. However, the NJ method does not assume the taxa to be equidistant
from the root.

The tree construction process is somewhat opposite to that used UPGMA.


Rather than building trees from the closest pair of branches and progressing
to the entire tree, the NJ tree method begins with a completely unresolved
star tree by joining all taxa onto a single node and progressively decomposes
the tree by selecting pairs of taxa based on the above modified pairwise
distances. This allows the taxa with the shortest corrected distances to be
joined first as a node. After the first node is constructed, the newly created
cluster reduces the matrix by one taxon and allows the next most closely
related taxon to be joined next to the first node. The cycle is repeated until all
internal nodes are resolved. This process is called star decomposition.
Unlike UPGMA, NJ and most other phylogenetic methods produce unrooted
trees. The out-group has to be determined based on external knowledge.

One of the disadvantages of the NJ method is that it generates only one tree
and does not test other possible tree topologies. This can be problematic
because, in many cases, in the initial step of NJ, there may be more than one
equally close pair of neighbours to join, leading to multiple trees. Ignoring
these multiple options may yield a suboptimal tree. To overcome the
limitations, a generalized NJ method has been developed, in which multiple
NJ trees with different initial taxon groupings are generated. A best tree is
then selected from a pool of regular NJ trees that best fit the actual

Module 23|2
evolutionary distances. This more extensive tree search means that this
approach has a better chance of finding the correct tree.

2. Optimality-Based Methods

The clustering-based methods produce a single tree as output. However, there


is no criterion in judging how this tree is compared to other alternative trees.
In contrast, optimality-based methods have a well-defined algorithm to
compare all possible tree topologies and select a tree that best fits the actual
evolutionary distance matrix. Based on the differences in optimality criteria,
there are two types of algorithms, Fitch–Margoliash and minimum evolution,
which we will discuss next.

Fitch–Margoliash

The Fitch–Margoliash (FM) method selects a best tree among all possible trees
based on minimal deviation between the distances calculated in the overall
branches in the tree and the distances in the original dataset. It starts by
randomly clustering two taxa in a node and creating three equations to
describe the distances, and then solving the three algebraic equations for
unknown branch lengths. The clustering of the two taxa helps to create a
newly reduced matrix. This process is iterated until a tree is completely
resolved. The method searches for all tree topologies and selects the one that
has the lowest squared deviation of actual distances and calculated tree
branch lengths.

Minimum Evolution

Minimum evolution (ME) constructs a tree with a similar procedure, but uses
a different optimality criterion that finds a tree among all possible trees with
a minimum overall branch length. Searching for the minimum total branch
length is an indirect approach to achieving the best fit of the branch lengths
with the original

So far our discussion has been on distance-based methods for phylogenetic


tree construction. Now we will move towards the second category.

CHARACTER-BASED METHODS

Character-based methods (also called discrete methods) are based directly on


the sequence characters rather than on pairwise distances. They count
mutational events accumulated on the sequences and may therefore avoid the
loss of information when characters are converted to distances. This
preservation of character information means that evolutionary dynamics of
each character can be studied. Ancestral sequences can also be inferred. The
two most popular character-based approaches are the maximum parsimony
(MP) and maximum likelihood (ML) methods.

Maximum Parsimony (MP)

Module 23|3
The parsimony method chooses a tree that has the fewest evolutionary
changes or shortest overall branch lengths. It is based on a principle related
to a medieval philosophy called Occam’s razor. The theory was formulated
by William of Occam in the thirteenth century and states that the simplest
explanation is probably the correct one. This is because the simplest
explanation requires the fewest assumptions and the fewest leaps of logic. In
dealing with problems that may have an infinite number of possible solutions,
choosing the simplest model may help to “shave off” those variables that are
not really necessary to explain the phenomenon. By doing this, model
development may become easier, and there may be less chance of introducing
inconsistencies, ambiguities, and redundancies, hence, the name Occam’s
razor.

For phylogenetic analysis, parsimony seems a good assumption. By this


principle, a tree with the least number of substitutions is probably the best
to explain the differences among the taxa under study. This view is justified
by the fact that evolutionary changes are relatively rare within a reasonably
short time frame. This implies that a tree with minimal changes is likely to be
a good estimate of the true tree. By minimizing the changes, the method
minimizes the phylogenetic noise owing to homoplasy and independent
evolution.

How Does MP Tree Building Work?

Parsimony tree building works by searching for all possible tree topologies
and reconstructing ancestral sequences that require the minimum number of
changes to evolve to the current sequences. To save computing time, only a
small number of sites that have the richest phylogenetic information are used
in tree determination. These sites are the so-called informative sites, which
are defined as sites that have at least two different kinds of characters, each
occurring at least twice. Informative sites are the ones that can often be
explained by a unique tree topology. Other sites are non-informative, which
are constant sites or sites that have changes occurring only once. Constant
sites have the same state in all taxa and are obviously useless in evaluating
the various topologies. The sites that have changes occurring only once are
not very useful either for constructing parsimony trees because they can be
explained by multiple tree topologies. The non-informative sites are thus
discarded in parsimony tree construction.

Once the informative sites are identified and the non-informative sites
discarded, the minimum number of substitutions at each informative site is
computed for a given tree topology. The total number of changes at all
informative sites are summed up for each possible tree topology. The tree that
has the smallest number of changes is chosen as the best tree.

A related term in this category is weighted parsimony.

Weighted Parsimony

Module 23|4
The parsimony method discussed is unweighted because it treats all
mutations as equivalent. This may be an oversimplification; mutations of
some sites are known to occur less frequently than others, for example,
transversions versus transitions, functionally important sites versus neutral
sites. Therefore, a weighting scheme that takes into account the different
kinds of mutations helps to select tree topologies more accurately. The MP
method that incorporates a weighting scheme is called weighted parsimony.

Maximum Likelihood Method (ML)

Another character-based approach is ML, which uses probabilistic models to


choose a best tree that has the highest probability or likelihood of reproducing
the observed data. It finds a tree that most likely reflects the actual
evolutionary process. ML is an exhaustive method that searches every
possible tree topology and considers every position in an alignment, not just
informative sites. By employing a particular substitution model that has
probability values of residue substitutions, ML calculates the total likelihood
of ancestral sequences evolving to internal nodes and eventually to existing
sequences. It sometimes also incorporates parameters that account for rate
variations across sites.

How Does the Maximum Likelihood Method Work?

ML works by calculating the probability of a given evolutionary path for a


particular extant sequence. The probability values are determined by a
substitution model (either for nucleotides or amino acids). For a particular
site, the probability of a tree path is the product of the probability from the
root to all the tips, including every intermediate branches in the tree topology.
Because multiplication often results in very small values, it is computationally
more convenient to express all probability values as natural log likelihood
(lnL) values, which also converts multiplication into summation. Because
ancestral characters at internal nodes are normally unknown, all possible
scenarios of ancestral states have to be computed.

After logarithmic conversion, the likelihood score for the topology is the sum
of log likelihood of every single branch of the tree. After computing for all
possible tree paths with different combinations of ancestral sequences, the
tree path having the highest likelihood score is the final topology at the site.
Because all characters are assumed to have evolved independently, the log
likelihood scores are calculated for each site independently. The overall log
likelihood score for a given tree path for the entire sequence is the sum of log
likelihood of all individual sites. The same procedure has to be repeated for
all other possible tree topologies. The tree having the highest likelihood score
among all others is chosen as the best tree, which is the ML tree. This process
is exhaustive in nature and therefore very time consuming.

Quartet Puzzling

Module 23|5
The most commonly used heuristic ML method is called quartet puzzling,
which uses a divide-and-conquer approach. In this approach, the total
number of taxa are divided into many subsets of four taxa known as quartets.
An optimal ML tree is constructed from each of these quartets. This is a
relatively easy process as there are only three possible unrooted topologies for
a four-taxon tree. All the quartet trees are subsequently combined into a
larger tree involving all taxa. This process is like joining pieces in a jigsaw
puzzle, hence the name. The problem in drawing a consensus is that the
branching patterns in quartets with shared taxa may not agree. In this case,
a majority rule is used to determine the positions of branches to be inserted
to create the consensus tree.

NJML

NJML is a hybrid algorithm combining aspects of NJ and ML. It constructs an


initial tree using the NJ method with bootstrapping (which will be described).
The branches with low bootstrap support are collapsed to produce multi-
furcating branches. The polytomy is resolved using the ML method. Although
the performance of this method is not yet as good as the complete ML method,
it is at least ten times faster.

Genetic Algorithm

A recent addition to fast ML search methods is the GA, a computational


optimization strategy that uses biological terminology as a metaphor because
the method involves “crossing” mathematical routines to generate new
“offspring” routines. The algorithm works by selecting an optimal result
through a mix-and-match process using a number of existing random
solutions. A “fitness” measure is used to monitor the optimization process. By
keeping record of the fitness scores, the process simulates the natural
selection and genetic crossing processes. For instance, a subroutine that has
the best score (best fit process) is selected in the first round and is used as a
starting point for the next round of the optimization cycle. Again using
biological metaphors, this is to generate more “offspring,” which are
mathematical trials with modifications from the previous ones. Different
computational routines (or “chromosomes”) are also allowed to combine (or
“crossover”) to produce a new solution. The iteration continues until an
optimal solution is found.

When applying GA to phylogenetic inference, the method strongly resembles


the pruning and re-grafting routines used in the branch-swapping process. In
GA-based tree searching, the fitness measure is the log likelihood scores. The
tree search begins with a population of random trees with an arbitrary branch
lengths. The tree with a highest log likelihood score is allowed to leave more
“offspring” with “mutations” on the tree topology. The mutational process is
essentially branch rearrangement. Mutated new trees are scored. Those that
are scored higher than the parent tree are allowed to mutate more to produce
even higher scored offspring, if possible. This process is repeated until no
higher scored trees can be found. The advantage of this algorithm is its speed;

Module 23|6
a near optimal tree can often be obtained within a limited number of
iterations.

Bayesian Analysis

Another recent development of a speedy ML method is the use of the Bayesian


analysis method. The essence of Bayesian analysis is to make inference on
something unobserved based on existing observations. It makes use of an
important concept of known as posterior probability, which is defined as the
probability that is revised from prior expectations, after learning something
new about the data. In mathematical terms, Bayesian analysis is to calculate
posterior probability of two joint events by using the prior probability and
conditional probability values using the following simplified formula:

Without going into much mathematical detail, it is important to know that the
Bayesian method can be used to infer phylogenetic trees with maximum
posterior probability. In Bayesian tree selection, the prior probability is the
probability for all possible topologies before analysis. The probability for each
of these topologies is equal before tree building. The conditional probability is
the substitution frequency of characters observed from the sequence
alignment. These two pieces of information are used as a condition by the
Bayesian algorithm to search for the most probable trees that best satisfy the
observations.

The tree search incorporates an iterative random sampling strategy based on


the Markov chain Monte Carlo (MCMC) procedure. MCMC is designed as a
“hill-climbing” procedure, seeking higher and higher likelihood scores while
searching for tree topologies, although occasionally it goes downhill because
of the random nature of the search. Over time, high-scoring trees are sampled
more often than low-scoring trees. When MCMC reaches high scored regions,
a set of near optimal trees are selected to construct a consensus tree.

In the end, the Bayesian method can achieve the same or even better
performance than the complete ML method, but is much faster than regular
ML and is able to handle very large datasets. The reason that the Bayesian
analysis may achieve better performance than ML is that the ML method
searches one single best tree, whereas the Bayesian method searches a set of
best trees. The advantage of the Bayesian method can be explained by the
matter of probability. Because the true tree is not known, an optimal ML tree
may have, say, 90% probability of representing the reality. However, the
Bayesian method produces hundreds or thousands of optimal or near-optimal
trees with 88% to 90% probability to represent the reality. Thus, the latter
approach has a better chance overall to guess the true tree correctly.

Module 23|7

You might also like