0% found this document useful (0 votes)
9 views

Phylogenic Tree

Uploaded by

21020075
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Phylogenic Tree

Uploaded by

21020075
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Phylogenic trees

Cladistics: tree of life

2
What is phylogenetic analysis and why
should we perform it?

Phylogenetic analysis has two major components:

1. Phylogeny inference or “tree building” —


the inference of the branching orders, and
ultimately the evolutionary relationships,
between “taxa” (entities such as genes,
populations, species, etc.)
2. Character and rate analysis —
using phylogenies as analytical frameworks
for rigorous understanding of the evolution of
various traits or conditions of interest
Common Phylogenetic Tree Terminology

Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny

D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)
Phylogenetic trees diagram the evolutionary
relationships between the taxa
Taxon B

Taxon C
No meaning to the
spacing between the
Taxon A taxa, or to the order in
which they appear from
top to bottom.
Taxon D

Taxon E

This dimension either can have no scale (for ‘cladograms’),


can be proportional to genetic distance or amount of change
(for ‘phylograms’ or ‘additive trees’), or can be proportional
to time (for ‘ultrametric trees’ or true evolutionary trees).

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses


These say that B and C are more closely related to each other than either is to A,
and that A, B, and C form a clade that is a sister group to the clade composed of
D and E. If the tree has a time scale, then D and E are the most closely related.
Three types of trees

Cladogram Phylogram Ultrametric tree

6
Taxon B 1 Taxon B Taxon B
1
Taxon C 3 Taxon C Taxon C
1
Taxon A Taxon A Taxon A

Taxon D 5
Taxon D Taxon D

no meaning genetic change time

All show the same evolutionary relationships, or branching orders, between the taxa.
Example: Which species are the closest
living relatives of modern humans?

Humans Gorillas
Chimpanzees Chimpanzees

Bonobos Bonobos

Gorillas Orangutans
Orangutans Humans

14 0 15-30 0
MYA MYA

Mitochondrial DNA, most nuclear DNA- The pre-molecular view was that the
encoded genes, and DNA/DNA great apes (chimpanzees, gorillas and
hybridization all show that bonobos and orangutans) formed a clade separate
chimpanzees are related more closely to from humans, and that humans diverged
humans than either are to gorillas. from the apes at least 15-30 MYA.
The goal of phylogeny inference is to resolve the
branching orders of lineages in evolutionary trees:

Completely unresolved Partially resolved Fully resolved,


or "star" phylogeny phylogeny bifurcating phylogeny

A A A

B C E

C E C

D B B

E D D

Polytomy or multifurcation A bifurcation


There are three possible unrooted trees
for four taxa (A, B, C, D)
Tree 1 Tree 2 Tree 3
A C A B A B

B D C D D C
Phylogenetic tree building (or inference) methods are aimed at
discovering which of the possible unrooted trees is "correct".
We would like this to be the “true” biological tree — that is, one
that accurately represents the evolutionary history of the taxa.
However, we must settle for discovering the computationally
correct or optimal tree for the phylogenetic method of choice.
The number of unrooted trees increases in a greater
than exponential manner with number of taxa
A B
# Taxa (N) # Unrooted trees

C A C 3 1
4 3
5 15
B D 6 105
7 945
C 8 10,935
A D
9 135,135
10 2,027,025
E . .
B
. .
C . .
A D . .
30 ≈3.58 x 1036

B F E (2N - 5)!! = # unrooted trees for N taxa


Inferring evolutionary relationships between
the taxa requires rooting the tree:

Unrooted tree Rooted tree


B A C D
A C
Root
Root

B D

B A C D
A C

Root
Root
B D
An unrooted, four-taxon tree theoretically can be rooted in five
different places to produce five different rooted trees

2 4
A C
The unrooted tree 1: 1 5

B 3 D

Rooted tree 1a Rooted tree 1b Rooted tree 1c Rooted tree 1d Rooted tree 1e
B A A C D

A B B D C

C C C A A

D D D B B
These trees show five different evolutionary relationships among the taxa!
All of these rearrangements show the same evolutionary
relationships between the taxa
A A
C D
D C
Rooted tree 1a
B B
B
C D
A
D C
C A A
B B
D
B B
C D

D C
A A
Possible evolutionary trees

Taxa (n): 2 3 4

Taxa (n) Unrooted/rooted

2 1/1
3 1/3
4 3/15
Possible evolutionary trees

Taxa (n) rooted unrooted


(2n-3)!/(2n-2(n-2)!) (2n-5)!/(2n-3(n-3)!)

2 1 1
3 3 1
4 15 3
5 105 15
6 954 105
7 10,395 954
8 135,135 10,395
9 2,027,025 135,135
10 34,459,425 2,027,025
There are two major ways to root trees:
By outgroup:
Uses taxa (the “outgroup”) that are
known to fall outside of the group of
interest (the “ingroup”). Requires
some prior knowledge about the
relationships among the taxa. The
outgroup can either be species (e.g.,
birds to root a mammalian tree) or
previous gene duplicates (e.g., outgroup
a-globins to root b-globins).

By midpoint or distance:
Roots the tree at the midway point A
d (A,D) = 10 + 3 + 5 = 18
between the two most distant taxa in
Midpoint = 18 / 2 = 9
the tree, as determined by branch
lengths. Assumes that the taxa are 10
C
evolving in a clock-like manner. This 3 2
assumption is built into some of the B 2
5 D
distance-based tree building methods.
Types of data used in phylogenetic inference:
Character-based methods: Use the aligned characters, such as DNA
or protein sequences, directly during tree inference.
Taxa Characters
Species A ATGGCTATTCTTATAGTACG
Species B ATCGCTAGTCTTATATTACA
Species C TTCACTAGACCTGTGGTCCA
Species D TTGACCAGACCTGTGGTCCG
Species E TTGACCAGTTCTCTAGTTCG

Distance-based methods: Transform the sequence data into pairwise


distances (dissimilarities), and then use the matrix during tree building.
A B C D E
Species A ---- 0.20 0.50 0.45 0.40
Example 1:
Species B 0.23 ---- 0.40 0.55 0.50 Uncorrected
Species C 0.87 0.59 ---- 0.15 0.40 “p” distance
(=observed percent
Species D 0.73 1.12 0.17 ---- 0.25 sequence difference)
Species E 0.59 0.89 0.61 0.31 ----

Example 2: Kimura 2-parameter distance


(estimate of the true number of substitutions between taxa)
The biological basis of evolution

Mother DNA: TCTGCCTC

TCTGCCTC GATGCCTC TCTGCCTCGGG

GATGCATC GACGCCTC GCTGCCTCGGG

GATGAATC GCCGCCTC GCTAAGCCTCGGG


Present species
Types of computational methods:

Clustering algorithms: Use pairwise distances. Are purely


algorithmic methods, in which the algorithm itself defines the tree
selection criterion. Tend to be very fast programs that produce singular
trees rooted by distance. No objective function to compare to other
trees, even if numerous other trees could explain the data equally well.
Warning: Finding a singular tree is not necessarily the same as finding
the "true” evolutionary tree.

Optimality approaches: Use either character or distance data.


First define an optimality criterion (minimum branch lengths, fewest
number of events, highest likelihood), and then use a specific algorithm
for finding trees with the best value for the objective function. Can
identify many equally optimal trees, if such exist. Warning: Finding an
optimal tree is not necessarily the same as finding the "true” tree.
Computational methods for finding optimal trees:

Exact algorithms: "Guarantee" to find the optimal or


"best" tree for the method of choice. Two types used in tree
building:
Exhaustive search: Evaluates all possible unrooted
trees, choosing the one with the best score for the method.
Branch-and-bound search: Eliminates the parts of the
search tree that only contain suboptimal solutions.

Heuristic algorithms: Approximate or “quick-and-dirty”


methods that attempt to find the optimal tree for the method of
choice, but cannot guarantee to do so. Heuristic searches
often operate by “hill-climbing” methods.
Exact searches become increasingly difficult, and
eventually impossible, as the number of taxa increases:

A B
# Taxa (N) # Unrooted trees
A C 3 1
C 4 3
5 15
B D 6 105
7 945
C 8 10,935
A D
9 135,135
10 2,027,025
. .
B E . .
C . .
A D
. .
30 ≈3.58 x 1036
B F E
(2N - 5)!! = # unrooted trees for N taxa
Heuristic search algorithms are Rerunning heuristic searches using
input order dependent and can get different input orders of taxa can help
stuck in local minima or maxima find global minima or maxima
Search
for global
Search maximum
for global
minimum GLOBAL GLOBAL
MAXIMUM MAXIMUM
local
maximum

local
minimum GLOBAL GLOBAL
MINIMUM MINIMUM
Classification of phylogenetic inference methods

COMPUTATIONAL METHOD
Optimality criterion Clustering algorithm
Characters

PARSIMONY

MAXIMUM LIKELIHOOD
DATA TYPE

Distances

MINIMUM EVOLUTION UPGMA

LEAST SQUARES NEIGHBOR-JOINING


Clustering (distance) methods

Optimality criterion: NONE. The algorithm itself builds


‘the’ tree.
Advantages:
• Can be used on indirectly-measured distances (immunological, hybridization).
• Distances can be ‘corrected’ for unseen events.
• The fastest of the methods available (NJ is screamingly fast!).
• Can therefore analyze very large datasets quickly (needed for HIV, etc.).
• Can be used for some types of rate and date analysis.

Disadvantages:
• Similarity and relationship are not necessarily the same thing, so clustering by
similarity does not necessarily give an evolutionary tree.
• Cannot be used for character analysis!
• Have no explicit optimization criteria, so one cannot even know if the program
worked properly to find the correct tree for the method.
NJ algorithm

Input: n*n distance matrix D


Output: phylogenetic tree T

Step 1: Find the smallest value of D to determine two taxa f and


g to join.

Step 2: Join the two taxa f and g to a new taxon u.

Step 3: Update distance between u and any other taxon k.

Step 4: The distance matrix D’ now contains n – 1 taxa. If there


are more than 2 taxa left go to step 1, else join two taxa by an
branch of length25d(ti,tj).
Maximum Parsimony
• Character based method
• NP-hard problem
• Widely-used
• Slower than NJ
• Faster than ML
Maximum Parsimony
• Input: Set S of n aligned sequences of
length k
• Output: A phylogenetic tree T
– leaf-labeled by sequences in S
– additional sequences of length k labeling
the internal nodes of T

such that å H (i, j )


( i , j )ÎE (T )
is minimized.
Maximum parsimony
(example)
• Input: Four sequences
– ACT
– ACA
– GTT
– GTA
• Question: which of the three trees has
the best MP scores?
Maximum Parsimony
ACT GTA ACA ACT

GTT ACA GTT GTA

ACA GTA

ACT GTT
Maximum Parsimony
ACT GTA ACA ACT
2 GTT GTA ACA ACT
1 3 1 3
GTT 2 ACA GTT GTA
MP score = 5 MP score = 7

ACA GTA
ACA GTA
1 2 1
ACT GTT
MP score = 4
Optimal MP tree
Maximum Parsimony:
computational complexity
Optimal labeling can be
computed in linear time O(nk)

ACA GTA
ACA GTA
1 2 1
ACT GTT
MP score = 4

Finding the optimal MP tree is NP-hard


Local search strategies

Local optimum

Cost

Global optimum

Phylogenetic trees
Local search for MP
• Determine a candidate solution s
• While s is not a local minimum
– Find a neighbor s’ of s such that MP(s’)<MP(s)
– If found set s=s’
– Else return s and exit

• Time complexity: unknown---could take


forever or end quickly depending on starting
tree and local move
• Need to specify how to construct starting tree
and local move
Starting tree for MP
• Random phylogeny---O(n) time
• Greedy-MP
Greedy-MP

Greedy-MP takes O(n^2k^2) time


Local moves for MP: NNI
• For each edge
we get two
different
topologies
• Neighborhood
size is 2n-6
Local moves for MP: SPR
• Neighborhood size is quadratic in number of taxa
• Computing the minimum number of SPR moves
between two rooted phylogenies is NP-hard
Local moves for MP: TBR
• Neighborhood size is cubic in number of taxa
• Computing the minimum number of TBR moves between two rooted phylogenies is NP-hard
Local optima is a problem
0.08

0.07

0.06

0.05

0.04 TNT

0.03

0.02

0.01

0
1 48 96 144 192 240 288 336
Iterated local search:
escape local optima by
perturbation

Local search
Local optimum
Iterated local search:
escape local optima by
perturbation

Local search
Local optimum

Perturbation

Output of perturbation
Iterated local search:
escape local optima by
perturbation

Local search
Local optimum

Perturbation
Local search

Output of perturbation

You might also like