0% found this document useful (0 votes)
9 views

4.phylogenetics

The document discusses phylogeny and models of DNA evolution, highlighting the concepts of synonymous and non-synonymous substitutions, as well as various models like Jukes-Cantor and Kimura's Two Parameter Model. It also covers phylogenetic trees, their representations, and methods for tree construction such as UPGMA and Neighbour Joining. Additionally, it addresses parsimony and maximum likelihood methods for estimating evolutionary relationships among sequences.

Uploaded by

Gohar Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

4.phylogenetics

The document discusses phylogeny and models of DNA evolution, highlighting the concepts of synonymous and non-synonymous substitutions, as well as various models like Jukes-Cantor and Kimura's Two Parameter Model. It also covers phylogenetic trees, their representations, and methods for tree construction such as UPGMA and Neighbour Joining. Additionally, it addresses parsimony and maximum likelihood methods for estimating evolutionary relationships among sequences.

Uploaded by

Gohar Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Phylogeny

Understanding and deriving


evolutionary relationships
Synonymous vs Non-synonymous substitutions

Degeneracy: Genetic code is inherently redundant!


Multiple codons correspond to the same amino acid
CGA
CGC CAU Histidine
Arginine
CGG CAC
CGU 2-fold degenerate
4-fold degenerate
18 out of 20 amino acids have more than one codon
Note: Redundancy but no ambiguity
Synonymous vs. Non-synonymous substitutions

The degeneracy of the genetic code responsible for


synonymous mutation
Synonymous mutation – same a.a.
CGA
CGC
Arginine
CGG
CGU
Non-synonymous mutation, CCA  Proline
Modeling of DNA Evolution

Assumptions
1. Site independence

2. Site homogeneity.

3. Markovian: given current base, future substitutions


independent of past.

4. Temporal homogeneity: stationary Markov chain.


Models of DNA Evolution

A number of different models available for DNA sequence


evolution.

Jukes-Cantor Model
Kimura’s Two Parameter Model
Felsenstein Model

Models differ in terms of the parameters used to describe


the rates at which one nucleotide replaces another during
evolution.
Jukes-Cantor Model
Simplest of all models α
A C
α
Proposed in 1969 – JC69 α α
α
Assumes G T
•Equal base frequencies α
πA = πC = πG = πT = ¼

•Equal mutation rates

Only one parameter, overall mutation rate


Equal mutation rate assumption is unrealistic

Nucleotide categories
Purines: A, G Pyrimidines: C, T/U

Nucleotide substitution within or between categories


happen at different rates
More frequent
Transitions:
Purine ↔ Purine A C
Pyrimidine ↔ Pyrimidine

Transversions:
Purine ↔ Pyrimidine G T
Kimura’s Two Parameter Model
Distinguishes between β
transitions and transversions A C
β
α α
Transitions occur at rate α β
G β T
Transversions occur at rate β

Proposed in 1980 – K80

Assumes equal base frequencies


πA = πC = πG = πT = ¼
Felsenstein’s 1981 Model
Extension of JC69 α
A C
α
Proposed in 1981 – F81 α α
α
Assumes G T
•Non-equal base frequencies α
πA ≠ πC ≠ πG ≠ πT ≠ ¼

•Equal mutation rates


Models of DNA Evolution
More complex models present / possible

• Unequal base frequencies, unequal rates for


transitions/ transversions

• Unequal rates for base substitions

• Codon based – synonymous vs. non-synonymous

• …

Models to describe rate variation among sites in a


sequence also available
Phylogenetics
Phylogenetic trees – Also called dendrograms
• Graphical representation of evolutionary relatedness of
3 or more sequences

Nodes – distinct taxonomical units


• Terminal / Leaf nodes: gene or
1 2 3 4 5
organisms for which sequence
data is present D
C
• Internal nodes: inferred common
ancestor that gave rise to 2 B
lineages
A
Phylogenetics
1 2 3 4 5 Newick notation
(((1,2), (3,4)),5)
C D
C D
B
B
A
A
Scaled Trees:
Branch lengths correspond to the changes between
neigbouring nodes
Unscaled Trees:
Only shows the relationship between different
sequences
Phylogenetics
Layout is not important

All trees are the same


1 2 3 4 5 2 1 4 3 5 5 4 3 2 1

C D C D D C

B B B
A A A
Phylogenetics
Rooted Trees:
Allows inference about 1 2 3 4 5
common ancestor and
C D
direction of evolution

Time
Single node assigned as B Root
common ancestor A
Unique path from root to any other node through
evolutionary time

Root assigned using an outgroup – something that


unambiguously separated earlier than species being
considered
Phylogenetics
1 2 3 4 5 1 2

C D
4
B
5
A 3

Unrooted Trees:
Specifies relationship between nodes

No information about direction of evolution

Why not always use rooted trees?


Why not always use rooted trees?
1. Rooted trees require a clear outgroup
2. Computationally difficult

Consider 3 sequences 1, 2 and 3:


3 possible rooted trees
1 2 3 1 3 2 2 3 1

Only 1 unrooted tree


1
3
2
Why not always use rooted trees?

Number of Number of Rooted Number of


Sequences Trees Unrooted Trees
3 3 1
4 15 3
5 105 15
10 34, 459,425 2,027,025
15 213,458,046,767,875 7,905,853,580,625
… … …

How many trees?


Rooted: (2n-3)! / 2n-2(n-2)!
Unrooted: (2n-5)! / 2n-3(n-3)!
Phylogeny methods
Major Methods:
Clustering Methods UPGMA
WPGMA
Neighbour Joining
Single Linkage
Complete Linkage
Objective Criterion Maximum Parsimony
based Methods Maximum Likelihood
Least Squares distance
UPGMA
Unweighted Pair Group Method with Arithmetic mean

Agglomerative or hierarchical clustering

Assumes a constant rate of evolution

Requires data to be condensed to a distance measure –


genetic distance

Build a distance matrix between sequences


UPGMA
Consider 4 sequences A, B, C and D
A B C D
A 0
B dAB 0 dAB is the pairwise distance
between A and B
C dAC dBC 0
D dAD dBD dCD 0
Step 1: Cluster the two sequences with smallest distance into
composite group. e.g. if dAB is the smallest, make a group “AB”
UPGMA
Consider 4 sequences A, B, C and D
A B C D
A 0
B dAB 0 dAB is the pairwise distance
between A and B
C dAC dBC 0
D dAD dBD dCD 0
Step 1: Cluster the two sequences with smallest distance into
composite group. e.g. if dAB is the smallest, make a group “AB”
Step 2: Create a new distance matrix between “AB”, C and D
d“AB”C = ½ (dAC + dBC) and d“AB”D = ½ (dAD + dBD)
UPGMA
Consider 4 sequences A, B, C and D
A B C D
A 0
B dAB 0 dAB is the pairwise distance
between A and B
C dAC dBC 0
D dAD dBD dCD 0
Step 1: Cluster the two sequences with smallest distance into
composite group. e.g. if dAB is the smallest, make a group “AB”
Step 2: Create a new distance matrix between “AB”, C and D
d“AB”C = ½ (dAC + dBC) and d“AB”D = ½ (dAD + dBD)
Step 3: Repeat above until all sequences have been grouped
UPGMA
Consider 4 sequences A, B, C and D
A B C D
A 0
B dAB 0 dAB is the pairwise distance
between A and B
C dAC dBC 0
D dAD dBD dCD 0
Step 1: Cluster the two sequences with smallest distance into
composite group. e.g. if dAB is the smallest, make a group “AB”
Step 2: Create a new distance matrix between “AB”, C and D
d“AB”C = ½ (dAC + dBC) and d“AB”D = ½ (dAD + dBD)
Step 3: Repeat above until all sequences have been grouped
Step 4: Put node halfway between grouped sequences
UPGMA - Example
A B C D
B 5
C 8 11 B
A
D 12 14 9
E 15 16 13 7
(A,B)
AB C D
C 9.5 A B D E
D 13 9
E 15.5 13 7 (A,B) (D,E)

d“AB”C = ½(8+11) = 9.5


UPGMA - Example
AB C D
C 9.5 A B D E
D 13 9
E 15.5 13 7 (A,B) (D,E)

AB C
C 9.5 A B C D E
DE 14.25 11
(D,E)
((A,B), C)
UPGMA - Example
AB C
C 9.5 A B C D E
DE 14.25 11
(D,E)
((A,B), C)

ABC
A B C D E
DE 12.625

(((A,B), C), (D,E))


UPGMA – Adding branch lengths
A B C D
B 5
C 8 11 B
A
D 12 14 9 2.5 2.5
E 15 16 13 7
(A,B)
AB C D
C 9.5 A B D E
2.5 2.5 3.5 3.5
D 13 9
E 15.5 13 7 (A,B) (D,E)
UPGMA - Example
AB C
C 9.5 A B C D E
2.5 2.5 3.5 3.5
DE 14.25 11
4.25 4.25 (D,E)
((A,B), C)

ABC
A B C D E
DE 12.625
2.5 2.5 3.5 3.5
4.25 4.25

6.3125 6.3125

(((A,B), C), (D,E))


UPGMA
very sensitive to unequal evolutionary rates

Requires data to be ultrametric – all lineages have diverged


by equal amount
Neighbour Joining
Special case of the star decomposition method

Neighbours are defined as a pair of OTU's (operational


taxonomic units (leaves of the tree)), which have only one
node connecting them.
Neighbour Joining

Every node is a neighbour Neighbours:


of other nodes 1 and 2
3 and 4

Non neighbours:
1 and 3

Neighbour Joining
Parsimony
Non parametric method

Character-based tree estimation methods


• Distance matrix not required
• Uses DNA sequence data (or any other discrete
data)

Explains the observed sequences with a minimum number


of substitutions

Best for small sets of sequences with high similarity


Parsimony: Fitch’s Algorithm

Two Step procedure

1. Starting from leaves , traverse tree “up” to root. Find sets


of possible ancestral states for each internal node.

2. Traverse tree “down”, from root to leaves. Determine


ancestral states for internal nodes.

Assumption:
Independent sites – can solve one site at a time.
Parsimony – Tree using complete sequence
data

Treat each site independently

Evaluate score (number of changes) at each position


separately

Get the total score by summing the scores at all sites

Select the tree with the lowest total score.


Parsimony – Tree using complete sequence
data

Example
1 2 3 4 5 6 7 8 9 10
Species 1 - A G G G T A A C T G
Species 2 - A C G A T T A T T A
Species 3 - A T A A T T G T C T
Species 4 - A A T G T T G T C G
Parsimony

Problems:

Statistically non consistent:


Not guaranteed to produce the true tree with high
probability, given sufficient data

Consistency: Monotonic convergence on the correct answer


with the addition of more data

Ambiguous assignments possible at internal nodes


Parsimony - Ambiguous assignment
What are the ancestral states?

A B C D E

C C C T T

C T
C

C/T
Maximum Likelihood

Purely statistical method

Considers probabilities of every nucleotide substitution in


the set of aligned sequences

Problem: Ancestor unknown

Requires testing all possible trees.

Select the tree that gives highest probability


Maximum Likelihood

Example:

P(A): 0.25
P(G): 0.25
P(A  G): 0.2
P(G  A): 0.2
G P(A  A): 0.6
G A
P(G  G): 0.6
Assumption: C and T are very unlikely
Maximum Likelihood
Possible Trees:
1 A (0.25) 2 G (0.25)
0.2 0.6
0.2 G 0.6 G
0.2 0.6 0.2 0.6

G A G G A G
Probability = 0.25 X 0.2 X 0.2 X 0.2 X 0.6 Probability = 0.25 X 0.6 X 0.6 X 0.2 X 0.6
= 0.0006 = 0.0108
3 A (0.25) 4 G (0.25)
0.6 0.2
0.2 A 0.6 A
0.6 0.2 0.6 0.2

G A G G A G
Probability = 0.25 X 0.2 X 0.6 X 0.6 X 0.2 Probability = 0.25 X 0.6 X 0.2 X 0.6 X 0.2
= 0.0036 = 0.0036
Phylogenetic Software
PHYLIP: Phylogenetics Inference Package – Free
http://evolution.genetics.washington.edu/phylip.html

Very useful. Includes many programs including distance based


methods, parsimony, maximum likelihood, …

PAUP: Phylogenetic Analysis Using Parsimony – Not Free


http://paup.csit.fsu.edu/

Also includes distance based and maximum likelihood methods

PAML: Phylogenetic Analysis by Maximum Likelihood


http://abacus.gene.ucl.ac.uk/software/paml.html

You might also like