Phylogeny_Notes
Phylogeny_Notes
Phylogenetics is the study of evolutionary relationships among species (taxa) or genes, often
represented as a tree-like diagram called a phylogenetic tree. The goal of phylogenetics is to
reconstruct the evolutionary history of organisms, showing how they are related to one another
based on shared characteristics, typically molecular data (e.g., DNA, RNA, or protein
sequences).
The basic principle of phylogenetics is that organisms that share a common ancestor will have
inherited some of their traits from that ancestor, and these traits can be used to infer
relationships. The more traits two species or taxa share, the more closely related they are
assumed to be. Phylogenetic trees aim to represent these relationships by showing the branching
patterns of common ancestry.
In constructing a phylogenetic tree, we use data such as molecular sequences (e.g., DNA or
protein sequences). The tree is built by comparing these characteristics across different taxa, and
then grouping them based on shared similarities and differences.
A phylogenetic tree is typically divided into nodes (representing common ancestors) and
branches (representing the evolutionary paths from ancestors to descendants). The root of the
tree represents the most recent common ancestor of all taxa in the tree. The leaves or tips of the
tree represent the current species (taxa) or genes being studied.
There are different methods for constructing phylogenetic trees, including distance-based
methods (like Neighbor-Joining), character-based methods (such as Maximum Parsimony), and
probabilistic methods (like Maximum Likelihood and Bayesian Inference). These methods use
different algorithms and assumptions to infer the best tree, but they all aim to represent the
evolutionary relationships among taxa as accurately as possible.
Phylogenetic analysis helps us understand the evolutionary processes that have shaped the
diversity of life on Earth, including how species evolved, adapted, and diversified over time. It
also provides insight into the origins of specific traits or diseases, as well as helping to identify
and classify new species based on their genetic relationships.
Creating a phylogenetic tree for a particular gene family involves several key steps.
Identify the gene family of interest by searching genomic or protein databases (e.g., GenBank,
UniProt) for sequences associated with this family.
For protein data, you may choose a model like WAG or JTT, depending on your dataset. Use
software like jModelTest (for nucleotides) or ProtTest (for proteins) to determine the most
appropriate model based on Akaike Information Criterion (AIC) or Bayesian Information
Criterion (BIC). For nucleotide data, select a suitable substitution model (e.g., GTR, HKY)
5. Tree Construction
Neighbor-Joining (NJ): Fast and simple, but may be less accurate for complex datasets.
Maximum Likelihood (ML): Often the most accurate, implemented in software like RAxML or
IQ-TREE.
Bayesian Inference (BI): Provides a posterior probability distribution for the tree, implemented
in software like MrBayes or BEAST.
Run the chosen tree construction method, inputting the aligned sequences and selected model.
6. Tree Visualization
Visualize the resulting tree using tools like FigTree, iTOL, or Dendroscope.
Evaluate the robustness of the tree by generating bootstrap values (for NJ and ML trees) or
posterior probabilities (for Bayesian trees).
7. Tree Interpretation
Analyze the tree to identify major clades, gene duplications, and evolutionary relationships
within the gene family. Examine the nodes for statistical support to assess the confidence of the
branching order.
If relevant, interpret the evolutionary history and functional implications based on the tree's
structure.
Creating a supermatrix for a species tree involves combining molecular data (usually from
multiple genes or loci) into a single, comprehensive matrix that can be used to infer the
evolutionary relationships between species.
1. Data Collection
Select Genes/Loci: Choose a set of orthologous genes or loci that are present in all species you
want to include in the tree. Ideally, these genes should be conserved across species but also have
enough variability to resolve species relationships.
Sequence Retrieval: Retrieve the sequences for the selected genes or loci from a variety of
species. These can be obtained from public databases like GenBank, Ensembl, or from your own
sequencing efforts.
2. Gene/Locus Alignment
Sequence Alignment: Align the individual gene or locus sequences using a multiple sequence
alignment tool like Clustal Omega, MAFFT, or MUSCLE. This step is crucial to ensure that
homologous positions across species are aligned properly.
Manual Inspection: After automated alignment, manually inspect the alignments for errors or
ambiguous regions (such as gaps or low-quality sequences) and make necessary adjustments.
3. Concatenate Alignments
Once each gene or locus is individually aligned, concatenate the alignments into a single
supermatrix. This creates a matrix with species as rows and concatenated gene sequences as
columns.
Formatting the Supermatrix: Ensure that the supermatrix is formatted correctly for phylogenetic
software (such as NEXUS or FASTA format). Each gene should be represented as a separate
block, and the final matrix should have one row per species and one column per site across all
loci.
Filter Sites: Remove regions with missing data or poorly aligned sequences (e.g., hypervariable
regions). Missing data can be handled by removing taxa with too many gaps or using methods
that account for missing data during tree inference (e.g., gaps as missing).
Subsampling: If the supermatrix becomes too large (in terms of taxa or loci), consider
subsampling species or loci to avoid computational challenges.
Check for Saturation: Ensure that the genes or loci you are using are not saturated with
substitutions, as this could lead to incorrect tree topologies.
5. Model Selection
Choose an Appropriate Substitution Model: For each gene or locus, determine the best-fit
molecular model (e.g., GTR, HKY, or JC for nucleotides, or WAG, JTT, or LG for proteins)
using tools like jModelTest or ProtTest.
Test Homogeneity: Check that the selected loci are evolving under similar models. Significant
differences in evolutionary models across loci may necessitate more complex models (such as
partitioned analyses).
6. Supermatrix Analysis
Phylogenetic Inference: With the concatenated and cleaned supermatrix, perform phylogenetic
analysis using tools like RAxML, IQ-TREE, or MrBayes. These tools allow for maximum
likelihood (ML) or Bayesian inference (BI) of the species tree.
Bootstrap/Posterior Support: Assess the robustness of the tree by generating bootstrap support
values (for ML analyses) or posterior probabilities (for Bayesian analyses).
Supertree approach:
The supertree approach is a method used to construct a species tree by combining multiple
phylogenetic trees (gene trees or other types of trees) that have been inferred for different sets of
loci, genes, or datasets. The main idea behind the supertree approach is to integrate the
information from individual gene trees (which may not be congruent due to incomplete lineage
sorting, horizontal gene transfer, or other factors) into a single, comprehensive species tree.
Sequence Selection: Select genes or loci for which individual phylogenies (gene trees) will be
constructed. These genes should be sufficiently conserved across species but also variable
enough to resolve evolutionary relationships.
Phylogenetic Analysis: For each gene, construct a phylogenetic tree (gene tree) using methods
like Maximum Likelihood (ML), Bayesian Inference (BI), or Neighbor-Joining (NJ). Tools like
RAxML, IQ-TREE, and MrBayes can be used to generate these gene trees.
Multiple Gene Trees: Typically, multiple gene trees are used to increase the robustness of the
final species tree. Different genes may provide different insights into the evolutionary history of
the species.
2. Supertree Construction
The supertree approach involves integrating information from multiple gene trees into a single
species tree. There are several methods to achieve this, including:
Matrix Representation (MRA) Method: This is one of the simplest approaches to construct a
supertree. The gene trees are represented as matrices, where each row corresponds to a different
gene tree, and each column represents a possible bipartition of taxa. The supertree is then
constructed by finding a consensus tree that minimizes the number of conflicts between gene
trees.
Cladistic Methods: These methods apply traditional cladistic principles to combine multiple
trees. One approach involves creating a matrix of bipartitions, with each bipartition representing
a split in the phylogeny. The goal is to find a consensus tree that preserves as many bipartitions
as possible, despite conflicts between gene trees.
MRC (Monophyly Representation Criterion): This approach uses the concept of monophyly and
aims to find a supertree that reflects the monophyletic groups found across gene trees.
Weighted Supertree Methods: These methods assign weights to the input gene trees based on
their confidence levels (e.g., bootstrap support values or posterior probabilities). Gene trees with
higher confidence receive greater weight in determining the final species tree.
Bayesian Supertree Methods: A more sophisticated method involves using Bayesian inference to
build the supertree. This involves integrating over the space of all possible trees, using a
probabilistic framework to account for uncertainty in both the gene trees and the relationships
between them. Software like ASTRAL is commonly used for this purpose.
Bayesian Model Averaging (BMA): This method combines different supertree methods by
averaging their probabilities, providing a more robust estimation of the species tree.
Distance-based methods are rooted in the idea that evolutionary distance between species can
be inferred by comparing overall dissimilarities in their genetic data. These methods calculate a
pairwise distance matrix, where each entry reflects the degree of difference between two taxa.
The distances can be derived using various measures, such as the number of nucleotide or amino
acid differences between homologous sequences, or more sophisticated models like the Jukes-
Cantor or Kimura distances, which correct for multiple substitutions at the same site. Once the
pairwise distances are calculated, the tree is typically constructed using algorithms like
Neighbor-Joining (NJ) or UPGMA (Unweighted Pair Group Method with Arithmetic Mean),
which group taxa based on the smallest evolutionary distances. Distance-based methods are
computationally efficient and can handle large datasets, but they simplify the evolutionary
process by focusing only on overall similarity, potentially losing details about the specific nature
of evolutionary changes.
Character-based methods, in contrast, do not summarize the data into a single distance metric
but instead evaluate individual characters (usually nucleotide or amino acid sites) across all taxa.
These methods examine the state of each character (e.g., A, C, G, or T at a particular position in
a sequence) and analyze the most likely evolutionary changes that could have given rise to the
observed data. Common character-based methods include Maximum Parsimony (MP) and
Maximum Likelihood (ML). In Maximum Parsimony, the tree that minimizes the number of
evolutionary changes (steps) required to explain the observed characters is selected as the most
likely tree. Maximum Likelihood, on the other hand, evaluates the probability of observing the
data under a specific evolutionary model and chooses the tree that maximizes this likelihood.
Character-based methods tend to provide more accurate trees, as they consider the detailed
changes at each site and often incorporate more complex models of evolution, though they can
be computationally intensive, especially for large datasets or complex models.
The calculation of these methods differs significantly. In distance-based methods, the first step
is to compute a pairwise distance matrix, which is then used to construct the tree. For example, in
Neighbor-Joining, the algorithm iteratively clusters the two taxa with the smallest distance and
updates the distance matrix until all taxa are clustered into a tree. In contrast, character-based
methods start with the raw data and use algorithms that evaluate all possible tree topologies to
find the one that best fits the observed data under the chosen model. For Maximum Parsimony,
this involves counting the minimum number of changes required to explain the character states
across all taxa. In Maximum Likelihood, complex calculations based on substitution models and
tree topologies are carried out to find the most probable evolutionary scenario.
Advantages and limitations of these methods vary. Distance-based methods are faster and
easier to compute, especially with large datasets, but they can be less accurate, particularly when
evolutionary changes are complex or when convergence is an issue. They rely on a simpler view
of evolution, where only the overall similarity or dissimilarity between sequences is considered,
which can obscure detailed evolutionary processes. Character-based methods, while more
accurate and capable of modeling the complexity of evolutionary processes, require more
computational resources and may struggle with very large datasets, especially when many
different evolutionary models need to be considered.
In summary, distance-based methods provide a more general approach by summarizing the data
into pairwise distances, making them computationally efficient but potentially oversimplifying
evolutionary relationships. Character-based methods, on the other hand, focus on detailed, site-
specific evolutionary changes, offering greater accuracy but at a higher computational cost. The
choice between these methods depends on the size and complexity of the dataset, as well as the
computational resources available. In practice, both methods can complement each other in
phylogenetic analysis, with distance-based methods offering a first-pass exploration and
character-based methods providing a more refined and accurate final result.
Neighbor-Joining (NJ)
The Neighbor-Joining (NJ) tree is a widely used distance-based method for constructing
phylogenetic trees that aims to minimize the total branch length of the tree, reflecting the
evolutionary distance between the taxa. The process of constructing an NJ tree begins with the
calculation of a pairwise distance matrix, where each entry in the matrix represents the
evolutionary distance between a pair of taxa. This matrix can be derived from various distance
metrics, such as nucleotide or amino acid substitutions. Once the matrix is constructed, the NJ
algorithm starts by identifying the pair of taxa with the smallest distance between them. These
two taxa are then joined together to form a new node, representing their most recent common
ancestor. The distances to all other taxa are recalculated based on this new node, using a specific
formula that adjusts for the distances between the new node and the remaining taxa. The idea is
that the new node’s distance to any other taxon is the average of the distances from the two
original taxa to that taxon, minus a correction factor that accounts for the distance between the
two taxa being joined. This process is repeated iteratively: in each step, the two taxa (or nodes)
with the smallest distance are joined, the distance matrix is updated, and the tree grows. At each
step, the tree begins to take shape, with branches representing evolutionary distances. The
procedure continues until all taxa are joined into a single tree, resulting in a rooted or unrooted
tree depending on the chosen approach. The NJ algorithm is designed to minimize the total
length of the branches in the tree, which is interpreted as minimizing the total evolutionary
distance required to explain the observed distances between taxa. One of the key features of the
NJ method is that it is computationally efficient, with a time complexity that scales relatively
well with the number of taxa, making it suitable for larger datasets compared to other methods,
such as UPGMA or Maximum Parsimony. However, NJ has its limitations, as it assumes that the
distances are additive and does not account for more complex evolutionary models, which can
lead to inaccuracies in some cases, especially when the data violates the assumptions of the
model. Nonetheless, NJ remains a popular method for constructing trees, particularly when
computational speed is a priority and when the evolutionary relationships between taxa are
relatively straightforward. The resulting tree structure reflects the closest evolutionary
relationships between taxa based on the pairwise distances, with each branch length proportional
to the amount of evolutionary change between the taxa involved.
Maximum Parsimony
Maximum Parsimony (MP) is a character-based phylogenetic method that seeks to find the tree
topology that minimizes the total number of evolutionary changes (or steps) required to explain
the observed data. The key idea behind MP is that the simplest explanation, or the one requiring
the fewest changes, is the most likely. In the context of sequence data, this means that the MP
algorithm aims to minimize the number of substitutions (nucleotide or amino acid changes)
across the entire dataset.
The process of building a phylogenetic tree using Maximum Parsimony begins with an
alignment of homologous sequences from different taxa. Each column in the alignment
represents a character (or a site in the sequence), and each taxon has a character state (e.g., a
particular nucleotide or amino acid) for that site. The method then examines all possible tree
topologies, calculating the number of evolutionary changes required to explain the sequence data
under each topology. For each tree, MP considers the possible character states at each internal
node of the tree, which represents the common ancestor of a group of taxa. The algorithm tries to
infer the most likely ancestral state at each node by comparing the states of the child taxa and the
observed data. The number of changes is minimized by selecting the character states at each
internal node that require the fewest evolutionary steps.
To compute the tree, MP evaluates the "parsimony score" for each possible tree topology. This
score is calculated by summing the number of changes (steps) needed to explain all the character
states across all sites in the alignment. The tree with the lowest parsimony score is chosen as the
best-fitting tree, as it represents the simplest hypothesis of evolutionary relationships based on
the observed data. MP is typically implemented through exhaustive searching, heuristic search
methods, or branch-and-bound algorithms that explore tree space efficiently. However,
exhaustive searches can be computationally prohibitive for large datasets due to the vast number
of possible tree topologies, so heuristic methods (e.g., stepwise addition or tree-bisection-
reconnection) are often employed.
One of the key advantages of MP is its simplicity and interpretability. The method does not rely
on complex probabilistic models or assumptions about the underlying evolutionary process,
making it easier to apply to various types of data. Additionally, MP does not require the
specification of an evolutionary model, which is advantageous when little is known about the
evolutionary processes at play. However, MP also has its limitations. It can be sensitive to the
presence of long-branch attraction (LBA) artifacts, where distantly related taxa are incorrectly
placed together because they have accumulated similar changes independently. Moreover, MP is
a "greedy" method, meaning it searches for the simplest tree but may not always find the globally
optimal solution, especially when there are many equally parsimonious trees. This can lead to a
lack of resolution in some cases, particularly when the data is complex or contains a lot of
homoplasy (evolutionary changes that are not shared by common ancestors but occur
independently in different lineages).
Maximum Likelihood
Maximum Likelihood (ML) is a statistical method used to estimate the most likely
phylogenetic tree based on observed sequence data, under a specific model of evolution. Unlike
Maximum Parsimony, which minimizes the number of evolutionary changes, ML calculates the
probability of observing the data given a particular tree topology and evolutionary model. The
goal is to find the tree that maximizes this likelihood.
The process begins with an alignment of sequences and the selection of an appropriate model of
sequence evolution (such as Jukes-Cantor or GTR). The model specifies parameters like
substitution rates and base frequencies. ML then evaluates the likelihood of observing the data
for each possible tree topology by considering all possible ways the characters could have
evolved along the tree branches, based on the chosen model. This process involves complex
calculations, summing the probabilities of observing the sequence data given the evolutionary
path.
Because there are many possible tree topologies, ML uses search algorithms (such as tree
bisection-reconnection) to explore tree space and find the topology that maximizes the
likelihood. Although ML provides accurate and detailed results, it is computationally intensive,
especially for large datasets. The accuracy of the method depends on the correct choice of the
evolutionary model, as a misspecified model can lead to biased results. Despite its computational
cost, ML is widely used for its robustness and ability to account for complex evolutionary
processes.
Bootstrapping
1. Resampling:
o Columns (or sites) in the sequence alignment are randomly resampled with
replacement to generate new datasets of the same size as the original alignment.
o This process captures variability and simulates the effect of sampling error.
2. Tree Reconstruction:
o For each bootstrapped dataset, a phylogenetic tree is reconstructed using the
chosen method (e.g., maximum likelihood, neighbor-joining, or parsimony).
3. Consensus Tree:
o The trees from all bootstrap replicates are combined into a single consensus tree.
o Each branch is annotated with bootstrap support values, representing the
percentage of replicates in which that branch appears.
Interpretation:
• High Bootstrap Values (e.g., >70% or >95%): Indicate strong support for a clade,
suggesting it is a robust feature of the data.
• Low Bootstrap Values (<50%): Suggest weak or uncertain support for a clade, implying
it may not reflect true evolutionary relationships.
Applications:
Limitations:
• Resampling Bias: Bootstrapping assumes sites are independent, which may not hold true
due to factors like linkage disequilibrium or shared evolutionary constraints.
• Overconfidence: Very high support values might still not guarantee accuracy if the
model or input data are flawed.
Standard bootstrap is a widely used method for assessing statistical confidence in phylogenetic
trees. In this approach, multiple replicate datasets are generated by resampling the original
alignment with replacement, ensuring each column (site) in the alignment has an equal chance of
being included in the replicates. A phylogenetic tree is then reconstructed for each replicate, and
the frequency with which a particular branch (or clade) appears across all replicates is calculated.
This frequency, expressed as a percentage, is used as the bootstrap support value for the branch.
Standard bootstrap is computationally intensive because it requires constructing a full
phylogenetic tree for each replicate, making it particularly demanding for large datasets or
complex tree-building methods. Moreover, it is known to provide reliable support values but
may be conservative, often underestimating support for well-supported clades.
In essence, the primary difference lies in the computational methodology and efficiency.
Standard bootstrap emphasizes methodological rigor, requiring extensive tree reconstructions for
robust support estimation, albeit at a high computational cost. UFboot sacrifices some of this
rigor for speed by leveraging approximations and computational shortcuts, providing a practical
alternative for large and complex phylogenetic analyses without significantly compromising
accuracy in most scenarios.
Posterior probability support, on the other hand, is derived from Bayesian inference. In this
approach, the posterior probability of a branch is calculated as the proportion of sampled trees
(from a posterior distribution) in which the branch appears, given the data and a specified
evolutionary model. Unlike bootstrap support, posterior probability incorporates prior
information about the parameters and the model, combining it with the likelihood of the
observed data to estimate the probability of a branch being correct. This measure is typically
more sensitive and provides values on a scale from 0 to 1, which can often be interpreted as the
likelihood of a clade being true under the model assumptions.
The choice between the two measures depends on the context and goals of the analysis. Posterior
probability support often gives higher values than bootstrap support for the same data, as it
benefits from the incorporation of prior information and the inherent assumptions of Bayesian
methods. However, it may overestimate support if the priors or models are misspecified or if the
alignment contains systematic biases. Bootstrap support, by contrast, is model-independent and
more conservative, making it a safer option when confidence in model assumptions is low or
when minimizing the risk of overestimation is crucial.
ProtTest and ModelFinder are two popular tools used in molecular evolution studies to identify the
best-fit models of protein sequence evolution for phylogenetic analysis. Both tools aim to select
the most appropriate model to infer evolutionary relationships based on sequence data. While they
serve similar purposes, they operate in different ways and offer unique features.
ProtTest:
ProtTest is a software tool designed for selecting the most suitable model of protein evolution from
a set of candidate models based on a given protein alignment. It evaluates a variety of candidate
models based on their statistical likelihood and ranks them according to how well they fit the data.
Model Selection: ProtTest evaluates a wide range of models, including both standard models like
the JTT, WAG, LG, and more complex models that account for site-specific rate variation or rate
heterogeneity.
Model Testing Criteria: It uses model comparison methods like Akaike Information Criterion (AIC),
Bayesian Information Criterion (BIC), and other likelihood-based criteria to identify the best model.
Assessing Substitution Patterns: ProtTest models substitution patterns, accounting for differences
in amino acid frequencies, exchangeabilities, and rate variation across sites. It also allows testing
models that include gamma-distributed rate heterogeneity.
Use of Statistical Tests: It implements likelihood ratio tests (LRT) to compare nested models,
helping to determine whether more complex models improve the fit significantly.
Comprehensive Model Library: ProtTest includes a large set of protein evolution models, both for
simpler and more complex scenarios.
teps Involved:
The tool provides a ranked list of models, with the best model according to different statistical
tests.
ModelFinder:
ModelFinder is another model selection tool, which is often considered part of the IQ-TREE
phylogenetic software suite. It aims to find the best-fit model of nucleotide or protein evolution
using sophisticated algorithms and statistical approaches.
Advanced Search Algorithms: ModelFinder employs advanced search algorithms like greedy
search, but it can also incorporate more sophisticated strategies such as Bayesian model averaging
or more exhaustive search approaches to find the optimal model.
Comprehensive Models: Like ProtTest, ModelFinder offers a wide range of models for protein and
nucleotide sequences, including complex models that account for site-specific rate variation and
other evolutionary features.
Automatic Model Selection: ModelFinder automatically selects the best-fit model by using
information-theoretic criteria, such as the AIC, BIC, and the corrected Akaike Information Criterion
(AICc), and likelihood-based tests.
Integrated with IQ-TREE: ModelFinder is integrated with IQ-TREE, a powerful phylogenetic inference
tool, which can immediately apply the best-fit model to build phylogenetic trees.
Flexibility with Data: It is versatile and can handle not just protein sequences but also nucleotide
sequences. It also supports partitioned models, where different regions of a dataset can evolve
under different models.
Steps Involved:
ModelFinder evaluates various models of evolution, considering site-specific rate variation and
other advanced features.
It then outputs the best-fit model, ranked according to AIC, BIC, or other criteria.
ProtTest uses a simpler approach based on comparing predefined models using statistical criteria.
ModelFinder employs more advanced search strategies, such as greedy algorithms and parallel
computing, which may give it an edge for large datasets or complex model evaluations.
ModelFinder is part of the IQ-TREE software suite, meaning it is often used in conjunction with tree-
building algorithms for phylogenetic analysis.
Complexity of Models:
ProtTest offers a wide range of models, but its focus is generally on more traditional substitution
models, often limited to simpler structures.
ModelFinder tends to support a broader range of complex models, including those that account for
site-specific substitution patterns and rate heterogeneity.
ModelFinder is optimized for larger datasets and offers parallel computation for faster
performance.
ProtTest has a more straightforward, command-line-based interface, but it may be less flexible for
partitioned or complex evolutionary models.
ModelFinder is highly flexible, with integrated support for handling multiple data partitions and
large-scale analyses in conjunction with phylogenetic tree inference.
Overall ProtTest is a robust tool for selecting protein evolutionary models, known for its simplicity
and ease of use, but with somewhat less flexibility than ModelFinder. ModelFinder, part of IQ-TREE,
provides advanced methods for model selection, offers more flexibility for complex datasets, and is
integrated with phylogenetic tree inference software, making it better suited for large-scale
analyses. In essence, ProtTest is a more standalone, easy-to-use tool for model selection, whereas
ModelFinder is more advanced, scalable, and integrated within the IQ-TREE framework for
phylogenetic analysis. The choice between them would depend on the complexity of the analysis,
the size of the dataset, and whether you need to infer phylogenetic trees alongside model selection.