Bayesian Evolutionary Analysis With BEAST PDF
Bayesian Evolutionary Analysis With BEAST PDF
ALEXEI J. DRUMMOND
University of Auckland, New Zealand
R E M C O R . B O U C K A E RT
University of Auckland, New Zealand
University Printing House, Cambridge CB2 8BS, United Kingdom
www.cambridge.org
Information on this title: www.cambridge.org/9781107019652
© Alexei J. Drummond and Remco R. Bouckaert 2015
Preface page ix
Acknowledgements x
Summary of most significant capabilities of BEAST 2 xi
Part I Theory 1
1 Introduction 3
1.1 Molecular phylogenetics 4
1.2 Coalescent theory 6
1.3 Virus evolution and phylodynamics 8
1.4 Before and beyond trees 8
1.5 Probability and Bayesian inference 10
2 Evolutionary trees 21
2.1 Types of trees 21
2.2 Counting trees 24
2.3 The coalescent 27
2.4 Birth–death models 36
2.5 Trees within trees 40
2.6 Exercise 43
Part II Practice 77
References 220
Index of authors 240
Index of subjects 244
Preface
This book consists of three parts: theory, practice and programming. The theory part
covers theoretical background, which you need to get some insight in the various
components of a phylogenetic analysis. This includes trees, substitution models, clock
models and, of course, the machinery used in Bayesian analysis such as Markov chain
Monte Carlo (MCMC) and Bayes factors.
In the practice part we start with a hands-on phylogenetic analysis and explain how
to set up, run and interpret such an analysis. We examine various choices of prior, where
each is appropriate, and how to use software such as BEAUti, FigTree and DensiTree
to assist in a BEAST analysis. Special attention is paid to advanced analysis such as
sampling from the prior, demographic reconstruction, phylogeography and inferring
species trees from multilocus data. Interpreting the results of an analysis requires some
care, as explained in the post-processing chapter, which has a section on troubleshooting
with tips on detecting and preventing failures in MCMC analysis. A separate chapter is
dedicated to visualising phylogenies.
BEAST 2.2 uses XML as a file format to specify various kinds of analysis. In the
third part, the XML format and its design philosophy are described. BEAST 2.2 was
developed as a platform for creating new Bayesian phylogenetic analysis methods, by a
modular mechanism for extending the software. In the programming part we describe
the software architecture and guide you through developing BEAST 2.2 extensions.
We recommend that everyone reads Part I for background information, especially
introductory Chapter 1. Part II and Part III can be read independently. Users of BEAST
should find much practical information in Part II, and may want to read about the XML
format in Part III. Developers of new methods should read Part III, but will also find
useful information about various methods in Part II.
The BEAST software can be downloaded from http://beast2.org and for developers,
source code is available from https://github.com/CompEvol/beast2/. There is a lot of
practical information available at the BEAST 2 wiki (http://beast2.org), including links
to useful software such as Tracer and FigTree, a list of the latest packages and links
to tutorials. The wiki is updated frequently. A BEAST users’ group is accessible at
http://groups.google.com/group/beast-users.
Summary of most significant
capabilities of BEAST 2
This book is part science, part technical, and all about the computational analysis of
heritable traits: things like genes, languages, behaviours and morphology. This book is
centred around the description of the theory and practice of a particular open source
software package called BEAST (Bayesian evolutionary analysis by sampling trees).
The BEAST software package started life as a small science project in New Zealand
but it has since grown tremendously through the contributions of many scientists from
around the world, chief among them the research groups of Alexei Drummond, Andrew
Rambaut and Marc Suchard. A full list of contributors to the BEAST software package
can be found on the BEAST GitHub page or printed to the screen when running the
software.
Very few things challenge the imagination as much as does evolution. Every living
thing is the result of the unfolding of this patient process. While the basic concepts of
Darwinian evolution and natural selection are second nature to many of us, it is the
detail of life’s tapestry which still inspires an awe of the natural world. The scientific
community has spent a couple of centuries trying to understand the intricacies of the
evolutionary process, producing thousands of scientific articles on the subject. Despite
this Herculean effort, it is tempting to say that we have only just scratched the surface.
As with many other fields of science, the study of biology has rapidly become dom-
inated by the use of computers in recent years. Computers are the only way that biolo-
gists can effectively organise and analyse the vast amounts of genomic data that are now
being collected by modern sequencing technologies. Although this revolution of data
has really only just begun, it has already resulted in a flourishing industry of computer
modelling of molecular evolution.
This book has the modest aim of describing this still new computational science of
evolution, at least from the corner we are sitting in. In writing this book we have not
aimed for it to be comprehensive and gladly admit that we mostly focus on the models
that the BEAST software currently supports. Dealing, as we do, with computer models
of evolution, there is a healthy dose of mathematics and statistics. However, we have
made a great effort to describe in plain language, as clearly as we can, the essential
concepts behind each of the models described in this book. We have also endeavoured
to provide interesting examples to illustrate and introduce each of the models. We hope
you enjoy it.
4 Introduction
The informational molecules central to all biology are deoxyribonucleic acid (DNA),
ribonucleic acid (RNA) and protein sequences. These three classes of molecules are
commonly referred to in the molecular evolutionary field as molecular sequences, and
from a mathematical and computational point of view an informational molecule can
often be treated simply as a linear sequence of symbols on a defined alphabet (see
Figure 1.1). The individual building blocks of DNA and RNA are known as nucleotides,
while proteins are composed of 20 different amino acids. For most life forms it is
the DNA double-helix that stores the essential information underpinning the biologic-
al function of the organism and it is the (error-prone) replication of this DNA that
transmits this information from one generation to the next. Given that replication is
a binary reaction that starts with one genome and ends with two similar if not identical
genomes, it is unsurprising that the natural and appropriate structure for visualising the
replication process over multiple generations is a bifurcating tree. At the broadest scale
of consideration the structure of this tree represents the relationships between species
and higher-order taxonomic groups. But even when considering a single gene within a
single species, the ancestral relationships among genes sampled from that species will
be represented by a tree. Such trees have come to be referred to as phylogenies and it
is becoming clear that the field of molecular phylogenetics is relevant to almost every
scientific question that deals with the informational molecules of biology. Furthermore,
many of the concepts developed to understand molecular evolution have turned out to
transfer with little modification to the analysis of other types of heritable information in
natural systems, including language and culture. It is unsurprising then that a book on
computational evolutionary analysis would start with phylogenetics.
The study of phylogenetics is principally concerned with reconstructing the evolu-
tionary history (phylogenetic tree) of related species, individuals or genes. Although
algorithmic approaches to phylogenetics pre-date genetic data, it was the availability of
genetic data, first allozymes and protein sequences, and then later DNA sequences, that
provided the impetus for development in the area.
A phylogenetic tree is estimated from some data, typically a multiple sequence align-
ment (see Figure 1.2), representing a set of homologous (derived from a common ances-
tor) genes or genomic sequences that have been aligned, so that their comparable regions
are matched up. The process of aligning a set of homologous sequences is itself a hard
computational problem, and is in fact entangled with that of estimating a phylogenetic
tree (Lunter et al. 2005; Redelings and Suchard 2005). Nevertheless, following conven-
tion we will – for the most part – assume that a multiple sequence alignment is known
and predicate phylogenetic reconstruction on it.
DNA {A,C,G,T}
RNA {A,C,G,U}
Proteins {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
Figure 1.1 The alphabets of the three informational molecular classes.
1.1 Molecular phylogenetics 5
Figure 1.2 A small multiple sequence alignment of mitochondrial sequence fragments isolated
from 12 species of primate. The alignment has 898 alignment columns and the individual
sequences vary in length from 893 to 896 nucleotides long. Individual differences from the
consensus sequence are highlighted. 373/898 (41.5%) sites are identical across all 12 species and
the average pairwise identity is 75.7%. The data matrix size is 10 776 (898 × 12) with only 30
gap states. This represents a case in which obtaining an accurate multiple sequence alignment
from unaligned sequences is quite easy and taking account of alignment uncertainty is probably
unnecessary for most questions.
6 Introduction
When a gene tree has been estimated from individuals sampled from the same popu-
lation, statistical properties of the tree can be used to learn about the population from
which the sample was drawn. In particular the size of the population can be estimated
using Kingman’s n-coalescent, a stochastic model of gene genealogies described by
Kingman (1982). Coalescent theory has developed greatly in the intervening decades
and the resulting genealogy-based population genetics methods are routinely used to
infer many fundamental parameters governing molecular evolution and population
dynamics, including effective population size (Kuhner et al. 1995), rate of population
growth or decline (Drummond et al. 2002; Kuhner et al. 1998), migration rates and
population structure (Beerli and Felsenstein 1999, 2001; Ewing and Rodrigo 2006a;
Ewing et al. 2004), recombination rates and reticulate ancestry (Bloomquist and
Suchard 2010; Kuhner et al. 2000).
When the characteristic time scale of demographic fluctuations are comparable to the
rate of accumulations of substitutions then past population dynamics are ‘recorded’ in
the substitution patterns of molecular sequences.
The coalescent process is highly variable, so sampling multiple unlinked loci
(Felsenstein 2006; Heled and Drummond 2008) or increasing the temporal spread
1.2 Coalescent theory 7
of sampling times (Seo et al. 2002) can both be used to increase the statistical power
of coalescent-based methods and improve the precision of estimates of both population
size and substitution rate (Seo et al. 2002).
In many situations the precise functional form of the population size history is
unknown, and simple population growth functions may not adequately describe the
population history of interest. Non-parametric coalescent methods provide greater
flexibility by estimating the population size as a function of time directly from the
sequence data and can be used for data exploration to guide the choice of parametric
population models for further analysis. These methods first cut the time-tree into
segments, then estimate the population size of each segment separately according to the
coalescent intervals within it.
Recently there has been renewed interest in developing mathematical modelling
approaches and computational tools for investigating the interface between population
processes and species phylogenies. The multispecies coalescent is a model of gene
coalescence within a species tree (Figure 1.3; see Section 2.5.1 and Chapter 8 for
further details). There is currently a large amount of development of phylogenetic
inference techniques based on the multispecies coalescent model (Bryant et al. 2012;
Heled and Drummond 2010; Liu et al. 2008, 2009a,b). Its many advantages over
standard phylogenetic approaches centre around the ability to take into account real
differences in the underlying gene tree among genes sampled from the same set of
0.025
0.020
0.015
Time
0.010
0.005
0
G. fortis G. magnirostris C. parvulus C. olivacea
0 0.01 0.02 0.03 0.04
Population
Figure 1.3 A four-taxa species tree with an embedded gene tree that relates multiple individuals
from each of the sampled species. The species tree has (linear) population size functions
associated with each branch, visually represented by the width of each species branch on the
x-axis. The y-axis is a measure of time.
8 Introduction
individuals from closely related species. Due to incomplete lineage sorting it is possible
for unlinked genes from the same set of multispecies individuals to have different
gene topologies, and for a particular gene to exhibit a gene tree that has different
relationships among species than the true species tree. The multispecies coalescent can
be employed to estimate the common species tree that best reconciles the coalescent-
induced differences among genes, and provides more accurate estimates of divergence
time and measures of topological uncertainty in the species tree. This exciting new
field of coalescent-based species tree estimation is still in its infancy and there are
many promising directions for development, including incorporation of population
size changes (Heled and Drummond 2010), isolation with migration (Hey 2010),
recombination and lateral gene transfer, among others.
A number of good recent reviews have been written about the impact of statistical
phylogenetics and evolutionary analysis on the study of viral epidemiology (Kühnert
et al. 2011; Pybus and Rambaut 2009; Volz et al. 2013). Although epidemic modelling
of infectious diseases has a long history in both theoretical and empirical research, the
term phylodynamics has a recent origin reflecting a move to integrate theory from math-
ematical epidemiology and statistical phylogenetics into a single predictive framework
for studying viral evolutionary and epidemic dynamics. Many RNA viruses evolve so
quickly that their evolution can be directly observed over months, years and decades
(Drummond et al. 2003a). Figure 1.4 illustrates the effect that this has on the treatment
of phylogenetic analysis.
Molecular phylogenetics has had a profound impact on the study of infectious dis-
eases, particularly rapidly evolving infectious agents such as RNA viruses. It has given
insight into the origins, evolutionary history, transmission routes and source populations
of epidemic outbreaks and seasonal diseases. One of the key observations about rapidly
evolving viruses is that the evolutionary and ecological processes occur on the same
time scale (Pybus and Rambaut 2009). This is important for two reasons. First, it means
that neutral genetic variation can track ecological processes and population dynamics,
providing a record of past evolutionary events (e.g. genealogical relationships) and past
ecological/population events (geographical spread and changes in population size and
structure) that were not directly observed. Second, the concomitance of evolutionary
and ecological processes leads to their interaction that, when non-trivial, necessitates
joint analysis.
t2 t1 t0
Figure 1.4 A hypothetical serially sampled gene tree of a rapidly evolving virus, showing that the
sampling time interval (t = t2 − t0 ) represents a substantial fraction of the time back to the
common ancestor. Red circles represent sampled viruses (three viruses sampled at each of three
times) and yellow circles represent unsampled common ancestors.
dynamic programming, hidden Markov models and optimisation algorithms have been
applied to this task. ClustalW and ClustalX (Larkin et al. 2007) are limited but widely
used programs for this task. The Clustal algorithm uses a guide tree constructed by
a distance-based method to progressively construct a multiple sequence alignment via
pairwise alignments. Since most Bayesian phylogenetic analyses aim to reconstruct a
tree, and the Clustal algorithm already assumes a tree to guide the alignment, it may be
that a bias is introduced towards recovering the guide tree. This guide tree is based on a
relatively simple model and it may contain errors resulting in sub-optimal alignments.
T-Coffee (Notredame et al. 2000) is another popular program that builds a library of
pairwise alignments to guide the construction of the complete alignment.
With larger data sets comes a need for high-throughput multiple sequence alignment
algorithms; two popular and fast multiple sequence alignment algorithms are MUSCLE
(Edgar 2004a,b) and MAFFT (Katoh and Standley 2013, 2014; Katoh et al. 2002)
However, a more principled approach in line with the philosophy of this book is
to perform statistical alignment in which phylogenetic reconstructions and sequence
alignments are simultaneously evaluated in a joint Bayesian analysis (Arunapuram et
al. 2013; Bradley et al. 2009; Lunter et al. 2005; Novák et al. 2008; Redelings and
Suchard 2005; Suchard and Redelings 2006). It has been shown that uncertainty in the
alignment can lead to different conclusions (Wong et al. 2008), but in most cases it is
10 Introduction
hard to justify the extra computational effort required, and statistical alignment is not
yet available in BEAST 2.2.
Ancestral recombination graphs: A phylogenetic tree is not always sufficient
to reflect the sometimes complex evolutionary origin of a set of homologous gene
sequences when processes such as recombination, reassortment, gene duplication or
lateral gene transfer are involved in the evolutionary history.
Coalescent theory has been extended to account for recombination due to
homologous crossover (Hudson 1990) and the ancestral recombination graph (ARG)
(Bloomquist and Suchard 2010; Griffiths and Marjoram 1996; Kuhner 2006; Kuhner et
al. 2000) is the combinatorial object that replaces a phylogenetic tree as the description
of the ancestral (evolutionary) history. However, in this book we will limit ourselves to
trees.
At the heart of this book is the idea that much of our understanding about molecular
evolution and phylogeny will come from a characterisation of the results of random or
stochastic processes. The sources of this randomness are varied, including the vagaries
of chance that drive processes like mutation, birth, death and migration. An appropriate
approach to modelling data that are generated by random processes is to consider the
probabilities of various hypotheses given the observed data. In fact the concept of
probability and the use of probability calculus within statistical inference procedures
is pervasive in this book. We will not, however, attempt to introduce the concept of
probability or inference in any formal way. We suggest (Bolstad 2011; Brooks et al.
2010; Gelman et al. 2004; Jaynes 2003; MacKay 2003) for a more thorough introduction
or reminder about this fundamental material. In this section we will just lay out some of
the terms, concepts and standard relationships and give a brief introduction to Bayesian
inference.
A conditional probability Pr(X = x|Y = y) gives the probability that random variable
X takes on value x, given the condition that the random variable Y takes on value y ∈ SY .
This can be shortened to Pr(x|y) and the following relation exists:
Pr(x) = Pr(x|y) Pr(y). (1.1)
y∈SY
and Pr(x) is known as the marginal probability of X = x in the context of the joint
probability distribution on (X, Y).
A probability density function f (x) defines a probability distribution over a continuous
parameter x ∈ SX (where SX is now a continuous space) so that the integral of the
density over the sample space sums to 1:
f (x)dx = 1,
x∈SX
and f (x) is everywhere non-negative, but may take on values greater than 1 (see
Figure 1.5 for an example). In this case the random variable X is continuous, and takes
on a value in set E ⊆ SX with probability
Pr(X ∈ E) = f (x)dx.
x∈E
This expresses the relationship between probabilities and probability densities on con-
tinuous random variables, where the left-hand side of the equation represents the prob-
ability of the continuous random variable X taking on a value in the set E, and the
right-hand side is the area under the density function in the set E (see Figure 1.5). If X
is univariate real and E is interval [a, b] then we have:
b
Pr(a ≤ X ≤ b) = f (x)dx.
a
When x has probability density f (·) we write X ∼ f (·). Finally, the expectation E(·)
of a random variable X ∼ f (·) can also be computed by an integral:
E(X) = xf (x)dx.
x∈SX
12 Introduction
0
0 0.5 1 1.5 2
x
Figure 1.5 A probability density function f (x) = 2e−2x (i.e. an exponential distribution with a
1.0
rate of 2). The area under the curve for Pr(0.5 ≤ X ≤ 1.0) = 0.5 f (x)dx ≈ 0.2325442 is
filled in.
p(D|θ )p(θ )
p(θ|D) = ,
p(D)
1 We use Pr(·) to denote probability distributions over discrete states, such as an alignment. Further, we use
f (·) to denote densities over continuous spaces, such as rates or divergence times. Where the space is partly
discrete and partly continuous, such as for (time-)trees, we will use the notation for densities.
14 Introduction
f (θ ) = det(I(θ )), (1.3)
where I(·) is the Fisher Information matrix of θ and det(·) is the determinant of a
matrix. This prior works well on one-dimensional problems, but its limitations for larger
problems has led to the development of reference priors (Berger and Bernardo 1992).
Non-informative priors hold a natural appeal, especially for proponents of focusing
exclusively on the evidence at hand. However, such priors can also often be improper
in a formal sense if they don’t have a finite integral. Such improper priors are extremely
dangerous, and should be treated with great caution. For example, certain Bayesian
methods (such as path sampling, see Section 1.5.7) require proper priors and will give
undefined results if improper priors are used.
20
18
Propose to step
16
from 18 to 19.5.
-8
14 Always accept
-12 increase.
-16 12
-20 Propose to step from
-24 10 14 to 12. Accept with
probability 12/14.
8
6
Propose to step from
4 10 to 2. Accept with
probability 2/10 = 0.2.
2
0
0
Figure 1.6 Left, posterior landscapes can contain many local optima. The MCMC sampler aims
to return more samples from high posterior areas and fewer from low posterior regions. If run for
sufficiently long, the sampler will visit all points. Right, the MCMC ‘robot’ evaluates a proposal
location in the posterior landscape. If the proposed location is better, it accepts the proposal and
moves to the new location. If the proposed location is worse, it will calculate the ratio of the
posterior at the new location and that of the current location and accepts a step with probability
equal to this ratio. So, if the proposed location is slightly worse, it will be accepted with high
probability, but if the proposed location is much worse, it will almost never be accepted.
a single parameter. If the proposal distribution is symmetric, the new state is accepted
with probability
f (θ |D)
min 1, , (1.4)
f (θ|D)
hence if the posterior of the new state θ is better it is always accepted, and if it is worse
it is accepted by drawing a random number from the unit interval and if the number is
less than the ratio f (θ |D)/f (θ|D) the new state is accepted. Either way, the step count
of the chain is incremented.
A component of the proposal distribution may be termed an operator. Since a single
operator typically only moves the state a small amount, the states θ and θ will be
highly dependent. However, after a sufficiently large number of steps, the states will
become independent samples from the posterior. So, once every number of steps, the
posterior and various attributes of the state are sampled and stored in a trace log and
tree logs. Note that if the number of steps is too small, subsequent samples will still be
autocorrelated, which means that the number of samples can be larger than the effective
sample size (ESS).
It is a fine art to design and compose proposal distributions and the operators that
implement them. The efficiency of MCMC algorithms crucially depends on a good
mix of operators forming the proposal distribution. Note that a proposal distribution
q(θ |θ ) differs from the target distribution f (θ|D) in that changes to the former only
affect the efficiency with which MCMC will produce an estimate of the latter. Some
16 Introduction
operators are not symmetric, so that q(θ |θ ) = q(θ|θ ). However, the MCMC algorithm
described above assumes the probability of proposing θ when in state θ is the same
as the probability of proposing θ when in state θ . The Metropolis–Hastings algorithm
(Hastings 1970) is an MCMC algorithm that compensates to maintain reversibility by
factoring in a Hastings ratio and accepts a proposed state with probability
f (θ |D) q(θ|θ )
α = min 1, . (1.5)
f (θ|D) q(θ |θ )
The Hastings ratio corrects for any bias introduced by the proposal distribution. Correct
calculation and implementation of non-symmetric operators in complex problems like
phylogenetics is difficult (Holder et al. 2005).
Green (Green 1995; Green et al. 2003) describes a general method to calculate the
Hastings ratio that also works when θ is not of fixed dimension. Green’s recipe assumes
that θ can be reached from θ by selecting one or more random variables u, and likewise
θ can be reached from θ by selecting u so that the vectors (θ , u ) and (θ, u) have
the same dimension. Let g and g be the probability (densities) of selecting u and u
) g (u )
respectively. Green showed that the ratio q(θ|θ q(θ |θ) can be calculated as g(u) |J| where
g(u) and g (u ) is the density of u and u respectively, and |J| is the Jacobian of the
tranformation (θ, u) → (θ , u ).
For example, a scale operator with scale factor β ∈ (0, 1) has a proposal distribution
qβ (θ |θ ) that transforms a parameter θ = uθ by selecting a random number u uniformly
from the interval (β, β1 ). Note that this scale operator has a probability of (1−β)/( β1 −β)
of decreasing and thus a higher probability of increasing θ, so the Hastings ratio cannot
be 1. Moving between (θ, u) and (θ , u ) such that θ = uθ requires that u = 1/u.
Consequently, the Hastings ratio when selecting u is
g(u ) ∂(θ , u ) 1 1 ∂θ /∂θ ∂θ /∂u
HR = = 1 /1
g(u) ∂(θ , u) β −β β −β
∂(1/u)/∂θ ∂(1/u)/∂u
u θ 1
= = . (1.6)
0 u12 u
Operators typically only change a small part of the state, for instance only the clock
rate while leaving the tree and substitution model parameters the same. Using oper-
ators that sample from the conditional distribution of a subset of parameters given the
remaining parameters results in a Gibbs sampler (Geman and Geman 1984), and these
operators can be very efficient in exploring the state space, but can be hard to implement
since it requires that this conditional distribution is available.
Operators often have tuning parameters that influence how radical the proposals are.
For the scale operator mentioned above, small values of β lead to bold moves, while
values close to 1 lead to timid proposals. The value of β is set at the start of the run, but it
can be tuned during the MCMC run so that it makes more bold moves if many proposals
are accepted, or more timid if many are rejected. For example, in the BEAST inference
engines (Bouckaert et al. 2014; Drummond et al. 2012), the way operator parameters are
tuned is governed by an operator schedule, and there are a number of tuning functions.
1.5 Probability and Bayesian inference 17
⎡ ⎤⎡ ⎤⎡ ⎤
− r1 r2 r3 − 1 0 0 − r1 0 0
⎢ r4 − r5 r6 ⎥ ⎢ 1 − 1 0 ⎥ ⎢ r4 − r5 0 ⎥
⎢ ⎥⎢ ⎥⎢ ⎥
⎣ r7 r8 − r9 ⎦ ⎣ 0 1 − 1 ⎦⎣ 0 r8 − r9 ⎦
r10 r11 r12 − 0 0 1 − 0 0 r12 −
Figure 1.7 Left, rate matrix where all rates are continuously sampled. Middle, indicator matrix
with binary values that are sampled. Right, rate matrix that is actually used, which combines rate
from the rate matrix on the left with indicator variables in the middle.
Tuning typically only changes the parameters much during the start of the chain, and
the tuning parameter will settle on a specific value as the chain progresses, guaranteeing
the correctness of the resulting sample from the posterior distribution. In BEAST, not
every operator has a tuning parameter, but if they do, its value will be reported at the
end of a run, and suggest different values if the operator does not perform well.
It is quite common that the parameter space is not of a fixed dimension, for ex-
ample when a nucleotide substitution model is chosen but the number of parameters is
unknown. The MCMC algorithm can accommodate this using reversible jump (Green
1995) or Bayesian variable selection (BSVS, e.g. (Kuo and Mallick 1998; Wu and
Drummond 2011; Wu et al. 2013)). With reversible jump, the state space actually
changes dimension, which requires care in calculating the Hastings ratios of the oper-
ators that propose dimension changes, but is expected to be more computationally
efficient than BSVS. BSVS involves a state space that contains the parameters of all the
sub-models. A set of boolean indicator variables are also sampled that determine which
model parameters are used and which are excluded from the likelihood calculation in
each step of the Markov chain. Figure 1.7 shows an example for a rate matrix. The
unused parameters are still part of the state space, and a prior is defined on them so
proposals are still performed on these unused parameters, making BSVS less efficient
than reversible jump. However, the benefit of BSVS is that it is easy to implement.
EBSP (Heled and Drummond 2008) and discrete phylogeography (Lemey et al. 2009a)
are examples that use BSVS, and the RB substitution model in the RBS plug-in is an
example that uses reversible jump.
for data D. Although it has been suppressed in previous sections, here we make the
posterior dependence on the model explicit by including the model on the right of
vertical bar. So Pr(D|M) is the marginal likelihood with respect to the prior on the
18 Introduction
parameters θM of model M:
Pr(D|M) = Pr(D|θM , M)f (θM |M)dθM , (1.8)
θM
where β runs from 0 to 1 following a sigmoid function (Baele et al. 2013). Such
direct comparison of models instead of estimating marginal likelihoods in separate
path sampling analyses has the advantage that fewer steps are required. Furthermore,
computational gains can be made if large parts of the model are shared by models M1
and M2 . For example, if M1 and M2 only differ in the tree prior, but the likelihoods are
the same, Pr(D|θM1 , M1 ) = Pr(D|θM2 , M2 ), Equation (1.9) reduces to
0.001, even though we know that only a few fast-evolving viruses have such high rates
(Duffy et al. 2008).
Priors can be set using the objective view, which is based on the principle that priors
should be non-informative and only the model and data should influence the posterior
(Berger 2006). Alternatively, subjective Bayesian practitioners set priors so as to contain
prior information from previous experience, expert opinions and information from the
literature (Goldstein 2006). This offers the benefit from a more pragmatic point of
view that the Bayesian approach allows us to combine information from heterogeneous
sources into a single analysis based on a formal foundation in probability theory. For
example, integrating DNA sequence data, information about geographic distribution
and data from the fossil record into a single analysis becomes quite straightforward, as
we shall see later in this book.
Another practical consideration is that after setting up priors and data, an MCMC
algorithm does not generally require as much special attention or tuning as many hill-
climbing or simulated annealing algorithms used in ML. A well-designed MCMC algo-
rithm just runs, automatically tunes itself and produces a posterior distribution. The
MCMC algorithm is guaranteed to converge in the limit of the number of samples
(Hastings 1970; Metropolis et al. 1953), but in practice it tends to converge much faster.
The algorithm is particularly suited to navigating the multi-modal and highly peaked
probability landscape typical of phylogenetic problems.
2 Evolutionary trees
Alexei Drummond and Tanja Stadler
This book is about evolution, and one of the fundamental features of evolutionary
analysis is the tree. The terms tree and phylogeny are used quite loosely in the literature
for the purposes of describing a number of quite distinct objects. Evolutionary trees are
a subset of the group of objects that graph theorists know as trees, which are themselves
connected graphs that do not contain cycles. An evolutionary tree typically has labelled
leaf nodes (tips) and unlabelled internal nodes (an internal node may also be known
as a divergence or coalescence). The leaf nodes are labelled with taxa, which might
represent an individual organism, or a whole species, or more typically just a gene
fragment, while the internal nodes represent unsampled (and thus inferred) common
ancestors of the sampled taxa. For reasons mainly of history, the types of trees are many
and varied; in the following we introduce the main types.
Tanja Stadler, ETH Zürich, Department of Biosystems Science & Engineering (D-BSSE), Mattenstrasse 26,
4058 Basel, Switzerland
22 Evolutionary trees
A C
B
D
D A
(a) (b)
Figure 2.1 Two leaf-labelled binary trees; (a) is rooted and (b) is unrooted.
(a) a star tree (b) partially resolved tree (c) fully resolved tree
A B C D E A B C D E A B C D E
know that the evolutionary process finishes at the leaves, but we do not know from
which point in the tree it starts. An unrooted binary tree is a tree in which each node has
one branch (leaf) or three branches (internal node) attached, and can be obtained from
a rooted binary tree through replacing the root node and its two attached branches by a
single branch. An unrooted binary tree of n taxa can be described by 2n − 2 nodes and
2n − 3 branches with branch lengths. In the unrooted tree diagram in Figure 2.1b the
path length between two leaf nodes is simply the sum of the branch lengths along the
shortest path connecting the two leaves. This figure makes it more obvious that taxon A
is the closest to taxon D.
There are many different units that the branch lengths of a tree could be expressed in,
but a common unit that is used for trees estimated from molecular data is substitutions
per site. We will expand more on this later in the book when we examine how one
estimates a tree using real molecular sequence data.
illustrates a completely unresolved (star) tree, along with a partially resolved (and thus
multifurcating) tree and a fully resolved (and thus binary or bifurcating) tree.
2.1.3 Time-trees
A time-tree is a special form of rooted tree, namely a rooted tree where branch lengths
correspond to calendar time, i.e. the duration between branching events. A time-tree of n
taxa can be described by 2n − 2 edges (branches) and 2n − 1 nodes with associated node
times (note that assigning 2n−2 branch lengths and the time of one node is equivalent to
assigning 2n−1 node times). The times of the internal nodes are called divergence times,
ages or coalescent times, while the times of the leaves are known as sampling times.
Figure 2.5 is an example of a time-tree. Often we are interested in trees in which all taxa
are represented by present-day samples, such that all the sampling times are the same.
In this case it is common for the sampling times to be set to zero, and the divergence
times to be specified as times before present (ages), so that time increases into the past.
Table 2.1 Notation used for time-trees in all subsequent sections and chapters
Notation Description
sufficiently large variance in the relation between amount and duration of evolution,
a relaxed clock model needs to be considered (see Chapter 4) since it attempts to model
this variance. Since in a time-tree node heights correspond to the ages of the nodes (or at
least relative ages if there is no calibration information), such rooted tree models have
fewer parameters than unrooted tree models have, approaching roughly half for large
number of taxa (for n taxa, there are 2n − 3 branch lengths for unrooted trees, while
for rooted time trees there are n − 1 node heights for internal nodes and a constant but
low number of parameters for the clock model). As BEAST infers rooted, binary (time)
trees we will focus on these objects. If not specified otherwise, a tree will refer to a
rooted binary tree in the following.
Estimating a tree from molecular data turns out to be a difficult problem. The diffi-
culty of the problem can be appreciated when one considers how many possible tree
topologies, i.e. trees without branch lengths, there are. Consider Tn , the set of all tip-
labelled rooted binary trees of n taxa. The number of distinct tip-labelled rooted binary
trees |Tn | for n taxa is (Cavalli-Sforza and Edwards 1967):
n
(2n − 3)!
|Tn | = (2k − 3) = . (2.1)
2n−2 (n − 2)!
k=2
Table 2.2 shows the number of tip-labelled rooted trees up to ten taxa, and other related
quantities.
Table 2.2 The number of unlabelled rooted tree shapes, the number of labelled rooted
trees, the number of labelled ranked trees (on contemporaneous tips) and the number of
fully ranked trees (on distinctly timed tips) as a function of the number of taxa, n
n #shapes #trees, |Tn | #ranked trees, |Rn | #fully ranked trees, |Fn |
2 1 1 1 1
3 1 3 3 4
4 2 15 18 34
5 3 105 180 496
6 6 945 2700 11 056
7 11 10 395 56 700 349 504
8 23 135 135 1 587 600 14 873 104
9 46 2 027 025 57 153 600 819 786 496
10 98 34 459 425 2 571 912 000 56 814 228 736
2.2 Counting trees 25
In general, the number of tree shapes (or unlabelled rooted tree topologies) of n taxa
(an ) is given by (Cavalli-Sforza and Edwards 1967):
(n−1)/2
ai an−i n is odd
an = i=1 (2.2)
a1 an−1 + a2 an−2 + · · · + 12 an/2 (an/2 + 1) n is even.
This result can easily be obtained by considering that for any tree shape of size n, it
must be composed of two smaller tree shapes of size i and n − i that are joined by the
root to make the larger tree shape. This leads directly to the simple recursion above. So
for five taxa, the two branches descending from the root can split the taxa into subtrees
of size (1 and 4) or (2 and 3). There are a1 a4 = 2 tree shapes of the first kind and
a2 a3 = 1 tree shape of the second kind, giving a5 = a1 a4 + a2 a3 = 3. This result can
be built upon to obtain a6 and so on.
All ranked trees with four tips are shown in Figure 2.3. When a tree has non-
contemporaneous times for the sampled taxa we term the tree fully ranked (Gavryushk-
ina et al. 2013) and the number of fully ranked trees of n tips can be computed by
recursion:
nm
|Rnm |
F(n1 , . . . , nm ) = F(n1 , n2 , . . . nm−2 , nm−1 + i), (2.4)
|Ri |
i=1
where ni is the number of tips in the ith set of tips, grouped by sample time (see
Gavryushkina et al. 2013 for details).
Let Fn be the set of fully ranked trees on n tips, each with a distinct sampling time,
then:
n times
|Fn | = F( 1, 1, · · · , 1 ).
2.2.3 Time-trees
Consider Gn , the (infinite) set of time-trees of size n. Gn can be constructed by the Carte-
sian product of (1) the set of ranked trees Rn and (2) D the set of ordered divergence
26 Evolutionary trees
t7
t6
t5
t4
A B C D A B C D A B D C
A B C D C D A B C D B A
A C B D A C B D A C D B
A C B D B D A C B D C A
A D B C A D B C A D C B
A D B C B C A D B C D A
In the remainder of this chapter, we discuss models giving rise to time-trees (and thus
also to ranked trees and tree shapes).
Figure 2.4 A haploid Wright–Fisher population of a dozen individuals with the ancestry of two
individuals sampled from the current generation traced back in time. Going back in time, the
traced lineages coalesce on a common ancestor 11 generations in the past, and from there
onwards the ancestry is shared.
28 Evolutionary trees
Now consider two random members of the current generation from a population of
fixed size N (refer to Figure 2.4). By complete mixing, the probability they share a
concestor (common ancestor) in the previous generation is 1/N. It can easily be shown
by a recursive argument that the probability the concestor is t generations back is
1 1 t−1
Pr{t} = 1− .
N N
It follows that X = t − 1 has a geometric distribution with a success rate of λ = 1/N,
and so has mean N and variance of N 2 − N ≈ N 2 .
With k lineages the time to the first coalescence can be derived in the same way. The
probability that none of the k lineages coalesces in the previous generation is
k
N−1 N−2 N−k+1
... = 1 − 2 + O(1/N 2 ).
N N N N
k
Thus the probability of a coalescent event is 2 /N + O(1/N 2 ). Now for large N we
can drop the order O(N −2 ) term, and this results in a success rate of λ = 2k /N and the
mean waiting time to the first coalescence among k lineages (τk ) of
N
E(τk ) = k .
2
Dropping O(N −2 ) implicitly assumes that N is much larger than k such that two
coalescent events occurring in the same generation is negligible.
Kingman (1982) showed that as N grows the coalescent process converges to
a continuous-time
Markov chain. The rate of coalescence in the Markov chain is
λ = 2k /N, i.e. going back in time, the probability of a pair coalescing from k lineages
on a short time interval t is O(λt). Unsurprisingly the solution turns out to be the
exponential distribution:
k k
τk
f (τk ) = 2
exp − 2 .
N N
exhibits sample genealogies that have the statistical properties of an idealised population
of size Ne . Although this means that the absolute values of Ne are difficult to relate
to the true census size, it allows different populations to be compared on a common
scale (Sjödin et al. 2005). Theoretical extensions to the concept of coalescent effective
population size have also received recent attention (Wakeley and Sargsyan 2009) and the
complexities of interpreting effective population size in the context of HIV-1 evolution
has received much consideration (Kouyos et al. 2006). However, see Gillespie (2001)
for an argument that these neutral evolution models are irrelevant to much real data
because neutral loci will frequently be sufficiently close to loci under selection, that
genetic draft and genetic hitchhiking will destroy the relationship between population
size and genetic diversity that coalescent theory relies on for its inferential power.
The original formulation was for a constant population (as outlined above), but the
theory has been generalised in a number of ways (see Hudson (1990) for a classic
review), including its application to recombination (Hudson 1987; Hudson and Kaplan
1985), island migration models (Hudson 1990; Slatkin 1991; Tajima 1989), population
divergence (Tajima 1983; Takahata 1989) t and deterministically varying functions of
population size for which the integral t01 N(t)−1 dt can be computed (Griffiths and
Tavaré 1994). Since all but the final extension requires more complex combinato-
rial objects than time-trees (either ancestral recombination graphs or structured time-
trees in which tips and internal branch segments are discriminated by their subpopula-
tion), we will largely restrict ourselves to single population non-recombining coalescent
models in the following sections. For information about structured time-tree models, see
Chapter 5.
1
9
2
13
3
12 4
11 5
10
6
8
7
clock constraints) can be obtained and then used to obtain a maximum likelihood
estimate of the mutation-scaled effective population size Ne μ (where μ is mutation
rate per generation time) using coalescent theory. If the mutation rate per generation is
known, then Ne can be estimated directly from a time-tree in which the time is expressed
in generations; otherwise if the mutation rate is known in some calendar units, μc , then
the estimated population size parameter will be Ne gc where μ = μc gc and thus gc is
the generation time in calendar units.
Consider a demographic function Ne (x) and an ordered set of node times t = {ti :
i ∈ V}. Let ki denote the number of lineages co-existing in the time interval (ti−1 , ti )
between node i − 1 and node i. Note that for a contemporaneous tree, ki decreases
monotonically with increasing i.
The probability those times are the result of the coalescent process reducing n
lineages into one is obtained by multiplying the (independent) probabilities for each
coalescence event,
⎛ ⎞
ti ki
ki ⎜ ⎟
f (t|Ne (x)) = 2
exp ⎝− 2
dx⎠, (2.5)
Ne (ti ) Ne (x)
i∈Y i∈V ti−1
where t0 = 0 is defined to effect compact notation in the second product. Note that the
second product is over all nodes (including leaf nodes) to provide for generality of the
result when leaf nodes are non-contemporaneous (i.e. dated tips).
2.3 The coalescent 31
1
f (g|Ne (x)) = f (t|Ne (x)),
|Rn |
since there are |Rn | ranked trees of equal probability under the coalescent process
(Aldous 2001). Note that this second result only holds for contemporaneous tips, as
not all ranked trees are equally probable under the coalescent when tips are non-
contemporaneous (see Section 2.3.2 for non-contemporaneous tips). The function Ne (x)
that maximises the likelihood of the time-tree g is the maximum likelihood estimate of
the population size history. For the simplest case of constant population size Ne (x) = Ne
and contemporaneous tips, this density becomes:
k
1 i
τi
f (g|Ne ) = exp − 2 .
Ne Ne
i∈Y
Note above that we use Ne understanding that the times, τi , are measured in gener-
ations. If they are measured in calendar units then Ne would be replaced by Ne gc where
gc is the generation time in the calendar units employed.
Furthermore, there is often considerable uncertainty in the reconstructed genealogy.
In order to allow for this uncertainty it is necessary to compute the average probability
of the population parameters of interest. The calculation involves integrating over
genealogies distributed according to the coalescent (Felsenstein 1988, 1992; Griffiths
1989; Griffiths and Tavaré 1994; Kuhner et al. 1995). Integration for some models of
interest can be carried out using Monte Carlo methods. Importance-sampling algorithms
have been developed to estimate the population parameter
∝ Ne μ (Griffiths and
Tavaré 1994; Stephens and Donnelly 2000), migration rates (Bahlo and Griffiths 2000)
and recombination (Fearnhead and Donnelly 2001; Griffiths and Marjoram 1996).
Metropolis–Hastings Markov chain Monte Carlo (MCMC) (Hastings 1970; Metropolis
et al. 1953) has been used to obtain sample-based estimates of
(Kuhner et al. 1995),
exponential growth rate (Kuhner et al. 1998), migration rates (Beerli and Felsenstein
1999, 2001) and recombination rate (Kuhner et al. 2000).
In addition to developments in coalescent-based population genetic inference,
sequence data sampled at different times are now available from both rapidly evolving
viruses such as HIV-1 (Holmes et al. 1993; Rodrigo et al. 1999; Shankarappa et al.
1999; Wolinsky et al. 1996), and from ancient DNA sources (Barnes et al. 2002;
Lambert et al. 2002; Leonard et al. 2002; Loreille et al. 2001). Temporally spaced data
provide the potential to observe the accumulation of mutations over time, thus allowing
the estimation of mutation rate (Drummond and Rodrigo 2000; Rambaut 2000). In fact
it is even possible to estimate variation in the mutation rate over time (Drummond et al.
2001). This leads naturally to the more general problem of simultaneous estimation of
population parameters and mutation parameters from temporally spaced sequence data
(Drummond and Rodrigo 2000; Drummond et al. 2001, 2002; Rodrigo and Felsenstein
1999; Rodrigo et al. 1999).
32 Evolutionary trees
pnc = e−τ/Ne ,
pc = 1 − pnc .
If there is a coalescence more recent than time τ then the tree must be one of the
following topologies: ((A,B),(C,D)), (((A,B),C),D), (((A,B),D),C).
τ τ
C D C D
0 0
A B A B
(a) (b)
Figure 2.6 (a) Only three ranked trees are possible if coalescence of A and B occurs more recently
than τ . (b) All ranked topologies are possible if A and B do not coalesce more recently than τ .
34 Evolutionary trees
removed from the infected population by way of acquired immunity, death or some
other inability to infect others or to be reinfected. The number of susceptible, infected
and removed individuals under an SIR model can be deterministically described forward
in time by a trio of coupled ordinary differential equations (ODEs), where β and μ
respectively represent the transition rates from susceptible S to infected I, and infected
I to removed R, such that
d
S(τ ) = −βI(τ )S(τ ), (2.6)
dτ
d
I(τ ) = βI(τ )S(τ ) − μI(τ ), (2.7)
dτ
d
R(τ ) = μI(τ ). (2.8)
dτ
Considering this model in the reverse time direction, we have S(t) = S(z0 − t),
I(t) = I(z0 − t), R(t) = R(z0 − t), from an origin z0 ago.
Recall that the coalescent calculates the probability
density of a tree given the co-
alescent rate. The coalescent rate for k lineages is 2k times the inverse of the product
of effective population size Ne and generation time gc . Volz (2012) proposed a co-
alescent approximation to epidemiological models such as the SIR, where the effective
population size Ne is the expected number of infected individuals through time, and the
generation time gc is derived as follows.
We summarise the epidemiological parameters η = {β, μ, S0 , z0 } with the start of
the epidemic being at time z0 in the past and S0 being the susceptible population size
at time z0 . The expected epidemic trajectory obtained from Equations (2.6)–(2.8) is
denoted V = (S, I, R). Let f (g|V, η) be the probability density of a tree given the
expected trajectory and the epidemic parameters η. The coalescent rate λk (t) of k
co-existing lineages computed following Volz (2012) is:
λk (t) = (Prob. of coalescence given single birth in population) × (total birth rate)
k I(t) −1
= × βS(t)I(t)
2 2
k 2βS(t)
≈ . (2.9)
2 I(t)
Note that ‘birth’ in this context refers to an increase in the number of infected hosts
by the infection process. In the last line, the approximation I(t) ≈ I(t) − 1 is used,
following (Volz 2012).
In summary, the relationship between the estimators of Ne gc (t), such as the Bayesian
skyline plot, and the coalescent model approximating the SIR dynamics described by
Volz (2012), is:
k 1 k 2βS(t)
= λk (t) ≈ , (2.10)
2 Ne (t)gc (t) 2 I(t)
36 Evolutionary trees
where Ne (t) = I(t) is a deterministically varying population size, and gc (t) ≈ 2β S1 (t) is
a deterministically varying generation time that results from the slowdown in infection
rate per lineage as the susceptible pool is used up.
The probability density of coalescent intervals t given an epidemic description can
then be easily computed, following Equation (2.5) and using calendar time units:
⎛ ⎞
ti
⎜ ⎟
f (t|V, η) = λki (ti ) exp ⎝− λki (x)dx⎠ . (2.11)
i∈Y i∈V ti−1
The coalescent can be derived by considering the data set to be a small sample from
an idealised Wright–Fisher population. Coalescent times only depend on deterministic
population size, meaning population size is the only parameter in the coalescent.
If the assumption of small sample size or deterministic population size breaks down
(Popinga et al. 2015; Stadler et al. 2015) or parameters other than population size may
governthe distribution of time-trees, then different models need to be considered. Clas-
sically, forward in time birth–death processes (Kendall 1948) are used as alternatives
to the coalescent. We will first discuss birth–death processes with constant rates and
samples taken at one point in time, which includes the Yule model (Harding 1971; Yule
1924). This model is then generalised to allow for changing rates, and finally sequential
sampling is considered.
n−1
fBD (g|λ, μ, ρ, z0 ) = p1 (z0 ) λp1 (tn+i )
i=1
2.4 Birth–death models 37
z0
t9
t8
t7
t6
0 = t1 = : : : = t5
Figure 2.7 Complete tree of age z0 (left) and corresponding reconstructed tree (right). In the
reconstructed phylogeny, all extinct and non-sampled extant species are pruned, only the
sampled species (denoted with a black circle) are included.
with
ρ(λ − μ)2 e−(λ−μ)x
p1 (x) = ,
(ρλ + (λ(1 − ρ) − μ)e−(λ−μ)x )2
where p1 (x) is the probability of an individual at time x in the past having precisely one
sampled descendant.
When assuming additionally a probability density f (z0 ) for z0 , we obtain the prob-
ability density of the time-tree g ∈ G together with its time of origin z0 , given a birth
rate λ, a death rate μ and a sampling probability ρ:
fBD (g, z0 |λ, μ, ρ) = fBD (g|λ, μ, ρ, z0 )f (z0 ).
We made the assumption here that the distribution of z0 is independent of λ, μ, ρ, i.e.
f (z0 |λ, μ, ρ) ≡ f (z0 ), following the usual assumption of independence of hyper-prior
distributions.
For obtaining non-biased estimates, we suggest to condition the process on yielding
at least one sample (i.e. g ∈ GS with GS = G1 ∪ G2 ∪ G3 . . .), as we only consider data
sets which contain at least one sample (Stadler 2013b). We obtain such a conditioning
through dividing fBD (g, z0 |λ, μ, ρ) by the probability of obtaining at least one sample.
We define p0 (x) to be the probability of an individual at time x in the past not having
any sampled descendants, and thus 1 − p0 (x) is the probability of at least one sample
(Yang and Rannala 1997):
ρ(λ − μ)
1 − p0 (x) = .
ρλ + (λ(1 − ρ) − μ)e−(λ−μ)x
The probability density of the time-tree g ∈ G together with its time of origin z0 , given
a birth rate λ, a death rate μ, a sampling probability ρ and conditioned on at least one
sample (S) is:
p1 (z0 )
n−1
fBD (g, z0 |λ, μ, ρ; S) = f (z0 ) λp1 (tn+i ). (2.12)
1 − p0 (z0 )
i=1
38 Evolutionary trees
Equation (2.12) defines the probability density of the time-tree g ∈ GS . Thus when
performing parameter inference, the number of samples n is considered as part of the
data. In contrast, under the coalescent, the probability density of the time-tree g ∈ Gn
is calculated (Equation (2.5)). Thus, parameter inference is conditioned on the number
of samples n, which means that we do not use the information n for inference and thus
ignore some information in the data by conditioning on it.
In order to compare the constant-rate birth–death process to the coalescent, we may
want to condition on observing n tips. Such trees correspond to simulations where first
a time z0 is sampled, and then a tree with age z0 is being simulated but only kept if the
final number of tips is n. The probability density of the time-tree g ∈ Gn together with
its time of origin z0 , given λ, μ and ρ, is (Stadler 2013b):
with
n−1
p1 (ti )
fBD (g|λ, μ, ρ, z0 ; n) = λ ,
q(z0 )
i=1
where
Note that when we condition on n (instead of considering it as part of the data), the
distribution for z0 as well as all other hyper-prior distributions are formally chosen with
knowledge of n.
We conclude the section on the constant-rate birth–death model with several remarks:
The constant-rate birth–death model with μ = 0 and ρ = 1, i.e. no extinction and
complete sampling, corresponds to the well-known Yule model (Edwards 1970; Yule
1924).
The three parameters λ, μ, ρ are non-identifiable, meaning that the probability density
of a time-tree is determined by the two parameters λ − μ and λρ (the probability dens-
ity only depends on these two parameters if the probability density is conditioned on
survival or on n samples (Stadler 2009)). Thus, if the priors for all three parameters are
non-informative, we obtain large credible intervals when estimating these parameters.
One may speculate that the distribution of g ∈ Gn under the constant-rate birth–death
process (Equation (2.13)) for ρ → 0 converges to the distribution of g ∈ Gn under
the coalescent (Equation (2.5)). However, this is only true in expectation under special
assumptions (Gernhard 2008; Stadler 2009). One difference between the constant-rate
birth–death process limit and the coalescent is that the birth–death process induces
a stochastically varying population size while classic coalescent theory relies on a
deterministic population size. Recent work demonstrates that even when coalescent
theory is applied to a population with a stochastically varying population size, the result
is not equivalent to the constant-rate birth–death limit (Stadler et al. 2015).
2.4 Birth–death models 39
is given in (Stadler 2010, Corollary 3.7). This model was used to quantify the basic
reproductive number for HIV in Switzerland (Stadler et al. 2012).
We note that if μ = 0, we obtain time-trees with all extinct lineages being included.
If ρ = 0, then it is straightforward to show that the probability density of a time-tree
only depends on two parameters λ − μ − ψ and λψ, if conditioned on survival (Stadler
2010).
Again, the model can be extended to piecewise changing rates (birth–death skyline
plot), and the probability density of the time-tree g ∈ G,
is given in (Stadler et al. 2013, Theorem 1). The birth–death skyline plot was used to
recover epidemiological dynamics of HIV in the UK and HCV in Egypt.
Recall that when analysing serially sampled data using the coalescent, we condition
the analysis on the number of samples n as well as on the sampling times. The birth–
death model, however, treats n and the sampling times as part of the data, and thus the
parameters are informed by the sampling times of the particular data sets.
Recent work started incorporating diversity-dependent models (Kühnert et al. 2014;
Leventhal et al. 2014) and trait-dependent models (Stadler and Bonhoeffer 2013) for
serially sampled data. Diversity-dependent models can be used to explicitly model
epidemiological dynamics in infectious diseases by acknowledging the dependence of
transmission rates on the number of susceptible individuals, formalised in SI, SIS, SIR
or SEIR models described in Section 2.3.3. In such models the birth rate of the birth–
death model is λ = βS, with β being the transition rate from susceptible to infected and
S being the number of susceptibles. Trait-dependent models may be used for structured
populations, where different population groups are characterized by a trait. Chapter 5
discusses some more details of these phylodynamic models.
Notation Description
5
Time
3 4
1 2 N4 ( 4 )
Population size, N
Figure 2.8 A species tree on nS = 3 species with a gene tree of n = (3, 4, 3) samples embedded.
tree being equal to the transmission history is problematic and may bias transmission
time estimates.
In this interpretation, we argue that the natural generative model at the level of
the host population is a branching process of infections, where each branching event
represents a transmission of the disease from one infected individual to the next, and
the terminal branches of this transmission tree represent the transition from infectious
to recovery or death of the infected host organism. For multicellular host species there
is an additional process of proliferation of infected cells within the host’s body (often
restricted to certain susceptible tissues) that also has a within-host branching process
of cell-to-cell infections. This two-level hierarchical process can be extended to con-
sider different infectious compartments at the host level, representing different stages in
disease progression, and/or different classes of dynamic behaviour among hosts.
Accepting the above as the basic schema for the generative process, one needs to
also consider a typical observation process of an epidemic or endemic disease. It is
often the case that data are obtained through time from some fraction, but not all, of the
infected individuals. Figure 2.9 illustrates the relationship between the full transmission
history and the various sampled histories. The transmission tree prior may be one of the
birth–death models introduced in Section 2.4. The gene tree prior may be a coalescent
process within the transmission tree (analogous to a coalescent process within a species
tree). Notice that the sampled host transmission tree may have internal nodes with a
single child lineage representing a direct ancestor of a subsequent sample. We refer to
such trees as sampled ancestor trees (Gavryushkina et al. 2013) and a reversible-jump
Bayesian inference scheme for such trees has recently been described (Gavryushkina
et al. 2014).
2.6 Exercise 43
A
A A
D D D
D
A C C C
C B B B
B
Figure 2.9 (1) A transmission tree and embedded pathogen gene tree; (2) the sampled
transmission tree; (3) the sampled pathogen gene tree; (4) the sampled host transmission tree.
This model schema (combining both the generative and observational processes) can
be readily simulated with a recently developed BEAST 2 package called MASTER
(Vaughan and Drummond 2013), and inference approaches under models of similar
form have been described (Ypma et al. 2013), but full likelihood inference under the
model depicted in Figure 2.9 is not yet available.
2.6 Exercise
The simplest measure of distance between a pair of aligned molecular sequences is the
number of sites at which they differ. This is known as the Hamming distance (h). This
raw score can be normalised for the length of a sequence (l) to get the proportion of sites
that differ between the two sequences, p = h/l. Consider two hypothetical nucleotide
fragments of length l = 10:
Sequence 1 AATCTGTGTG
Sequence 2 AGCCTGGGTA
In these sequences h = 4 and p = 4/10 = 0.4. The proportion of sites that are
different, p, is an estimate of the evolutionary distance between these two sequences.
A single nucleotide site can, given enough time, undergo multiple substitution events.
Because the alphabet of nucleotide sequences is small, multiple substitutions can be
hidden by reversals and parallel changes. If this is the case, some substitutions will not
be observed. Therefore the estimate of 0.4 substitutions/site in this example could be
an underestimate. This is easily recognised if one considers two hypothetical sequences
separated by a very large evolutionary distance – for example ten substitutions/site.
Even though the two sequences will be essentially random with respect to each other,
they will still, by chance alone, have matches at about 25% of the sites. This would
give them an uncorrected distance, p, of 0.75 substitutions/site, despite being actually
separated by ten substitutions/site.
To compensate for this tendency to underestimate large evolutionary distances, a
technique called distance correction is used. Distance correction requires an explicit
model of molecular evolution. The simplest of these models is the Jukes–Cantor (JC)
model (Jukes and Cantor 1969). Under the JC model, an estimate for the evolutionary
distance between two nucleotide sequences is:
3 4
dˆ = − ln 1 − p .
4 3
For the example above, the estimated genetic distance dˆ ≈ 0.571605, and this is
an estimate of the expected number of substitutions per site. This model assumes that
all substitutions are equally likely and that the frequencies of all nucleotides are equal
and at equilibrium. This chapter describes the JC model and related continuous-time
3.1 Continuous-time Markov process 45
Markov processes (CTMPs). In general these models are assumed to act independently
and identically across sites in a sequence alignment.
A CTMP is a stochastic process taking values from a discrete state space at random
times, and which satisfies the Markov property.
Let X(t) be the random variable representing the state of a Markov process at time t.
Assuming that the Markov process is in state i ∈ {A, C, G, T} at time t, then in the next
small moment, the probability that the process transitions to state j ∈ {A, C, G, T} is
governed by the instantaneous transition rate matrix Q:
⎡ ⎤
· qAC qAG qAT
⎢qCA · qCG qCT ⎥
Q=⎢ ⎣qGA qGC
⎥,
· qGT ⎦
qTA qTC qTG ·
with the diagonal entries qii = − j =i qij , so that the rates for a given state should sum
to zero across the row. The off-diagonal entries qij > 0, i = j are positive and represent
rates of flow from nucleotide i to nucleotide j. The diagonal entries qii represent the
total flow out of nucleotide state i and are thus negative. At equilibrium, the total rate of
change per site per unit time is thus:
μ=− πi qii ,
i
where πi is the probability of being in state i at equilibrium, so that μ is just the weighted
average outflow rate at equilibrium.
Figure 3.1 depicts the states and instantaneous transition rates in a general Markov
model on the DNA alphabet. In particular, for a small time t we have:
Pr{X(t + t) = j|X(t) = i} = qij t.
In general, the transition probability matrix, P(t), provides the probability for being
state j after time t, assuming the process began in state i:
P(t) = exp(Qt).
For simple models, such as the Jukes–Cantor model (see following section) the ele-
ments of the transition probability matrix have analytical closed-form solutions. How-
ever, for more complex models (including general time-reversible; GTR) this matrix
exponentiation can only be computed numerically, typically by Eigen decomposition
(Stewart 1994).
qAT
A T
qT A
qCT qAG
qCA qAC qGT qT G
qGA qT C
qCG
C G
qGC
because they describe processes that are more mathematically tractable, as opposed to
biologically realistic.
A stationary CTMP has the following properties:
πQ = 0, π P(t) = π , ∀t,
3.2.1 Jukes–Cantor
The Jukes–Cantor process (Jukes and Cantor 1969) is the simplest CTMP. All transi-
tions have equal rates, and all bases have equal frequencies (πA = πC = πG = πT =
1/4). An unnormalised Q̂ matrix for the Jukes–Cantor model is:
3.2 DNA models 47
⎡ ⎤
−3 1 1 1
⎢1 −3 1 1⎥
Q̂ = ⎢
⎣1
⎥.
1 −3 1 ⎦
1 1 1 −3
However, it is customary when describing substitution processes with a CTMP to
used a normalised instantaneous rate matrix Q = β Q̂, so that the normalised matrix
has an expected mutation rate of 1 per unit time, i.e. μ = − i πi qii = 1. This can be
achieved by choosing β = 1/ − i πi q̂ii . For the above Q̂ matrix, setting Q = 13 Q̂ leads
to a normalised Q matrix for the Jukes–Cantor model of:
⎡ ⎤
−1 1/3 1/3 1/3
⎢1/3 −1 1/3 1/3⎥
Q=⎢ ⎣1/3 1/3 −1 1/3⎦ .
⎥
Notice that this matrix has no free parameters. The benefit of normalising to unitary
output is that the times to calculate transition probabilities for can now be expressed
in substitutions per site (i.e. genetic distances). The entries of the transition probability
matrix for the Jukes–Cantor process are easily computed and as follows:
⎧ % &
⎨ 1 + 3 exp − 4 d if i = j
4 4
pij (d) = % 3 & , (3.1)
⎩ 1 − 1 exp − 4 d if i = j
4 4 3
where d is the evolutionary time in units of substitutions per site. The transition prob-
abilities from nucleotide A are plotted against genetic distance (d) in Figure 3.2. It
can be seen that at large genetic distances the transition probabilities all asymptote
to 14 , reflecting the fact that for great enough evolutionary time, all nucleotides are
JC transition probability
pAA (d)
0.25
pAC (d) = pAG (d) = pAT (d)
0
0 0.5 1.0 1.5 2.0
Genetic distance (d)
Figure 3.2 The transition probabilities from nucleotide A for the Jukes–Cantor model, plotted
against genetic distance (d = μt).
48 Substitution and site models
-6
-8
ln l(d)
-10
-12
equally probable, regardless of the initial nucleotide state. Armed with the transition
probabilities in Equation (3.1) it is possible to develop a probability distribution over
Hamming distance h for a given genetic distance (d) between two sequences of length l:
l
L(d) = Pr(h|d) = pii (d)(l−h) [1 − pii (d)]h .
h
Using the example from the beginning of the chapter, where the two sequences were
of length L = 10 and differed at H = 4 sites the (log-)likelihood as a function of d is
shown in Figure 3.3.
3.2.2 K80
The K80 model (Kimura 1980) distinguishes between transitions (A ←→ G and
C ←→ T state changes) and transversions (state changes from a purine to pyrimidine
or vice versa). The model assumes base frequencies are equal for all characters. This
transition/transversion bias is governed by the κ parameter and the Q matrix is:
⎡ ⎤
−2 − κ 1 κ 1
⎢ 1 −2 − κ 1 κ ⎥
Q=β⎢
⎣ κ
⎥.
1 −2 − κ 1 ⎦
1 κ 1 −2 − κ
3.2 DNA models 49
Figure 3.4 plots the first row in the K80 transition probability matrix as a function of
genetic distance, for κ = 4.
3.2.3 F81
The F81 model (Felsenstein 1981) allows for unequal base frequencies (πA = πC =
πG = πT ). The Q matrix is:
⎡ ⎤
πA − 1 πC πG πT
⎢ πA πC − 1 πG πT ⎥
Q=β⎢ ⎣ πA
⎥.
πC πG − 1 πT ⎦
πA πC πG πT − 1
pAA (d)
pAG (d)
0.25
pAC (d) = pAT (d)
0
0.0 0.5 1.0 1.5 2.0
Genetic distance (d)
Figure 3.4 The transition probabilities from nucleotide A for the K80 model with κ = 4, plotted
against genetic distance (d = μt). Although, in this case, pAG (d) exceeds 14 at genetic distances
above d = ln 2 ≈ 0.693147, all transition probabilities still asymptote to 14 for very large d.
50 Substitution and site models
pAA (d)
πA
pAC (d)
πC
πT pAT (d)
πG pAG (d)
0
0 0.5 1.0 1.5 2.0
Genetic distance (d)
Figure 3.5 The transition probabilities from nucleotide A for the F81 model with
π = (πA = 0.429, πC = 0.262, πG = 0.106, πT = 0.203), plotted against genetic
distance (d = μt).
Figure 3.5 shows how these transition probabilities asymptotically approach the
equilibrium base frequency of the final state with increasing genetic distance.
3.2.4 HKY
The HKY process (Hasegawa et al. 1985) was introduced to better model the substitu-
tion process in primate mtDNA. The model combines the parameters in the K80 and
F81 models to allow for both unequal base frequencies and a transition/transversion
bias. The Q matrix has the following structure:
⎡ ⎤
· πC κπG πT
⎢ πA · πG κπT ⎥
Q=β⎢ ⎣κπA πC
⎥.
· πT ⎦
πA κπC πG ·
The diagonal entries are omitted for clarity, but as usual are set so that the rows sum
to zero. The transition probabilities for the HKY model can be computed and have a
closed-form solution; however, the formulae are rather long-winded and are omitted for
brevity.
3.2.5 GTR
The GTR model is the most general time-reversible stationary CTMP for describing the
substitution process. The Q matrix is:
⎡ ⎤
· aπC bπG cπT
⎢aπA · dπG eπT ⎥
Q=β⎢ ⎣bπA dπC
⎥.
· πT ⎦
cπA eπC πG ·
3.3 Codon models 51
Protein-coding genes have a natural pattern due to the genetic code that can be exploited
by extending a four-nucleotide state space to a 64-codon state space.
Two pairs of researchers published papers on codon-based Markov models of substi-
tution in the same volume in 1994 (Goldman and Yang 1994; Muse and Gaut 1994).
The key features they shared were:
• a 61-codon state space (excluding the three stop codons);
• a zero rate for substitutions that changed more than one nucleotide in a codon at
any given instant (so each codon has nine immediate neighbours, minus any stop
codons);
• a synonymous/nonsynonymous bias parameter making synonymous mutations,
that is mutations that do not change the protein that the codon codes for, more
likely than nonsynonymous mutations.
Given two codons i = (i1 , i2 , i3 ) and j = (j1 , j2 , j3 ) the Muse–Gaut-94 codon model
has the following entries in the Q matrix:
⎧
⎪
⎪ if nonsynonymous change at codon position k
⎨βωπjk
qij = βπjk if synonymous change at codon position k .
⎪
⎪
⎩0 if codons i and j differ at more than one position
A microsatellite (or short tandem repeat; STR) is a region of DNA in which a short DNA
sequence motif (length one to six nucleotides) is repeated in an array, e.g. the sequence
AGAGAGAGAGAGAG is a dinucleotide microsatellite comprising seven repeats of
the motif AG. Because they are abundant, widely distributed in the genome and highly
polymorphic, microsatellites have become one of the most popular genetic markers for
making population genetic inferences in closely related populations.
Unequal crossing over (Richard and Pâques 2000; Smith 1976) and replication slip-
page (Levinson and Gutman 1987) are the two main mechanisms proposed to explain
the high mutation rate of microsatellites. The simplest microsatellite model is the step-
wise mutation model (SMM) proposed by Ohta and Kimura (1973), which states that
the length of the microsatellite increases or decreases by one repeat unit at a rate
independent of the microsatellite length. For SMM the Q matrix has the following
entries:
⎧
⎪
⎪ if |i − j| = 1
⎨β
qij = 0 if |i − j| > 1 .
⎪
⎩−
q
⎪
if i = j
k =i ik
There are a large number of more complex models that have been introduced in the
literature to account for length-dependent mutation rates, mutational bias (unequal rates
of expansion and contraction) and ‘two-phase’ dynamics, in which the length of the
repeat changes by more than one repeat unit with a single mutation. For a review of these
models, and a description of a nested family of microsatellite models that encompasses
most of the variants, see (Wu and Drummond 2011).
i
ti
i; j
j
tj
where = {Q, μ} includes parameters of the substitution model and the overall
rate μ.
Let DY represent the (n − 1) × L matrix whose rows are the unknown ancestral
sequences at the internal nodes (Y) of the tree g:
⎛ ⎞ ⎛ ⎞
DY 1 sY1 ,1 sY1 ,2 ··· sY1 ,L
⎜ DY2 ⎟ ⎜ sY2 ,1 sY2 ,2 ··· sY2 ,L ⎟
⎜ ⎟ ⎜ ⎟
DY = ⎜ .. ⎟=⎜ .. .. .. .. ⎟.
⎝ . ⎠ ⎝ . . . . ⎠
DYn−1 sYn−1 ,1 sYn−1 ,2 ··· sYn−1 ,L
L (
)
Pr{D|g, } = eQμ(ti −tj ) .
si,k ,sj,k
DY ∈D i,j∈R k=1
(In the above formula, compact notation is obtained by including in the product over
edges an edge terminating at the root from an ancestor of infinite age.)
The sum over all possible ancestral sequences DY looks onerous, but using a pruning
algorithm Felsenstein (1981) demonstrated an efficient polynomial-time algorithm that
makes this integration over unknown ancestral sequences feasible (see Section 3.7).
54 Substitution and site models
It is common to allow rate variation across sites, and a key component of most models
of rate variation across sites is the discrete gamma model introduced by Yang (1994).
For K discrete categories this involves K times the computation as can be seen in the
likelihood where an extra sum is used to average the likelihood over the K categories
for each site in the alignment:
⎛ ⎞
L
K
1⎝ ( )
Pr{D|g, } = eQμrc (ti −tj ) ⎠.
K si,k ,sj,k
k=1 c=1 DY ∈D i,j∈R
Here rc is the relative rate of the cth rate category in the discrete gamma distribution
and = {Q, μ, γ } includes the shape parameter (α) of the discrete gamma model of
rate variation across sites, governing the values of r1 , r2 , . . . , rK . Figure 3.7 shows how
the density of the continuous gamma distribution varies from L-shaped (α ≤ 1) to bell-
shaped (α 1) with the shape parameter α.
Despite this flexibility, the introduction of a separate category of invariant sites that
are assumed to have an evolutionary rate of zero can improve the fit to real data. This
is the so-called + I approach to modelling rate variation (Gu et al. 1995; Waddell
and Penny 1996). Model selection will often favour + I over , although accurate
estimation of the two parameters (a proportion of invariant sites, pinv and α) is highly
sensitive to sampling effects (Sullivan and Swofford 2001; Sullivan et al. 1999).
2.0
1.5 α= 16
Probability density
1.0 α= 4
0.5 α= 1
α= 0.25
0.0
Figure 3.7 The gamma distribution for different values of the shape parameter α with scale
parameter set to 1/α so that the mean is 1 in all cases.
3.7 Felsenstein’s pruning algorithm 55
b8
b7
8
7 b6
6 b5
b1
b2 b3 b4
1 2 3 4 5
Figure 3.8 An example tree to illustrate the ‘pruning’ algorithm.
The ‘pruning’ algorithm for computing the phylogenetic likelihood was introduced by
Felsenstein (1981). In the following discussion we will consider a single site s and the
corresponding nucleotide states associated with the ancestral nodes sY in the five-taxon
tree in Figure 3.8:
⎛ ⎞
s1 ⎛ ⎞
⎜s ⎟ s6
⎜ 2⎟ ⎜s7 ⎟
⎜ ⎟
s = ⎜s3 ⎟ , sY = ⎜ ⎟
⎝s8 ⎠ .
⎜ ⎟
⎝s4 ⎠
s9
s5
Together they form a full observation:
⎛ ⎞
s1
⎜s ⎟
s ⎜ 2⎟
sV = = ⎜ . ⎟.
sY ⎝ .. ⎠
s9
Consider the tree in Figure 3.8. If we had full knowledge of the sequences at internal
nodes of this tree then the probability of site pattern sV can be easily computed as a
product of transition probabilities over all branches i, j multiplied by the prior proba-
bility of the nucleotide state at the root node (πs9 ), i.e.:
Pr{sV |g} = πs9 psi sj (bj )
i,j
However, since we don’t know the ancestral nucleotides sY we must integrate over
their possible values to get the probability of the leaf data:
Pr{s|g} = Pr{sV |g}
sY
= πs9 ps9 s7 (b7 )ps7 s1 (b1 )ps7 s2 (b2 )
s9 ∈C s8 ∈C s7 ∈C s6 ∈C
× ps9 s8 (b8 )ps8 s5 (b5 )ps8 s6 (b6 )ps6 s3 (b3 )ps6 s4 (b4 ).
Felsenstein makes the point that you can move the summations in the above equation
rightwards to reduce the amount of repeated calculation:
⎧ ⎫
⎨
⎬
Pr{s|g} = πs9 ps9 s7 (b7 ) ps7 s1 (b1 ) ps7 s2 (b2 )
⎩ ⎭
s9 ∈C s7 ∈C
⎧ ⎡ ⎤⎫
⎨
⎬
× ps9 s8 (b8 ) ps8 s5 (b5 ) ⎣ ps8 s6 (b6 ) ps6 s3 (b3 ) ps6 s4 (b4 ) ⎦ .
⎩ ⎭
s8 ∈C s6 ∈C
(3.2)
Notice that the pattern of brackets mirrors the topology of the tree. This is a clue that
Equation (3.2) can be redefined in terms of a recursion that can be efficiently computed
by dynamic programming.
Now define a matrix of partial likelihoods:
⎛ ⎞
L1,A L1,C L1,G L1,T
⎜ L2,A
⎜ L2,C L2,G L2,T ⎟ ⎟
L=⎜ . .. .. .. ⎟ ,
⎝ .. . . . ⎠
L2n−1,A L2n−1,C L2n−1,G L2n−1,T
where Li,c is the partial likelihood of the data under node i, given the ancestral character
state at node i is c ∈ C. The entries of L can be defined recursively. Assuming the two
descendant branches of internal node y are y, j and y, k we have:
⎡ ⎤ ⎡ ⎤
Ly,c = ⎣ pcx (bj )Lj,x ⎦ × ⎣ pcx (bk )Lk,x ⎦ .
x∈C x∈C
3.8 Miscellanea
The dual concepts of a time-tree and a molecular clock are central to any attempt at
interpreting the chronological context of molecular variation. As we saw in Chapter 2,
several natural tree prior distributions (e.g. coalescent and birth–death families) generate
time-trees, rather than unrooted trees. But how are these time-trees reconciled with
the genetic differences between sequences modelled in Chapter 3? The answer is the
application of a molecular clock. The concept of a molecular clock traces back at least
to the 1960s. One of the early applications of the molecular clock was instrumental
in a celebrated re-calibration of the evolutionary relatedness of humans to other great
apes, when Allan Wilson and Vincent Sarich described an ‘evolutionary clock’ for
albumin proteins and exploited the clock to date the common ancestor of humans and
chimpanzees to five million years ago (Sarich and Wilson 1967).
The molecular clock is not a metronome. Each tick of the clock occurs after a stochastic
waiting time, determined by the substitution rate μ. Let D(t) be a random variable
representing the number of substitutions experienced over evolutionary time t. The
probability that exactly k substitutions have been experienced in time t is:
e−μt (μt)k
Pr{D(t) = k} = .
k!
Figure 4.1 shows six realisations of the Poisson accumulation process, D(t), showing
the total number of substitutions accumulated through time. All of the realisations have
the same substitution rate (μ = 1), but the time at which they reach 25 substitutions
varies substantially. This figure was produced with the following R code:
1 # set substitution rate to 1.0
2 mu <- 1.0
3 # construct matrix with six poisson trajectories
4 # each column has 25 cumulating exponential waiting times
5 ti <- replicate(6, c(0,cumsum(rexp(25,mu))))
6 # construct a parallel matrix of unit steps
7 k <- replicate(ncol(ti), 1:nrow(ti)-1)
8 # plot each column of the matrix in unit steps
9 matplot(ti, k, type="s", xlab="t", ylab="D(t)", lty=1)
4.2 The molecular clock 59
25
20
15
D(t)
10
5
0
0 5 10 15 20 25 30
t
Figure 4.1 Six realisations of a Poisson accumulation process (μ = 1) showing the total number
of substitutions accumulated through time. All the realisations have the same substitution rate,
but the time at which they reach 25 substitutions varies substantially.
0.10
0.08
Pr(k|d=12)
0.06
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
k
Figure 4.2 The Poisson probability distribution of experiencing k substitution events in genetic
time d = μt = 12.
4.5
0.5
1
1 C
0.5
1
1.5 1 B
0 A
A B C D D
Figure 4.3 A time-tree of four taxa, branches labelled with rates of evolution and the resulting
non-clock-like tree with branches drawn proportional to substitutions per site.
with 95% of such lineages having between 6 and 18 substitutions inclusive. The plot in
this figure can be produced in R with the following two lines of code:
1 # Plot poisson distribution; k = 0 to 25 events, d=r*t=12
2 k <- 0:25; d <- 12;
3 barplot(dpois(k,d),names.arg=k,xlab="k",ylab="Pr(k|d=12)")
Researchers have grappled with the tension between molecular and non-molecular
evidence for evolutionary time scales since Wilson and Sarich’s ground-breaking
work (Sarich and Wilson 1967; Wilson and Sarich 1969) on the ancestral relationship
between humans and chimpanzees. Recently, a number of authors have developed
‘relaxed molecular clock’ methods. These methods accommodate variation in the rate
of molecular evolution from lineage to lineage (see Figure 4.3 which depicts a time-tree
of four taxa, branches labelled with varying rates of evolution, alongside a tree with
branch lengths that display the resulting non-clock-like genetic distances). In addition
to allowing non-clock-like relationships among sequences related by a phylogeny,
modelling rate variation among lineages in a gene tree also enables researchers to
incorporate multiple calibration points that may not be consistent with a strict molecular
clock. These calibration points can be associated either with the internal nodes of the
tree or the sampled sequences themselves. Furthermore, relaxed molecular clock models
appear to fit real data better than either a strict molecular clock or the other extreme
of no clock assumption at all. In spite of these successes, controversy still remains
around the particular assumptions underlying some of the popular relaxed molecular
4.3 Relaxing the molecular clock 61
t7
r6
t6
r4
r5
r3
t5
r1 r2
0
1 2 3 4
Figure 4.4 A time-tree of four taxa, branches labelled with rates of evolution.
clock models currently employed. A number of authors argue that changes in the rate
of evolution do not necessarily occur smoothly nor on every branch of a gene tree. The
alternative expounds that large subtrees share the same underlying rate of evolution and
that any variation can be described entirely by the stochastic nature of the evolutionary
process. These phylogenetic regions or subtrees of rate homogeneity are separated by
changes in the rate of evolution. This alternative model may be especially important
for gene trees that have dense taxon sampling, in which case there are potentially many
short, closely related lineages, amongst which there is no reason a priori to assume
differences in the underlying rate of substitution.
A Bayesian framework for allowing the substitution rate to vary across branches has
the following structure:
Pr{D|g, r, θ }f (r|θ , g)f (g|θ )f (θ )
f (g, r, θ |D) = , (4.1)
Pr{D}
where r is a vector of substitutions rates, one for each branch in the tree (g) (Figure 4.4).
For brevity, the vector θ represents all other parameters in the model, including the
parameters of the relaxed-clock model that govern r (e.g. μ and S2 in the case of log-
normally distributed rates among branches), as well as the parameters of the substitution
model and tree prior. As usual, D is the multiple sequence alignment. In the following
sections we will survey a number of alternative approaches to relaxing the molecular
clock. In most cases the term f (r|θ, g) can be further broken down into a product of
densities over all the branches in the tree:
f (r|θ , g) = f (rj |),
i,j∈R
where j is a unique index associated with the tip-ward node of branch i, j in tree g.
62 The molecular clock
where A(i) is the index of the parent branch of the ith branch, rA(root) = μR by definition,
and τi is the time separating branch i from its parent. The precise definition of τi and the
means by which the ri parameters are applied to the tree has varied in the literature. In
the original paper τi was defined as the time between the midpoint of branch i and the
midpoint of its parent branch:
where ti is the time of the tip-ward node of the ith branch and bi = tA(i) − ti .
Subsequently, Kishino et al. (2001) associated a rate ri with the ith node (the node on
the tip-ward end of the branch i, whose time is ti ) and applied a bias-corrected geometric
Brownian motion model:
/
1 1 b σ 2 2
f (ri |μR , σ 2 ) = ln(ri /rA(i)
i
exp − )+ .
i r 2πbi σ 2 2bi σ 2 2
2
The aim of the bias correction term bi2σ is to produce a child substitution rate whose
expectation is equal to that of the parent rate, i.e. E(ri ) = rA(i) . Finally, instead of using
ri as the substitution rate of the ith branch, Kishino et al. (2001) used the arithmetic
average of the two node-associated rates: ri = (ri + rA(i) )/2.
There are many variations on the autocorrelated clock models described above,
including the compound Poisson process (Huelsenbeck et al. 2000), and the Thorne–
Kishino continuous autocorrelated model (Thorne and Kishino 2002). The first draws
the rate multiplier ri from a Poission distribution, while the latter draws ri from a
log-normal with mean rA(i) and variance proportional to the branch length.
where M is the mean of the logarithm of the rate and S2 is the variance of the log rate.
An alternative parameterisation used by Rannala and Yang (both parameterisations are
available in BEAST) employs the mean rate μ instead of M:
- .
1 1
f (ri |μ, S) = √ exp − 2 [ln(ri /μ) + 1/2S ] .
2 2
ri 2πS2 2S
1 Á6 Á6
0 0 Á5 Á4 Á6 rroot
0 Á3 Á6
0 1 Á1 Á2 Á6 Á2
1 2 3 4 1 2 3 4 1 2 3 4
Figure 4.5 Left, the indicator variables for changes in rate. Centre, branches labelled with
corresponding elements of . Right, the resulting rates on the branches, once the indicator mask
is applied.
The simpler of the two flavours of random local clock computes branch rates for all
nodes apart from the root as follows:
rA(i) if βi = 0
ri = .
φi otherwise
From this it can be seen that B acts as a set of binary indicators, one for each branch.
If βi = 0 then the ith branch simply inherits the same substitution rate as its parent
branch, otherwise it takes up an entirely independent rate, φi . Thus the branches where
βi = 1 are the locations in the tree where the substitution rate changes from one local
clock to the next (see Figure 4.5). Setting all βi to zero gives rise to a strict clock.
The second flavour, and the one described in detail by Drummond and Suchard (2010)
computes branch lengths as follows:
rA(i) if βi = 0
ri = .
φi × rA(i) otherwise
This model retains some autocorrelation in substitution rate from a parent local clock
to a ‘child’ local clock, with the parameters of interpreted as relative rate modifiers
in this framework, rather than absolute rates as in the first construction. Regardless, in
order to sample either of these models in a Bayesian framework, the key to success is
the construction of an appropriate prior on and B. Drummond and Suchard (2010)
defined K, the number of rate change indicators:
2n−2
K= βi
i=1
and suggested a prior on B that would induce a truncated Poisson prior distribution
on K:
K ∼ Truncated-Poisson(λ).
4.4 Calibrating the molecular clock 65
The prior on can be chosen relatively freely. Independent and identically dis-
tributed from a log-normal distribution would be quite appropriate for the elements
of in the uncorrelated RLC model. Independent and identically distributed from a
gamma distribution has been suggested for elements of in the autocorrelated RLC
model (Drummond and Suchard 2010).
Although the basic idea of the random local clock model is simple, there are a number
of complications to its implementation in a Bayesian framework that are not dealt
with here. These include normalisation of the substitution rate across the phylogeny
in the absence of calibrations and issues related to sampling across models of differing
dimensions. For readers interested in these finer details, we refer you to the original
paper by Drummond and Suchard (2010).
Molecular clock analysis and divergence-time dating go hand in hand. The calibration
of a molecular clock to an absolute time scale leads directly to ages of divergences in
the associated time-tree. Although divergence-time dating is a well-established corner-
stone of evolutionary biology, there is still no widely accepted objective methodology
for converting information from the fossil record to calibration information for use in
molecular phylogenies.
which molecular divergence the fossil corresponds to, (2) the need to determine the
exact form of the calibration density and (3) the assumption that the fossil does indeed
correspond to a node in the molecular phylogeny (as opposed to some extinct offshoot,
or to a direct ancestor along one of the lineages in between two divergences). These
shortcomings are somewhat intertwined and the last two can be somewhat alleviated
by developing a calibration density for the dated node based on an explicit model of
phylogenetic speciation and extinction. Using stochastic simulation based on a simple
phylogenetic birth–death model, Monte Carlo model-based calibration densities can be
constructed (Matschiner and Bouckaert 2013).
However, regardless of how the calibration density for a phylogenetic divergence
is arrived at, there is an additional challenge for Bayesian node dating. The task of
correctly constructing a Bayesian tree prior that composes one or more fossil calibration
densities with an underlying tree process prior involves computational challenges when
the phylogenetic tree is also inferred (Heled and Drummond 2012, 2013).1
1 The composition of calibrated birth–death tree priors is more straightforward for Bayesian divergence-time
dating on a fixed tree topology (Yang and Rannala 2006).
4.4 Calibrating the molecular clock 67
in a calibration density aims to describe the unknown time separating the fossil from the
divergence it is associated with. That separation time is effectively directly modelled in
total-evidence dating. However, Ronquist et al.’s approach does not fully address the
final shortcoming of node dating because their model effectively posits that all fossils
are ‘extinct offshoots’, by representing them as leaves in the phylogeny.2
2 Although one could perhaps consider a fossil to be a direct ancestor of another leaf if the posterior estimate
of the branch length attaching it to the phylogeny was very short.
5 Structured trees and phylogeography
This chapter describes multi-type trees and various extensions to the basic phylogen-
etic model that can account for population structure, geographical heterogeneity and
epidemiological population dynamics.
1 But note that in some methods described below the sampled taxa may be explicitly recognised as a statistical
sample from underlying populations, of which properties are inferred, such as levels of gene flow, or
geographic isolation.
2 Alternatives to model-based statistical phylogeography are mainly forms of parsimony in various guises,
sometimes ironically called ‘statistical phylogeography’ or ‘statistical parsimony’.
5.2 Multi-type trees 69
tj
j
,j
i
t
tz
z
ti
ϕi,j (t) i
ty
y
tx = 0
x
Figure 5.1 A multi-type tree T = (V, R, t, M) with V = I ∪ Y where I = {x, y, z}, Y = {i, j},
R = {x, i, y, i, i, j, z, j} and the coalescence times t and type mappings M are as shown.
Here we have selected the type set D = {blue, red, green, orange}, although this can be
composed of the values of any discrete trait.
and defined such that ϕi,j (t) is the type associated with the time t on edge i, j ∈ R.
Such a tree is illustrated in Figure 5.1.
3 In analyses in which only the transition rate matrix needs to be inferred (without the full multi-type tree) it
is possible to numerically integrate over all possible transition events, which may be computationally less
intensive (e.g. Stadler and Bonhoeffer 2013).
5.4 The structured coalescent 71
algorithms (Vaughan et al. 2014). More details are provided in the description of the
individual models below.
The structured coalescent (Hudson 1990) can also be employed to study phylogeog-
raphy. The structured coalescent has also been extended to heterochronous data (Ewing
et al. 2004), thus allowing the estimation of migration rates between demes in calendar
units (see Figure 5.2). The serial structured coalescent was first applied to an HIV
data set with two demes to study the dynamics of subpopulations within a patient
(Ewing et al. 2004), but the same type of inference can be made at the level of the
host population. Further development of the model allowed for the number of demes to
change over time (Ewing and Rodrigo 2006a). MIGRATE (Beerli and Felsenstein 2001)
also employs the structured coalescent to estimate subpopulation sizes and migration
rates in both Bayesian and maximum likelihood frameworks and has also recently
been used to investigate spatial characteristics of viral epidemics (Bedford et al. 2010).
72 Structured trees and phylogeography
N3
m13 m32
m31 m23
m12
N1 N2
m21
Figure 5.2 A simulation of the serially sampled structured coalescent on three demes.
The population size of the three demes is equal (N1 = N2 = N3 = 1000).
Additionally, some studies have focused on the effect of ghost demes (Beerli 2004;
Ewing and Rodrigo 2006b). Although the structured coalescent model is promising,
its application in Bayesian MCMC is computationally demanding because the standard
form of the likelihood calculation (Beerli and Felsenstein 2001; Hudson 1990) requires
that the genealogical tree be augmented with all of the unknown migration events in the
ancestry of the sample. The migration events themselves are typically treated as nuis-
ance parameters and integrated out using MCMC (Beerli and Felsenstein 2001; Ewing
et al. 2004). Recently some effort has been made to apply uniformisation (Fearnhead
and Sherlock 2006; Rodrigue et al. 2008) to obtain efficient Bayesian MCMC sampling
algorithms for structured coalescent inference from large serially sampled data sets
(Vaughan et al. 2014). However, despite this activity, to our knowledge there are no
models explicitly incorporating population structure, heterochronous samples and non-
parametric population size history yet available.
One ad hoc solution is the mugration model described in the previous section, which
involves modelling the migration process along the tree in a way that is conditionally
independent of the population sizes estimated by the skyline plot (Lemey et al. 2009a).
Thus, conditional on the tree, the migration process is independent of the coalescent
prior. However, this approach does not capture the interaction between migration and
coalescence that is implicit in the structured coalescent, since coalescence rates should
depend on the population size of the deme the lineages are in, leading to a natural
interaction between the migration and branching processes. The mugration method
also does not permit the population sizes of the individual demes to be accurately
estimated as part of the analysis. As we will see in the following section, statistical
5.6 Phylogeography in a spatial continuum 73
The birth–death models introduced in Section 2.4 can also be extended to model popula-
tion structure (Kendall 1949). Similarly to the structured coalescent process this results
in a fully probabilistic approach in which the migration process among discrete demes
depends on the characteristics of the demes.
Such multi-type birth–death models come in different flavours, depending on the
research question posed. When samples are indeed taken from geographical locations
with migration among them, migration events should be occurring along the branches
in the phylogeny. In other cases type changes at branching events are more reasonable,
e.g. when trying to identify superspreaders in an HIV epidemic (Stadler and Bonhoeffer
2013).
In either case, one can either employ the multi-type trees introduced above (see
Figure 5.3), or integrate out the migration events such that standard BEAST trees can
be used for inference of the migration rates among types.
When applied to virus epidemics, a birth–death tree prior allows the reconstruction
of epidemiological parameters such as the effective reproduction number R (see Sec-
tion 5.7). Using a structured birth–death model, these parameters can differ among
demes and be estimated separately.
In some cases it’s more appropriate to model the spatial aspect of the samples as a con-
tinuous variable. The phylogeography of wildlife host populations have often been mod-
elled in a spatial continuum by using diffusion models, since viral spread and host move-
ment tend to be poorly modelled by a small number of discrete demes. One example is
the expansion of geographic range in the eastern United States of the raccoon-specific
rabies virus (Biek et al. 2007; Lemey et al. 2010). Brownian diffusion, via the compar-
ative method (Felsenstein 1985; Harvey and Pagel 1991), has been utilised to model
the phylogeography of feline immunodeficiency virus collected from the cougar (Puma
concolor) population around western Montana. The resulting phylogeographic recon-
struction was used as a proxy for the host demographic history and population structure
due to the predominantly vertical transmission of the virus (Biek et al. 2006). However,
one of the assumptions of Brownian diffusion is rate homogeneity on all branches.
This assumption can be relaxed by extending the concept of relaxed clock models
(Drummond et al. 2006) to the diffusion process (Lemey et al. 2010). Simulations
show that the relaxed diffusion model has better coverage and statistical efficiency over
Brownian diffusion when the underlying process of spatial movement resembles an
over-dispersed random walk.
74 Structured trees and phylogeography
Figure 5.3 Three realisations of the structured coalescent on two demes. The population size of
the two demes is equal (N0 = N1 = 1000) and the migration rates in both directions are
m01 = m10 = 0.00025 in units of expected migrants per generation.
Like their mugration model counterparts, these models ignore the interaction of pop-
ulation density and geographic spread in shaping the sample genealogy. However, there
has been progress in the development of mathematical theory that extends the coalescent
framework to a spatial continuum (Barton et al. 2002, 2010a, 2010b), although no
methods have yet been developed providing inference under these models.
A new area, known as phylodynamics (Grenfell et al. 2004; Holmes and Grenfell 2009),
promises to synthesise the distinct fields of mathematical epidemiology and statistical
phylogenetics (Drummond and Rambaut 2007; Drummond et al. 2002, 2012; Ronquist
5.7 Phylodynamics with structured trees 75
et al. 2012a; Stadler 2010) to produce a coherent framework (Kühnert et al. 2014;
Leventhal et al. 2014; Mollentze et al. 2014; Palmer et al. 2013; Rasmussen et al. 2011;
Stadler and Bonhoeffer 2013; Stadler et al. 2013; Volz 2012; Volz et al. 2009; Welch
2011) in which genomic data from epidemic pathogens can directly inform sophisti-
cated epidemiological models. Phylodynamics is particularly well suited to inferential
epidemiology because many viral and bacterial pathogens (Gray et al. 2011) evolve
so quickly that their evolution can be directly observed over weeks, months or years
(Kühnert et al. 2011; Pybus and Rambaut 2009; Volz et al. 2013). So far, only part
of the promise of phylodynamics has been realised. Early efforts include: (1) mod-
elling the size of the pathogen population through time using a deterministic model
for the epidemic (Volz 2012; Volz et al. 2009); (2) adopting new types of model for
the transmission tree itself that are more suited to the ways in which pathogens are
spread and sampled (Stadler et al. 2013); and (3) coupling this with an approximation
to a stochastic compartmental model for the pathogen population (Kühnert et al. 2014;
Leventhal et al. 2014). Only the last two of these approaches have been implemented
in software and made available to practitioners. These efforts are just scratching the
surface of this complex problem. They all make approximations that introduce biases
of currently unknown magnitude into estimates.
Figure 5.4 depicts a multi-type (or structured) SIR process, in which there are two
(coupled) demes, each of which is undergoing a stochastic SIR process. The number
of infected individuals in the two demes is shown as a stochastic jump process, and
the vertical lines emphasise the correspondence between events in the tree and events
in the underlying infected populations. Every internal node in the tree corresponds to
an infection event in the local epidemic of one of the demes. Likewise every migration
Infected individuals
Time
Figure 5.4 A two-deme phylodynamic time-tree with associated stochastic dynamics of infected
compartments. (With thanks to Tim Vaughan for producing this figure.)
76 Structured trees and phylogeography
5.8 Conclusion
information from fossil evidence and illustrates some practical issues for setting up an
analysis.
To run through this exercise, you will need the following software at your disposal,
which is useful for most BEAST analyses:
This chapter will guide you through the analysis of an alignment of sequences
sampled from 12 primate species (see Figure 1.2). The goal is to estimate the phylogeny,
the rate of evolution on each lineage and the ages of the uncalibrated ancestral
divergences.
The first step will be to convert a NEXUS file with a DATA or CHARACTERS block
into a BEAST XML input file. This is done using the program BEAUti (which stands
for Bayesian Evolutionary Analysis Utility). This is a user-friendly program for setting
the evolutionary model and options for the MCMC analysis. The second step is to run
BEAST using the input file generated by BEAUti, which contains the data, model and
analysis settings. The final step is to explore the output of BEAST in order to diagnose
problems and to summarise the results.
6.1 BEAUti
The program BEAUti is a user-friendly program for setting the model parameters for
BEAST. Run BEAUti by double-clicking on its icon. Once running, BEAUti will look
similar irrespective of which computer system it is running on. For this chapter, the Mac
OS X version is used in the figures but the Linux and Windows versions will have the
same layout and functionality.
Figure 6.1 A screenshot of the data tab in BEAUti. This and all following screenshots were
taken on an Apple computer running Mac OS X and will look slightly different on other
operating systems.
Figure 6.2 A screenshot of the Partitions tab in BEAUti after linking and renaming the clock
model and tree.
first drop-down menu in the Clock Model column and rename the shared clock model
to ‘clock’. Likewise, rename the shared tree to ‘tree’. This will make following options
and generated log files more easy to read.
and more readable in later parts of the exercise. Finally check the ‘estimate’ box for the
Substitution rate parameter and also check the Fix mean substitution rate box. This
will allow the individual partitions to have their relative rates estimated for the unlinked
site models.
Last, hold ‘shift’ key to select all site models on the left side, and click OK to clone
the setting from noncoding into 1stpos, 2ndpos and 3rdpos (Figure 6.3). Go through
each site model, as you can see, their configurations are the same now.
Priors
The Priors tab allows priors to be specified for each parameter in the model. The model
selections made in the Site Model and Clock Model tabs result in the inclusion of
various parameters in the model, and these are shown in the Priors tab (see Figure 6.4).
84 Bayesian evolutionary analysis by sampling trees
Here we also specify that we wish to use the Calibrated Yule model (Heled and
Drummond 2012) as the tree prior. The Yule model is a simple model of speciation
that is generally more appropriate when considering sequences from different species
(see Section 2.4 for details). Select this from the Tree prior drop-down menu.
We should convince ourselves that the priors shown in the priors panel really reflect
the prior information we have about the parameters of the model. We will specify diffuse
‘uninformative’ but proper priors on the overall molecular clock rate (clockRate) and
the speciation rate (birthRateY) of the Yule tree prior. For each of these parameters
select Gamma from the drop-down menu and, using the arrow button, expand the view
to reveal the parameters of the gamma prior. For both the clock rate and the Yule birth
rate set the Alpha (shape) parameter to 0.001 and the Beta (scale) parameter to 1000.
By default each of the gamma shape parameters has an exponential prior distribution
with a mean of 1. This implies (see Figure 3.7) we expect some rate variation. By default
the kappa parameters for the HKY model have a log-normal (1,1.25) prior distribution,
which broadly agrees with empirical evidence (Rosenberg et al. 2003) on the range of
realistic values for transition/transversion bias. These default priors are kept since they
are suitable for this particular analysis.
Figure 6.5 A screenshot of the calibration prior options in the Priors panel in BEAUti.
Name the taxa set by filling in the taxon set label entry. Call it human-chimp, since
it will contain the taxa for Homo sapiens and Pan. In the list below you will see the
available taxa. Select each of the two taxa in turn and press the >> arrow button. Click
OK and the newly defined taxa set will be added to the prior list. As this is a calibrated
node to be used in conjunction with the Calibrated Yule prior, monophyly must be
enforced, so select the checkbox marked Monophyletic?. This will constrain the
tree topology so that the human–chimp grouping is kept monophyletic during the course
of the MCMC analysis.
To encode the calibration information we need to specify a distribution for the most
recent common ancestor (MRCA) of human–chimp. Select the Log-normal distribution
from the drop-down menu to the right of the newly added human-chimp.prior.
Click on the black triangle and a graph of the probability density function will appear,
along with parameters for the log-normal distribution. We are going to set M = 1.78
and S = 0.085, which will specify a distribution centred at about 6 million years with a
standard deviation of about 0.5 million years. This will give a central 95% probability
range covering 5–7 Mya. This roughly corresponds to the current consensus estimate of
the date of the MRCA of humans and chimpanzees (Figure 6.5).
Firstly we have the Chain Length. This is the number of steps the MCMC will make
in the chain before finishing. How long this should be depends on the size of the data
set, the complexity of the model and the quality of answer required. The default value of
10 000 000 is entirely arbitrary and should be adjusted according to the size of your data
set. For this analysis let’s set the chain length to 6 000 000 as this will run reasonably
quickly on most modern computers (a few minutes).
The Store Every field determines how often the state is stored to file. Storing the
state periodically is useful for situations where the computing environment is not very
reliable and a BEAST run can be interrupted. Having a stored copy of the recent state
allows you to resume the chain instead of restarting from the beginning, so you do not
need to get through burn-in again. The Pre Burnin field specifies the number of samples
that are not logged at the very beginning of the analysis. We leave the Store Every and
Pre Burnin fields set to their default values. Below these are the details of the log files.
Each one can be expanded by clicking the black triangle.
The next options specify how often the parameter values in the Markov chain should
be displayed on the screen and recorded in the log file. The screen output is simply for
monitoring the program’s progress so can be set to any value (although if set too small,
the sheer quantity of information being displayed on the screen will actually slow the
program down). For the log file, the value should be set relative to the total length of the
chain. Sampling too often will result in very large files with little extra benefit in terms
of the accuracy of the analysis. Sample too infrequently and the log file will not record
sufficient information about the distributions of the parameters. You probably want to
aim to store no more than 10 000 samples, so this should be set to no less than chain
length / 10 000.
For this exercise we will set the screen log to 10 000 and the file log to 1000. The
final two options give the file names of the log files for the sampled parameters and the
trees. These will be set to a default based on the name of the imported NEXUS file.
If you are using the Windows operating system then we suggest you add the suffix
.txt to both of these (so, Primates.log.txt and Primates.trees.txt) so
that Windows recognises these as text files.
Now run BEAST and when it asks for an input file (Figure 6.6), provide your newly
created XML file as input. BEAST will then run until it has finished reporting informa-
tion to the screen. The actual results files are saved to the disk in the same location as
your input file. The output to the screen will look something like this:
6.2 Running BEAST 87
Source code distributed under the GNU Lesser General Public License:
http://github.com/CompEvol/beast2
BEAST developers:
Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled,
Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li,
Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel,
Oliver Pybus, Chieh-Hsi Wu, Walter Xie
Thanks to:
Roald Forsberg, Beth Shapiro and Korbinian Strimmer
Alignment(primate-mtDNA)
12 taxa
898 sites
413 patterns
Bouckaert RR, Heled J, Kuehnert D, Vaughan TG, Wu C-H, Xie D, Suchard MA,
Rambaut A, Drummond AJ (2014) BEAST 2: A software platform for Bayesian
evolutionary analysis. PLoS Computational Biology 10(4): e1003537
===============================================================================
Writing file Primates.log
Writing file Primates.trees
Sample posterior ESS(posterior) likelihood prior
0 -7924.3599 N -7688.4922 -235.8676 --
10000 -5529.0700 2.0 -5459.1993 -69.8706 --
20000 -5516.8159 3.0 -5442.3372 -74.4786 --
30000 -5516.4959 4.0 -5439.0839 -77.4119 --
40000 -5521.1160 5.0 -5445.6047 -75.5113 --
50000 -5520.7350 6.0 -5444.6198 -76.1151 --
60000 -5512.9427 7.0 -5439.2561 -73.6866 2m31s/Msamples
70000 -5513.8357 8.0 -5437.9432 -75.8924 2m31s/Msamples
... ... ... ... ...
5990000 -5516.6832 474.6 -5442.5945 -74.0886 2m31s/Msamples
6000000 -5512.3802 472.2 -5440.8928 -71.4874 2m31s/Msamples
Tuning: The value of the operator’s tuning parameter, or ’-’ if the operator can’t be optimized.
#accept: The total number of times a proposal by this operator has been accepted.
#reject: The total number of times a proposal by this operator has been rejected.
Pr(m): The probability this operator is chosen in a step of the MCMC (i.e. the normalized weight).
Pr(acc|m): The acceptance probability (#accept as a fraction of the total proposals for this operator).
Note that there is some useful information at the start concerning the alignments
and which tree likelihoods are used. Also, all citations relevant for the analysis are
mentioned at the start of the run, which can easily be copied to manuscripts reporting
about the analysis. Then follows reporting of the chain, which gives some real-time
feedback on progress of the chain.
6.3 Analysing the results 89
At the end, an operator analysis is printed, which lists all operators used in the
analysis together with how often the operator was tried, accepted and rejected (see
columns #total, #accept and #reject, respectively). The acceptance rate is the proportion
of times an operator is accepted when it is selected for doing a proposal. In general, an
acceptance rate that is high, say over 0.5, indicates the proposals are conservative and
do not explore the parameter space efficiently. On the other hand, a low acceptance rate
indicates that proposals are too aggressive and almost always result in a state that is
rejected because of its low posterior. Both too high and too low acceptance rates result
in low effective sample size (ESS) values. An acceptance rate of 0.234 is the target
(based on very limited evidence provided by Gelman et al. 1996) for many (but not all)
operators implemented in BEAST.
Some operators have a tuning parameter, for example the scale factor of a scale
parameter. If the final acceptance rate is not near the target, BEAST will suggest a new
value for the tuning parameter, which is printed in the operator analysis. In this case,
all acceptance rates are good for the operators that have tuning parameters. Operators
without tuning parameters include the wide exchange and Wilson–Balding operators
for this analysis. Both of these operators attempt to change the topology of the tree with
large steps, but since the data support a single topology overwhelmingly, these radical
proposals are almost always rejected.
The BEAST run produces two logs; a trace log and a tree log. To inspect the trace log,
run the program called Tracer. When the main window has opened, choose Import
Trace File... from the File menu and select the file that BEAST has created called
Primates.log (Figure 6.7).
Remember that MCMC is a stochastic algorithm so the actual numbers will not be
exactly the same as those depicted in the figure.
On the left-hand side is a list of the different quantities that BEAST has logged to
file. There are traces for the posterior (this is the natural logarithm of the product of the
tree likelihood and the prior density), and the continuous parameters. Selecting a trace
on the left brings up analyses for this trace on the right-hand side, depending on the tab
that is selected. When first opened, the ‘posterior’ trace is selected and various statistics
of this trace are shown under the Estimates tab. In the top right of the window is a table
of calculated statistics for the selected trace.
Select the clockRate parameter in the left-hand list to look at the average rate
of evolution (averaged over the whole tree and all sites). Tracer will plot a (marginal
posterior) histogram for the selected statistic and also give you summary statistics such
as the mean and median. The 95% HPD interval stands for highest posterior density
interval and represents the most compact interval on the selected parameter that contains
95% of the posterior probability. It can be loosely thought of as a Bayesian analogue
to a confidence interval. The TreeHeight parameter gives the marginal posterior
distribution of the age of the root of the entire tree.
90 Bayesian evolutionary analysis by sampling trees
• What is the estimated rate of molecular evolution for this gene tree (include the
95% HPD interval)?
• What sources of error does this estimate include?
• How old is the root of the tree (give the mean and the 95% HPD range)?
To show the relative rates for the four partitions, select the mutationRate parameter for
each of the four partitions, and select the marginal density tab in Tracer. Figure 6.9
shows the marginal densities for the relative substitution rates. The plot shows that
codon positions 1 and 2 have substantially different rates (0.452 versus 0.181) and both
are far slower than codon position 3 with a relative rate of 2.95. The non-coding partition
6.5 Obtaining an estimate of the phylogenetic tree 91
Figure 6.8 A screenshot of the 95% HPD intervals of the root height and the user-specified
(human–chimp) MRCA in Tracer.
has a rate intermediate between codon positions 1 and 2 (0.344). Taken together this
result suggests strong purifying selection in both the coding and non-coding regions of
the alignment.
Likewise, a marginal posterior estimate can be obtained for the gamma shape param-
eter and the kappa parameter, which are shown in Figures 6.10 and 6.11, respectively.
The plot for the gamma shape parameter suggest that there is considerable rate variation
for all of the partitions with the least rate variation in the third codon position.
The plot for the kappa parameter (Figure 6.11) shows that all partitions show consid-
erable transition/transversion bias, but that the third codon position in particular has a
high bias with a mean of almost 29 more transitions than transversions.
Figure 6.9 A screenshot of the marginal posterior densities of the relative substitution rates of the
four partitions (relative to the site-weighted mean rate).
3
Density
0
0 1 2 3 4 5
Gamma shape (α)
Figure 6.10 The marginal prior and posterior densities for the shape (α) parameters. The prior is
in green. The posterior density estimate for each partition is also shown: non-coding (black) and
first (blue), second (red) and third (orange) codon positions.
6.5 Obtaining an estimate of the phylogenetic tree 93
0.4
0.3
Density
0.2
0.1
0.0
0 10 20 30 40
Transition/transversion bias (κ)
Figure 6.11 The marginal prior and posterior densities for the transition/tranversion bias (κ)
parameters. The prior is in green. The posterior density estimate for each partition is also shown:
non-coding (black) and first (blue), second (red) and third (orange) codon positions.
probability for each node. Run the TreeAnnotator program and set it up as depicted in
Figure 6.12.
The burn-in is the number of trees to remove from the start of the sample. Unlike
Tracer, which specifies the number of steps as a burn-in, in TreeAnnotator you need to
specify the actual number of trees. For this run, you specified a chain length of 6 000 000
steps, sampling every 1000 steps. Thus the trees file will contain 5000 trees and so to
specify a 1% burn-in use the value 50.
The Posterior probability limit option specifies a limit such that if a node is found
at less than this frequency in the sample of trees (i.e. has a posterior probability less
94 Bayesian evolutionary analysis by sampling trees
than this limit), it will not be annotated. The default of 0.5 means that only nodes seen
in the majority of trees will be annotated. Set this to zero to annotate all nodes.
The Target tree type specifies the tree topology that will be annotated. You can either
choose a specific tree from a file or ask TreeAnnotator to find a tree in your sample. The
default option, Maximum clade credibility tree, finds the tree with the highest product
of the posterior probability of all its nodes.
For node heights, the default is Common Ancestor Heights, which calculates the
height of a node as the mean of the MRCA time of all pairs of nodes in the clade. For
trees with large uncertainty in the topology and thus many clades with low support,
some other methods can result in trees with negative branch lengths. In this analysis,
the support for all clades in the summary tree is very high, so this is not an issue here.
Choose Mean heights for node heights. This sets the heights (ages) of each node in the
tree to the mean height across the entire sample of trees for that clade.
For the input file, select the trees file that BEAST created and select a file for the
output (here we called it Primates.MCC.tree). Now press Run and wait for the
program to finish.
Finally, we can visualise the tree in another program called FigTree. Run this program,
and open the Primates.MCC.tree file by using the Open command in the File
menu. The tree should appear. You can now try selecting some of the options in the
control panel on the left. Try selecting Node Bars to get node age error bars. Also turn
on Branch Labels and select posterior to get it to display the posterior probability for
each node. Under Appearance you can also tell FigTree to colour the branches by the
rate. You should end up with something similar to Figure 6.13.
An alternative view of the tree can be made with DensiTree, which is part of
BEAST 2. The advantage of DensiTree is that it is able to visualise both uncertainty in
node heights and uncertainty in topology. For this particular data set, the most probable
topology is present in more than 99% of the samples. So, we conclude that this analysis
results in a very high consensus on topology (Figure 6.13).
It is a good idea to rerun the analysis while sampling from the prior to make sure that
interactions between priors are not affecting your prior information. The interaction
between priors can be problematic, especially when using calibrations since it means
putting multiple priors on the tree (see Section 9.1 for more details). Using BEAUti,
you can set up the same analysis under the MCMC options by selecting the Sample
from prior only option. This will allow you to visualise the full prior distribution in the
absence of your sequence data. Summarise the trees from the full prior distribution and
compare the summary to the posterior summary tree.
6.7 Comparing your results to the prior 95
Figure 6.13 A screenshot of FigTree (top) and DensiTree (bottom) for the primate data.
96 Bayesian evolutionary analysis by sampling trees
Divergence time estimation using ‘node dating’ of the type described in this
chapter has been applied to answer a variety of different questions in ecology and
evolution. For example, node dating with fossils was used in determining the species
diversity of cycads (Nagalingum et al. 2011), analysing the rate of evolution in flowering
plants (Smith and Donoghue 2008) and investigating the origins of hot and cold desert
cyanobacteria (Bahl et al. 2011).
7 Setting up and running a
phylogenetic analysis
In this chapter, we will go through some of the more common decisions involved
in setting up a phylogenetic analysis in BEAST. The order in which the issues are
presented follows more or less the order in which an analysis is set up in BEAUti for
a standard analysis. So, we start with issues involved in the alignment, then setting up
site and substitution models, clock models and tree priors and all of their priors. Some
notes on calibrations and miscellanea are followed by some practicalities of running a
BEAST analysis. Note that a lot of the advice in this section is rather general. Since
every situation has its special characteristics, the advice should be interpreted in the
context of what you know about your data.
Some tips on selecting samples and loci for alignments are discussed in (Ho and Shapiro
2011; Mourier et al. 2012; Silva et al. 2012).
Recombinant sequences: Though under some circumstances, horizontal transmis-
sion was shown not to impact the tree and divergence time estimates (Currie et al. 2010;
Greenhill et al. 2010), the models in BEAST cannot handle recombinant sequences
properly at the time of writing. So, it is recommended that these are removed from the
alignment. There are many programs that can help identify recombinant sequences, for
example 3seq (Boni et al. 2007) or SplitsTree (Huson and Bryant 2006).
Duplicate sequences: An erroneous argument for removal of duplicate sequences in
the alignment is that multiple copies will lead to ambiguous trees and slow down the
analysis. However, a Bayesian approach aims to sample all trees that have an appreci-
able probability given the data. One of the assumptions underlying common Bayesian
phylogenetic models is that there is a binary tree according to which the data were
generated. If, for example, three taxa have identical sequences, it does not mean that
they represent the same individual, or that they are equally closely related in the true
tree. All that can be said is that there were no mutations in the sampled part of the
genome during the ancestral history of those three taxa. In this case, BEAST would
sample all three subtrees with equal probability: ((1,2),3), (1,(2,3)), ((1,3),2). If you
summarise the BEAST output as a single tree (see Section 11.4) you will see some
98 Setting up and running a phylogenetic analysis
particular sub-tree over these identical sequences based on the selected representative
tree. But the posterior probability for that particular sub-tree will probably be low
(around one-third in our example), since other trees have also been sampled in the chain.
One of the results of a Bayesian phylogenetic analysis is that it gives an estimate of
how closely related the sampled sequences are, even if the sequences are identical. This
is possible because all divergences in the phylogeny are estimated using a probabilistic
model of substitution. For identical sequences, this amounts to determining how old the
common ancestor of these sequences could be given that no mutations were observed in
their common ancestry, and given the estimated substitution rate and sequence length.
Among a set of identical sequences, the only divergence with the possibility of signifi-
cant support would be their common ancestor. If this is the case then you can confidently
report the age of their common ancestor, but should not try to make any statements about
relationships or divergence times within the group of identical sequences.
Finally, there is a population genetic reason not to remove identical sequences. Im-
agine you have sequenced 100 random individuals and among them you observe only
20 unique haplotypes. You are tempted to just analyse the 20 haplotypes. However, if
you are applying a population genetic prior like the coalescent, then this is equivalent to
misrepresenting the data, since the coalescent tree prior assumes that you have randomly
sampled the 20 individuals. If only unique haplotypes are analysed, then it will appear
that every random individual sampled had a unique haplotype. If this was actually the
case you would conclude that the background population from which these individuals
came must be very large. As a result, by removing all the identical sequences you will
cause an overestimation of the population size.
Outgroups: An outgroup is a taxon or set of taxa that is closely related to the taxa
of interest (the ingroup), but definitely has a common ancestor with the ingroup that is
more ancient then the most recent common ancestor of the ingroup. In unrooted phylo-
genetic reconstruction the outgroup traditionally serves as a phylogenetic reference and
provides a root for the ingroup (Felsenstein 2004). However, adding an outgroup is
generally discouraged in Bayesian time-tree analyses because inclusion of outgroups
can introduce long branches which can make many estimation tasks more difficult.
Having said that, a well-chosen outgroup can provide additional information about the
ingroup root position, even when a molecular clock is already being used to estimate a
rooted tree.
Nevetheless, outgroups are usually less well sampled than the ingroup, which violates
a basic assumption of many of the standard time-tree priors. Most time-tree priors
assume that the entire tree is sampled with consistent intensity across all clades at each
sampling time. For example, this assumption underlies most coalescent and birth–death
priors (but for alternative sampling assumptions see Höhna et al. 2011).
Also, in a population genetic context, if the outgroup is from a different species than
the ingroup and the ingroup taxa are from the same species, care should be taken in
selecting a prior, and your options may be limited, compared to analyses restricted
to the single-species ingroup. Coalescent-based priors are appropriate for the ingroup,
but birth–death-based priors are more appropriate for divergences separating different
species in the tree.
7.1 Preparing alignments 99
R A, G B C, G, T
Y C, T D A, G, T
M A, C H A, C, T
W A, T V A, C, G
S C, G N,?,- A, C, G, T
K G, T
Finally, traditionally the outgroup was picked to be the one most genetically similar
to the ingroup, that is, on the shortest branch. However, this tends to select for atypical
taxa that are evolving slowly, which has a biasing impact on the relaxed and strict clock
analyses required to do divergence-time dating (Suchard et al. 2003).
So, the simple answer to the question ‘How do I instruct BEAST to use an outgroup?’
is that you may not want to. A Bayesian time-tree analysis will sample the root position
along with the rest of the tree topology. If you then calculate the proportion of sampled
trees that have a particular root, you obtain a posterior probability for the root position.
However, if you do include an outgroup and have a strong prior belief that it really is
an outgroup then you should probably reflect that in your model by constraining the
ingroup to be monophyletic.
Ambiguous data: When a site in a sequence cannot be unambiguously determined,
but it is known to be from a subset of characters, these positions in the sequence can be
encoded as ambiguous. For example, for nucleotide data an ‘R’ in a sequence represents
the state is either ‘A’ or ‘G’ but certainly not ‘C’ or ‘T’ (see Table 7.1 for the IUPAC-
IUB ambiguity codes for nucleotide data). By default, ambiguous data are treated as
unknowns for reasons of efficiency, so internally they are replaced by a ‘?’ or ‘-’.
Both unknowns and gaps in the sequence are treated the same in most likelihood-
based phylogenetic analyses. Note there are alternative approaches that use indels as
phylogenetic information (Lunter et al. 2005; Novák et al. 2008; Redelings and Suchard
2005, 2007; Suchard and Redelings 2006). By treating ambiguities as unknowns the
phylogenetic likelihood (see Section 3.5) can be calculated about twice as fast as when
ambiguities are explicitly modelled. When a large proportion of the data consists of
ambiguities, it may be worth modelling the ambiguities exactly. This can be done by
setting the useAmbiguities=“true” flag on the tree-likelihood element in the XML (see
Chapter 13).
However, simulations and empirical analysis suggest that missing data are not prob-
lematic and for sufficiently long sequences taxa with missing data will be accurately
placed in the phylogeny. When the number of characters in the analysis is larger than
100, and up to 95% of the sequence data are missing the tree can still be reconstructed
correctly (Wiens and Moen 2008). Model misspecification is probably a larger problem
in assuring phylogenetic accuracy than missing data (Roure et al. 2013).
Partitioning: Alignments can be split into various subsets called partitions. For
example, if the alignment is known to be protein encoding, it often makes sense to
100 Setting up and running a phylogenetic analysis
split according to codon position (Bofkin and Goldman 2007; Shapiro et al. 2006). If
there is a coding and non-coding part of a sequence, it may make sense to separate the
two parts in different partitions (see for instance the exercise in Chapter 6). Partitions
can be combined by linking site models, clock models and trees for the partitions of
interest. This has the effect of appending the alignments in the partitions. Which of the
various combinations of linking and unlinking site models, clock models and trees is
appropriate depends on the scenario. For example:
Scenario 1: If you are interested in the gene tree from multiple genes sampled from
a single evolutionarily linked molecule like the mitochondrial genome or the hepatitis
C virus (HCV) genome, then it is more likely you are interested in estimating a single
phylogenetic tree from a number of different genes, and for each gene you would like a
different substitution model and relative rate.
Scenario 2: If you are interested in a species tree from multiple genes that are not
linked, then a multispecies coalescent (*BEAST; Heled and Drummond 2010) analysis
is more appropriate.
Scenario 3: If you are interested in estimating the average birth rate of lineages from
a number of different phylogenetic data sets you would set up a multi-partition analysis.
But you would have a series of phylogenetic trees that all share the same Yule birth rate
parameter to describe the distribution of branch lengths (instead of a series of population
genealogies sharing a population size parameter). In addition they may have different
relative rates of evolution and different substitution models, which you could set up by
having multiple site models, one for each partition.
If one gene sequence is missing in one taxa then it should be fine to just use ‘?’
to represent this (or gaps, ‘-’, they are treated the same way). If the shorter sequences
can be assumed to follow a different evolutionary path than the remainder, it is better to
split the sequences into partitions that share a single tree, but allow different substitution
models.
Partitions should not be chosen too small. There should still be sufficient information
in each partition to be able to estimate parameters of interest. One has to keep in mind
that it is not the number of sites in the partition that matters, but the number of unique
site patterns. A site pattern is an assignment of characters in an alignment for a particular
site, illustrated as any column in the alignment in Figure 1.2.
Methods that automatically pick partitions from data include (Wu et al. 2013) for
arbitrary partitions and (Bouckaert et al. 2013) for consecutive partitions.
Combining partitions: You can combine partitions if they share the same taxa
simply by linking their tree, site model and clock model. This has the same effect as
concatenating the alignments into a single partition. Note that sharing just part of the
model across multiple partitions is also possible.
In general, it is very hard to give an answer to the question ‘Which prior should I use?’,
because the only proper answer is ‘It depends’. In this section we make an attempt to
7.2 Choosing priors/model set-up 101
Table 7.2 Some common distributions used as priors and some of their properties. Plots indicate some
of the shapes
2π σ 2
1 u) =
U(x|l, Upper u Shift to right
u−l if l ≤ x ≤ u
0 otherwise
* If offset is set to non-zero, the offset should be added to the range. † M is the mean of log x, but the log-
normal distribution can also be specified by its true mean, μ. If so, μ is the mean of the distribution. ‡ NB a
number of parameterisations are in use, this shows the one we use in BEAST.
102 Setting up and running a phylogenetic analysis
give insight into the main issues around choosing priors. However, it is important to
realise that for each of the pieces of advice given in the following paragraphs there are
exceptions. As a reference, Table 7.2 shows some of the more common distributions
used as priors.
Table 7.3 GTR model can be used to define other nucleotide substitution models by linking parameters.
δ Model α β γ δ ω # Dimensions
A T
F81 (JC69) 1 1 1 1 1 1 0
β
α ω HKY85 (K80) a 1 a a 1 a 1
TN93 a b a a 1 a 2
C G TIM a b c c 1 a 3
γ GTR (SYM) a b c d 1 e 5
On the left, the four letters and the rates between them are shown. On the right, the models
with frequencies estimated (models that have their frequencies not estimated in braces) and the
corresponding values, which show which parameters are linked by sharing the same alphabetic
character. The dimensions column shows the number of parameters that need to be estimated for
the model.
that is set to 1. Still other programs normalise so that the sum of all six relative rates is
6 or 1, or other numbers (for example, so that the expected output of the instantaneous
rate matrix is 1 – which also requires knowledge of the base frequencies). Luckily, you
can easily compare two sets of relative rates that have different normalisations by (for
example) rescaling one set of relative rates so that their sum (all six) is equal to the sum
of (all six of) the other.
Use GTR to define other models: The GTR model can be used to define less
parameter-rich models such as TN93 (Tamura and Nei 1993) or even JC69 (Jukes and
Cantor 1969) by linking some of the parameters. In fact, GTR can be used to define
any reversible nucleotide substitution model by linking parameters and changing the
frequency estimation method. Table 7.3 shows which rates should be shared to get
a particular substitution model. Note that since the rates are normalised in a BEAST
analysis so that on average one substitution is expected in one unit of time, there is one
parameter that can be fixed to a constant value, which is required to ensure convergence.
In general, it is a good idea to fix a rate for which there is sufficient data to get a good
estimate from the alignment. Most nucleotide alignments contain C–T transitions, so
the default in BEAST is to fix that rate to 1 and estimate other rates relative to that.
Number of gamma categories: The number of gamma categories determines how
well rate heterogeneity following a gamma distribution is approximated by its discret-
isation into a number of distinct categories. The computational time required by BEAST
is linear with the number of gamma categories, so a model with 12 rate categories takes
three times longer than one with four rate categories. For most applications four to
eight categories does a sufficiently good job at approximating a gamma distribution
in this context (Yang 1994). A good strategy is to start with four categories. If the
estimate of the shape parameter turns out to be close to zero this is an indication that
there is substantial rate heterogeneity and a larger number of categories can result in
a better fit. One situation where rate heterogeneity among sites occurs is when the
sequences are protein coding, and sites at codon positions 1 and 2 are known to have a
lower rate than those at codon position 3 due to purifying selection on the underlying
amino acid sequences and redundancy of the genetic code leading to reduced selection
104 Setting up and running a phylogenetic analysis
at codon position 3. In these cases, partitioning the data by codon position may fit
better than gamma categories (Bofkin and Goldman 2007; Shapiro et al. 2006) and
this model is up to x times faster than using gamma categories, where x is the number of
categories.
Prior for the gamma shape parameter α: The default prior for the gamma shape
parameter in BEAST is an exponential with mean 1. Suppose you would use a uniform
distribution between 0 and 1000 instead. That means that the prior probability of the
shape parameter α being >1 (and smaller than 1000) is 0.999 (99.9%) and the prior
probability of 10 < α < 1000 is 0.99. As you can appreciate this is a fairly strong prior
for almost-equal rates across sites. Even if the data have reasonably strong support for
a shape of, say, 1.0, this may still be overcome by the prior distribution. All priors in
BEAST are merely defaults, and it’s impossible ahead of time to pick sensible defaults
for all possible analyses. So it is important in a Bayesian analysis to assess your priors.
It is better to choose a good prior, then to fix the parameter value to the maximum
likelihood estimate. Doing the latter artificially reduces the variance of the posterior
distribution by using the data twice if the estimate came from the same data.
Invariant sites: Without invariant sites (that is, proportion invariant = 0), the gamma
model acts on all sites. When setting the proportion of invariant sites to a non-zero
value, the gamma model for rate heterogeneity across sites will need to explain less
rate variation in the remaining variable sites. As a consequence, the remaining sites
look less heterogeneous and the estimated value of the shape parameter will increase.
Estimating a proportion of invariant sites introduces an extra random variable, and thus
injects extra uncertainty in the analysis, which may not always be desirable (see model
selection strategy in Chapter 10). The paradox is that for α < 1 and short total tree
lengths you expect sites that are invariant even though they have a non-zero rate.
Frequency model: For the HKY (Hasegawa et al. 1985), GTR and many other
substitution models it is possible to choose the way frequencies are determined when
calculating the tree likelihood. The three standard methods are uniform frequencies
(one-quarter each for nucleotide models), empirically estimated from the alignment and
estimated by MCMC. Uniform frequencies are generally only useful when you have
done some simulations and uniform is the truth. Empirical frequencies are regarded
as a reasonable choice, although obviously they are based on a simple average, so
that if you have a lot of identical sequences at the end of a long branch then the
empirical frequencies will be dominated by that one sequence which might not be
representative. Estimating frequencies is the most principled approach and the estimates
typically converge fast. However, estimating the frequencies does introduce a bit of
uncertainty in the analysis, which can be detrimental to analysis of parameter-rich
models on low-diversity data.
Fix mean substitution rate: When there are two or more partitions that share a clock
model, for example, when splitting a coding sequence into its first, second and third
codon positions, it can be useful to estimate relative substitution rates. This is done in
such a way that the mean substitution rate is 1, where the mean is weighted by the
number of sites in each of the partitions. So, for two partitions where the first is twice
as long as the second, the relative substitution rate of the first partition is weighted by
7.2 Choosing priors/model set-up 105
two-thirds, and the second partition by one-third. In this case the first mutation rate can
be 0.5 and the second two to give a mean substitution rate of 0.5 × 2/3 + 2 × 1/3 = 1.
Note that the unweighted mean does not equal 1 in this case.
For a mean substitution rate to be fixed to 1, substitution rates of at least two partitions
must be marked as estimated. Warning: when the mean substitution rate is not fixed, but
substitution rates are estimated and a clock rate for the partition is also estimated, the
chain will not converge since the clock and substitution rates are not identifiable. This
happens because there are many combinations of values of substitution rate and clock
rate that result in the same tree likelihood.
Typically, BEAST uses substitution rates for relative rates, and the clock rate for
absolute rates. But in the end, the actual rate used to calculate the likelihood is the
product of these two. BEAUti just makes it convenient to add a prior on substitution
rates (by checking the Fix Mean Substitution Rate check box) such that the mean rate
of the substitution rates across multiple partitions is one.
In short, for low-diversity intra-species data sets there are often low levels of rate
variation between branches, and in that case the strict molecular clock can be superior
for rate and divergence-time estimation (Brown and Yang 2011). Even with significant
evidence for moderate rate variation, it is worth using a strict molecular clock if you
determine that the model misspecification is not severe (i.e. if there is no evidence to
reject S < 0.1).
Absolute and relative rates: The substitution rate for a partition is the product of
the substitution rate specified for the site model and the clock rate of the clock model.
Care must be taken to determine which rate parameters should be estimated and which
should be fixed to constant values. Failure to choose the correct combination will result
in invalid analysis or nonsensical results.
If there is one partition and no timing information in the form of calibration or tip
dates, both substitution rate and clock rate should be fixed to a constant value or have
a strongly informative prior. Usually substitution rate is set to 1.0 and clock rate is
set to either 1.0, or a known molecular clock rate. If a proper informative prior has
been constructed based on the literature or previous analyses, then the clock rate can be
‘estimated’, in the sense that uncertainty can be incorporated into its prior. Since clock
rate is a scale parameter, its prior should typically be a log-normal, rather than a normal,
so that the prior probability is zero for a rate of zero.
If for a single partition there is some timing information, one of the clock rate
and substitution rate should be estimated, but not both. If both substitution and molec-
ular clock rates are estimated, the analysis will be invalid due to non-identifiability.
Typically, you want the clock rate to be estimated, which is set by BEAUti by
default.
Conversely, if both rates are fixed and the rate k̂ implied by the timing information, t
(in conjunction with the genetic distances d from the sequence data; i.e. k̂ = d/t) differs
considerably from the specified overall clock rate then divergence-time estimation may
be compromised. A mismatch of timing information and fixed molecular clock rates
could also cause poor convergence.
For multiple partitions sharing the same tree, one partition can be chosen as reference
and the remainder can have their substitution rates estimated relative to the reference
partition. Preferably, the substitution rates of all partitions are estimated, but with the
constraint that the mean substitution rate per site is fixed to 1 (Drummond 2002; Pybus
et al. 2003). To calculate the mean substitution rate per site, the length of partitions is
taken in account, as longer partitions will have a bigger impact on the mean substitution
rate than shorter ones.
Multiple partitions do not need to share a single time-tree. For each individual tree
in the analysis, molecular clock rates can be estimated if there is timing information
to inform the rate. However, per tree at least one substitution rate should be fixed or
alternatively the mean of the substitution rates must be fixed.
Set the clock rate: As mentioned above, when a clock rate is available from another
source, like a generalised insect clock rate of 2.3% per million years, you may want to
use this fixed rate instead of estimating it. A rate of 2.3% per million years equals 0.023
substitutions per site per million years (or half that if it was a divergence rate, that is the
7.2 Choosing priors/model set-up 107
rate of divergence between two extant taxa), so when you set the clock rate to 0.023 it
means your timescale will be in millions of years.
Mutation/clock rate prior: If your prior on the rate parameters were the default
ones (currently, BEAUti assumes a uniform distribution with an extremely large upper
bound), and your timing information is not strongly informative, then you should expect
unreasonable values for rate estimates. This is because you are sampling an inappro-
priate default prior, which due to the lack of signal in the data will dominate the rate
estimate. The uniform prior in this context puts a large fraction of the prior mass on large
rates (larger than one substitution per unit of time), resulting in nonsensical estimates.
Of course if you want a normal prior in log-space then you can choose the log-normal
prior. The log-normal is recommended rather than normal since the normal distribution
has a non-zero density for a rate of zero, which makes no sense for a scale parameter like
clock and substitution rate. A log normal with appropriate mean and 95% HPD bounds
is a good choice for a substitution rate prior in our opinion. If you have a number of
independent rate estimates from other papers on relevant taxa and gene regions, you
can simply fit a log-normal distribution to them and use that as a prior for your new
analysis.
Alternatively, a diffuse gamma distribution (shape 0.001, scale 1000) can be appro-
priate. In general, if the data are sufficiently informative and there is a good calibration,
the prior should not have too much impact on the analysis. The initial value should not
matter, but setting it as the mean of the distribution does no harm.
There is a reasonably good (but broad) prior on the rate of evolution available for
almost any group of organisms with a careful read of the literature. For example, ver-
tebrate mitochondrial evolution tends to be in the range 0.001–1 substitutions per site
per million years. While this is a broad prior it certainly rules out more than it rules
in. However, if you want to assert ignorance about the clock rate parameter it is fine to
take a broader prior on the clock rate as long as there is a proper prior on one of the
divergences in the tree and a proper prior on the standard deviation of the log rate S if
you use a log-normal relaxed clock. The default prior for S (exponential with mean 13 )
is suitable for this purpose, although a smaller mean (e.g. 0.1) of the exponential prior
on S for closely related taxa can be easily argued for.
Figure 7.1 Left, a simulated Yule tree; right, a simulated coalescent (with constant population)
tree with 20 taxa. Note, coalescent trees have much shorter branches near the tips.
also be included as a random variable in the model (Stadler 2009; Yang and Rannala
1997). See Chapter 2 for details on these tree priors.
The sampling proportion can make a big difference to the prior if it is very small
(Stadler 2009). But if we put the sampling proportion aside for a moment then Yule
versus birth–death is not as big a difference in priors as Yule versus coalescent. The
coalescent prior varies quadratically
with the number of lineages spanning an inter-
node time (E[tk ] ∝ exp − 2k where tk is the length of time spanned by the kth
interval in the tree) whereas the Yule prior varies linearly (E[tk ] ∝ exp(−k)) (see
Figure 7.1).
The Yule prior assumes that the birth rate of new lineages is the same everywhere in
the tree. For example, it would assume that the birth rate of new lineages within the set of
taxa of interest is the same as the rate of lineage birth in the outgroup. This assumption
may not be appropriate. A relaxed molecular clock also allows quite some scope for
changes in the rate from branch to branch, so that the molecular clock model can interact
with the tree prior to accommodate a poor fit of the tree prior by compensating with
lineage-specific rate variation.
The coalescent is a tree prior for time-trees relating a small sample of individuals
from a large background population, where the background population may have
experienced changes in population size over the time period that the tree spans. Non-
parametric forms of the coalescent tree prior are more flexible and so may accommodate
a wider range of divergence time distributions. However, the parameters of the
coalescent tree prior will be difficult to interpret if your sequences come from different
species.
The skyline (Drummond et al. 2005) and skyride (Minin et al. 2008) priors are
coalescent priors that are useful for complex dynamics. The parameterised coalescent
priors like the constant and exponential coalescent are useful when you want to estimate
specific parameters, for example the growth rate. However, you have to be confident that
the model is a good description of the data for such estimates to be valid.
Note that some priors, such as the Yule prior, assume all tips are sampled at the same
time. These priors are not available for use with serially sampled data, and the birth–
death skyline plot can be used instead (Stadler et al. 2013).
7.2 Choosing priors/model set-up 109
For divergence-time dating, once a tree prior has been chosen, we highly recommend
to practitioners that they look carefully at the resulting prior distribution (running the
analysis without data) on node ages to see whether the prior reflects prior beliefs about
possible times of origin of clades of interest (see also Section 7.2.4 on calibrations).
Prior for Yule birth rate: If you are using the Yule tree prior then the birth rate λ
needs to be specified. This parameter governs the rate at which species diverge. This
rate, in turn, determines the (prior) expected age of the species tree (denoted by troot ).
The formula connecting these is two quantities is
n−1
1 k
λ= , (7.2)
troot n(n − k)
k=1
where n is the number of species. So, if you have some information about the speci-
ation rate, you can use this to specify a prior on λ. This will effectively form a prior
distribution on the age of the tree through the tree prior.
If no prior information on the speciation rate λ is available, then a uniform prior is
okay. Note that in practice for *BEAST analysis only we found that 1/x works better for
a birth rate prior on the species tree. Such a prior puts higher preference on lower birth
rates, hence older species trees. But the age of the species tree is limited from above
by corresponding divergences in the gene trees, which each have their own coalescent
prior. As a result the species tree will ‘snugly’ fit the gene trees, while with a uniform
birth rate prior it might linger lower than the gene trees would justify.
Prior for constant population size: When using the constant-size coalescent tree
prior, a choice needs to be made for the prior on the constant population size hyper-
parameter. If you have prior information, then an informative prior is always more
preferable.
Otherwise, the 1/x prior (which for the population size parameter is also the Jeffrey’s
prior) is a good non-informative prior for population size. We showed in Table 3 of
(Drummond et al. 2002) that this prior leads to good recovery of the population size
in terms of frequentist coverage statistics, but we were not attempting to provide a
general result in that paper. It is well known that estimation of the growth rate for
exponentially growing populations is positively biased (especially when using a single
locus, see Kuhner et al. 1998) and we do not know of any work that has been done on
appropriate priors for that parameter. Having said 1/x is okay for θ, it is not if you want
to do path sampling (see Section 9.5), because path sampling is only possible when the
prior is proper, whereas the 1/x prior is improper since it doesn’t have a finite integral.
Number of BSP groups: The number of groups for a BSP prior determines how well
the demographic function can be approximated. When there is only a single group, BSP
becomes equivalent to a simple constant population size model. The maximum number
of groups is one per coalescent event and in this case a population size is estimated
for every interval between two coalescent events. Choosing the maximum number of
groups leads to a large number of estimated parameters and these estimates will be very
noisy. So, the choice for the number of groups typically needs to be in between these
110 Setting up and running a phylogenetic analysis
two extremes. Too few groups results in loss of signal of the population size history, and
too many groups result in noisy estimates that are difficult to interpret.
The optimal value depends on the data at hand and the demographic history being
inferred. A rule-of-thumb is to start with five groups and if more resolution is required
increase the number of groups in subsequent runs. Alternatively, use EBSP, which
supports estimation of the number of change-points from the data as well as use of
multiple loci (Heled and Drummond 2008; see Section 2.3.1);
Different behaviour for different tree hyper-priors: Different tree hyper-parameter
priors can result in differences in rate estimates. For example, a uniform prior on Yule
birth rate works the way one expects, while the uniform prior on the population size of
a constant-size coalescent prior does not. The reason for this lies in the way in which
the different models are parameterised. If the coalescent prior had been parameterised
with a parameter that was equal to 1 over the constant population size, then a uniform
prior would have behaved as expected (in effect the Jeffrey’s prior is performing this
re-parameterisation). Conversely, if the Yule tree model had been parameterised by the
mean branch length (for derivation of mean branch length in a Yule tree see Steel and
Mooers 2010) it would have behaved in a similar way to coalescent prior with a uniform
prior on the population size.
Before you start thinking that we parameterised the coalescent prior incorrectly, it
is important to realise there is no parameterisation that is correct for all questions. For
some hypotheses one prior distribution is correct, for others another prior distribution
works better. The important thing is to understand how the individual marginal priors
interact with each other. For an analysis involving divergence-time dating and rate
estimation one should be aware that the tree prior has the potential to influence the
rate estimates and vice versa.
Finally, if you have strongly informative prior distributions on divergence times or
rates (like gamma or log-normal distributions with moderate standard deviations) then
most of these effects become negligble. But refer to Heled and Drummond (2012, 2013)
for a detailed discussion of calibrated tree priors.
7.2.4 Calibrations
Choosing a prior for a calibration: If you have some idea about where most of the
probability for a calibration may be, for example, you are 95% sure that the date must
be between X and Y, or you are 95% sure that the divergence date is not older than X,
then BEAUti can help in choosing the correct prior. BEAUti can display a probability
density for a variety of potential prior distributions, and also report the 2.5% and 5%
upper and lower tails of the distribution. So, you can select the parameters of a prior by
matching the 90% or 95% bounds of your prior information with the tails of the chosen
prior distribution.
Note that a calibration density based on the estimated divergence time of a previous
analysis of the same data will lead to inappropriately small credible intervals for the
estimated divergence ages. Using the data twice (double-dipping) leads to credible
intervals that no longer accurately represent the true uncertainty under the model. But it
7.2 Choosing priors/model set-up 111
should be noted that using data to define priors is in certain circumstances regarded as
reasonable, especially when employed in empirical Bayes approaches which have been
used successfully in some phylogenetic contexts (Huelsenbeck et al. 2001; Nielsen and
Yang 1998; Yang et al. 2000).
An alternative is to base priors on an independent data set. This can be effective when
there is not enough temporal information in the data set you are working with. The
estimate of the clock rate of the first data set can be used to put a prior on the molec-
ular clock rate of the data set of interest. You can then approximate this distribution
with a normal or gamma distribution and use it as an independent empirically derived
prior. However, it is important to make sure the data sets are independent and have no
sequences in common and the samples are indeed independent, that is, the sequences
from the data sets are not para- or polyphyletic in a combined phylogeny.
Divergence times are scale parameters, so they are always a positive number. There-
fore you should use priors that are appropriate for scale parameters like log-normal
which is a density that is only defined for positive values (see Table 7.2 for other dis-
tributions). As a rule, normal distributions should not be used for calibration densities.
Divergence times are defined only on the positive number line, and so your prior should
be as well.
Another method for determining the shape of the calibration density is through
the CladeAge package (Matschiner and Bouckaert 2013). It requires specification
of the net diversification rate, turnover rate and fossil sampling rate. Using this
information, together with the interval that specifies the geological age range of the
fossil, a simulation-based calibration density can be generated.
The calibrations, monophyly constraints and the tree prior can interact with each
other in unpredictable ways. For example, a calibration on the MRCA time of a clade
limiting its height has an impact on any calibration on a subclade that has tails exceeding
the limit of the parent clade (Heled and Drummond 2012, 2013). Therefore, you should
use a calibrated tree prior wherever possible when doing divergence time dating in
BEAST (Heled and Drummond 2012, 2013). If that is not feasible then always run
your analysis without data to see how the tree prior and calibration densities interact.
This can be done by selecting the sample from prior option in BEAUti. This way,
you can determine what the actual marginal prior distributions are for divergence
times of interest. Calibration densities combined naively with the tree prior (Yule, or
coalescent) can yield surprising results, and you might need to change your calibrations
accordingly.
When calibrating an analysis with a single calibration density and using the Yule
prior, the ‘calibrated Yule prior’ can be used (Heled and Drummond 2012) that will
guarantee that the marginal distribution of the calibrated node corresponds to the cali-
bration density. With more than one calibration, the calibrated birth–death prior imple-
mented in BEAST can become computationally burdensome, and this is an active area
of research (Heled and Drummond 2013). A superior alternative for fossil dating has
very recently been developed that can handle large numbers of fossils under the birth–
death sampling model, including estimation of the correct phylogenetic location for
each fossil taxa (Gavryushkina et al. 2014).
112 Setting up and running a phylogenetic analysis
Multiple calibrations: When using the strict clock, a single calibration tends to be
sufficient, assuming there are sufficient sequence data. If you decide to use multiple
calibrations, then BEAST, as always, samples the tree topology and branch lengths that
give good posterior probabilities, that is, it samples a region of the parameter space
proportional to its posterior probability. If multiple calibrations do not fit well with
each other in light of the sequence data, then some kind of compromised inference will
result. This could mean that some of the marginal posterior distributions of calibrated
divergence times end up far away from their prior distributions. As noted before, unless
carefully designed, multiple calibrations can interact with each other, so make sure that
the joint effect of the calibrations represent your prior information by running BEAST
without data. Alternatively, consider using BEAST’s implementation of the fossilised
birth–death prior for calibration (Gavryushkina et al. 2014; Heath et al. 2014).
Calibration and monophyly: Depending on the source of information used for
informing a calibration, often the clade to which the calibration applies is assumed
to be monophyletic (the oldest fossil penguin can’t be assigned to an ancestral diver-
gence in the tree, it is not precisely clear which taxa constitute the penguins). Fur-
thermore, if the calibrated clade is not constrained to be monophyletic, the MCMC
chain may mix poorly. When a taxon moves into the calibrated clade during the MCMC
chain, the tree prior and posterior may prefer an older age for the clade, whereas when
the taxon moves out again, the opposite will be true. Moving between these different
modes can pose a problem for the standard MCMC operators.1 For both of these rea-
sons we recommend constraining calibrated nodes to be monophyletic when it can be
justified.
Calibrating the age of the root vs. fixing the molecular clock rate: It can happen
that there is no timing information, or the timing information is of no interest to the
analysis, for example, if only the tree topology, or relative divergence times are required,
or the geographical origin in a phylogeographical analysis is of interest. It is tempting
to put a calibration on the root of the tree, perhaps with a small variance. However, if
timing is of no interest, it might be better to fix the clock rate to 1.0 because the MCMC
chain will mix more efficiently that way. This is because some operators act to change
the topology by changing the root node age, which is hampered by a tight calibration
on the root. Fixing the clock rate to 1.0 will not interfere with the efficiency of the tree
operators and will result in branch lengths that have units of expected substitutions.
7.3 Miscellanea
1 But if needed mixing problems like this can be addressed using Metropolis-coupled MCMC.
7.3 Miscellanea 113
computers. More information for specific operating systems and on where packages are
installed is available on the BEAST wiki.
Log frequency: In order to prevent log files from becoming too large and hard to
handle, the log frequency should be chosen so that the number of states sampled to
the log file does not exceed 10 000. So, for a run of 10 million, choose 1000 and
for a run of 50 million, choose 5000 for the log frequency. Note that to be able to
resume it is a good idea to keep log intervals for trace logs, tree logs and the number
of samples between storing the state all the same. If they differ and an MCMC run is
interrupted, log files can be of different lengths and some editing may be required (see
below).
Operator weights: There can be orders of magnitude difference between operator
weights. For example, the default weight for the uniform operator which changes diver-
gence times in a tree without changing the topology is 30, while the default weight for
the exchange operator on frequencies is just 0.01. This is because in general the age
of divergences in trees is hard to estimate hence requires many moves, while the fre-
quencies for substitution models tend to be strongly driven by alignment data in the
analysis. In BEAUti, you can change operator weights from the defaults. Typically, the
defaults give reasonable convergence for a wide range of problems. However, if you
find that ESSs are low for some parameters while many others are high, generally you
want to increase the weight on operators acting on parameters with low ESSs in order to
equalise out the mixing of different components of the posterior. Weights for operators
on the tree should increase if the number of taxa in the tree increases. Whether the
increase should be linear or sub-linear requires further research.
Fixed topology: The topology can be kept constant while estimating other param-
eters, for example divergence times. This can be done in BEAUti by setting the weights
of operators that change the topology of the tree to zero. Alternatively, the operators
can be removed from the XML. The standard operators that change the tree topol-
ogy are subtree-slide, Wilson–Balding and the narrow and wide exchange operators
(for information about tree proposals, see Höhna and Drummond 2012). For *BEAST
analysis, the node-reheight operator affects the topology of the species tree (Heled and
Drummond 2010).
Newick starting tree: A user-defined starting tree can be provided by editing the
XML. Using the beast.util.TreeParser class, a tree in Newick format can be
specified. There are a few XML files in the examples directory that show how to do this.
BEAST only accepts binary trees, so if your tree has polytomies you have to convert
the tree to a binary tree and create an extra branch of zero length. For example, if your
polytomy has three taxa and one internal node, say (A:0.3,B:0.3,C:0.3), then the binary
tree ((A:0.3,B:0.3):0.0,C:0.3) is a suitable representation of your tree.
Multi-epoch models: When there is reason to believe that the method of evolution
changes at different time intervals, for example because part of the time frame is gov-
erned by an ice age, a single substitution model may not be appropriate. In such a
situation, a multi-epoch model (Bielejec et al. 2014) can be useful. In the BEASTlabs
package there is an implementation of a EpochSubstitutionModel where you
can specify different substitution models at different time intervals.
114 Setting up and running a phylogenetic analysis
Random number seed: When starting BEAST, a random number seed can be speci-
fied. Random number generators form a large specialised topic (Knuth 1997). A good
random number generator should not allow you to predict what a sequence will be for
seed B if you know the sequence for seed A (even if A and B are close). BEAST uses the
Mersenne Prime twister pseudo-random number generator (Matsumoto and Nishimura
1998), which is considered to be better than the linear congruential generators that are
the default in many programming languages, including Java. A pseudo-random number
generator produces random numbers in a deterministic way. That means that if you run
a particular version of BEAST with the same seed and the same input file, the outcome
will be exactly the same. It is a good idea to run BEAST multiple times so that it can be
checked these runs all converge on the same estimates. Needless to say, the runs should
be started with different seeds, otherwise the runs will all show exactly the same result.
Which seed you use does not matter and seeds that only differ by one result in com-
pletely different sequences of random numbers being generated. By default, BEAST
initialises a seed with the number of milliseconds since 1970 according to the clock
on your computer at the time you started BEAST. If you want a run you can exactly
reproduce you should override this with a seed number of your choice.
Stopping early: Once a BEAST run is started, the process can be stopped due to
computer failure or by killing the process manually. If this happens, you should check
that the log files were properly completed, because the program might have stopped
in the middle of writing a line to the file. If a log line is corrupted, the partial line
should be removed. For some post-processing programs like Tracer, when doing a
demographic reconstruction it is important for the tree file to be properly finalised with
a line containing ‘End;’. Failing to do so may result in the analysis being halted.
Resuming runs: A state file representing the location in sample space and operator
tuning parameters can be stored at regular intervals so that the chain can be resumed
when the program is interrupted due to a power failure or unexpected computer outage.
BEAST can resume a run using this state file, which is named the same as the BEAST
XML input file name with .state appended to the end (e.g. beast.xml.state
for beast.xml). When resuming, the log files will be extended from the point of the
last log line. Log files need to end in the same sample number, so it is recommended
that the log frequency for all log and tree files is kept the same, otherwise a run that is
interrupted requires editing of the log files to remove the last lines so that all log files
end with the same sample number as the shortest log. The same procedure applies when
a run is interrupted during the writing of a log file and the log files became corrupted
during that process.
BEAGLE: BEAGLE (Ayres et al. 2012; Suchard and Rambaut 2009) is a library for
speeding up phylogenetic likelihood calculations that can result in dramatic improve-
ments in run time. You have to install BEAGLE separately, and it depends on your
hardware and data how much performance difference you will get. BEAGLE can utilise
some types of graphics processing units, which can significantly speed up phylogenetic
likelihood calculations, especially with large data sets or large state spaces like codon
7.4 Running BEAST 115
models. There are a considerable number of BEAGLE options that can give a perfor-
mance boost, in particular switching off scaling when there are not a large number
of taxa, choosing single precision over double precision calculations and choosing the
SSE version over the CPU version of BEAGLE. These options may need some extra
command line arguments to BEAST. To see which options are available, run BEAST
with the -help option. It requires a bit of experimentation with different BEAGLE
settings to find out what gives the best performance. It actually depends on the data
in the analysis and the hardware you use.
Note that some models, such as the multi-epoch substitution model and stochastic
Dollo model (Nicholls and Gray 2008) and multi-state stochastic Dollo (Alekseyenko et
al. 2008), are not supported by BEAGLE. Also, the tree-likelihood for SNAPP (Bryant
et al. 2012) is not currently supported by BEAGLE.
8 Estimating species trees from
multilocus data
The increasing availability of sequence data from multiple loci raises the question of
how to determine the species tree from such data. It is well established that just con-
catenating nucleotide sequences results in misleading estimates (Degnan and Rosenberg
2006; Heled and Drummond 2010; Kubatko and Degnan 2007). There are a number
of more sophisticated methods to infer a species phylogeny from sequences obtained
from multiple genes. This chapter starts with an example of a single locus analysis to
highlight some of the issues, then details the multispecies coalescent. The remainder
describes two multilocus methods for inferring a species phylogeny from DNA and
SNP data respectively. Though even multispecies coalescent may suffer from detectable
model misspecification (Reid et al. 2013), it has not been shown that it is worse than
concatenation.
Consider the situation where you have data from a single locus, but have a number
of gene sequences sampled from each species and you are interested in estimating
the species phylogeny. Arguably, even in this case, an approach that explicitly models
incomplete lineage sorting is warranted. The ancestral relationships in the species tree
can differ considerably from those of an individual gene tree, due (among other things)
to incomplete lineage sorting. This arises from the fact that in the absence of gene
flow the divergence times of a pair of genes sampled from related species must diverge
earlier than the corresponding speciation time (Pamilo and Nei 1988). More generally,
a species is defined by the collection of all its genes (each with their own history
of ancestry) and analysing just a single gene to determine a species phylogeny may
therefore be misleading, unless the potential discrepancy between the gene tree and the
species tree is explicitly modelled.
For example, consider a small multiple sequence alignment of the mitochondrial con-
trol region, sampled from 16 specimens representing four species of Darwin’s finches.
The variable columns of the sequence alignment are presented in Figure 8.1.
The alignment is composed of three partial sequences from each of Camarhynchus
parvulus and Certhidea olivacea, four from Geospiza fortis and six from G. magni-
rostris (Sato et al. 1999). The full alignment has 1121 columns and can be found
8.1 Darwin’s finches 117
Figure 8.1 The 105 variable alignment columns from the control region of the mitochondrial
genome sampled from a total of 16 specimens representing four species of Darwin’s finches.
The full alignment is 1121 nucleotides long.
AF_109067_G_magnirostris AF_109027_G_fortis
AF_109035_G_magnirostris AF_109028_G_fortis
fortis
AF_109053_G_fortis AF_109052_G_fortis
AF_109027_G_fortis AF_109053_G_fortis
AF_109034_G_magnirostris AF_109034_G_magnirostris
AF_109016_G_magnirostris AF_109035_G_magnirostris
magnirostris
AF_109028_G_fortis AF_109016_G_magnirostris
AF_109036_G_magnirostris AF_109036_G_magnirostris
AF_109052_G_fortis AF_109067_G_magnirostris
AF_109037_G_magnirostris AF_109037_G_magnirostris
parvulus
AF_109019_C_parvulus AF_109018_C_parvulus
AF_109018_C_parvulus AF_109019_C_parvulus
AF_109060_C_parvulus AF_109060_C_parvulus
AF_109025_C_olivacea AF_109015_C_olivacea
olivacea
AF_109015_C_olivacea AF_109025_C_olivacea
AF_110423_C_olivacea AF_110423_C_olivacea
AF_109028_G_fortis
AF_109027_G_fortis
AF_109053_G_fortis
AF_109052_G_fortis
AF_109067_G_magnirostris
AF_109034_G_magnirostris
AF_109016_G_magnirostris
AF_109036_G_magnirostris
AF_109035_G_magnirostris
AF_109037_G_magnirostris
AF_109018_C_parvulus
AF_109019_C_parvulus
AF_109060_C_parvulus
AF_109025_C_olivacea
AF_110423_C_olivacea
AF_109015_C_olivacea
Figure 8.2 Single locus analysis of Darwin’s finches. Top left, gene trees, top middle species
trees, top right shows the gene trees when the species tree is fixed and sampling from the prior,
which shows the prior puts a preference on trees that are as low as the species tree allows.
Bottom shows how the summary tree of the gene trees fits in the species tree. Branch width of
the species tree indicates population sizes.
A standard non-*BEAST analysis of these data results in approximately the same tree
as the gene tree shown in Figure 8.2.
When the species tree and population sizes are fixed, a sample from the prior distri-
bution on gene trees is possible in a *BEAST analysis. Figure 8.2 shows three gene tree
samples from this prior, with the species tree set to the median estimate from a posterior
analysis of the Darwin’s finches data. It can be seen that with the estimated divergence
times and population sizes, paraphyly is expected for the two Geospiza species, whereas
monophyly is the tendency for both C. olivacea and C. parvulus.
The posterior point estimate of the gene tree enclosed in the corresponding species
tree estimate is shown at the bottom of Figure 8.2. This shows the summary tree for the
gene tree fitted in the summary tree for the species tree. Branch widths of the species tree
indicate population sizes estimated by *BEAST, showing increase in population sizes
with time going forward for all the species. The clades for C. parvulus and C. olivacea
are each monophyletic, lending support for the older estimates of their corresponding
speciation times relative to population sizes.
8.3 *BEAST 119
In an empirical study (Leaché and Rannala 2011) it was shown Bayesian methods for
species tree estimation perform better than maximum likelihood and parsimony. In a
Bayesian setting the probability of the species tree S given the sequence data (D) can be
written (Heled and Drummond 2010):
f (S)
m
f (g, S|D) = Pr(Di |gi )f (gi |S), (8.1)
Pr(D)
i=1
where D = {D1 , . . . , Dm } is the set of m sequence alignments, one for each of the gene
trees, g = {g1 , . . . , gm }. The term Pr(Di |gi ) is a standard tree likelihood, which typically
subsumes a substitution model, site model and clock model for each of the individual
genes (see Chapters 3 and 4 for details). f (gi |S) is the multispecies coalescent likelihood,
which is a prior on a gene tree given the species tree and f (S) is the prior on the species
tree (see Section 2.5.1 for details).
The species tree prior f (S) can be thought of as consisting of two parts: a prior on
the species time-tree (gS ), f (gS ), and a prior on population sizes, f (N), together giving
f (S) = f (gS )f (N). The species time-tree prior f (gS ) is typically the Yule or birth–death
prior (see Chapter 2 for details).
In order to estimate a species tree with the posterior distribution in Equation (8.1) one
approach is to simply sample the full state space of gene trees and species tree using
MCMC and treat the gene trees (g) as nuisance parameters, thereby summarising the
marginal posterior distribution of the species tree. This effectively integrates the gene
trees out via MCMC (Heled and Drummond 2010; Liu 2008):
m
f (S)
f (S|D) = Pr(Di |gi )f (gi |S)dG. (8.2)
Pr(D) G i=1
An alternative approach that has been used in the application of the multispecies
coalescent to SNP data (Bryant et al. 2012) is to numerically integrate out the gene
trees for each SNP so that:
f (S)
m
f (S|D) = Pr(Di |S), (8.3)
Pr(D)
i=1
where Pr(Di |S) = Gi Pr(Di |gi )f (gi |S)dGi (Bryant et al. 2012).
8.3 *BEAST
BEAST 2 includes a Bayesian framework for species tree estimation. The statistical
methodology described in this section is known by the name *BEAST (pronounced
‘star-beast’, which is an acronym for Species Tree Ancestral Reconstruction using
BEAST) (Heled and Drummond 2010). The model assumes no recombination within
120 Estimating species trees from multilocus data
Time
4
1 2
Population size
Figure 8.3 A species tree with constant size per species branch. For nS species this leads to
2nS − 1 population size parameters.
each locus and free recombination between loci. Approaches that include hybridisation
to the species tree are in development (Camargo et al. 2012; Chung and Ané 2011; Yu
et al. 2011). A tutorial is available for *BEAST that uses three gopher genes (Belfiore
et al. 2008) to estimate the species tree (see BEAST wiki).
*BEAST does not require that each gene alignment contains the same number of
sequences. It also does not need the same individuals to be sampled for each gene,
nor does it need to match individuals from one gene to the next. All that is needed is
that each sequence in each gene alignment is mapped to the appropriate species. Note
that *BEAST cannot be used with time-stamped sequences at the time of writing due
primarily to technical limitations in the implementation of the MCMC proposals. For
details on the multispecies coalescent model that underlies *BEAST, see Section 2.5.1.
Most multispecies coalescent models assume that the population size is constant over
each branch in the species tree (Figure 8.3). However, two other models of population
size history are implemented in *BEAST. The first allows linearly changing population
sizes within each branch of the species tree including the final ancestral population at
the root (see Figure 8.4). The second also allows linear changing population sizes, but
has a constant population size for the ancestral population stemming from the root (see
Figure 8.2 for an example of this latter option). The linear model is the most general
implemented in *BEAST. The other two models can be used when fewer data are
available.
The population sizes prior f (N) depends on the model used. For constant population
per branch (see Figure 8.3), the population size is assumed to be a sample from a gamma
distribution with a mean 2ψ and a shape of α, that is, (α, ψ) (defaults to α = 2 at
8.3 *BEAST 121
Ai ∼ Γ(k; Θ)
Ancestral
species
5
Aleft(i) ∼ Γ(k; Θ)
Aright(i) ∼ Γ(k; Θ)
i.e. Aleft(i) + Aright(i) ∼ Γ(2k; Θ)
Time
4
Ai ∼ Γ(k; Θ)
3
Extant
1 2 species
Figure 8.4 The population size priors on the branches of a three-species tree.
the time of writing). Unless we have some specific knowledge about population size,
an objective Bayesian-inspired choice might be fψ (x) ∝ 1/x for hyper-parameter ψ,
although note that this choice may be problematic if the marginal likelihood needs to be
computed by path sampling.
In the continuous linear model, we have ns population sizes at the tips of the species
tree, and two per each of the (ns − 1) internal nodes, expressing the starting population
size of each of the descendant species (Figure 8.4). The prior for the population sizes
at the internal nodes are as above, but for the ones at the tips they are assumed to come
from a (2k, ψ) distribution. This is chosen to assure a smooth transition at speciations
because X1 , X2 ∼ (k, ψ) implies X1 + X2 ∼ (2k, ψ). This corresponds to having
the same prior on all final (most recent) population sizes of both extant and ancestral
species (see Figure 8.4).
*BEAST has been applied to determine that polar bears are an old and distinct bear
lineage (Hailer et al. 2012), to distinguish between single and dual origin hypothesis of
rice (Molina et al. 2011), analysing the speciation process of forest geckos (Leaché and
Fujita 2010) and examining cryptic diversity in butterflies (Dincă et al. 2011).
gene tree is part of a certain species. At the multispecies coalescent tab you can choose
the population function (constant, linear or linear with constant root). Also, for each of
the gene trees you can specify the ploidy of the sequences.
Convergence: Note that *BEAST analysis can take a long time to converge, espe-
cially when a large number of gene trees are involved. The usual methods for speeding
up tree-likelihood calculations such as using threads and fine-tuning with BEAGLE
applies, as well as specifying as much prior information as possible. If few loci are
available, you can run a *BEAST analysis with just a few lineages (say 20) for an
initial run in order to determine model settings. Reducing sequences speeds up the
chain considerably without affecting accuracy of estimates too much, since adding more
sequence is not as effective as adding more loci.
Linking trees: If the genes are linked they should only be represented by a single tree
in *BEAST. So genes from the same ‘non-recombining’ region of the Y chromosome
should be represented by only a single tree in *BEAST (or any other multispecies
coalescent method for that matter).
*BEAST for species tree with single gene: Estimating a species tree from a single
gene tree is perfectly valid. In fact, we highly recommend it, as it will give much more
realistic assessments of the posterior clade supports. That is, it will correctly reduce the
level of certainty on species tree groupings, since incomplete lineage sorting may mean
the species tree is different from the gene tree, even in the face of high posterior support
for groupings in the gene tree topology.
Visualising trees: DensiTree can be used to visualise a species tree where the branch
widths represent population sizes. There are tools in the biopy package that can help
visualise species trees and gene trees as well.1 The *BEAST tutorial has some examples
on visualising species trees.
Population size estimates: In terms of accuracy, the topology of the species tree is
typically well recovered by a *BEAST analysis, the time estimates of the species tree
contain larger uncertainty and population size estimates contain even larger uncertainty.
The most effective way to increase accuracy of population size estimates is to add more
loci (Heled and Drummond 2010). As a result, the number of lineage trees increases,
and with it the number of coalescent events that are informing population size estimates.
Increasing sequence lengths helps in getting more accurate time estimates, but since
the number of coalescent events does not increase, population size estimates are not
increasing in accuracy as much.
Species assignment: In a *BEAST analysis, it is assumed that you know the species
that a sequence belongs to. However, if it is uncertain whether a sequence belongs to
species A or species B, you can run two analyses, one with each assignment. The chain
with the best marginal likelihood (see Section 9.5) can be assumed to contain the correct
assignment (Grummer et al. 2014).
It can happen that the species assignment is incorrect, which results in coalescent
events higher up in the species tree than if the species assignment is correct. As a
result population size estimates will be unusually high. So, relative high population
size estimates may indicate incorrect assignments, but it can also indicate the existence
of a cryptic species in your data.
Note that a species assignment of lineages does not enforce monophyly of the lin-
eages belonging to a single species, as shown in Figure 8.2.
8.4 SNAPP
SNAPP (SNP and AFLP Package for Phylogenetic analysis) is a package in BEAST to
perform MCMC analysis on SNP and AFLP data using the method described in Bryant
et al. (2012). It calculates Equation (8.1) just as for a *BEAST analysis, but works with
binary data such as SNP or amplified fragment length polymorphism (AFLP) instead
of nucleotide data typically used in *BEAST analysis. If we call the sequence values
‘green’ and ‘red’, this means we have to specify a substitution rate u for going from
green to red, and a substitution rate v for going from red to green. Another difference
with *BEAST is that instead of keeping track of individual gene trees, these are inte-
grated out using a smart mathematical technique. So, instead of calculating the integral
as in Equation (8.2) by MCMC, the integral is solved numerically. This means that the
individual gene trees are not available any more, as they are with *BEAST. A coalescent
process is assumed with constant population size for each of the branches, so with every
branch a population size is associated.
A common source of confusion, both for SNAPP and similar methods, is that the
rates of mutation and times are correlated, so we typically rescale the substitution rates
such that the average number of mutations per unit of time is one. Let μ be the expected
number of mutations per site per generation, g the length of generation time in years,
N the effective population size (number of individuals) and θ the expected number of
mutations between two individuals. For a diploid population, θ = 4Nμ. If μ is instead
the expected number of mutations per site per year then we would have θ = 4Nμg. In
the analysis in (Bryant et al. 2012) time is measured in terms of expected number of
mutations. Hence
• the expected number of mutations per unit time is one;
• a branch of length τ in the species tree corresponds to τ/μ generations, or τ g/μ
years;
• the backward and forward substitution rates, u and v, are constrained so that the
total expected number of mutations per unit time is one, which gives
v u
u+ v = 1;
u+v u+v
• the θ values are unaffected by this rescaling. If the true mutation rate μ is known,
then the θ values returned by the program can be converted into effective popu-
lation sizes using N = θ/(4μ).
There is no technical limitation to using SNAPP for divergence date estimation using
either calibrations or serially sampled data. However, this has not been formally
tested yet.
124 Estimating species trees from multilocus data
Prior α β κ
independent inverse gamma distributions for thetas, so (2/r) has an inverse gamma
(α, β) distribution. That means that r has density proportional to 1/(r2 ) ∗ inv (2/r|α, β)
where inv is the inverse gamma distribution.
The Cox–Ingersoll–Ross (CIR) process (Cox et al. 1985) ensures that the mean of θ
reverts to a (α, β) distribution, but can divert over time. The speed at which the rate
reverts is determined by κ. The correlation between time 0 and time t is e−κt , so with
the CIR prior the distribution for rates is not uniform throughout the tree.
When selecting ‘uniform’ we assume the rate is uniformly distributed in the range
0 ... 10 000, which means a large proportion of the prior indicates a large value, with a
mean of 5000.
Figure 8.5 Population size estimate issues for species tree (in grey) with three species and a
single gene tree in black. Left, sequences too close, middle sequences too diverse, right
insufficient sequences.
sequences are too diverse, all coalescent events happen in the root and no accurate
population size estimates can be guaranteed for all lower-lying branches. Make sure
you have at least a couple of sequences per species, otherwise it becomes impossible
to get an estimate of population sizes. This is because with just a single lineage for a
species, there will be no coalescent event in the branch of the species tree ending in the
taxon for that lineage, and population size estimates rely on coalescent events.
It is a good idea to run a SNAPP analysis with different settings for rate priors
to detect that the estimates are not just samples from the prior but are informed by
the sequence data. If the rate estimates are shaped according to the prior parameters
one must be suspect of the accuracy. You can see this by inspecting the log file in
Tracer. When the estimate changes with the prior parameters in individual runs with
different priors you know that the estimates are not informed by sequence data, and
these estimates should be considered uninformative. In general, it is quite hard to get
good population size estimates. Like with *BEAST analysis, the estimate for the species
tree topology tends to be most accurate, while timing estimates tend to be less accurate,
and population size estimates even less accurate.
If you have large population size estimates relative to estimates for most branches,
this may indicate errors in the data. Incorrect species assignment for a sequence will
create the appearance of high diversity within the population it is placed in. Existence
of cryptic species in your data also leads to large effective population size estimates.
The models underlying SNAPP, as well as most of the methods in BEAST, assume a
randomly mating population, at least approximately. Violations of this assumption could
have unpredictable impacts on the remaining inferences. If cryptic species or popula-
tion substructure is suspected it is preferable to rerun the analysis with subpopulations
separated. In summary, large population size estimates should be reason for caution. By
using model selection it is possible to reliably assign lineages to species and perform
species delimitation (Leaché et al. 2014).
SNAPP is currently used in analysis of human SNPs, including some based on
ancient DNA sequences, European and African blue tits, the latter in Morocco and on
the Canary Islands, and various species of reptiles and amphibians such as western
fence lizards (Sceloporus occidentalis), horned lizards (Phrynosoma), West African
forest geckos (Hemidactylus fasciatus complex), African agama lizards (Agama agama
complex), tailed frogs (Ascaphus truei) and African leaf litter frogs (Arthroleptis
poecilonotus) (personal communications).
9 Advanced analysis
There are a number of reasons to run an analysis in which you sample only from the
prior before running the full analysis. One reason to do this is to confirm that a prior
is proper. Here we mean proper in a strict mathematical sense, that is, that the prior
integrates to unity (or any finite constant).
Although there are some situations in which improper priors can be argued for
(Berger and Bernardo 1992), having an improper prior typically results in an improper
posterior and if that is the case then the results of the MCMC analysis will be
meaningless, and its statistical properties undefined. Often the observed behaviour
of the chain will be that some parameters meander to either very large or very small
values and never converge to a steady-state target distribution. For example, a uniform
prior with bounds 0 and +∞ for a molecular clock rate will result in the clock rate
wandering to extreme values when an attempt is made to sample the prior in the absence
of data. Even if no problems are evident when sampling the posterior, certain analyses
rely on a proper prior and will return invalid results regardless of whether the posterior
appears to be sampled correctly (an example of a method that requires a proper prior
is path sampling for model comparison, see Section 1.5.7). For this reason, you should
chose proper priors unless you know what you are doing.
Another reason to sample from the prior is to make sure that the various priors do
not produce an unexpected joint prior in combination, or if they do, to check that
the resulting prior is close enough to the practitioner’s intentions. Especially when
calibrations are used this can be an issue (Heled and Drummond 2012, 2013) since
calibrations are priors on a part of a tree and a prior for the full tree is usually also
specified (like Yule or coalescent). This means there are overlapping priors on the same
parameter, which can produce unexpected results. Another situation occurs where a
calibration with an upper bound on an ancestral clade sets an upper bound on the age of
all the descendant clades, since none of them can exceed the age of their direct ancestor.
When a non-bounded calibration is specified on a descendant this prior will be truncated
at the top by the calibration on the ancestral clade. This phenomenon arises because
calibrations are specified as one-dimensional marginal priors, even though it would
be more appropriate to directly specify multi-dimensional priors when more than one
divergence is calibrated. The result of specifying independent one-dimensional prior
on divergences that are mutually constrained (e.g. x < y) can, for example, show up
128 Advanced analysis
Figure 9.1 Left, a distribution of a clade by sampling from the prior. The clade has a normal
(mean = 9, σ = 1 indicated by red line) calibration and a superclade has a uniform distribution
with upper bound of 10 (blue line). The distribution still has a mean close to 9, but is now clearly
asymmetric and the probability mass of the tail on the right is squeezed in between 9 and 10.
Right, a marginal density of MRCA time of a clade with a log-normal prior together with the
distribution of the MRCA time of a subclade and of a superclade. Note that the subclade follows
the parent’s distribution mixed with tail to the left. The superclade has almost the same
distribution as the calibrated clade since the Yule prior on the tree tries to reduce the height of
the tree, but the calibration prevents it from pushing it lower.
as a truncated distribution when sampling from the prior (see Figure 9.1). If there is
any reason to believe a clade for which there is a calibration is monophyletic, it is
always a good idea to incorporate this information as a monophyletic constraint since
this typically reduces the complexity of interaction between multiple calibrations and
simplifies quantification of the joint prior.
The easiest way to sample from the prior is to set up an analysis in BEAUti and on
the MCMC panel click the ‘Sample from prior’ checkbox. Running such an analysis
is typically very fast since most of the computational time in a full analysis is spent
calculating the phylogenetic likelihood(s) (see Section 3.7 for details of the algorithm),
which are not computed when sampling from the prior. After verifying that the prior
distribution is an adequate representation of your prior belief, uncheck the checkbox in
BEAUti and run the full analysis.
Calibrating one or more internal nodes (node dating), as described in Section 4.4, is one
way to introduce temporal information into an analysis. Another method to achieve this
is to use the sampling dates of the taxa themselves. Especially with fast-evolving species
such as RNA viruses like HIV and influenza, the evolutionary rate is high enough and
the sampling times wide enough that rates can be estimated accurately (Drummond et
al. 2003a).
Note that having serially sampled data is not always sufficient to establish a time
scale for the tree, or equivalently estimate molecular clock rates with any degree of
certainty. Figure 9.2 illustrates why in some situations having dated tips is not sufficient
to get an accurate divergence-time estimate. If the age of the root (troot ) is orders of
magnitude greater than the difference between the oldest and most recent samples (the
sampling interval t), one must wonder how accurate an estimate of the root age will be.
9.3 Demographic reconstruction 129
A B C D E A B C C E
D A B D
Figure 9.2 Left, tree without sampling dates for which no divergence-time estimate is possible
without calibrations on internal nodes or a strong prior on the rate of the molecular clock.
Middle, tree where tip date information is sufficiently strong (provided long sequences) to allow
divergence-time estimates with a good degree of certainty, since the sample dates cover a
relatively large fraction of the total age of the tree (sample interval t = 13 troot ). Right, tree
where tip date information is available, but may not be sufficient for accurate estimation
(sample interval t = 25 1 t
root ), so divergence-time estimates will also be informed by the prior
to a large extent.
Relative population sizes through time can be estimated from a time-tree using coales-
cent theory as explained in Section 2.3. One of the nice features of such a reconstruction
130 Advanced analysis
population size parameter values can be exported to a spreadsheet where they can be
divided by τ and a new graph created. The median tends to be a more robust estimator
when the posterior distribution has a long tail, and a log scale for the y-axis can help
visualisation when population sizes vary over orders of magnitude, as can be the case
during periods of exponential growth or decline.
The advantage of the extended Bayesian skyline plot (EBSP; Heled and Drummond
2008) over the BSP is that it does not require specifying the number of pieces in the
piecewise population function. EBSP estimates the number of population size changes
directly from the data using Bayesian stochastic variable selection (BSVS).
For EBSP analysis, a separate log file is generated, which can be processed with
the EBSPAnalyser, which is part of BEAST 2. The program reads in the log file
and generates a tab-delimited file containing time and population size mean, median
and 95% HPD interval information that can be visualised with any statistical analysis
package (e.g. R) or spreadsheet program.
The number of groups used by the EBSP analysis is recorded in the trace log file. It
is not uncommon when comparing a BSP analysis with an EBSP analysis to find that
(with default priors) the EBSP analysis uses on average fewer groups when analysing
a single locus. As a result, the demographic reconstruction based on EBSP does not
show a lot of detail, and in fact may often converge on a single group (which makes it
equivalent to a constant population size model). Be aware that the default prior on the
number of change points in the population size function may be prone to under-fitting,
so that increasing the prior on the number of population size changes may increase the
sensitivity of the analysis for the detection of more subtle signals in the data.
To increase the resolution of the demographic reconstruction, it is tempting to just
add more taxa. However, due to the nature of coalescence, EBSP, like all coalescent
methods, will generally benefit more from using multiple loci than from just adding
taxa (Heled and Drummond 2008). So, adding an extra locus, with an independent gene
tree but sharing the EBSP tree prior gives an increase in accuracy of the population
size estimation that is generally larger than just doubling the number of taxa of a single
alignment. This does not mean that adding more taxa does not help, just that adding
another locus will generally help more than adding the same number of new sequences
to existing loci. Likewise, increasing the length of the sequence tends not to help as
much as increasing the number of loci. For a more refined analysis (Felsenstein 2006)
has made a careful study of the sampling tradeoffs in population size estimation using
the coalescent.
However, the reason these HPD intervals do not behave as expected is that the bound-
aries of the underlying piecewise-constant population function coincide with coalescent
events (by design). If instead the boundaries were chosen to be equidistant, then the
expected increase in uncertainty going back towards the root would occur because on
average there would be fewer coalescent events in intervals that are nearer to the root
and the estimated times of those coalescent events nearer the root would also be more
uncertain. However, because of the designed coincidence of population size changes at
coalescent events in the BSP and EBSP, the number of coalescent events per interval (i.e.
the amount of information available to estimate population size per interval) is a priori
equal across intervals. So in (E)BSP models, uncertainty in population size is traded
for a loss of resolution in the timing of population size changes (since the underlying
population size can typically only change at wide intervals near the root of the tree).
Note that this modelling choice is different from that used in the birth–death skyline
plot (Stadler et al. 2013), in which the piecewise function is chosen to have changes
at regularly spaced intervals. Consequently, (net) birth rates reconstructed through time
using the birth–death skyline model tend to show 95% HDP intervals that grow larger
the further one goes back in time.
It is sometimes the case that a horizontal line can be drawn through the 95% HPD
intervals of the resulting BSP. This doesn’t mean that the reconstruction is consistent
with a constant population size. A better way to decide whether there is a trend in
population size is to count the fraction of trees in the posterior sample that suggest
a trend. For example, if more than 95% of the posterior samples describe a skyline
plot where the population size at the root is smaller than at the tips, then a growing
population size can be inferred. In the case of EBSP, a more direct test is simply to
calculate the posterior probability that there is more than one piece in the population
size function.
One of the reasons that a demographic reconstruction can come with large 95% HPD
intervals suggesting large uncertainty in population sizes is due to uncertainty in the
molecular clock rate. Figure 9.4 shows the effect of having low uncertainty in the
relative divergence times, but large uncertainty in the overall molecular clock rate. If
only trees corresponding to low molecular clock rates and thus old trees were considered
in the posterior, the BSP would exhibit narrow 95% HPD intervals; likewise for trees
associated with high molecular clock rates. However, since the posterior contains both
the old and young trees, together the 95% HPD interval spans the range and the HPD
bounds become quite large. A BSP drawn in units of fractions of the root age or based on
a fixed clock rate (say the posterior median estimate of the clock rate) would show much
smaller credible intervals, and may uncover a strong underlying signal for population
dynamics, that would otherwise be confounded.
Finally, one has to be aware that non-parametric methods like BSP can contain sys-
tematic bias due to sampling strategies when using measurable evolving populations
(Silva et al. 2012). This can lead to conclusions that an epidemic slowed, while the
facts do not support this. On the other hand, parametric methods can give unbiased
estimates if the population sizes are large enough, and the correct parametric model is
employed.
9.3 Demographic reconstruction 133
6
Log pop size
Year
Figure 9.4 A BSP analysis in which the relative divergence times are well estimated, but the
absolute times are not, will produce a posterior distribution containing trees with a wide
variation of root ages. It may be the case that conditional on a particular root age, there is a
strong trend in population size through time with low uncertainty, but as a result of the
uncertainty in the absolute time scale the overall result is a BSP with large uncertainty. In this
hypothetical example, the time-trees with old root ages have a BSP indicated by the curve
to the flattest curve (green) with (conditional) 95% HPD indicated by dashed lines. Likewise,
the youngest time-trees from the posterior sample can have a BSP indicated by the curve with
the youngest transition in population size (blue). The middle curve shows a conditional
demographic estimate associated with an average tree age. However, together the 95% HPD
will cover all of these curves, which shows up as large 95% HPD intervals in the skyline plot
of the full posterior.
BEAST can perform various phylogeographical analyses. One way to look at these
analyses is as extending alignments with extra information to indicate the location. For
discrete phylogeographical analysis, a single character (data column) is added and for
continuous analysis a latitude–longitude pair is added. These characters share the tree
with the alignment, but have their own substitution and clock model. That is why for
discrete and continuous phylogeography they are treated in BEAUti as just a separate
partition, and they are listed in the alignment tab just like the other partitions.
Discrete phylogeography (Lemey et al. 2009a) can be interpreted as a way to do
ancestral reconstruction on a single character, which represents the location of the taxa.
Sampled taxa are associated with locations and the ancestral states of the internal nodes
in a tree can be reconstructed from the taxon locations. In certain circumstances it is
not necessarily obvious how to assign taxa to a set of discrete locations, or indeed how
to choose the number of discrete regions used to describe the geographical distribution
of the taxa. The number of taxa must be much larger than the number of regions for
the analysis to have any power. This approach was applied to reconstruction of the
initial spread of human influenza A in the 2009 epidemic (Lemey et al. 2009b), tracing
Clostridium difficile, which is the leading cause of antibiotic-associated diarrhoea
worldwide (He et al. 2013) and geospatial analysis of musk ox with ancient DNA
(Campos et al. 2010).
Besides the ancestral locations, the key object of inference in a phylogeographical
model is the migration rate matrix (see Chapter 5). In a symmetrical migration model
with K locations there are K(K−1) 2 migration rates, labelled 1, 2, . . . , K(K−1)
2 . These
labels are in row-major order, representing the rates in the upper right triangle of the
migration matrix (see Figure 9.5). For a non-symmetric migration model the number of
rates is K(K − 1).
For continuous phylogeography (Lemey et al. 2010), the locations (or regions) of the
individual taxa need to be encoded in latitude and longitude. A model of migration via a
random walk is assumed, and this makes such an analysis a lot more powerful than a dis-
crete phylogeography analysis since distances between locations are taken into account.
For many species, a random walk is a reasonable null model for migration, especially
when the dispersal kernel is allowed to be over-dispersed compared to regular Brownian
motion (Lemey et al. 2010). However, for species that have seasonal migrations such as
shore birds, this may not be a good null model. Likewise, viruses transmitted by humans
A B C D E A B C D E
A - 1 2 3 4 A - 1 2 3 4
B 1 - 5 6 7 B 5 - 6 7 8
C 2 5 - 8 9 C 9 10 - 11 12
D 3 6 8 - 10 D 13 14 15 - 16
E 4 7 9 10 - E 17 18 19 20 -
Figure 9.5 Left, a matrix for ancestral reconstruction with five locations {A,B,C,D,E}, with
symmetric rates. Right, a matrix with five locations {A,B,C,D,E}, with non-symmetric rates.
9.5 Bayesian model comparison 135
Since BEAST provides a large number of models to choose from, an obvious question is
which one to choose. The most sound theoretical framework for comparing two models
136 Advanced analysis
Figure 9.6 Reconstruction of hepatitis B migration from Asia through northern North America
using the landscape-aware model. The summary tree is projected onto the map as a thick line,
and the set of trees representing the posterior is projected onto the map as lightly coloured dots
indicating some uncertainty in the migration path, especially over some of the islands. Note the
migration backwards into Alaska after first moving deeply into Canada. The model assumed
much higher migration along coastlines than over water or land.
From a practical standpoint, it should be obvious that you cannot do a valid BF analysis
if one of the chains has not converged, as indicated by low effective sample sizes (ESSs).
So unless you are able to get good ESSs for all competing models, your BF estimates
are meaningless. A low ESS for a particular model might indicate that there is not much
signal in the data and you will have to make sure your priors on the model parameters are
proper and sensible. Also, see troubleshooting tips (Section 10.3) for more suggestions.
Once you have adequate posterior samples from each of the competing models, the
easiest, fastest but also least reliable way (Baele et al. 2012) to compare models is to
use the harmonic mean estimator (HME; see Section 1.5.6) to estimate the marginal
likelihood. This can be done in Tracer (Rambaut and Drummond 2014) by loading the
log files and using the Analysis → Model Comparison menu. The HME is known to
be very sensitive to outliers in the posterior sample. Tracer implements a smoothed
HME (sHME) that employs a bootstrap approach (Redelings and Suchard 2005) by
taking a number of pseudoreplicates from the posterior sample of likelihoods so that
the standard deviation of the estimate can be calculated, which accounts for the effect
of outliers somewhat. Still, even the sHME is not recommended since like the HME it
tends to produce poor estimates of the marginal likelihood (Baele et al. 2013a).
Path sampling and the stepping stone algorithm (Baele et al. 2012; Xie et al. 2011) are
more advanced algorithms for estimating the marginal likelihood (see Section 1.5.7).
They work by running an MCMC chain that samples a family of target distributions
πβ (θ ) = f (θ ) Pr(D|θ )β constructed from a schedule of values of β where f (θ ) is the
prior, Pr(D|θ ) the likelihood, θ the parameters of the model and D the data. When
β = 0, the chain samples from the prior, when β = 1, it samples from the posterior and
intermediate values of β bridge between these two extremes. Of the two approaches,
the stepping stone method tends to be more robust. Empirically, a set of values for β,
the steps, that gives an efficient estimate of the marginal likelihood (Xie et al. 2011)
is obtained by following the proportions of a β(0.3, 1.0) distribution. To set up a path
sampling analysis in BEAST you need to set up a new XML file that refers to the
MCMC analysis of the model. To reduce computation on burn-in, the end-state of a run
for one value β is used as the starting state for the next β value. This works for every
value of β, except for the first run, which requires getting through burn-in completely.
So, you need to specify (1) a burn-in for the first run, (2) a burn-in for consecutive runs
and (3) the chain length for generating the samples. Appropriate values depend on the
kind of data and model being used, but burn-in for the first run should not be less than
burn-in used for running a standard MCMC analysis on the same data. The log files can
be inspected with Tracer to see whether the value of burn-in is sufficiently large and that
the chain length produces ESSs such that the total ESS used for estimating the marginal
likelihood is sufficiently large.
The number of steps needs to be specified as well, and this number is also dependent
on the combination of model and data. To determine an appropriate number of steps,
run the path sampling analysis with a low number of steps (say ten) first, then increase
the number of steps (with, say, increments of ten, or doubling the number of steps)
and see whether the marginal likelihood estimates remain unchanged. Large differences
138 Advanced analysis
between estimates indicate that the number of steps is not sufficiently large. It may not
be practical to run a path sampling analysis because of the computational time involved,
especially for large analyses where running the main analysis can take days or longer.
In these cases a pragmatic decision can be made using the AICM method instead (see
Section 1.5.6 for details).
The efficiency of running a path sampling analysis can be improved by using threads
or a high-performance cluster. Model comparison is an active area of research, so it is
possible that new more efficient and more robust methods will be available in the near
future.
Simulation studies can be used to find out the limits of the power of some models. Some
questions that can be answered using simulation studies are:
• How much sequence data is needed to estimate the model parameters reliably?
• How well can a tree topology be recovered using a specific model?
• How uncertain are rate estimates under various tip sampling schemes?
BEAST contains a sequence simulator that can be used to generate synthetic mul-
tiple sequence alignments. It requires specification of a tree-likelihood, with its tree,
site model, substitution model and molecular clock model, and generates a random
sequence alignment according to the specification. The tree can be a fixed tree speci-
fied in Newick, or a random tree generated from a coalescent process, possibly with
monophyletic constraints and calibrations. The site and substitution model, as well as
the molecular clock model, can be any of the ones available in BEAST or one of its
packages. However, note that using a non-strict clock model requires extra care because
of the way it is initialised.
To perform simulation studies that involve dynamics of discrete populations the
Moments and Stochastic Trees from Event Reactions (MASTER) (Vaughan and
Drummond 2013) package can be used. It supports simulation of single and multiple
population size trajectories as well as corresponding realisations of the birth–death
branching processes (trees). Some applications include: simulation of dynamics under
a stochastic logistic model, estimating moments from an ensemble of realisations of
an island migration model, simulating an infection transmission tree from an epidemic
model and simulating structured coalescent trees. MASTER can be used in conjunction
with a sequence simulator to generate alignments under these models. For example,
a tree simulated from the structured coalescent can be used to generate an alignment,
which can be used to try to recover the original tree and the population parameters that
generated it.
10 Posterior analysis and
post-processing
In this chapter we will have a look at interpreting the output of an MCMC analysis. At
the end of a BEAST run, information is printed to the screen, and saved in trace log and
tree log files. This chapter considers how to interpret the screen and trace log, while the
next chapter deals with tree logs. We have a look at how to use the trace log to compare
different models, and diagnose problems when a chain does not converge. As you will
see, we emphasise comparing posterior samples with samples from the prior, since you
want to be aware whether the outcome of your analysis is due to the data or a result of
the priors used in the analysis.
Interpreting BEAST screen log output: At the end of a BEAST run, some infor-
mation is printed to screen (see listing on page 88) detailing how well the operators
performed. Next to each operator in the analysis, the performance summary shows the
number of times an operator was selected, accepted and rejected. If the acceptance
probability is low (< 0.1) or very low (< 0.01) this may be an indication that either
the chain did not mix very well, or that the tuning parameters for the operator were not
appropriate for this analysis. BEAST provides some suggestions to help with the latter
case. Note that a low acceptance rate does not necessarily mean that the operator is not
appropriately parameterised. For example, when the sequence alignment data strongly
support one particular topology, operators that make large changes to the topology (like
the wide exchange operator) will almost always be rejected. So, some common sense is
required in interpreting low acceptance rates.
If the acceptance rate is high (> 0.5), this indicates that the operator is probably
making jumps that are too small, and BEAST may produce a suggestion to change
a parameter setting for the operator. The exception to this are operators that use the
Gibbs distribution (Geman and Geman 1984), which are generally efficient and always
accepted.
For relaxed clock models, if the uniform operator on the branch-rate categories
parameter has a good acceptance probability (say > 0.1) then you do not need the
random walk integer operator on branch-rate categories. You could just remove it
completely and increase the weight on the uniform operator on branch-rate categories.
Any operator that changes the branch-rate categories parameter changes the rates
on the branches, and thus the branch lengths in substitutions and therefore the
likelihood.
140 Posterior analysis and post-processing
Probably the first thing to do after a BEAST run has finished is to determine that the
chain has converged to the target distribution. There are many statistics for determining
whether a chain has converged (Brooks and Gelman 1998; Gelman and Rubin 1992;
Gelman et al. 2004; Geweke 1992; Heidelberger and Welch 1983; Raftery and Lewis
1992; Smith 2007), all of which have their advantages and disadvantages, in particular
sensitivity to burn-in. The practitioner of MCMC should always visually inspect the
trace log in Tracer (see Figure 6.7 for a screenshot) to detect obvious problems with the
chain, and check the ESS (Kass et al. 1998).
How much burn-in? The first thing to determine is the amount of burn-in that
the chain requires. Burn-in is the number of samples the chain takes to reach the
target distribution equilibrium and these samples must be discarded. By default, Tracer
assumes that 10% of the chain is burn-in and if this is not sufficient the chain probably
needs to be run for longer. In fact, a hardline view held by some Bayesian MCMC
aficionados is that if any burn-in is required, then the chain has not been run long
enough. More pragmatically, you really need to visually inspect the graph of each of the
posterior statistics that are logged since it is usually easy to see when burn-in is reached;
during burn-in the values increase or decrease steadily and an upward or downward
trend is obviously present. After burn-in the trace should not show a trend any more,
and oscillate around a stationary distribution (the target distribution). The smaller the
amount of burn-in, the more samples are available. If a substantial proportion of the
chain (>10%) is burn-in, it becomes difficult to be sure the remaining part is a good
representation of the posterior distribution. In such a case it would be prudent to run
the chain for, say, ten times longer, or run many independent chains and combine their
post-burn-in portions.
As you can see in Figure 10.1, it is very important to actually visually inspect the
trace output to make sure that burn-in is removed, and when running multiple chains
that all the runs have converged on the same distribution.
Figure 10.1 Left, trace with an ESS of 36 that should be rejected, not only due to the low ESS but
mainly due to the obvious presence of burn-in that should be removed. Right, trace with ESS of
590 that is acceptable.
10.1 Trace log file interpretation 141
All about ESS: After determining the correct burn-in, the next thing to check is
the ESS (Kass et al. 1998) of the sampled parameters. The ESS is the number of
independent draws from the posterior distribution that the Markov chain is equivalent
to. This differs from the actual number of samples recorded in the trace log, which are
autocorrelated and thus dependent samples. In general, a larger ESS means the estimate
is a better approximation of the true posterior distribution. Tracer flags ESSs smaller
than 100, and also indicates whether ESSs are between 100 and 200. But this may be
a bit liberal and ESSs over 200 are more desirable. On the other hand, chasing ESSs
larger than 10 000 is almost certainly a waste of computational resources. Generally
ESSs over 200 would be adequate for most purposes so ≈1000 is very good. If the
ESS of a parameter is small then the estimate of the marginal posterior distribution of
that parameter can be expected to be poor. In Tracer you can calculate the standard
error of the estimated mean of a parameter (note that this is not the standard deviation
of the parameter, but the error in the estimate of the mean of the marginal posterior
distribution of the parameter). If the ESS is small then the standard error will be large,
and vice versa.
The ESSs of parameters in the log file don’t necessarily tell you whether the MCMC
chain is mixing in phylogenetic tree space. At the moment BEAST does not include any
tools for directly examining the ESS of the tree or clade statistics. For this purpose a
program like AWTY can be used (Nylander et al. 2008).
Increasing ESS: These are some ways of increasing the ESS of a parameter:
• The most straightforward way of increasing the ESS is to increase the chain
length. This obviously requires more computer resources and may not be
practical.
• If only a few items get low ESSs, generally you want to reduce the weight on
operators that affect parameters with very high ESSs and increase weights on
operators operating on the parameters with low ESSs.
• You can increase the sampling frequency. The ESS is calculated by measuring
the correlation between sampled states in the chain which are the entries in the
log file. If the sampling frequency is very low these will be uncorrelated. This
will be indicated by the ESS being approximately equal to the number of states
sampled in the log file (minus the burn-in). If this is the case, then it may be
possible to improve the ESSs simply by increasing the sampling frequency until
the samples in the log file begin to be autocorrelated. However, be warned that
sampling too frequently will not affect the ESSs but will increase the size of the
log file and the time it takes to analyse it.
• Combine the results of multiple independent chains. It is a good idea to do
multiple independent runs of your analyses and compare the results to check
that the chains are converging and mixing adequately. If this is the case then
each chain should be sampling from the same distribution and the results could
be combined (having removed a suitable burn-in from each). The continuous
parameters in the log file can be analysed and combined using Tracer. The
tree files will currently have to be combined manually using a text editor or
LogCombiner. An advantage of this approach is that the different runs can be
142 Posterior analysis and post-processing
x% HPD interval: Tracer shows the 95% highest posterior density (HPD) interval
of every item that is in the log. In general, the x% HPD interval, is the smallest interval
that contains x% of the samples. If one tail is much longer than the other then most of
the removed values will come from the longer tail.
To calculate the x% HPD interval, sort the values and consider the interval starting
from the very smallest value to the x-percentile value and store the difference in the
values (that is, the size of the interval). Now increment one value at a time (at both
ends, so that you always have x% of the values in the interval) until you get to the
interval that starts at the (100-x)-percentile value and finishes at the largest value. Each
step compares the new interval width to the minimum found so far. If it is smaller than
the minimum so far, update the minimum and store the start value. This algorithm takes
O(N) operations where N is the size of the sample. Since the sample needs to be sorted
first, which takes O(N log N), this dominates the calculation time.
The x% central posterior density (CPD) interval, on the other hand, is the interval
containing x% of the samples after [(100-x)/2]% of the samples are removed from each
tail. The shorthand for both types of interval is the x% credible interval.
Clock rate units: Tracer shows the mean of the posterior but not the units. The units
for the clock rate depends on the units used for calibrations. When a calibration with,
say, a mean of 20 is used representing a divergence date 20 million years ago, the clock
rate is in substitutions per site per million years. If all digits are used, that is 20 000 000
or 20E6 for the mean of the calibration, the clock rate unit will be substitutions/site/year.
If the calibration represents 20 years ago, the unit is substitutions/site/year. Likewise,
units for tree divergence ages are dependent on the units used to express timing infor-
mation; with a calibration of 20 representing 20 million years ago, a divergence age of
40 means 40 million years ago.
If there is no timing information, by default the clock rate is not estimated and fixed
to 1, and branch lengths represent substitutions per site. Of course, if the clock rate is
estimated a prior on the clock rate needs to be provided and the units for clock rate is
equal to the units used for the clock rate prior. Alternatively, the clock rate can be fixed to
a value other than 1 if an estimate is available from the literature or from an independent
data set. The ages of nodes in the tree are in units of time used for the clock rate. If the
clock rate is expressed in substitutions/site/million years then estimated divergence ages
will be expressed in million of years.
Age of tree and divergence times: To interpret the age of the tree (root height)
and divergence times in Tracer, you have to keep in mind that time runs backwards in
BEAST by default. The most recently sampled sequence has an age of zero, and the age
of the tree, at say 340, represents the MRCA of all sequences, at 340 years (assuming
your units are in years) in the past. If you have a taxon with two virus samples, one
from the year 2000 and one from the year 2005, and you get an MRCA time value of
10, these viruses diverged in 1995. Note that MRCA time is counted from the youngest
taxon member.
Coefficient of variation: With relaxed clock models a coefficient of variation (cv )
is logged, which is defined as the standard deviation divided by the mean of the clock
rate. This is a normalised measure of dispersion. The coefficient gives information about
144 Posterior analysis and post-processing
how clock-like the data are. A coefficient of 0.633 means that it was estimated that the
difference in the rate of evolution of two typical lineages in the analysis varied by 63.3%
of the absolute clock rate. Values closer to zero indicate the data are more clock-like and
a strict clock may be more appropriate. There is no strict rule, but values below 0.1 are
generally considered to be low enough to justify the use of a strict molecular clock
model (e.g. see Brown and Yang 2011). For the log-normaldistribution cv is a simple
function of the standard deviation of the log rate S: cv = exp(S2 ) − 1. For S 1,
cv ≈ S.
If the marginal posterior distribution of the S or cv extends to values very close to zero,
this is also an indication that the strict clock model cannot be rejected. Using a strict
molecular clock for clock-like data has the advantage of requiring fewer parameters
and increases precision of rate estimates (Ho et al. 2005) and topological inference
(Drummond et al. 2006) without compromising accuracy. If the value of the coefficient
of variation is large (e.g. cv 0.1) then the standard deviation is larger and a relaxed
molecular clock is required (see Figure 10.2). If S or cv is greater than 1, then the
data are very non-clock-like and probably shouldn’t be used for estimating divergence
times.
A relaxed molecular clock with exponential distribution of rates across branches will
always have a coefficient of variation of 1.0. It is a one-parameter distribution and the
parameter determines both the mean and the variance. The only reason that BEAST
reports a number slightly under 1 is because of the way the distribution is discretised
across the branches in the tree. So you should ignore the ESS estimates for this statistic
when using the exponential model. If you do not think that the coefficient of variation
12
10
8
Density
6
4
2
0
is about 1 for your data, then you probably should not use the exponentially distributed
rates across branches model.
Rates for clades: Rates for clades are not logged in the trace log but can be found
in the tree log. You can calculate a summary tree with TreeAnnotator and visualise the
rates for a clade in FigTree or IcyTree.
In Sections 1.5.6 and 9.5 we have already looked at comparing models using Bayes
factors (BFs). Here we look at strategies for comparing the various substitution models,
clock models and tree priors.
Model selection strategy: The first rule to take into account when selecting a model
is not to use a complex model until you can get a simple model working. A good
starting point is the HKY substitution model, with a strict molecular clock and a constant
population size coalescent tree prior (if you want a coalescent tree prior, otherwise start
with Yule instead). If you cannot get convergence to the target distribution using this
simple model then there is no point in going further. Convergence means running several
independent chains long enough that they give the same answers and have large enough
ESSs. If you can get this working then add the parameters you are interested in. For
example, if you want to get some information about population size change, try BSP
with a small number of groups (say four) or EBSP, which finds the number of groups.
You should probably add gamma rate heterogeneity across sites to your HKY model
once everything else is working.
The important thing about model choice is the sensitivity of the estimated parameter
of interest to changes in the model and prior. So in many respects it is more important
to identify which aspects of the modelling have an impact on the answer you care about
than to find the ‘right’ model.
You could argue that a ‘correct’ analysis would compare all possible combinations of
demographic, clock and substitution models, assuming your state of knowledge before-
hand was completely naive about the relative appropriateness of the different models.
This is a principled but extreme perspective. Nevertheless, there has been some progress
in implementing model averaging over all the substitution models automatically within
a single MCMC analysis. For example, the model from (Wu et al. 2013) is implemented
in the subst-bma package and simultaneously estimates the appropriate substitution
models and partitioning of the alignment. The reversible-jump-based model (Bouckaert
et al. 2013) is a substitution model that jumps between models in a hierarchy of models,
and is available through the RBS package as the RB substitution model. It can also
automatically partition the alignment, but unlike subst-bma assumes that partitions
consist of a fixed number of contiguous runs of sites. One can average over clock models
in a single MCMC analysis as well (Li and Drummond 2012).
For the standard models (HKY, GTR, gamma categories, invariant sites) using Model-
Test (Posada and Crandall 1998) or jModelTest (Posada 2008) is probably a reasonable
thing to do if you feel that time is a precious commodity. However, some practitioners
146 Posterior analysis and post-processing
would assert that mixing ML and Bayesian techniques is not a principled approach, and
ideally one would want to select or average over models in a Bayesian framework, if
Bayesian inference is the final aim.
Regardless, if you have protein-coding sequences then you should consider a codon-
position model by splitting the alignment into codon positions (Bofkin and Goldman
2007; Shapiro et al. 2006) and based on a BF decide between it and alternative substi-
tution models.
One pleasant aspect of Bayesian inference is that it is easier to take into account the
full range of sources of uncertainty when making a decision, like choosing a substitution
model. With a BF you will be taking into account the uncertainty in the tree topology
in your assessment of different substitution models, whereas in ModelTest you have
to assume a specific tree (usually a neighbour-joining tree). While this may not be too
important Yang et al. (1995) argue that if the tree is reasonable then substitution model
estimation will be robust), the Bayesian method is definitely more satisfying in this
regard.
A disadvantage of ModelTest for protein-coding sequences is that it does not consider
the best biological models, those that take into account the genetic code, either through
a full codon model (generally too slow computationally to estimate trees) or by parti-
tioning the data into codon positions. See (Shapiro et al. 2006) for empirical evidence
that codon position models are generally superior to GTR + + I, and just as fast if not
faster.
As far as demographic models are concerned, it is our experience that they do not
generally have a great effect on the ranking of substitution models.
Strict vs. relaxed clock: The random local clock model (Drummond and Suchard
2010) can be used as a Bayesian test of the strict molecular clock. If the posterior
probability of zero rate changes (i.e. a strict molecular clock) is in the 95% credible set,
then the data can be considered compatible with a strict clock. Alternatively, a heuristic
comparison between log-normal relaxed and strict molecular clocks is relatively easy.
Use a log-normal relaxed clock first, with a prior on S that places 50% of the probability
mass below S = 0.1 (e.g. a gamma (shape = 0.5396, scale = 0.3819) distribution has a
median of ≈ 0.1 and has 97.5% of prior probability below S = 1). If the estimated
S > 0.1 and there is no probability mass near zero in the marginal posterior distribu-
tion of S (see also Section 10.1) then you cannot use a strict clock. However, if the
estimate of S is smaller than 0.1, then a strict clock can probably be safely employed
(Brown and Yang 2011). Note that if the clock model is the parameter of interest,
then you should employ formal model comparison, such as by calculating BFs (see
Section 1.5.6).
Constant vs. exponential vs. logistic population: To select a population function
for a coalescent tree prior, use exponential growth first. If the marginal posterior dis-
tribution of the growth rate includes zero, then your data are compatible with constant
population size.
If you are only doing a test between exponential growth and constant then you can
simply inspect the posterior distribution of the growth rate and determine if zero is
contained in the 95% HPD. If it is, then you cannot reject a constant population size
10.2 Model selection 147
based on the data. Of course this is only valid if you are using a parameterisation that
allows negative growth rates.
Note that if the population history is the parameter of interest, then you should
employ formal model comparison, such as by calculating BFs using path sampling (see
Sections 1.5.6 and 1.5.7).
Alternative *BEAST species trees: Suppose you have two alternative assignments
of sequences to species in a *BEAST analysis, and you want to choose one of the species
trees. If your observations were actually gene trees instead of sequence alignments then
it would be straightforward: the species.coalescent likelihoods would be the appropriate
likelihoods to be comparing by BFs. However, since we do not directly observe the gene
trees, but only infer them from sequence alignments, the species.coalescent likelihoods
are not likelihoods but rather parametric priors on gene trees, where we happen to care
about the parameters and structure of the prior, that is, the species tree topology and the
assignments of individuals to species.
If you are considering two possible species tree assignments and the posterior distri-
butions of the species.coalescent from the two runs are not overlapping, then you can
pick the model that has the higher posterior probability, assuming the species.coalescent
is computed up to the same constant for differing numbers of species and species
assignments.
Comparing clade ages: If you want to compare the estimated ages for two (non-
overlapping) clades, say A and B, probably the best approach is to estimate the posterior
probability that the age of clade A is greater than that of B. But the R statistical program-
ming language can compute this with a few lines (assuming a BEAST log file ‘post.log’,
containing columns called ‘tmrcaA’ and ‘tmrcaB’):
1 # read in the log file
2 post <- read.table("post.log", sep="\t", header=TRUE)
3 # remove 10 percent burnin
4 post <- post[round(nrow(post)*0.1+1):nrow(post),]
5 # calculate posterior probability of tmrcaA > tmrcaB
6 pp <- mean(post$tmrcaA > post$tmrcaB)
Alternatively it is easy enough to load the trace log in which the ages have been
reported as MRCA statistics into a spreadsheet. Copy the two columns of interest into a
new sheet, and in the third column use ‘=IF(A1>B1, 1, 0)’ and take the average over the
last column. This gives you Pr(A > B|D) where D is the data. Either way, you should
remember to remove the burn-in (lines 3–4 in the R script).
As always, it is important to not only report the posterior support for A > B but also
the prior support. Without this, the value of Pr(A > B|D) is harder to interpret in terms
of the evidence provided by the data at hand, since the prior can in principle be set up to
support any hypothesis over another. If no special tree priors are added and the samples
are contemporaneous, we would expect the prior Pr(A > B) to be 0.5 if A and B are
of the same size, indicating 50% of the time one clade should be higher than the other.
Pr(A > B) can be calculated with the same procedure used to calculate Pr(A > B|D),
but this time with the sample from the prior. The BF is then calculated as (Suchard et al.
2001):
148 Posterior analysis and post-processing
Table 10.1 Balanced labelled rooted trees of four taxa are represented by two labelled
histories, depending on which cherry (pair of taxa with a common parent) is older
10.3 Troubleshooting
There are a large number of potential problems that may be encountered when running
an MCMC analysis. In this section, we will have a look at some of the more common
issues and how to solve them.
10.3 Troubleshooting 149
Checking for signal in serially sampled data: For serially sampled data, a quick
test to see whether there is signal in the data is to construct a neighbour-joining tree
and look at the amount of evolution by inspecting the distribution of root-to-tip genetic
distances. Path-O-gen1 (Rambaut 2010) can be useful to show more formally how
well divergence correlates with time of sampling (Drummond et al. 2003). If there is
no structure, this might indicate that some sequences are misaligned or recombinant
sequences, causing larger divergence times than expected. As a consequence estimates
for any of the parameters of interest may be different as well.
Fitting an elephant: Our experience is that many practical problems in phylogenetic
MCMC with BEAST arise from users immediately trying to fit the most complex
model to their data. Though it is tempting to ‘fit an elephant’ (Steel 2005), a pragmatic
approach that starts with a simple model works best in our experience. Only after you
have a handle on your data, by having analysed it with a simple model, would we then
recommend that you use model selection and model averaging tools to see if a more
complex model might be justified (see Sections 10.2 and 9.5).
Non-starting: BEAST might not start because the initial state has a posterior prob-
ability of zero, which is an invalid state for the chain to be in. If this happens, BEAST
will print out all components that make up the posterior, and the first one that is marked
as negative infinity is the component that should be investigated more closely.
There are several reasons why the posterior can be calculated as zero (hence the log
posterior as −∞). When using hard bounds in calibrations you need to ensure that the
starting tree is compatible with the bounds you have specified. In BEAST 2, randomly
generated trees are automatically adjusted to these constraints, as long as the constraints
are mutually compatible. The priors can be incompatible if a calibration on a subclade
has a lower bound that is higher than the upper bound of a calibration on a superclade.
In this situation no tree exists that fits both of these constraints. If for some reason
no random starting tree can be found and the calibrations are compatible, you could
provide a starting tree in Newick format that is consistent with the calibrations. If such
a starting tree is inconsistent, BEAST will start with a log posterior that is −∞ and
stop immediately.
Another reason for BEAST to register a zero initial posterior probability is when there
are a large number of taxa and the initial tree is very far from the optimum. Then, the
tree likelihood will be so small, since the data do not fit the tree, that the tree likelihood
can’t be calculated due to numerical underflow and the likelihood will return 0 (hence
−∞ in log space). To prevent this happening, a better starting tree could be a UPGMA
(unweighted pair group method with arithmetic mean) or neighbour-joining tree, or a
tree estimated using another program in Newick format that has been adjusted to obey
all calibration constraints.
When using an initial clock rate that is many orders of magnitude smaller than the
actual clock rate, it can happen that underflow occurs when scaling the branch lengths
to units of substitution. A potential reason is that some calibrations are in different units,
say millions of years, while the clock rate is in another unit, say years. This is easily
fixed by choosing a different starting value for the clock rate.
1 Available from http://tree.bio.ed.ac.uk/software/pathogen.
150 Posterior analysis and post-processing
When BEAST does not start and the coalescent likelihood is reported as −∞, this
probably means that a parameter like the growth rate is initially so high that there are
numerical issues calculating the coalescent likelihood.
The MCMC chain does not converge: If there are a lot of sequences in the align-
ment, you need to run a very long chain, perhaps more than 100 million states. Alter-
natively, you can run a large number of shorter chains, say 20 million states on ten
computers. Though a systematic quantitative analysis has yet to be done, in our experi-
ence the chain length needs to increase quadratically in the number of sequences. So, if
you double the number of sequences then you need to quadruple the length of the chain
to get the same ESS. Thus, you could analyse a sub-sample of, say, 20 sequences and
use this to estimate how long a chain you need to analyse 40 or 80 or 160 sequences
and obtain the same ESS. In general if you have much more than 100 sequences you
should expect the analysis to involve a lot of computation and doing multiple runs and
combining them is a good idea.
One reason a chain does not converge is when the posterior is multi-modal. Mixing
for multi-modal posteriors can take a long time because the operators can have difficulty
finding a path through parameter space between the different modes. Since the ESS is
chain length divided by autocorrelation time (ACT) and the ACT can be very large for
multi-modal traces, the ESS will be very small. Inspecting the traces of parameters in
Tracer can help determine if some parameters have two or more modes; however, this
is more difficult if the modes are in tree space (Höhna and Drummond 2012; Whidden
et al. 2014).
Lack of convergence can also occur if the model is non-identifiable. For example, if
both the clock rate and substitution rate are estimated when analysing a single partition,
there is no reason to expect the chain to converge, because there are two parameters, but
there is only one degree of freedom (i.e. the likelihood only changes as a function of the
product of the two parameters). In this case, even if the chain appears to converge, there
is no guarantee that the algorithm will produce a sample from the correct target distribu-
tion. A model can be made identifiable by choosing fewer parameters to estimate, and
in the example above this is achieved by fixing one of the two parameters to, say, 1.0.
In general, more complex models with more parameters converge more slowly
because each parameter will be changed less often by operators. Hence, parameter-rich
models require more samples to reach the same ESS. There are many more reasons a
chain does not converge that are specific to some models. Some of these model-specific
problems will be addressed below.
Sampling from the prior does not converge: Sampling from the prior is highly
recommended in order to be able to check how the various priors interact. Unfortunately,
the space that needs to be sampled for the prior tends to be a lot larger than the posterior
space. Therefore, it may take a lot more samples for convergence to be reached when
sampling from the prior than when sampling the posterior. It is not uncommon for
the difference between the number of samples to be one or two orders of magnitude.
Fortunately, evaluating the prior takes a lot less time than evaluating the posterior, so
the prior chain should run a lot faster.
10.3 Troubleshooting 151
It is not possible to sample from the prior if improper priors (see Section 1.5.4) are
used. For example, OneOnX priors and uniform priors without bounds can’t be used if
you want to sample from the prior, or perform path sampling (see Section 1.5.7)
Issues with BSP analysis: When running a BSP analysis, it is not uncommon to see
low ESS values for group sizes. The group size is the number of steps in the population
size function. These are integer values, and group sizes tend to be highly autocorrelated
since they do not change value very often. Unfortunately high ESSs are needed for this
parameter to ensure correctness of the analysis.
When there is low variability in the sequences, the estimates for coalescent times will
contain large uncertainties and the skyline plot will be hard to reconstruct. So, expect
the BSP to fail when there are only a handful of mutations in sequences.
Estimating highly parametric demographic functions from single gene alignments
of just a handful of sequences is not likely to be very illuminating. For the single-
population coalescent, sequencing more individuals will help, but there is very little
return for sequencing more than 20–40 individuals. Longer sequences are better as
long as they are completely linked (for example, the mitochondrial genome). Another
approach is to sequence multiple independent loci and use a method that can combine
the information across multiple loci (like EBSP). Felsenstein (2006) provides an illu-
minating analysis of the factors contributing to accuracy of coalescent-based estimates.
Much of this appears to translate to the context of highly parametric population size
estimation (Heled and Drummond 2008).
Demographic reconstruction fails: There are two common reasons for Tracer to fail
when performing a demographic reconstruction. Firstly, there may be no ‘End;’ at the
end of the tree log. This can be fixed easily by adding a line to the log with ‘End;’ in
a text editor. Secondly, Tracer fails when the sample frequencies of the trace log and
the tree log file differ. The file with the highest frequency can be sub-sampled using
LogCombiner.
Relaxed clock gone wrong: Sometimes models using the relaxed clock model take
a long time to converge. In general, they tend to take longer than when using a strict
molecular clock model with otherwise the same settings. See tips on increasing ESS in
Section 10.1 for some techniques to deal with this.
When a relaxed clock model does converge, but the coefficient of variation is much
larger than 1, this may be an indication something did not go well. A coefficient of
variation of 0.1 < cv < 1 is already a lot of variation in rates from branch to branch,
especially in an intra-species data set. Having a coefficient of variation of much larger
than 1 represents a very large amount of branch rate heterogeneity and if that were really
the case it would be very hard to sample from the posterior distribution due to the strong
correlations between the divergence times and the highly variable branch rates. It is not
much use to try to estimate divergence times under these circumstances. This problem
can occur due to a prior on S, the standard deviation of the log rate (when using the
log-normal relaxed clock) that is too broad. It may also be due to failure to converge,
which can be tested by rerunning the analysis a few times to see whether the same
posterior estimates are reached.
152 Posterior analysis and post-processing
If you are looking mostly at intra-species data then the uncorrelated relaxed clock
should be used with caution. We recommend using a log-normal uncorrelated relaxed
clock with a prior on S centred around S = 0.1, and with the bulk of the prior probability
below S = 1. Alternatively, a random local clock might be more appropriate, since
one would expect most of the diversity within a species to be generated by the same
underlying evolutionary rate.
Rate estimate gone wrong: If you only have isochronous sequences, that is, all
sampled from the same time, and you do not have any extra calibration information
on any of the internal nodes nor a narrow prior on clock rates, then there is nothing you
can do in BEAST that will create information about rates and dates.
If you are looking at fast-evolving species such as RNA viruses, then a decade of
sampling through time is typically more than enough to get good estimates of the rate of
evolution. However, if you are looking at a slow-evolving species like humans or mice,
this is not a sufficient time interval to allow estimation of mutation rates, unless perhaps
if you have sequenced whole genomes. The issue is that with species exhibiting very low
evolutionary rates (like mammals), all tip dates become effectively contemporaneous
and we end up in the situation with (almost) no calibration information.
RNA viruses like HIV, influenza and hepatitis C have substitution rates of about 10−3
(Drummond et al. 2003a; Jenkins et al. 2002; Lemey et al. 2004); however, other viruses
can evolve a lot more slowly, some as slowly as 10−8 (Duffy et al. 2008). For those slow-
evolving viruses there is the same issue as for mammals, and BEAST will not be able
to estimate a rate based on samples that have been isolated only decades apart. This can
show up in the estimate of the root height having 95% HPDs that are very large. If the
data reflect rates that are too slow to estimate, the posterior of the rate estimate should
equal the prior, as is indeed the case with BEAST when using simulated data (Firth et al.
2010). However, care is needed because real data can contain different signals causing
the rate estimates to be too high due to model misspecification.
Another factor is the number of sampling times available. For example, if you only
have two distinct sampling times, far apart compared to the coalescent rate, so that the
clades of the old and new sequences are reciprocally monophyletic with respect to each
other, then you will not be able to reliably estimate the rate because there is no way of
determining where along the branch between the two clusters the root should attach. In
this situation, due to the tree topology, there is a ‘pulley’ at the root which creates large
uncertainty in the rate estimates.
One thing you should remember about a Bayesian estimate is that it includes uncer-
tainty. So even if the point estimates of a rate are quite different, they are not signif-
icantly different unless the point estimate of one analysis is outside the 95% HPD
interval of another analysis with a different prior. In other words, the point estimates
under different priors may be different just because the estimates are very uncertain.
Conflicting calibration and clock rate priors: It is possible to have timing infor-
mation (that is, tip or node dating) while also having an informative prior on the clock
rate, or even fixing the clock rate. If the clock rate is constrained to a value consider-
ably different from the rate implied by the (tip or node) calibration information, then
convergence could become a problem. In BEAST, genetic distances are modelled by
10.3 Troubleshooting 153
the product of rate and time. If you try to fix both the rate and the time (by having
informative tip or node calibration(s)) then there is the potential to get a strong conflict
between the genetic distances implied by those data, and those implied by the product
of the priors. When a chain has to accommodate such a mismatch, this can manifest
itself in poor convergence, unrealistic estimates in parameter values and unexpected tree
topologies. It is important to make sure that this situation does not occur inadvertently
by fixing the clock rate to 1, when calibration information is available.
Tracer out of memory: Tracer reads the complete log file into memory before
processing it, and it can run out of memory in the process. The log files increase
proportionally to the number of samples. For very long runs the log intervals should
be increased so that the total number of entries that end up in the log is no more than
10 000. So, for a run of 50 million, don’t log more often than every 5000 samples. If you
have produced a log with too many samples, then use LogCombiner to down-sample the
log file before loading it in Tracer.
11 Exploring phylogenetic tree space
In phylogenetic inference the parameters to estimate are the tree topology (or ranked
tree) and associated divergence times of the common ancestors. This is not a standard
statistical problem because the parameter space is not a simple Euclidean space,1 i.e. it is
not Rn , or any simple convex subspace of Rn . While the space of trees can be considered
to be constructed of Euclidean subspaces that do lie in Rn (one n-dimensional orthant,
that is, the restriction of Rn to non-negative reals, for each tree topology or ranked tree;
see Figure 11.1), its overall structure is not Euclidean (see Figure 11.2), and thus new
statistical techniques are required for analysis and visualisation.
Formally, a phylogenetic tree space, or just tree space for short, is a metric space
such that the points of the space are in one-to-one correspondence with the set of
phylogenetic trees on n taxa (i.e. there is a bijection between the metric space and the set
of all trees, where the set of trees has the size of the continuum, since each combination
of divergence times on a topology represents a distinct tree). The distance of the metric
space induces (via the isomorphism mentioned above) a distance between any pair of
trees. Here, as throughout the book, we are specifically considering time-trees and time-
tree space, but we will use the term tree space for short.
Despite the fact that statistical inference of phylogenetic trees is a decades-old pur-
suit, surprisingly little theoretical development of tree metric spaces has taken place
and there are many open problems and challenges that derive from the non-Euclidean
nature of tree space. A key challenge lies in how one should summarise a set of trees in
tree space. Basic concepts in statistics, such as the mean and variance of a sample, are
challenging to define for tree space (unless the topology is fixed), and have unusual
properties compared to their counterparts in Euclidean spaces. A key result in this
area was the description of the geometry of tree space for phylogenetic trees with
unconstrained branch lengths (i.e. not time-trees), called BHV space (Billera et al.
2001). Recently there have been a number of exciting developments regarding BHV
space, including the description of a polynomial-time algorithm to compute distances
between trees in BHV space (Owen and Provan 2011), opening up new approaches
to producing Bayesian point estimates of phylogenetic reconstructions (Benner et al.
2014). However, no such analogous results are available for time-tree space.
1 Although those familiar with differential geometry or philosophy might disagree that Euclidean space is
simple!
11.1 Tree space 155
1 2 3
x
y
y
1 2 3
1 2 3
1 2 3
Figure 11.1 A Euclidean two-dimensional space representing the space of all possible time-trees
for the topology ((1,2),3). There are two parameters, x and y, one for each of the two
inter-coalescent intervals, the sum of which is the age of the root (troot = x + y). Three trees are
displayed, along with their arithmetic mean tree, also called the centroid. The dashed lines show
the path connecting each of the three trees to the mean tree by the shortest distance (i.e. their
deviations from the mean).
1 2 3
2 3 1 1 2 3
2
13
Figure 11.2 T3 , the simplest non-trivial tree space (for time-trees), representing the space of
time-trees for n = 3 taxa sampled contemporaneously. Each of the three non-degenerate tree
topologies is represented by a two-dimensional Euclidean space (as illustrated in Figure 11.1)
and these subspaces meet at a single shared edge representing the star tree, which is a
one-dimensional subspace and thus has a single parameter (the age of the root). The dashed
lines shows the paths of shortest distance between the four displayed trees.
156 Exploring phylogenetic tree space
It is precisely because of the non-standard nature of tree space, and the consequent
limits to its statistical characterisation, that there are specialist programs for Bayesian
phylogenetic inference. Otherwise, popular general tools for Bayesian statistical infer-
ence such as Stan (Hoffman and Gelman 2014; Stan Development Team 2014) or BUGS
(Lunn et al. 2000, 2009) could be used. This statement implies that Stan or BUGS
could be productively employed for Bayesian phylogenetic inference problems that
are conditioned on a fixed topology, or ranked history, depending on the tree space
employed.
A Bayesian phylogenetic MCMC analysis produces one or more tree log files containing
a sample of trees from the posterior distribution over tree space. The MCMC algorithm
produces a chain of states, and these are autocorrelated so that in general two adjacent
states in the chain are not independent draws from the posterior distribution. Thus a
critical element of evaluating the resulting chain of sampled trees from a Bayesian
phylogenetic MCMC analysis is determining whether the chain is long enough to be
representative of the full posterior distribution over tree space (Nylander et al. 2008). In
general it is not trivial to quantify MCMC exploration of phylogenetic tree space (Whid-
den et al. 2014) and careful investigation can reveal unexpected properties. Research
into efficient sampling of phylogenetic tree space, including improving existing MCMC
algorithms, is an active field of enquiry (Höhna and Drummond 2012; Lakner et al.
2008; Whidden et al. 2014). In this chapter we will primarily consider what one does
after a representative posterior sample of trees has been obtained. In practice this stage is
reached by running a long enough chain (or multiple chains) so that standard diagnostic
tests are passed, and then taking a regular subsampling of the full chain(s) as your
resulting posterior sample (i.e. the contents of the tree log file(s)).
The main methods for dealing with such a posterior sample of trees are:
In this chapter we briefly review the first five of these methods and compare them.
The methods are judged on their ability to clarify properties of the posterior sample,
and in particular whether they can highlight areas of certainty and uncertainty in both
divergence times and tree topology. We will pay special attention to summary trees
and DensiTrees. Visualising phylogenies and phylogenetic tree space are active areas of
research (Graham and Kennedy 2010; Procter et al. 2010; Whidden et al. 2014).
11.3 Tree set analysis methods 157
In this section, we consider some tree set analysis methods. Some work equally well for
both rooted and unrooted trees, while others are designed only for rooted trees.
A B C D A C B D A B C D
in the posterior sample is a Monte Carlo estimate of the clade’s posterior probability.
These are sometimes termed posterior clade probabilities or posterior clade support.
The problem with this approach is that, C(n), the number of potential non-trivial clades
(i.e. strict subsets of taxa of size ≥ 2), grows exponentially with the number of taxa,
n: C(n) = 2n − n − 2.
Although the posterior support can often be localised in a small fraction of tree space,
if everything else is equal we may well assume that the clades with significant posterior
support might also grow roughly exponentially with the number of taxa analysed. When
there are many closely related taxa in a data set, the number of clades appearing in 5% or
more of the posterior sample can easily be in the hundreds or thousands. Obviously, such
clade sets would be hard to interpret without good visualisation. Another problem is that
credible sets only provide information about the uncertainty in the tree topology, and do
not inform about the uncertainty in divergence times, unless each clade is additionally
annotated with information about the marginal posterior distribution of the age of the
clade’s most recent common ancestor (e.g. 95% HPD intervals).
As an illustration, a typical posterior sample from the Anolis data set of 18 000
trees (20 000 trees minus 10% burn-in) would comprise 9200 ± 110 unique topologies
(9163, 8942, 9328, 9473). More comfortingly, the 5% credible set of topologies contains
only 58 different clades (all four runs agreed). The 50% credible set includes only
about 107 ± 2.7 clades (103, 106, 105, 115) and the 95% credible set includes only
156 ± 4.1 clades (158, 152, 148, 167). These are all quite small numbers compared to
the approximately 36 million billion possible clades for a 55-taxa tree. So, while there
are a large number of tree topologies, a relatively small number of clades dominate the
tree distribution. In a typical run of the Anolis data there are 41 clades that each occur in
over 90% of the trees. In fact (A. angusticeps, A. paternus) is one of 13 two-taxa clades
that have a posterior probability of 1.0. So the topological uncertainty is limited to a
small number of divergences, and a small number of alternatives at those divergences.
But because there is uncertainty present in a number of different parts of the phylogeny,
there is a combinatorial explosion of full resolutions of the tree, leading to the large
number of unique trees in the posterior sample and credible set.
While it is easy to provide a list of topologies and clades and the number of times
they appear in the posterior sample (for example, with the TreeLogAnalyser tool in
BEAST), such data are difficult to interpret, and it does not give much intuition about
the overall structure of the posterior distribution. However, it can be useful for testing
specific a priori phylogenetic hypotheses such as whether a clade is monophyletic (see
Section 10.2).
11.4 Summary trees 159
There are many ways to construct a summary tree, also known as a consensus tree, from
a set of trees. See the excellent review of (Bryant 2003) for a few dozen methods. Most
of these methods create a representation of the tree set that are not necessarily good
estimators of the phylogeny (Barrett et al. 1991). A few of the more popular options are
the following.
160 Exploring phylogenetic tree space
Figure 11.4 Multi-dimensional scaling result for the Anolis tree set.
The majority rule consensus tree is a tree constructed so that it contains all of the
clades that occur in at least 50% of the trees in the posterior distribution. In other
words it contains only the clades that have a posterior probability of ≥ 50%. The
extended majority tree is a fully resolved consensus tree where the remaining clades
are selected in order of decreasing posterior probability, under the constraint that each
newly selected clade be compatible with all clades already selected. It should be noted
that it is quite possible for the majority consensus tree to be a tree topology that has
never been sampled and in certain situations it might be a tree topology with relatively
low probability, although it will have many features that have quite high probability.
Holder et al. (2008) argue that the majority rule consensus tree is the optimal tree for
answering the question ‘What tree should I publish for this group of taxa, given my
data?’ assuming a linear cost in the number of incorrect and missing clades with higher
cost associated to missing clades.
The maximum clade credibility (MCC) tree produced by the original version of
TreeAnnotator is the tree in the posterior sample that had the maximum sum of
posterior clade probabilities. However, we will restrict our use of the term MCC to
the more natural current default of TreeAnnotator, which is the tree with the maximum
product of posterior clade probabilities. The MCC tree is always a tree in the tree set,
and is often shown in publications that use BEAST for phylogenetic reconstruction.
Recent empirical experiments show that MCC trees perform well on a range of criteria
(Heled and Bouckaert 2013).
The term maximum a posteriori tree or MAP tree has a number of interpretations.
It has sometimes been used to describe the tree associated with the sampled state in
the MCMC chain that has the highest posterior probability density (Rannala and Yang
1996). This is problematic, because the sampled state with the highest posterior prob-
ability density may just happen to have extremely good branch lengths on an otherwise
fairly average tree topology. A better definition of the MAP tree topology is the tree
topology that has the greatest posterior probability, averaged over all branch lengths
11.4 Summary trees 161
and substitution parameter values. For data sets that are very well resolved, or have a
small number of taxa, this is easily calculated via Monte Carlo, by just determining
which tree topology has been sampled the most often in the chain. However, for large
data sets it is quite possible that every sampled tree has a unique topology. In this case,
conditional clade probabilities can be used to estimate the MAP topology (Höhna and
Drummond 2012; Larget 2013).
A natural candidate for a point estimate is the tree with the maximum product of the
posterior clade probabilities, the so-called maximum credibility tree. To the extent that
the posterior probabilities of different clades are additive, this definition is an estimate
of the total probability of the given tree topology, that is, it provides a way of estimating
the maximum a posteriori tree (MAP) topology. Höhna and Drummond (2012) and
Larget (2013) introduce a method to estimate posterior probabilities of a tree based on
conditional clade probabilities.
To define a median tree, one must first define a metric on tree space. This turns out to
be quite a difficult task to perform, but there are a number of candidate metrics described
in the literature. Visualise the trees in the posterior sample as a cluster of points in a
high-dimensional space, then the median tree is the tree in the middle of the cluster –
the median tree has the shortest average distance to the other trees in the posterior
distribution. With a metric defined (say the Robinson–Foulds distance; Robinson and
Foulds 1981), a candidate for the median tree would be the tree in the posterior sample
that has the minimum mean distance to the other trees in the sample.
Once a tree topology or topologies are found that best summarises a Bayesian phylo-
genetic analysis, the next question is what divergence times (node heights) to report.
One obvious solution is to report the mean (or median) divergence time for each of the
clades in the summary tree. This is especially suitable for the majority consensus tree
and the maximum credibility tree, however defined. For the median tree, it should be
noted that some metrics, that is, those that take account of branch lengths and topology,
allow for a single tree in the sample to be chosen that has the median topology and
branch lengths. Likewise, if the MAP sampled state is the chosen tree topology, then
the associated branch lengths of the chosen state can be reported.
Instead of constructing a single summary tree, a small set of summary trees can be
used to represent the tree set (Stockham et al. 2002).
Figure 11.5 Single consensus tree of the Anolis tree set, with bars representing uncertainty in
node height.
HPD interval for the height with many methods will be of size zero, since the interval
will be based on all (A,B) clades in the tree set, and this clade only occurs once. This
illustrates the danger of basing estimates on only the clades occurring in the summary
tree, and ignoring other information in the tree set.
Uncertainty in topology does not show up in the consensus tree. An alternative is to
label the tree nodes with the support in the tree set. In this case, every node would have
a number attached with the number of trees that contain the clade associated with that
node. A low number could indicate low consensus in the topology. For the Anolis tree,
most branches have support of over 90%, indicating most of the tree topology is strongly
supported by the data. However, for those branches with lower support, it is not clear
what the alternative topologies are. In summary, distinguishing between uncertainty in
topology and uncertainty in branch lengths requires careful examination of the tree and
its annotations.
Algorithms for generating summary trees that use not only those clades in the poster-
ior that are found in the summary tree but that use all clades were developed recently
(Heled and Bouckaert 2013). These algorithms do not suffer from negative branch
lengths. An implementation is available in biopy (available from https://code.google.
com/p/biopy), which is integrated with DensiTree (see next section). One set of algo-
rithms tries to match clade heights as closely as possible, resulting in large trees with
long branches. Another set of algorithms tries to match branch lengths. These methods
have a tendency to collapse branches that have little clade support, but tend to have
higher likelihood on the original data used in simulation experiments.
The common ancestor heights algorithm determines the height of nodes in a summary
tree by using the average height of the summary tree clades in the trees of the tree set.
This method tends to do well in estimating divergence times and is fast to compute
(Heled and Bouckaert 2013). Since the height of a node of a clade is always at least as
high as any of its subclades, this guarantees that all branch lengths are non-negative. It
is the default setting in TreeAnnotator in BEAST 2.2.
11.5 DensiTree
A DensiTree (Bouckaert 2010) is an image of a tree set where every tree in the set is
drawn transparently on top of each other. The result is that areas where there is large
consensus on the topology of the tree show up as distinct, fat lines while areas where
there is no consensus show as blurs. The advantage of a DensiTree is that it is very clear
where the uncertainty in the tree set occurs, and no special skills are required to interpret
annotations on the tree.
Figure 11.6 shows the DensiTree for the Anolis data, together with a consensus tree.
The image clearly shows there is large consensus of the topology of the trees close
to the leaves of the tree. Also, the outgroup consisting of the clade with Diplolaemus
darwinii and Polychrus acutirostris at the bottom of the image is clearly separated
from the other taxa. Again, this might not quite be what is expected from the large
number of different topologies in the Anolis tree set noted in Section 11.3.2. Where
164 Exploring phylogenetic tree space
Figure 11.6 DensiTree of the Anolis tree set. Bars indicate 95% HPD of the height of clades. Only
clades with more than 90% support have their bar drawn.
the topology gets less certain is in the middle, as indicated by the crossing line in the
DensiTree.
Consider the clade just above the outgroup consisting of A. aeneus, A. richardi, A.
luciae, Phenacosaurus nicefori, A. agassizi and A. microtus. The first three form a solid
clade, say clade X, and the last two, say clade Y, as well. However, it is not clear where
P. nicefori fits in. There are three options; P. nicefori split off before the other two
clades, clade X split off before P. nicefori and clade Y, or clade Y split off before P.
nicefori and clade X. There is support in the data for all three scenarios, though the last
scenario has most support with 50%, judging from the 50% clade probability consisting
of clade X and P. nicefori. Further, there is 35% support for the first scenario, leaving
15% support for the second scenario. So, where the summary tree shows the most likely
scenario, there are two other scenarios in the 95% credible set of scenarios and these
are visualised in the image.
However, when a lot of taxa are closely related there can be a lot of uncertainty inside
clades. This is shown in Figure 11.7, which shows the tree set for dengue-4 virus.
Many samples were used in this analysis that only differ by a few mutations from other
11.5 DensiTree 165
D4/DM/M44/1981
D4/PR/M20/FEB-1982
D4/PR/M7/FEB-1982
D4/PR/M24/FEB-1982
D4/PR/M25/FEB-1982
D4/PR/M15/FEB-1982
D4/PR/M3/FEB-1982
D4/PR/M5/FEB-1982
D4/PR/M21/FEB-1982
D4/PR/M16/FEB-1982
D /PR/M10/1982
D4
D4/PR/M10/1982
D 4 /P
D4 /PR
P R / M12
M /FEB-1982
D4/PR/M12/FEB-1982
D4
D 4
4/P
//P
PR/M M13
13 /FEB-1982
D4/PR/M13/FEB-1982
D4
D4/P
4 /PR
/PR/M
/P M44
4/FEB
D4/PR/M4/FEB-1982/F
/FEB
/FE
/F
FEB
FEEB -1982
E
D4/P
D44 /P
/PR R / M9
M 9 //F
D4/PR/M9/FEB-1982/FEB
/FE
FEB
F
FEEB
E B -1
- 1 9 82
-19
D4/PR/M33/1985
D 4/PR/1/DEC-1987
D4/PR/1/DEC-1987
D 4 /PR/9/FEB-1987
D4
D4/PR/9/FEB-1987
D44 /P
/ P R / 1 14/1985
/PR
D4/PR/114/1985
D /PR/5/FEB-1987
D4
D4/PR/5/FEB-1987
D 4/PR/67/SEP-1987
D4/PR/67/SEP-1987
D 4/PR/8/MAR-1987
D4/PR/8/MAR-1987
D /PR/60/AUG-1987
D4
D4/PR/60/AUG-1987
D 4/PR/66/SEP-1987
D4/PR/66/SEP-1987
D /PR/69/OCT-1987
D4
D4/PR/69/OCT-1987
D4/PR/M32/1985
D4/PR/M37/1985
D /PR/116/MAY-1986
D4
D4/PR/116/MAY-1986
D /PR/117/AUG-1986
D4
D4/PR/117/AUG-1986
D4/PR/M36/1985
D /PR/M34/1985
D4
D4/PR/M34/1985
D /PR/M42/OCT-1986
D4
D4/PR/M42/OCT-1986
D /PR/63/AUG-1987
D4
D4/PR/63/AUG-1987
D /PR/73/OCT-1987
D4
D4/PR/73/OCT-1987
D 4/PR/64/SEP-1987
D4/PR/64/SEP-1987
D /PR/65/AUG-1987
D4
D4/PR/65/AUG-1987
D4/PR/M31/1985
D4/PR/107/NOV-1990
D4/PR/24/DEC-1992
D /PR/26/NOV-1992
D4
D4/PR/26/NOV-1992
D4/PR/34/MAY-1992
D4/PR/37/AUG-1992
D /PR/41/OCT-1992
D4
D4/PR/41/OCT-1992
D4/PR/28/DEC-1992
D4
D 4 //P
P R /93/DEC-1990
D4/PR/93/DEC-1990
D 4 /P
D4 /PR/9 944 / DEC-1990
4/
D4/PR/94/DEC-1990
D4/PR/29/DEC-1992
D4/PR/35/SEP-1992
D /PR/76/AUG-1994
D4
D4/PR/76/AUG-1994
D4/PR/77/SEP-1994
D /PR/78/AUG-1994
D4
D4/PR/78/AUG-1994
D 4/PR/80/JUL-1994
D4/PR/80/JUL-1994
D /PR/89/AUG-1994
D4
D4/PR/89/AUG-1994
D4/PR/82/SEP-1994
D /PR/79/AUG-1994
D4
D4/PR/79/AUG-1994
D /PR/87/AUG-1994
D4
D4/PR/87/AUG-1994
D /PR/81/AUG-1994
D4
D4/PR/81/AUG-1994
D4/PR/83/SEP-1994
D /PR/25/NOV-1992
D4
D4/PR/25/NOV-1992
D /PR/30/NOV-1992
D4
D4/PR/30/NOV-1992
D /PR/27/NOV-1992
D4
D4/PR/27/NOV-1992
D /PR/31/NOV-1992
D4
D4/PR/31/NOV-1992
D /PR/32/NOV-1992
D4
D4/PR/32/NOV-1992
D /PR/42/OCT-1992
D4
D4/PR/42/OCT-1992
D 4 /PR// 72
72/O
/O
O C T --19
D4/PR/72/OCT-1987 1 87
D 4 /P
/ P R /62/AU
U G -19
D4/PR/62/AUG-1987 -1 19
1 9 87
D4/PR/97/JAN-1991
D 4/PR/115/JUL-1986
D4/PR/115/JUL-1986
D /PR/3/FEB-1987
D4
D4/PR/3/FEB-1987
D4/PR/36/AUG-1992
D4/P
D4/PR
D4/PR/
D
D4
D4/
4/PR/
4 /P
4// PR/
PR
PRR// 9
96/DEC
96/D
966
6/D
D4/PR/96/DEC-1990 -1990
D4/PR/86/JUL-1994
D
D4/PR/ 86/JUL-1994
D4/PR/M35/1985
D4/PR/12/NOV-1998
D 4/P
D4/PR/14/MAR-1998
D4/PR/18/MAR-1998
D4/P
D4/P
D4/PR/19/MAR-1998
D4
D4/PR/20/JUN-1998
D4/P
D4/PR/17/FEB-1998
D4/
D4/PR/45/APR-1998
D4/P
D4/PR/13/JAN-1998
D4/
D4/PR/47/MAY-1998
D4/P
D4/PR/15/MAR-1998
D4
D4/PR/48/JUN-1998
D4/
D4/PR/46/APR-1998
D4/PR// 84/JUL
D 8 4/JU
84/
84/JU
844//J
4
4/JUL
/ JUL
/JUL
JU
JU
UL
D4/PR/84/JUL-1994L --199
-19
-1994
1994
19
199
1 994
99
9 9
944
D
D4/PR/44/AUG-1998
D
D4/PR/ 85/JUL-1994
D4/PR/85/JUL-1994
D
D4/PR/ 88/AUG-1994
D4/PR/88/AUG-1994
Figure 11.7 DensiTree of the dengue-4 tree set. Bars as in Figure 11.6.
BEAST is software for performing a wide range of phylogenetic analyses. The vision
we have for BEAST is that it provides tools for computational science that are
1. easy to use, that is, well documented, having intuitive user interfaces with shallow
learning curve.
2. open access, that is, open source, open XML format, facilitating reproducibility
of results, and running on many platforms.
3. easy to extend, by having extensibility in their design.
We limit the scope of BEAST to efficient Bayesian computation for sequence data
analysis involving tree models. Making BEAST easy to use is one of the things that
motivated writing this book. The code is set up to encourage documentation that is used
in user interfaces like BEAUti. By dividing the code base into a core set of classes
that can be extended by packages (Chapter 15), we hope that it will be easier for new
developers to learn how to write new functionality and perform new science. Further
help and documentation is available via the BEAST 2 wiki.1
We want BEAST to be open access (Vision 2) and therefore it is written in Java, open
source and licensed under the Lesser GNU Public License.2 BEAST 2 typically runs as
a standalone application, by double-clicking its icon (in modern operating systems) or
starting from the command line with java -jar beast.jar. A BEAST 2 XML
file should be specified as the command line argument. XML files are used to store
models and data in a single place. The XML format is an open format described in
Chapter 13.
Since we want the system to be extensible (Vision 3), everything in the system
implements BEASTInterface. The BEASTObject class provides a basic imple-
mentation and many classes in BEAST derive from BEASTObject. We will say
that an object is a BEAST-object if it implements BEASTInterface and an object
is a BEASTObject if it derives from BEASTObject. Every BEAST-object can
specify inputs to connect with other BEAST-objects, which allows for flexible model
building. Input objects contain information on type of input and how they are stored
in BEAST XML files. To extend the code, you write Java classes that implement
BEASTInterface by deriving from the BEASTObject class, or deriving from any
1 http://beast2.org/wiki
2 The BEAST 2 source code can be downloaded from http://beast2.org.
170 Getting started with BEAST 2
of the more specialised classes that subclass BEASTObject. But before we get into
the gory details of writing BEAST-objects, let us first have a guided tour of BEAST 2.
Figure 12.1 shows (part of) a model, representing a nucleotide sequence analysis using
the Jukes–Cantor substitution model. The ‘rockets’ represent BEAST-objects, and their
‘thrusters’ the inputs. Models can be built up by connecting BEAST-objects through
these inputs with other BEAST-objects. For example, in Figure 12.1 the SiteModel
BEAST-object has a JC69 substitution model BEAST-object as input, and Tree,
SiteModel and Alignment are inputs to the TreeLikelihood BEAST-object.
The TreeLikelihood calculates the likelihood of the alignment for a given tree.
To do this, the TreeLikelihood also needs at least a SiteModel as input, and
potentially also a BranchRateModel (not necessary in this example and a strict
clock is assumed by default). The SiteModel specifies everything related to the
transition probabilities for a site from one node to another in the Tree, such as the
number of gamma categories, proportion of invariant sites and substitution model. In
Figure 12.1, the Jukes–Cantor substitution model is used. In this section, we extend this
with the HKY substitution model and show how this model interacts with the operators,
state, loggers and other bits and pieces in the model.
To define the HKY substitution model, first we need to find out what its inputs should
be. The kappa parameter of the HKY model represents a variable that can be estimated.
BEAST-objects in the calculation model (that is, the part of the model that performs the
posterior calculation) are divided into StateNodes and CalculationNodes.
StateNodes are classes an operator can change, while CalculationNodes
are classes that change their internal state based on inputs. The HKY model is a
CalculationNode, since it internally stores an eigenvalue matrix that is calculated
based on kappa. Kappa can be changed by an operator and does not calculate anything
itself, so the kappa parameter is a StateNode.
The other bit of information required for the HKY model is the character frequen-
cies. These can be calculated from the alignment or estimated using a parameter, so
data
tree treeLikelihood
alignment siteModel
sequence
state
tree distribution mcmc
operator
logger
gammaCategoryCount=1 siteModel
substModel
JC69
Figure 12.1 Example of a model specifying the Jukes–Cantor substitution model (JC69). It shows
BEAST-objects represented by rocket shapes connected to other BEAST-objects through inputs
(the thrusters of the rocket).
12.1 A quick tour of BEAST 2 171
kappa
value=1.0
kappa hky
frequencies
freqs gammaCategoryCount=1 siteModel
data
substModel
data
treeLikelihood
tree
siteModel
sequence alignment state
tree distribution
mcmc
operator
logger
parameter
scaleFactor=0.5
kappaScaler
weight=1.0
tree tree
treeScaler Uniform
scaleFactor=0.5
weight=10.0
weight=1.0
value=1.0
kappa
tree
kappa tree narrow
hky WilsonBalding
weight=1.0
frequencies gammaCategoryCount=1 weight=1.0
siteModel isNarrow=false
freqs substModel tree wide
data SubtreeSlide t r e e
weight=1.0
weight=5.0
data
tree treeLikelihood
sequence
alignment siteModel
state
tree distribution mcmc
operator
logger
parameter
scaleFactor=0.5
kappaScaler
weight=1.0
tree tree
stateNode treeScaler Uniform
State scaleFactor=0.5
weight=10.0
storeEvery=100000 weight=1.0
value=1.0
kappa
tree
kappa tree narrow
hky WilsonBalding
weight=1.0
frequencies gammaCategoryCount=1 weight=1.0
siteModel isNarrow=false
freqs substModel tree wide
data SubtreeSlide t r e e
weight=1.0
weight=5.0
data
tree treeLikelihood
sequence
alignment siteModel
state
tree distribution mcmc
operator
logger
parameter
scaleFactor=0.5 kappaScaler
weight=1.0
tree tree
stateNode treeScaler Uniform
State scaleFactor=0.5
weight=10.0
weight=1.0
storeEvery=100000
value=1.0
kappa
tree
kappa tree narrow
hky WilsonBalding
weight=1.0
frequencies gammaCategoryCount=1 weight=1.0
siteModel isNarrow=false
freqs substModel tree wide
data SubtreeSlide t r e e
weight=1.0
weight=5.0
data
tree treeLikelihood
sequence
alignment siteModel
state
tree distribution mcmc
operator
logger
fileName=test.$(seed).trees
log
TreeLogger
logEvery=10000
log
logEvery=10000
TraceLogger
model
fileName=test.$(seed).log
log ScreenLogger
logEvery=10000
model
with parameter values (a tab delimited file that can be analysed with Tracer) and one
log file with trees in Newick format.
Finally, the Alignment consists of a list of Sequences. Each sequence object
contains the actual sequence and taxon information. This completes the model, shown
in Figure 12.6 and this model can be executed by BEAST 2.
However, this does not represent a proper Bayesian analysis, since no prior is defined.
We need to define one prior for each of the items that form the State, in this case a
tree and the kappa parameter of the HKY model. A posterior is a Distribution that
is the product of prior and likelihood, which themselves are a Distribution. For
such distributions there is the CompoundDistribution, and Figure 12.7 shows the
complete model with posterior, prior and likelihood as CompoundDistribution
BEAST-objects. The prior consists of a log-normal prior on the kappa parameter and a
Yule prior on the tree. Since the Yule prior has a birth rate that can be estimated, the
12.2 BEAST core: BEAST-objects and inputs 173
parameter
scaleFactor=0.5
kappaScaler
weight=1.0
taxon=bonobo tree
Sequence tree
Uniform
stateNode treeScaler
CACCTCTTTACAGTGA State scaleFactor=0.5
weight=10.0
storeEvery=100000 weight=1.0
value=1.0
kappa
tree
kappa tree narrow
taxon=orangutan
Sequence hky WilsonBalding
weight=1.0
frequencies gammaCategoryCount=1 weight=1.0
GCCTCTCTTTGCAATGA siteModel isNarrow=false
freqs substModel tree wide
data SubtreeSlide t r e e
weight=1.0
weight=5.0
taxon=human
Sequence
data
ACACCTCTTTACAGTGA
tree treeLikelihood
sequence
alignment siteModel
state
taxon=siamang
Sequence tree distribution mcmc
CCGCCTCTTTACAGTGA operator
logger
fileName=test.$(seed).trees
log TreeLogger
taxon=chimp
Sequence logEvery=10000
ACACCTCTTTACAGTGA
log
logEvery=10000
TraceLogger
model
taxon=gorilla
Sequence
fileName=test.$(seed).log
GCACCTCTTTGCAGTGA log ScreenLogger
logEvery=10000
model
Figure 12.6 Adding the sequences. This forms a complete description of the model, which can be
executed in BEAST 2.
parameter tree
stateNode scaleFactor=0.5 KappaScaler
scaleFactor=0.5 treeScaler
state
weight=0.1 weight=3.0
storeEvery=1000
tree tree
UniformOperator narrow
weight=30.0 weight=15.0
M
LogNormalDistributionModel tree isNarrow=false
totalcount=4 S SubtreeSlide wide
tree
seq_Anolis_equestris
taxon=Anolis_equestris meanInRealSpace=true weight=15.0
weight=3.0
x
TCTTACCTGTGTCTATTAACCGTTA? KappaPrior
distr tree
kappa tree
value=2.0
scaleFactor=0.5 treeRootScaler WilsonBalding
rootOnly=true
value=1.0
birthRate weight=3.0
kappa birthDiffRate weight=3.0
totalcount=4 hky YuleModel prior
seq_Anolis_stratulus
taxon=Anolis_stratulus
distribution
frequencies tree
TTTACCTGTGTTTATTAATCGTTGA shape
empiricalFreqs proportionInvariant
SiteModel
data
substModel distribution
posterior
data state
totalcount=4 tree treeLikelihood distribution mcmc
seq_Anolis_luteogularis
taxon=Anolis_luteogularis siteModel operator
distribution
likelihood
CTTACCTGTGTCTATCAATCGTTGA branchRateModel logger
sequence
alignment
totalcount=4
seq_Anolis_olssoni
taxon=Anolis_olssoni StrictClock
value=1.0
clockRate clock.rate
ACTTTACCTGTGTTCATTAAT?????? log
screenlog
logEvery=1000
fileName=anolis.$(seed).log
totalcount=4 log
seq_Anolis_cuvieri
taxon=Anolis_cuvieri Tree logEvery=1000
tracelog
alignment
TaxonSet taxonset
TTTACCTGTGTTTATTAATCGTTGA model
sort
fileName=anolis.$(seed).trees
log treelog
totalcount=4 logEvery=1000
seq_Anolis_brevirostris
taxon=Anolis_brevirostris mode
CTTTACCTGTGTCTATTAATCGTTGA
Figure 12.7 Adding a posterior, prior and likelihood and appropriate priors on kappa, the tree and
birth rate of the Yule prior. This model forms a proper Bayesian analysis that can be executed in
BEAST 2.
birth rate parameter requires a prior as well (and is part of the State, not shown in the
figure). The prior on the birth rate is a uniform prior in Figure 12.7.
One way of looking at BEAST is that it is a library consisting of two parts: an MCMC
library, which lives in the beast.core package, and an evolution library in the
beast.evolution package. Beast, BEAUti, SequenceGenerator and a handful
of other tools are applications built on top of these libraries, and the application-
specific code is in the beast.app package. Since all computational science heavily
174 Getting started with BEAST 2
Bayesian computation is most often accomplished using the MCMC algorithm. Box
12.1 shows the basic structure of the MCMC algorithm. A glance at this bit of pseudo-
code reveals that the least that is required are the following components:
1 Read d a t a
2 Initialize state
3 while ( not t i r e d ) {
4 P r o p o s e new s t a t e
5 calculateLogPosterior () ;
6 i f ( new s t a t e i s a c c e p t a b l e )
7 / / do s o m e t h i n g
8 else
9 / / do s o m e t h i n g e l s e
10 Log s t a t e
11 }
12.3.1 MCMC/runable
A good understanding of the implementation of the MCMC algorithm in BEAST is
essential in writing efficient BEAST-objects. In this section we will go through its
details. The main loop of the MCMC algorithm executes the following steps after the
state is initialised:
1 P r o p o s e new s t a t e
2 logP = c a l c u l a t e L o g P o s t e r i o r ( ) ;
3 i f ( new s t a t e i s a c c e p t a b l e )
4 / / do s o m e t h i n g
5 else
6 / / do s o m e t h i n g e l s e
1 Store s t a t e
2 P r o p o s e new s t a t e
3 logP = c a l c u l a t e L o g P o s t e r i o r ( ) ;
4 i f ( new s t a t e i s a c c e p t a b l e )
5 accept s t a t e
6 else
7 restore state
176 Getting started with BEAST 2
Figure 12.8 The difference between StateNode and CalculationNode. StateNodes are
part of the State and can only be changed by operators. CalculationNodes change when
one of it’s input is a StateNode that changed or a CalculationNode that changed.
At the start of the loop in line 1 the state is stored, which makes it easy to restore
the state later, if required. When a new state is proposed (line 2), one or more of the
StateNodes in the state will be given a new value; for example, a parameter may
have its values scaled or a tree may have its topology changed. The State keeps track
of which of the StateNodes are changed. All StateNodes that changed have a flag
marking that they are ‘dirty’, while all other StateNodes are marked as ‘clean’. If the
state turns out to be acceptable, the state is notified (line 5) and all StateNodes that
were marked dirty before are now marked clean again. If the state is not acceptable, the
state should be restored to the old state that was stored in line 1.
The State is aware of the network of BEAST-objects and can calculate which of the
CalculationNodes may be impacted by a change of a StateNode. Figure 12.8
shows a State consisting of a kappa parameter and a tree. When the kappa parameter
changes, this has an effect on the HKY substitution model, which causes a change in the
site model, which requires the TreeLikelihood to be updated. However, if the tree
changes, but the kappa parameter remains the same, there is no need to update the HKY
model or the site model. In fact, the TreeLikelihood is set up to detect which part
of the tree requires updating so that the peeling algorithm does not need to be applied
for the complete tree every time a part of a tree changes. The following listing shows
where the CalculationNodes get updated during the execution of the main loop.
1 Store s t a t e
2 P r o p o s e new s t a t e
3 s t o r e c a l c u l a t i o n nodes
4 check d i r t y n e s s of c a l c u l a t i o n nodes ( r e q u i r e s R e c a l c u l a t i o n ( ) )
5 logP = c a l c u l a t e L o g P o s t e r i o r ( ) ;
6 i f ( new s t a t e i s a c c e p t a b l e )
7 accept s t a t e
8 mark c a l c u l a t i o n n o d e s c l e a n ( s t o r e ( ) )
9 else
10 restore state
11 r e s t o r e c a l c u l a t i o n nodes ( r e s t o r e ( ) )
12.3 MCMC library 177
12.3.4 Operators
Operators determine how the state space is explored. An Operator has at least
one StateNode as input and implements the proposal method. Most operators can
12.3 MCMC library 179
The evolution library can be found in the beast.evolution package, and contains
BEAST-objects for handling alignments, phylogenetic trees and various BEAST-objects
for calculating the likelihood of an alignment for a tree and various priors. You can
find the details of the individual classes by reading the Java-doc documentation, or by
directly looking at the Java classes. In this section, we concentrate on how the various
packages and some of the classes inside these packages are related to each other and
give a high-level overview of the library.
data
Partition1
Alignment
data
Partition2
Figure 12.9 Splitting an alignment into two alignments using two filtered alignments, one
partition for every third site in the alignment, and one for the first and second sites, skipping
every third site. For each of these partitions a likelihood can be defined.
alignment data
tree treeLikelihood
siteModel
clusterType=upgma
tree branchRateModel
taxa
substModel
siteModel
1.0
kappa StrictClockModel
clock.rate
hky distribution
posterior
kappa
clockRate
ConstantPopulation
populationModel
coalescent
treeIntervals
tree
TreeIntervals
Figure 12.10 Example illustrating most of the components of the evolution library.
12.4.2 Tree-likelihood
Figure 12.10 shows a model for a HKY substitution model, and strict clock and co-
alescent prior with constant population size that illustrates most of the components of
the evolution library. Let’s have a look at this model, going from the posterior at the
right down to its inputs. The posterior is a compound distribution from the MCMC
library. Its inputs are a coalescent prior and a tree-likelihood representing the prior
182 Getting started with BEAST 2
and likelihood for this model. The coalescent is a tree prior with a demographic com-
ponent and the simple coalescent with constant population size can be found in the
beast.evolution.tree.coalescent package together with a number of more
complex tree priors, such as (extended) Bayesian skyline plot.
The beast.evolution.likelihood package contains the tree-likelihood
classes. By default, BEAST tries to use the BEAGLE implementation of the peel-
ing algorithm, but otherwise uses a Java implementation. Since the tree-likelihood
calculates the likelihood of an alignment for a given tree, it should come as no
surprise that the tree-likelihood has an alignment and a tree as its input. The beast.
evolution.alignment package contains classes for alignments, sequences and
taxon sets. The beast.evolution.tree package contains the Tree state node
and classes for initialising trees randomly (RandomTree), logging tree information
(TreeHeightLogger, TreeWithMetaDataLogger) and TreeDistribution,
which is the base class for priors over trees, including the coalescent-based priors.
There is another group of tree priors that are not based on coalescent theory but on
theories about speciation, such as the Yule and birth–death priors. Since only a single
prior on the tree should be specified, none of these priors is shown in Figure 12.10.
These priors can be found in the beast.evolution.speciation package
together with priors for *BEAST analysis and utility classes for initialisation species
trees and logging for *BEAST.
The tree-likelihood requires a site model as input, which takes a substitution model
as input, a HKY model in Figure 12.10. In the evolution library, there are packages for
site models and substitution models. The site model package (beast.evolution.
sitemodel) only contains an implementation of the gamma site model, which
allows a proportion of the sites to be invariant. The substitution model package
(beast.evolution.substitutionmodel) contains the most popular models,
including Jukes–Cantor, HKY and general time-reversible substitution models for
nucleotide data as well as JTT, WAG, MTREV, CPREV, BLOSUM and Dayhoff
substitution models for amino-acid data. Root frequencies are in the substitution model
package.
The tree-likelihood has a branch-rate model input, which can be used to define
clock models on the tree. In Figure 12.10, a strict clock model is shown. The package
beast.evolution.branchratemodel contains other clock models, such as the
uncorrelated relaxed clock model and the random local clock model.
There is a package for operators in the evolution library, which contains most of
the Operator implementations. It is part of the evolution library since it contains
operators on trees, and it is handy to have all general-purpose operators together in a
single package. More details on operators can be found in Section 12.3.4.
There are a few more notable packages that can be useful that are outside the core and
evolution packages, namely the following.
12.6 Exercise 183
The beast.math package contains classes mainly for mathematical items. The
beast.math.distributions package contains distributions for constructing
prior distributions over parameters and MRCA times. The beast.math.statistics
package contains a class for calculating statistics and for entering mathematical
calculations.
The beast.util package contains utilities such as random number generation, file
parsing and managing packages. It contains Randomize for random number gener-
ation. Note that it is recommended to use the Randomizer class for generating random
numbers instead of the java.util.Random class because it makes debugging a
lot easier (see Section 14.5.2) and helps ensuring an analysis started with the same
seed leads to reproducible results. The beast.util package contains classes for
reading and writing a number of file formats, such as XMLParser and XMLProducer
for reading and writing BEAST XML files, and NexusParser for reading a subset
of NEXUS files. TreeParser parses Newick trees. LogAnalyser handles trace
log files and calculates some statistics on them. Further, the beast.util package
contains classes for installing, loading and uninstalling BEAST 2 packages.
The beast.app package contains applications built on the MCMC and evolution
libraries, such as BEAST and BEAUti, and its classes are typically not reused with the
exception of input-editors for BEAUti. See Section 15.3.2 for more details.
12.6 Exercise
BEAST uses XML as a file format for specifying an analysis. Typically, the XML file
is generated through BEAUti, but for new kinds of analysis or analyses not directly
supported by BEAUti, it is necessary to construct the XML by hand in a text editor.
BEAST-object developers also need to know how to write BEAST XML files in order to
test and use their BEAST-objects. BEAUti also uses XML as a file format for specifying
templates, which govern its behaviour.
This chapter starts with a short description of XML, then explains how BEAST
interprets XML files and how an XML file can be modified. Finally, we work through
an example of a typical BEAST specification.
XML stands for eXtensible Markup Language and has some similarities with HTML.
However, XML was designed for encoding data, while HTML was designed for
displaying information. The easiest way to explain what XML is without going into
unnecessary detail is to have a look at the example shown in Figure 13.1.
Here, we have an (incomplete) BEAST specification that starts with the so-called
XML declaration, which specifies the character set used (UTF-8 here) and is left
unchanged most of the time, unless it is necessary to encode information in another
character set. The second line shows a tag called ‘beast’. Tags come in pairs, an opening
tag and a closing tag. Opening tags are of the form <tag-name> and can have extra
information specified in attributes. Attributes are pairs of names and values such as
version=‘2.0’ for the beast-tag in the example. Names and values are separated
by an equal sign and values are surrounded by single or double quotes. Both are valid,
but they should match, so a value started with a double quote needs to end with a double
quote. Closing tags are of the form </tag-name> and have no attributes. Everything
between an opening tag and closing tag is called an element.
Elements can have other elements nested in them. In the example above, the
input element is nested inside the beast element. Likewise, the kappa element
is nested inside the input element. When an element does not have any other
elements nested inside the opening and closing tag can be combined in an abbreviated
tag of the form <tag-name/>. In the example, the data element only has an
13.1 What is XML? 185
Figure 13.1 Small XML example with all of its items annotated.
idref attribute and has no element enclosed, so it can be abbreviated from <data
idref="alignment"></data> to <data idref="alignment"/>.
XML comments start with <!-- and end with --> and text in between is ignored. Any
text except double dash -- is allowed inside comments.
There are a few special characters that should be used with care. Since tags are
identified by < and > characters, these cannot be used for attribute names or value or tag
names. The other special characters are single and double quote and ampersand. XML
entities are interpreted as follows:
< <
> >
&dquot; "
" ‚
& &
CDATA sections are XML constructs that are interpreted as literal strings, so the
content is not parsed. Without the CDATA section, every <, >, ’, " and & character
would require an XML entity, which would not make such fragments very readable.
Since elements are nested within other elements the nesting defines a hierarchy. So,
we can speak of a parent and child relationship between an element and those nested
within it. The first element is not nested inside any other element and is called the top-
level element. Only one top-level element is allowed.
186 BEAST XML
There are two input elements, the first one specifying a BEAST-object
of class beast.evolution.operators.ScaleOperator. The attributes
scaleFactor and weight set primitive inputs to value 0.5 and 1, respectively.
The scale operator has an input with name parameter, and the nested element refers
through the idref attribute to a BEAST-object that should be defined elsewhere in the
XML.
element. The list is a colon-separated list of packages; for example, to specify a name
space containing beast.core and beast.evolution.operators use
1 <beast namespace="beast.core:beast.evolution.operators">
The parser finds a BEAST-object class by going through the list and appending the
value of the spec attribute to the package name. By default, the top-level package
is part of the list (at the end), even when not specified explicitly in the namespace
attribute. With the above name space definition, the fragment shown earlier is equivalent
to
1 <input name="operator" id="kappaScaler"
2 spec="ScaleOperator" scaleFactor="0.5" weight="1">
3 <input name="parameter" idref="hky.kappa"/>
4 </input>
Note that if there are BEAST-objects with the same name in different packages, the
BEAST-object that matches with the first package in the class path is used. To prevent
such name clashes, the complete class name can be used.
Note that when the name attribute is specified, the tag name will be ignored. Further,
the end tag should have the same name as the start tag, so both have the tag name
‘operator’.
is interpreted as
1 <parameter id="color.red" value="1.0"/>
2 <parameter id="color.green" value="1.0"/>
3 <parameter id="color.blue" value="1.0"/>
Note that plates can be nested when different variable names are used.
a map-element. The map element has a name attribute that defines the tag name
and the text content describes the class. For example, to map tag name prior to
beast.math.distributions.Prior, use
1 <map name=’prior’>beast.math.distributions.Prior</map>
x stateNode
KappaPrior state
distr storeEvery=1000
state
distribution mcmc
operator
OneOnX logger
distribution
prior
birthDiffRate posterior
YuleModel distribution
birthRate tree
kappa value=1.0
value=2.0
distribution
likelihood
kappa
hky
taxonset
Tree
frequencies
data
tree treeLikelihood
siteModel
data
equalFreqs branchRateModel
shape
estimate=false SiteModel
proportionInvariant
substModel
sequence
dna
clockRate clock.rate
StrictClock
TaxonSet value=1.0
alignment
The following analysis estimates the tree using an HKY substitution model for which
the kappa is estimated and a strict clock model. A Yule prior is placed on the tree and a
1/X prior on kappa. Comments are added to highlight peculiarities of the BEAST XML
parser. Figure 13.2 shows the model view of the file, but with sequences and loggers
removed for clarity.
The first line is the XML declaration which tells the parser about the character encod-
ing. The second line indicates that this is a BEAST version 2 file. Furthermore, a list of
packages is defined that constitute the name space.
8 <sequence taxon="orangutan">
AGAAATATGTCTGACAAAAGAGTTACTTTGATAGAGTAAAAAATAGAGGT...</
sequence>
9 <sequence taxon="siamang">
AGAAATACGTCTGACGAAAGAGTTACTTTGATAGAGTAAATAACAGGGGT...</
sequence>
10 </data>
Note that data (line 3) and sequence (lines 4 to 9) are reserved names that
are associated with beast.evolution.alignment.Alignment and beast.
evolution.alignment.Sequence BEAST-objects, respectively. Most BEAST
XML files have a data block element at the start. The sequence BEAST-object has
an input called ‘value’ and the text inside the sequence tags representing character
sequences are assigned to this ‘value’ input. The dots in the XML fragment indicate
that there are a lot more sites in the sequence not shown here to save space. The data
element has an id attribute, so that it can later be referred from, for example, the tree-
likelihood.
Lines 11 to 34 define the posterior from which we sample. Note that the spec
attribute in line 11 contains part of the package (util.CompoundDistribution)
that contains the CompoundDistribution class. The posterior contains two distri-
butions: a compound distribution for the prior (line 12); and a tree-likelihood for the
likelihood (line 23). The prior consists of a Yule prior (line 13) and a prior on kappa
192 BEAST XML
(line 18). Note that the kappa parameter referred to in line 18 is defined in the state (line
37), showing that idrefs can refer to BEAST-objects specified later in the file.
The tree likelihood has a sitemodel as input (line 24), which here has a HKY substi-
tution model (line 25) as input. The likelihood also has a clock model (line 30) which
here is a strict clock. All elements have id attributes so that they can be referred to, for
example, for logging.
Lines 35 to 78 specify the MCMC BEAST-object. This is the main entry point for
the analysis. The child elements of the MCMC element are the state, the distribution
to sample from, a list of operators, a list of loggers and an initialiser for the tree. The
state (line 37 to 44) lists the state nodes that are operated on, here just the tree and kappa
parameter. Note that the tree element (line 38) has a name attribute linking it to the state
through its stateNode input. If the tag name would be set to stateNode, the name
attribute is superfluous, but a spec attribute is required to specify the class, which is
implicit when ‘tree’ is used as a tag.
35 <run chainLength="10000000" id="mcmc" preBurnin="0" spec="MCMC">
36
46 <distribution idref="posterior"/>
The distribution (line 46) the MCMC analysis samples from refers to the posterior
defined earlier in the file at line 11. Lines 47 to 53 list the operators used in the MCMC
chain. Operators need to refer to at least one state node defined in the state element (line
37 to 44).
47 <operator degreesOfFreedom="1" id="treeScaler" scaleFactor="0.5"
spec="ScaleOperator" tree="@Tree" weight="1.0"/>
48 <operator id="UniformOperator" spec="Uniform" tree="@Tree"
weight="10.0"/>
49 <operator gaussian="true" id="SubtreeSlide" optimise="true" size="1.0"
spec="SubtreeSlide" tree="@Tree" weight="5.0"/>
50 <operator id="narrow" isNarrow="true" spec="Exchange" tree="@Tree"
weight="1.0"/>
51 <operator id="wide" isNarrow="false" spec="Exchange" tree="@Tree"
weight="1.0"/>
52 <operator id="WilsonBalding" spec="WilsonBalding" tree="@Tree"
weight="1.0"/>
53 <operator degreesOfFreedom="1" id="KappaScaler" scaleFactor="0.5"
spec="ScaleOperator" weight="1.0" parameter=’@kappa’/>
To log the states of the chain at regular intervals, three loggers are defined; a trace
logger (line 54), which can be analysed by the Tracer program; a screen logger (line
63), which gives feedback on screen while running the chain; and a tree logger (70) for
writing a NEXUS file to store a tree set.
13.4 Exercise 193
Finally, the XML file needs a closing tag for the run element and a closing element
for the top-level BEAST element.
78 </run>
79 </beast>
A notable difference from BEAST 1 is that the order in which BEAST-objects are
specified does not matter.
13.4 Exercise
Are the following XML fragments equivalent, assuming that the name space is
‘beast.evolution.sitemodel: beast.evolution.
substitutionmodel: beast.evolution.likelihood’?
Fragment 1
1 <input name=’substModel’ id="hky" spec="HKY">
2 <input name=’kappa’ idref="hky.kappa" >
3 <input name=’frequencies’ id="freqs" spec="Frequencies">
4 <input name=’data’ idref="alignment"/>
5 </input>
6 </input>
194 BEAST XML
7
8 <input spec="TreeLikelihood">
9 <input name=’data’ idref=’alignment’/>
10 <input name=’tree’ idref=’tree’/>
11 <input name=’siteModel’ spec="SiteModel">
12 <input name=’substModel’ idref=’hky’/>
13 </input>
14 </input>
Fragment 2
1 <substModel id="hky" spec="HKY" kappa="@hky.kappa" >
2 <frequencies id="freqs" spec="Frequencies"
3 data="@alignment"/>
4 </substModel>
5
We will show a few patterns commonly used with BEAST 2, illustrating how the
MCMC framework can be exploited to write efficient models. The way BEAST 2
classes are based on BEAST-objects has the advantage that it does automatically define
a fairly readable form of XML and allows automatic checking of a number of validation
rules. The validation helps in debugging models. These advantages are based on Java
introspection and there are a few peculiarities due to the limitations of Java introspection
which can lead to unexpected behaviour. These issues will be highlighted in this chapter.
We start with some basic patterns for BEAST-objects, inputs and the core StateNode
and CalulationNode BEAST-objects, and how to write efficient versions of them.
We also show a number of the most commonly used BEAST-objects from the evolution
library and how to extend them. Finally, there are some tips and a list of common errors
that one should be aware of.
1 Note we use BEAST-object and BEASTObject for implementations of BEASTInterface and extensions
of BEASTObject, respectively.
196 Coding and design patterns
6 // members next
7 private Object myObject;
8
9 // initAndValidate
10 @Override
11 public void initAndValidate() {
12 //...
13 }
14
15 // class specific methods
16
17 // Overriding methods
18 }
14.1 Basic patterns 197
11 @Override
12 public void initAndValidate() throws Exception {...}
13
14 @Override
15 public void getTransitionProbabilities(double distance,
double[] matrix) {...}
16
17 @Override
18 protected boolean requiresRecalculation() {...}
19
20 @Override
21 protected void store() {...}
22
23 @Override
24 protected void restore() {...}
25 }
assigned value, for instance, through the XMLParser. So, BEAST-objects do not have
a constructor typically (though see Section 14.2.3 for an exception).
After the initialisation, class-specific methods and overriding methods (with notably
store, restore and requiresRecalculation at the end) conclude a BEAST-
object.
Box 14.2 shows the skeleton of a larger example. Note that apart from the
Description annotation, there is also a Citation annotation that can be used
to list a reference and DOI of a publication. At the start of a run, BEAST visits all
BEAST-objects in a model and lists the citations, making it easy for users to reference
work done by BEAST-object developers.
The HKY BEAST-object has a single input for the kappa parameter. There are more
details on Input construction and validation in Section 14.2. There is no further valid-
ation required in the initAndValidate method, where only a few shadow param-
eters are initialised. The getTransitionProbabilities method is where the
work for a substitution model takes place. The methods requiresRecalculation,
store and restore complete the BEAST-object with implementation of
CalculationNode methods.
198 Coding and design patterns
Note at least two arguments are required for an Input constructor: the name of the
input and a short description of the function of the input. Inputs of a BEAST-object can
be other BEAST-objects, which can be created similarly like this.
1 public Input<Frequencies> freqsInput =
2 new Input<Frequencies>("frequencies",
3 "substitution model equilibrium state frequencies");
Inputs can have multiple values. When a list of inputs is specified, the Input construc-
tor should contain a (typically empty) List as a start value so that the type of the list can
be determined through Java introspection (as far as we know this cannot be done from
the declaration alone due to Java introspection limitations).
1 public Input<List<RealParameter>> parametersInput =
2 new Input<List<RealParameter>>("parameter",
3 "parameter, part of the state",
4 new ArrayList<RealParameter>());
When the XMLParser processes an XML fragment, these validation rules are auto-
matically checked. So, when the kappa input is not specified in the XML, the parser
throws an exception. These input rules are also used in BEAUti to make sure the model
is consistent, and in documentation generated for BEAST-objects.
If a list of inputs need to have at least one element specified, the required argument
needs to be provided.
1 public Input<List<Operator>> operatorsInput =
2 new Input<List<Operator>>("operator",
3 "operator for generating proposals in MCMC state space",
4 new ArrayList<Operator>(), Validate.REQUIRED);
Sometimes either one or another input is required, but not both. In that case an input
is declared XOR and the other input is provided as an extra argument. The XOR goes on
the second input. Note that the order of inputs matters since at the time of construction
of an object the members are created in order of declaration. This means that the first
input cannot access the second input at the time just after it was created. Therefore, the
XOR rule needs to be put on the second input.
1 public Input<Tree> treeInput =
2 new Input<Tree>("tree",
3 "if specified, all tree branch length are scaled");
4 public Input<Parameter> parameterInput =
5 new Input<Parameter>("parameter",
6 "if specified, this parameter is scaled",
7 Validate.XOR, treeInput);
1 public GTR() {
2 rates.setRule(Validate.OPTIONAL);
3 }
The input only represents the link between BEAST-objects. Say, an input inputX
has a BEAST-object X as its value. By ‘writing’ an input to a new BEAST-object Y
a new link is created to Y, but that does not replace the BEAST-object X. This is not
a problem when X has no other outputs than inputX, but it can lead to unexpected
results when there are more outputs. There are exceptions, for example programs for
editing models, like BEAUti and ModelBuilder, but care must be taken when assigning
input values.
8 if (calculationNodeInput.get().isDirtyCalculation()) {
9 return true;
10 }
11 return false;
12 }
16 void update() {
17 someThing = ...;
18 needsUpdate = false;
19 }
20
21 public boolean requiresRecalculation() {
22 if (someInputIsDirty()) {
23 needsUpdate = true;
24 return true;
25 }
26 return false;
27 }
28
29 public void store() {super.store();}
30
14.5 Common extensions 203
1 Object intermediateResult;
2 Object storedIntermediateResult;
3
There are a few classes that we would like to highlight for extensions and point
out a few notes and hints on how to do this. To add a clock model, implement the
BranchRateModel interface, which has just one method getRateForBranch.
To add a new Tree prior, extend TreeDistribution (not just Distribution)
and implement calculateLogP.
204 Coding and design patterns
14.6 Tips
14.6.1 Debugging
Debugging MCMC chains is a hazardous task. To help check that the model is valid,
BEAST recalculates the posterior the first number of steps for every third sample.
14.7 Known ways to get into trouble 205
Before recalculating the posterior, all StateNodes become marked dirty so all
CalculationNodes should update themselves. If the recalculated posterior differs
from the current posterior, BEAST halts and reports the difference. To find the bug that
caused this dreaded problem, it is handy to find out which operator caused the last state
change, and thus which CalculationNodes might not have updated themselves
properly. Have a look at the MCMC doloop method and the debugging code inside for
further details.
14.8 Exercise
Write a clock model that, like the uncorrelated relaxed clock, selects a rate for a branch,
but where the number of different categories is limited to a given upper bound. Imple-
ment it as a lean CalculationNode. Optimise the class by implementing it as a fat
CalculationNode.
15 Putting it all together
15.1 Introduction
• A jar file that contains the class files of BEAST-objects, all supporting code
and potentially some classes for supporting BEAUti. Other libraries used for
developing the package can be added separately.
• A jar file with the source code. BEAST 2 is licensed under LGPL, since the
BEAST team are strong advocates for open source software. All derived work
should have its source code made available.
• Example XML files illustrating typical usage of the BEAST-object, similar to the
example files distributed with BEAST 1.
• Documentation describing the purpose of the package and perhaps containing
articles that can serve as a reference for the package.
• A BEAUti 2 template can be added so that any BEAST-objects in the package are
directly available for usage in a GUI.
• A file named version.xml which contains the name and version of the
package, and the names and versions of any other packages it depends on. For
instance, the version.xml file for a package which depends only on the
current release of BEAST would have the following simple structure:
The package consists of a zip-archive containing all of the above. To make the pack-
age available for other BEAST users, the zip-file should be downloadable from a public
URL. To install a package, download it and unzip it in the beast2 directory, which
15.3 BEAUti 209
is a different location depending on the operating system.1 This is best done through
BEAUti, but can also be done from the command line using the packagemanager
program. BEAST and BEAUti will automatically check out these directories and load
any of the available jar files.
BEAST expects packages to follow the following directory structure:
myPackage.src.jar source files
examples/ XML examples
lib/ libraries used (if any)
doc/ documentation
templates/ BEAUti templates (optional)
version.xml Package metadata file
The natural order in which to develop a package is to develop one or more BEAST-
objects, test (see Section 15.4) and document these BEAST-objects, develop example
XML files and then develop BEAUti support. To start package development, you need
to check out the BEAST 2 code and set up a new project that has a dependency on
BEAST 2. Details for setting up a package in IDEs like Eclipse or Intellij can be found
on the BEAST 2 wiki. Directions on writing BEAST-objects were already discussed in
Chapter 14.
15.3 BEAUti
To make a package popular it is important to have GUI support, since many users are
not keen to edit raw BEAST XML files. BEAUti is a panel-based GUI for manipulating
BEAST models and reading and writing XML files. Unfortunately, a lot of concepts are
involved in GUI development, as well as understanding BEAST models. Consequently,
this section is rather dense, so be prepared for a steep learning curve. This sections aims
at introducing the basic concepts, but once you read it you will probably want to study
the templates that come with the BEAST distribution as well before writing your own.
The easiest way to make a BEAST-object available is to write a BEAUti template,
which can define new BEAUti panels and determines which sub-models go in which
BEAUti panel. A BEAUti template is stored in an XML file in BEAST XML format.
There are two types of templates: main templates and sub-templates. Main templates
define a complete analysis, while sub-templates define a sub-model, for example a
substitution model, which can be used in main templates like the Standard or *BEAST
template. Main templates are specified using a BeautiConfig object, while sub-
templates are specified through BeautiSubTemplate objects.
template. When the inputs of X are expanded, for every input of X a suitable input editor
is found and added to the panel. It is also possible to suppress some of the inputs of X
in order to keep the GUI looking clean or hide some of the more obscure options.
There are a number of ways to define the behaviour of an input editor, such as whether
to show all inputs of a BEAST-object, and whether to suppress some of the inputs. For
list input editors, buttons may be shown to add, remove or edit items from a list. For
example, for the list operators it makes sense to allow editing, but not adding.
To write an input editor, derive a class from InputEditor or from some class that
already derives from InputEditor. In particular, if the input editor you want to write
can handle list inputs, then derive from ListInputEditor.
The important methods to implement are type or types and init. The type
method tells BEAUti to use this input editor for the particular input class (use types if
multiple classes are supported). The init method should add components containing
all the user-interface components for manipulating the input. This is where custom-
made code can be inserted to create the desired user-interface for an input. A com-
plex example is the tip-dates input editor, which can be used to edit the traits-input
of a tree. Though the tip dates are simply encoded as comma-separated strings with
name=value pairs, it is much more desirable to manipulate these dates in a table,
which is what the tip-dates input editor does.
Input editors are discovered by BEAUti through Java introspection, so they can
be part of any jar file in any package. However, they are only expected to be in the
beast.app package and won’t be picked up from other packages.
For more details, have a look at existing input editors in BEAST. It is often easy to
build input editors out of existing ones, which can save quite a bit of boilerplate code.
The CDATA section from lines 2 to 8 contains the sub-graph created when the
template is activated. For a Yule prior on the tree (line 3), a prior on the birth rate
(line 6) and a scale operator on the birth rate (line 7) are created. Lines 9 to 14
specify the connections that need to be made through BeautiConnectors. A
connector specifies a BEAST-object to connect from (through the srcID attribute), a
BEAST-object to connect to (through the targetID attribue) and an inputName
specifies the name of the input in the target BEAST-object to connect with. The
connector is only activated when some conditions are met. If the condition is not
met, BEAUti will attempt to disconnect the link (if it exists). The conditions are
separated by the ‘and’ keyword in the if attribute. The conditions are mostly of the
form inposterior(YuleModel.t:$(n)), which tests whether the BEAST-
object with id YuleModel.t:$(n) is a predecessor of a BEAST-object with id
posterior in the model. Further, there are conditions of the form Tree.t:$(n)/
estimate=true, used to test whether an input value of a BEAST-object has a certain
value. This is mostly relevant to test whether StateNodes are estimated or not, since
if they are not estimated no operator should be defined on it, and logging is not very
useful.
Line 9 connects the prior to the BEAST-object with id ‘prior’. This refers to a com-
pound distribution inside the MCMC, and the Yule prior is added to the input with the
name ‘distribution’. The birth rate parameter is added to the state (line 10), the prior on
the birth rate is added to the prior (line 11), the scale operator is connected to the MCMC
operators input (line 12) and the Yule prior and birth rate are added to the trace log (lines
13 and 14). Note that these connections are only executed if the condition specified in
the if input is true, otherwise the connection is attempted to be disconnected.
Connectors are tested in order of appearance. It is always a good idea to make the
first connector the one connecting the main BEAST-object in the sub-template, since if
this main BEAST-object is disconnected, most of the others should be disconnected as
well. For this tree prior, the tree’s estimate flag can become false when the tree for
the partition is linked.
Instead of defining all sub-templates explicitly for the BeautiConfig BEAST-
object, a merge-point can be defined. Before processing a template, all merge-points
are replaced by XML fragments in sub-templates. For example, the main template
can contain <mergepoint id=’parametricDistributions’/> inside a
BeautiConfig BEAST-object and a sub-template can contain an XML fragment
like this:
1 <mergewith point=’substModelTemplates’>
2 <subtemplate id=’JC69’ class=’beast.evolution.substitutionmodel.
JukesCantor’ mainid=’JC69.s:$(n)’>
3 <![CDATA[
4 <distr spec=’JukesCantor’ id=’JC69.s:$(n)’/>
5 ]]>
6 </subtemplate>
7 </mergewith>
will be inserted in the XML of the main template. The sub-template needs an id, a
class specifying the class of the input it can be connected to and the id of the BEAST-
object that needs to be connected. The actual BEAST-object that is created when the
sub-template is activated is defined inside a CDATA section starting at line 3 with
<![CDATA[ and closing at line 5 with ]]>. The BeautiTemplate has an input
called ‘value’ which contains the XML fragment specifying all BEAST-objects in the
sub-graph and everything inside the CDATA section is assigned to that input.
The variable selection-based substitution model (VS model) is a substitution model that
jumps between six substitution models using BSVS. The frequencies at the root of the
trees are empirically estimated from sequence data. Figure 15.1 shows the parameters
involved in the six substitution models. Since BEAST normalises transition probability
matrices such that on average one substitution is expected per unit length on a branch,
one parameter can be set to 1. The models are selected so that they are nested, that is,
every model with i parameters can be expressed in models with j parameters if j > i.
The popular models F81, HKY85, TN93, TIM and GTR follow this model. In order to
finalise the set of models, we chose an extra model, labelled EVS in Figure 15.1, that
obeys the nesting constraint. To transition from model i to model i + 1 means that the
VS model utilised one more variable. If the variable is not used, it is effectively sampled
from the prior. In this section we will have a look at what is involved in adding the VS
model to BEAST as a package.
Figure 15.1 The six substitution models and their parameters in the VS substitution model.
15.4 Variable selection-based substitution model package example 215
data
treeLikelihood
sequence
alignment siteModel
branchRateModel
substModel
SiteModel
distribution
likelihood
count clock.rate
StrictClock
rates RB
clockRate
data
freqs frequencies value=1.0
prior distribution
posterior
distribution
x
CountPrior
distr
Exponential
lower=0 stateNode
upper=5 Count state
state
value=5 storeEvery=1000
distribution mcmc
operator
count
logger
Rates
parameter RateScaler
value=1 scaleFactor=0.5
weight=1.0
count
x RBprior rates
distr count RBOperator
Gamma
weight=1.0
Figure 15.2 The VS model consists of substitution model, two parameters and two operators on
these parameters, a prior on the rates and count. The accompanying BEAUti template contains a
set of rules to connect its BEAST-objects to a larger model, e.g. the rates to the rate prior, the
loggers to the trace-log, etc. For clarity, the tree, sequences and loggers are omitted.
We need to add an extra input to indicate how many dimensions are in use.
1 public Input<IntegerParameter> countInput = new Input<IntegerParameter>
("count", "model number used 0 = F81, 1 = HKY, 2 = TN93, 3 = TIM,
4 = EVS, 5 and higher GTR (default 0)", Validate.REQUIRED);
Since the substitution model is intended for nucleotide data only, we need to add a
method indicating that other data-types are not supported. To this end, we override the
canHandleDataType method.
1 @Override
2 public boolean canHandleDataType(DataType dataType) throws Exception {
3 if (dataType instanceof Nucleotide) {
4 return true;
5 }
6 throw new Exception("Can only handle nucleotide data");
7 }
The superclass takes care of everything else, including storing, restoring and setting
a flag for recalculating items. A small efficiency gain could be achieved by letting the
requiresRecalculation method test whether any of the relevant rates changed,
which is left as an exercise to the reader.
model and check how close the substitution model that is estimated is to the one used
to generate the data. Ideally, when sampling from the VS model with x parameters, we
obtain an estimate of x parameters when analysing the data. The proportion of time
the number of parameters is indeed estmated as x is a measure of how well the model
performs.
There is quite a complex workflow involved in setting up such tests, and much of this
is repetitive since we want to run through the process multiple times. You might find
the BEASTShell package useful for this, as it was designed to help with testing BEAST
models.
Some inputs, like eigenSystem, should not show in BEAUti, which is marked in this
declaration. At its heart, the template consists of a BEAST XML fragment specifying
the model, its priors and operators, wrapped in a CDATA section.
10 <![CDATA[
11 <substModel spec=’VS’ id=’VS.s:$(n)’>
12 <count spec=’parameter.IntegerParameter’ id=’VScount.s:$(n)’
value=’5’ lower=’0’ upper=’5’/>
13 <rates spec=’parameter.RealParameter’ id=’VSrates.s:$(n)’
value=’1’ dimension=’5’ lower=’0.01’ upper=’100.0’/>
14 <frequencies id=’freqs.s:$(n)’ spec=’Frequencies’>
15 <data idref=’$(n)’/>
16 </frequencies>
17 </substModel>
18
15.5 Exercise
Write an exciting new package for BEAST 2 and release it to the public!
References
Akaike, H (1974). ‘A new look at the statistical model identification’. In: IEEE Transactions on
Automatic Control 19.6, pp. 716–723 (page 18).
Aldous, D (2001). ‘Stochastic models and descriptive statistics for phylogenetic trees, from Yule
to today’. In: Statistical Science 16, pp. 23–34 (pages 31, 39).
Alekseyenko, AV, CJ Lee and MA Suchard (2008). ‘Wagner and Dollo: a stochastic duet by
composing two parsimonious solos’. In: Systematic Biology 57.5, pp. 772–784 (page 115).
Allen, LJ (2003). An introduction to stochastic processes with applications to biology. Upper
Saddle River, NJ: Pearson Education (page 12).
Amenta, N and J Klingner (2002). ‘Case study: visualizing sets of evolutionary trees’. In:
INFOVIS 2002. IEEE Symposium on Information Visualization, pp. 71–74 (page 159).
Anderson, RM and RM May (1991). Infectious diseases of humans: dynamics and control.
Oxford: Oxford University Press (page 34).
Arunapuram, P, I Edvardsson, M Golden, et al. (2013). ‘StatAlign 2.0: combining statistical
alignment with RNA secondary structure prediction’. In: Bioinformatics 29.5, pp. 654–655
(page 9).
Atarhouch, T, L Rüber, EG Gonzalez, et al. (2006). ‘Signature of an early genetic bottleneck
in a population of Moroccan sardines (Sardina pilchardus)’. In: Molecular Phylogenetics and
Evolution 39.2, pp. 373–383 (page 133).
Avise, J (2000). Phylogeography: the history and formation of species. Cambridge, MA: Harvard
University Press (page 68).
Ayres, D, A Darling, D Zwickl, et al. (2012). ‘BEAGLE: a common application programming
interface and high-performance computing library for statistical phylogenetics’. In: Systematic
Biology 61, pp. 170–173 (page 114).
Baele, G, P Lemey, T Bedford, et al. (2012). ‘Improving the accuracy of demographic and
molecular clock model comparison while accommodating phylogenetic uncertainty’. In:
Molecular Biology and Evolution 29.9, pp. 2157–2167 (pages 18, 19, 136, 137).
Baele, G, WLS Li, AJ Drummond, MA Suchard and P Lemey (2013a). ‘Accurate model selection
of relaxed molecular clocks in Bayesian phylogenetics’. In: Molecular Biology and Evolution
30.2, pp. 239–243 (pages 18, 137).
Baele, G, P Lemey and S Vansteelandt (2013b). ‘Make the most of your samples: Bayes factor
estimators for high-dimensional models of sequence evolution’. In: BMC Bioinformatics 14.1,
p. 85 (page 19).
Bahl, J, MC Lau, GJ Smith, et al. (2011). ‘Ancient origins determine global biogeography of hot
and cold desert cyanobacteria’. In: Nature Communications 2, art. 163 (page 96).
Bahlo, M and RC Griffiths (2000). ‘Inference from gene trees in a subdivided population’. In:
Theoretical Population Biology 57.2, pp. 79–95 (page 31).
References 221
Billera, L, S Holmes and K Vogtmann (2001). ‘Geometry of the space of phylogenetic trees’. In:
Advances in Applied Mathematics 27, pp. 733–767 (page 154).
Bloomquist, EW and MA Suchard (2010). ‘Unifying vertical and nonvertical evolution: a
stochastic ARG-based framework’. In: Systematic Biology 59.1, pp. 27–41 (pages 6, 10).
Bofkin, L and N Goldman (2007). ‘Variation in evolutionary processes at different codon
positions’. In: Molecular Biology and Evolution 24.2, pp. 513–521 (pages 100, 104, 146).
Bolstad, WM (2011). Understanding computational Bayesian statistics. Vol. 644. Hoboken, NJ:
Wiley (page 10).
Boni, MF, D Posada and MW Feldman (2007). ‘An exact nonparametric method for inferring
mosaic structure in sequence triplets’. In: Genetics 176.2, pp. 1035–1047 (page 97).
Bouckaert, RR (2010). ‘DensiTree: making sense of sets of phylogenetic trees’. In: Bioinformatics
26.10, pp. 1372–1373 (pages 156, 163).
Bouckaert, RR and D Bryant (2012). ‘A rough guide to SNAPP’. In: BEAST 2 wiki (pages x,
124).
Bouckaert, RR, P Lemey, M Dunn, et al. (2012). ‘Mapping the origins and expansion of the
Indo-European language family’. In: Science 337, pp. 957–960 (pages 57, 135).
Bouckaert, RR, M Alvarado-Mora and J Rebello Pinho (2013). ‘Evolutionary rates and HBV:
issues of rate estimation with Bayesian molecular methods’. In: Antiviral Therapy 18, pp. 497–
503 (pages 57, 100, 145).
Bouckaert, RR, J Heled, D Kühnert, et al. (2014). ‘BEAST 2: a software platform for Bayesian
evolutionary analysis’. In: PLoS Computational Biology 10.4, e1003537 (page 16).
Bradley, RK, A Roberts, M Smoot, et al. (2009). ‘Fast statistical alignment’. In: PLoS Computa-
tional Biology 5.5, e1000392 (page 9).
Brooks, S and A Gelman (1998). ‘Assessing convergence of Markov chain Monte Carlo
algorithms’. In: Journal of Computational and Graphical Statistics 7, pp. 434–455 (page 140).
Brooks, S, A Gelman, GL Jones and XL Meng (2010). Handbook of Markov chain Monte Carlo.
Boca Raton, FL: Chapman & Hall/CRC. (page 10).
Brown, JM (2014). ‘Predictive approaches to assessing the fit of evolutionary models’. In:
Systematic Biology 63.3, pp. 289–292 (page 18).
Brown, RP and Z Yang (2011). ‘Rate variation and estimation of divergence times using strict
and relaxed clocks’. In: BMC Evolutionary Biology 11, art. 271 (pages 105, 106, 144, 146).
Bryant, D (2003). ‘A classification of consensus methods for phylogenetics’. In: BioConsensus
(Piscataway, NJ, 2000/2001). Providence, RI: AMS, pp. 163–184 (page 159).
Bryant, D, RR Bouckaert, J Felsenstein, N Rosenberg, and A RoyChoudhury (2012). ‘Inferring
species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent
analysis’. In: Molecular Biology and Evolution 29.8, pp. 1917–1932 (pages 7, 115, 119, 123).
Bunce, M, TH Worthy, T Ford, et al. (2003). ‘Extreme reversed sexual size dimorphism in the
extinct New Zealand moa Dinornis.’ In: Nature 425, pp. 172–175 (page 129).
Burnham, K and D Anderson (2002). Model selection and multimodel inference: a practical
information-theoretic approach. New York: Springer Verlag (page 32).
Camargo, A, LJ Avila, M Morando and JW Sites (2012). ‘Accuracy and precision of species trees:
effects of locus, individual, and base pair sampling on inference of species trees in lizards of the
Liolaemus darwinii group (Squamata, Liolaemidae)’. In: Systematic Biology 61.2, pp. 272–288
(page 120).
Campos, PF, E Willerslev, A Sher, et al. (2010). ‘Ancient DNA analyses exclude humans as
the driving force behind late Pleistocene musk ox (Ovibos moschatus) population dynamics’.
In: Proceedings of the National Academy of Sciences USA 107.12, pp. 5675–5680 (page 134).
References 223
Cavalli-Sforza, L and A Edwards (1967). ‘Phylogenetic analysis: models and estimation proced-
ures’. In: American Journal of Human Genetics 19, pp. 233–257 (pages 24, 25).
Chaves, JA and TB Smith (2011). ‘Evolutionary patterns of diversification in the Andean
hummingbird genus Adelomyia’. In: Molecular Phylogenetics and Evolution 60.2, pp. 207–218
(page 165).
Chung, Y and C Ané (2011). ‘Comparing two Bayesian methods for gene tree/species tree
reconstruction: simulations with incomplete lineage sorting and horizontal gene transfer’. In:
Systematic Biology 60.3, pp. 261–275 (page 120).
Coop, G and RC Griffiths (2004). ‘Ancestral inference on gene trees under selection’. In:
Theoretical Population Biology 66.3, pp. 219–232 (pages 69, 76).
Cox, J, J Ingersoll and S Ross (1985). ‘A theory of the term structure of interest rates’. In:
Econometrica 53, pp. 385–407 (page 125).
Currie, TE, SJ Greenhill, R Mace, et al. (2010). ‘Is horizontal transmission really a problem for
phylogenetic comparative methods? A simulation study using continuous cultural traits’. In:
Philosophical Transactions of the Royal Society B: Biological Sciences 365, pp. 3903–3912
(page 97).
Debruyne, R, G Chu, CE King, et al. (2008). ‘Out of America: ancient DNA evidence for a new
world origin of late quaternary woolly mammoths’. In: Current Biology 18.17, pp. 1320–1326
(page 133).
Degnan, JH and NA Rosenberg (2006). ‘Discordance of species trees with their most likely gene
trees’. In: PLoS Genetics 2.5, e68 (page 116).
Didelot, X, J Gardy and C Colijn (2014). ‘Bayesian inference of infectious disease transmission
from whole genome sequence data’. In: Molecular Biology and Evolution, msu121 (page 41).
Dincă, V, VA Lukhtanov, G Talavera and R Vila (2011). ‘Unexpected layers of cryptic diversity
in wood white Leptidea butterflies’. In: Nature Communications 2, art. 324 (pages 121, 165).
Drummond, AJ (2002). ‘Computational statistical inference for molecular evolution and popula-
tion genetics’. PhD thesis. University of Auckland (page 106).
Drummond, AJ and A Rambaut (2007). ‘BEAST: Bayesian evolutionary analysis by sampling
trees’. In: BMC Evolutionary Biology 7, art. 214 (pages 14, 74, 79).
Drummond, AJ and AG Rodrigo (2000). ‘Reconstructing genealogies of serial samples under the
assumption of a molecular clock using serial-sample UPGMA’. In: Molecular Biology and
Evolution 17.12, pp. 1807–1815 (page 31).
Drummond, AJ and MA Suchard (2008). ‘Fully Bayesian tests of neutrality using genealogical
summary statistics’. In: BMC Genetics 9, art. 68 (page 18).
Drummond, AJ and MA Suchard (2010). ‘Bayesian random local clocks, or one rate to rule them
all’. In: BMC Biology 8.1, art. 114 (pages 63–65, 83, 146).
Drummond, AJ, R Forsberg and AG Rodrigo (2001). ‘The inference of stepwise changes in
substitution rates using serial sequence samples.’ In: Molecular Biology and Evolution 18.7,
pp. 1365–1371 (page 31).
Drummond, AJ, GK Nicholls, AG Rodrigo and W Solomon (2002). ‘Estimating mutation param-
eters, population history and genealogy simultaneously from temporally spaced sequence
data’. In: Genetics 161.3, pp. 1307–1320 (pages 6, 14, 31, 33, 34, 66, 74, 109).
Drummond, AJ, OG Pybus, A Rambaut, R Forsberg and AG Rodrigo (2003a). ‘Measurably
evolving populations’. In: Trends in Ecology & Evolution 18, pp. 481–488 (pages 8, 66, 128,
152).
Drummond, AJ, OG Pybus and A Rambaut (2003b). ‘Inference of viral evolutionary rates from
molecular sequences’. In: Advances in Parasitology 54, pp. 331–358 (page 149).
224 References
Drummond, AJ, A Rambaut, B Shapiro and OG Pybus (2005). ‘Bayesian coalescent inference
of past population dynamics from molecular sequences’. In: Molecular Biology and Evolution
22.5, pp. 1185–1192 (pages 32, 108, 130, 133).
Drummond, AJ, SYW Ho, MJ Phillips and A Rambaut (2006). ‘Relaxed phylogenetics and dating
with confidence’. In: PLoS Biology 4.5, e88 (pages 62, 63, 73, 144).
Drummond, AJ, MA Suchard, D Xie and A Rambaut (2012). ‘Bayesian phylogenetics with
BEAUti and the BEAST 1.7’. In: Molecular Biology and Evolution 29.8, pp. 1969–1973
(pages 16, 74).
Duffy, S, LA Shackelton and EC Holmes (2008). ‘Rates of evolutionary change in viruses:
patterns and determinants’. In: Nature Reviews Genetics 9.4, pp. 267–276 (pages 20, 152).
Durbin, R, SR Eddy, A Krogh and G Mitchison (1998). Biological sequence analysis: probabilis-
tic models of proteins and nucleic acids. Cambridge: Cambridge University Press (page 8).
Edgar, RC (2004a). ‘MUSCLE: a multiple sequence alignment method with reduced time and
space complexity’. In: BMC Bioinformatics 5, p. 113 (page 9).
Edgar, RC (2004b). ‘MUSCLE: multiple sequence alignment with high accuracy and high
throughput’. In: Nucleic Acids Research 32.5, pp. 1792–1797 (page 9).
Edwards, A (1970). ‘Estimation of the branch points of a branching diffusion process
(with discussion)’. In: Journal of the Royal Statistical Society, Series B 32, pp. 155–174
(page 38).
Edwards, A and L Cavalli-Sforza (1965). ‘A method for cluster analysis’. In: Biometrics, pp. 362–
375 (page 6).
Edwards, CTT, EC Holmes, DJ Wilson, et al. (2006). ‘Population genetic estimation of the loss
of genetic diversity during horizontal transmission of HIV-1’. In: BMC Evolutionary Biology
6, art. 28 (page 32).
Etienne, R, B Haegeman, T Stadler, et al. (2012). ‘Diversity-dependence brings molecular
phylogenies closer to agreement with the fossil record’. In: Proceedings of the Royal Society B:
Biological Sciences, doi: 10.1098/rspb.2011.1439 (page 39).
Ewing, G and AG Rodrigo (2006a). ‘Coalescent-based estimation of population parameters when
the number of demes changes over time’. In: Molecular Biology and Evolution 23.5, pp. 988–
996 (pages 6, 71).
Ewing, G and AG Rodrigo (2006b). ‘Estimating population parameters using the structured serial
coalescent with Bayesian MCMC inference when some demes are hidden’. In: Evolutionary
Bioinformatics 2, pp. 227–235 (page 72).
Ewing, G, G Nicholls and AG Rodrigo (2004). ‘Using temporally spaced sequences to simultan-
eously estimate migration rates, mutation rate and population sizes in measurably evolving
populations’. In: Genetics 168.4, pp. 2407–2420 (pages 6, 14, 69, 71, 72).
Faria, NR, MA Suchard, A Abecasis, et al. (2012). ‘Phylodynamics of the HIV-1 CRF02_AG
clade in Cameroon’. In: Infection, Genetics and Evolution 12.2, pp. 453–460 (page 135).
Fearnhead, P and P Donnelly (2001). ‘Estimating recombination rates from population genetic
data’. In: Genetics 159.3, pp. 1299–1318 (page 31).
Fearnhead, P and C Sherlock (2006). ‘An exact Gibbs sampler for the Markov-modulated
Poisson process’. In: Journal of the Royal Statistical Society, Series B 68.5, pp. 767–784
(page 72).
Felsenstein, J (1981). ‘Evolutionary trees from DNA sequences: a maximum likelihood
approach’. In: Journal of Molecular Evolution 17, pp. 368–376 (pages 6, 49, 53, 55, 69, 70).
Felsenstein, J (1985). ‘Phylogenies and the comparative method’. In: The American Naturalist
125.1, pp. 1–15 (page 73).
References 225
Felsenstein, J (1988). ‘Phylogenies from molecular sequences: inference and reliability’. In:
Annual Review of Genetics 22, pp. 521–565 (page 31).
Felsenstein, J (1992). ‘Estimating effective population size from samples of sequences: ineffi-
ciency of pairwise and segregating sites as compared to phylogenetic estimates’. In: Genetical
Research 59, pp. 139–147 (pages 29, 31).
Felsenstein, J (2001). ‘The troubled growth of statistical phylogenetics’. In: Systematic Biology
50.4, pp. 465–467 (pages 6, 19, 68).
Felsenstein, J (2004). Inferring phylogenies. Sunderland, MA: Sinauer Associates (pages 14, 98).
Felsenstein, J (2006). ‘Accuracy of coalescent likelihood estimates: do we need more sites, more
sequences, or more loci?’ In: Molecular Biology and Evolution 23.3, pp. 691–700 (pages 6,
131, 151).
Finlay, EK, C Gaillard, S Vahidi, et al. (2007). ‘Bayesian inference of population expansions in
domestic bovines’. In: Biology Letters 3.4, pp. 449–452 (page 133).
Firth, C, A Kitchen, B Shapiro, et al. (2010). ‘Using time-structured data to estimate evolutionary
rates of double-stranded DNA viruses’. In: Molecular Biology and Evolution 27, pp. 2038–
2051 (page 152).
Fisher, R (1930). Genetical theory of natural selection. Oxford: Clarendon Press (page 28).
FitzJohn, R (2010). ‘Quantitative traits and diversification’. In: Systematic Biology 59.6, pp. 619–
633 (page 39).
FitzJohn, RG, WP Maddison and SP Otto (2009). ‘Estimating trait-dependent speciation
and extinction rates from incompletely resolved phylogenies’. In: Systematic Biology 58.6,
pp. 595–611 (page 39).
Ford, CB, PL Lin, MR Chase, et al. (2011). ‘Use of whole genome sequencing to estimate the
mutation rate of Mycobacterium tuberculosis during latent infection’. In: Nature Genetics 43.5,
pp. 482–486 (page 66).
Fraser, C, CA Donnelly, S Cauchemenz, et al. (2009). ‘Pandemic potential of a strain of influenza
A (H1N1): early findings’. In: Science 324, pp. 1557–1561 (page 133).
Fu, YX (1994). ‘A phylogenetic estimator of effective population size or mutation rate’. In:
Genetics 136.2, pp. 685–692 (page 29).
Gavryushkina, A, D Welch and AJ Drummond (2013). ‘Recursive algorithms for phylogenetic
tree counting’. In: Algorithms in Molecular Biology 8, p. 26 (pages 25, 42).
Gavryushkina, A, D Welch, T Stadler and AJ Drummond (2014). ‘Bayesian inference of sampled
ancestor trees for epidemiology and fossil calibration’. In: arXiv preprint arXiv:1406.4573
(pages 42, 67, 111, 112).
Gelman, A and DB Rubin (1992). ‘A single series from the Gibbs sampler provides a false sense
of security’. In: Bayesian statistics. Ed. by J Bernardo, JO Berger, JO Dawid and AFM Smith.
Vol. 4. Oxford: Oxford University Press, pp. 625–631 (page 140).
Gelman, A, GO Roberts and WR Gilks (1996). ‘Efficient Metropolis jumping rules’. In: Bayesian
statistics. Ed. by JM Bernardo, JO Berger, AP Dawid and AFM Smith. Vol. 5. Oxford: Oxford
University Press, pp. 599–608 (page 89).
Gelman, A, J Carlin, H Stern and D Rubin (2004). Bayesian data analysis. 2nd edn. New York:
Chapman & Hall/CRC (pages 10, 140).
Geman, S and D Geman (1984). ‘Stochastic relaxation, Gibbs distribution, and the Bayesian
restoration of images’. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 6,
pp. 721–741 (pages 16, 139).
Gernhard, T (2008). ‘The conditioned reconstructed process’. In: Journal of Theoretical Biology
253.4, pp. 769–778 (page 38).
226 References
Harvey, PH and MD Pagel (1991). The comparative method in evolutionary biology. Oxford:
Oxford University Press (page 73).
Hasegawa, M, H Kishino and T Yano (1985). ‘Dating the human–ape splitting by a molecular
clock of mitochondrial DNA’. In: Journal of Molecular Evolution 22, pp. 160–174 (pages 50,
104).
Hastings, W (1970). ‘Monte Carlo sampling methods using Markov chains and their applica-
tions’. In: Biometrika 57, pp. 97–109 (pages 14, 16, 20, 31).
He, M, F Miyajima, P Roberts, et al. (2013). ‘Emergence and global spread of epidemic
healthcare-associated Clostridium difficile’. In: Nature Genetics 45.1, pp. 109–113 (page 134).
Heath, TA, JP Huelsenbeck and T Stadler (2014). ‘The fossilized birth–death process for coherent
calibration of divergence-time estimates’. In: Proceedings of the National Academy of Sciences
USA 111.29, pp. 2957–2966 (pages 67, 112).
Heidelberger, P and PD Welch (1983). ‘Simulation run length control in the presence of an initial
transient’. In: Operations Research 31.6, pp. 1109–1144 (page 140).
Hein, J, M Schierup and C Wiuf (2004). Gene genealogies, variation and evolution: a primer in
coalescent theory. Oxford: Oxford University Press (page 27).
Heled, J and RR Bouckaert (2013). ‘Looking for trees in the forest: summary tree from posterior
samples’. In: BMC Evolutionary Biology 13.1, art. 221 (pages 159, 160, 163).
Heled, J and AJ Drummond (2008). ‘Bayesian inference of population size history from multiple
loci’. In: BMC Evolutionary Biology 8.1, art. 289 (pages 6, 17, 32, 110, 130, 131, 151).
Heled, J and AJ Drummond (2010). ‘Bayesian inference of species trees from multilocus data’.
In: Molecular Biology and Evolution 27.3, pp. 570–580 (pages 7, 8, 100, 113, 116, 119, 122).
Heled, J and AJ Drummond (2012). ‘Calibrated tree priors for relaxed phylogenetics and
divergence time estimation’. In: Systematic Biology 61.1, pp. 138–149 (pages 66, 84, 110,
111, 127).
Heled, J and AJ Drummond (2013). ‘Calibrated birth–death phylogenetic time-tree priors for
Bayesian inference’. In: arXiv preprint arXiv:1311.4921 (pages 66, 110, 111, 127).
Hey, J (2010). ‘Isolation with migration models for more than two populations’. In: Molecular
Biology and Evolution 27.4, pp. 905–920 (page 8).
Hillis, D, T Heath and K St John (2005). ‘Analysis and visualization of tree space’. In: Systematic
Biology 54.3, pp. 471–482 (page 159).
Ho, SY and B Shapiro (2011). ‘Skyline-plot methods for estimating demographic history from
nucleotide sequences’. In: Molecular Ecology Resources 11.3, pp. 423–434 (page 97).
Ho, SYW, MJ Phillips, AJ Drummond and A Cooper (2005). ‘Accuracy of rate estimation using
relaxed-clock models with a critical focus on the early Metazoan radiation’. In: Molecular
Biology and Evolution 22, pp. 1355–1363 (page 144).
Hoffman, J, S Grant, J Forcada and C Phillips (2011). ‘Bayesian inference of a historical
bottleneck in a heavily exploited marine mammal’. In: Molecular Ecology 20.19, pp. 3989–
4008 (page 133).
Hoffman, MD and A Gelman (2014). ‘The no-U-turn sampler: adaptively setting path lengths
in Hamiltonian Monte Carlo’. In: Journal of Machine Learning Research 15, pp. 1593–1623
(page 156).
Höhna, S and AJ Drummond (2012). ‘Guided tree topology proposals for Bayesian phylogenetic
inference’. In: Systematic Biology 61.1, pp. 1–11 (pages 113, 150, 156, 159, 161).
Höhna, S, T Stadler, F Ronquist and T Britton (2011). ‘Inferring speciation and extinction rates
under different sampling schemes’. In: Molecular Biology and Evolution 28.9, pp. 2577–2589
(page 98).
228 References
Holder, MT, PO Lewis, DL Swofford and B Larget (2005). ‘Hastings ratio of the LOCAL
proposal used in Bayesian phylogenetics’. In: Systematic Biology 54, pp. 961–965 (page 16).
Holder, MT, J Sukumaran and PO Lewis (2008). ‘A justification for reporting the majority-
rule consensus tree in Bayesian phylogenetics’. In: Systematic Biology 57.5, pp. 814–821
(page 160).
Holmes, EC and BT Grenfell (2009). ‘Discovering the phylodynamics of RNA viruses’. In: PLoS
Computational Biology 5.10, e1000505 (page 74).
Holmes, EC, LQ Zhang, P Simmonds, AS Rogers and AJ Leigh Brown (1993). ‘Molecular
investigation of human immunodeficiency virus (HIV) infection in a patient of an HIV-infected
surgeon’. In: Journal of Infectious Diseases 167.6, pp. 1411–1414 (page 31).
Hudson, RR (1987). ‘Estimating the recombination parameter of a finite population model
without selection’. In: Genetics Research 50.3, pp. 245–250 (page 29).
Hudson, RR (1990). ‘Gene genealogies and the coalescent process’. In: Oxford surveys in
evolutionary biology. Ed. by D Futuyma and J Antonovics. Vol. 7. Oxford: Oxford University
Press, pp. 1–44 (pages 29, 71, 72).
Hudson, RR and NL Kaplan (1985). ‘Statistical properties of the number of recombination events
in the history of a sample of DNA sequences’. In: Genetics 111.1, pp. 147–164 (page 29).
Huelsenbeck, JP and F Ronquist (2001). ‘MrBayes: Bayesian inference of phylogenetic trees’.
In: Bioinformatics 17, pp. 754–755 (page 14).
Huelsenbeck, JP, B Larget and DL Swofford (2000). ‘A compound Poisson process for relaxing
the molecular clock’. In: Genetics 154, pp. 1879–1892 (page 62).
Huelsenbeck, JP, F Ronquist, R Nielsen and JP Bollback (2001). ‘Bayesian inference of
phylogeny and its impact on evolutionary biology’. In: Science 294, pp. 2310–2314 (pages 6,
111).
Huelsenbeck, JP, B Larget and ME Alfaro (2004). ‘Bayesian phylogenetic model selection using
reversible jump Markov chain Monte Carlo’. In: Molecular Biology and Evolution 21.6,
pp. 1123–1133 (page 57).
Hurvich, CM and CL Tsai (1989). ‘Regression and time series model selection in small samples’.
In: Biometrika 76.2, pp. 297–307 (page 32).
Huson, DH and D Bryant (2006). ‘Application of phylogenetic networks in evolutionary studies’.
In: Molecular Biology and Evolution 23.2, pp. 254–267 (page 97).
Jackman, T, A Larson, KD Queiroz and J Losos (1999). ‘Phylogenetic relationships and tempo
of early diversification in Anolis lizards’. In: Systematic Biology 48.2, pp. 254–285 (page 157).
Jaynes, ET (2003). Probability theory: the logic of science. Cambridge: Cambridge University
Press (pages 10, 19).
Jeffreys, H (1946). ‘An invariant form for the prior probability in estimation problems’.
In: Proceedings of the Royal Society A: Mathematical and Physical Sciences 186.1007,
pp. 453–461 (page 14).
Jeffreys, H (1961). Theory of probability. 1st edn. London: Oxford University Press (page 14).
Jenkins, GM, A Rambaut, OG Pybus and EC Holmes (2002). ‘Rates of molecular evolution in
RNA viruses: a quantitative phylogenetic analysis’. In: Journal of Molecular Evolution 54.2,
pp. 156–165 (pages 66, 152).
Jones, G (2011). ‘Calculations for multi-type age-dependent binary branching processes’. In:
Journal of Mathematical Biology 63.1, pp. 33–56 (page 39).
Jukes, T and C Cantor (1969). ‘Evolution of protein molecules’. In: Mammaliam protein
metabolism. Ed. by H Munro. New York: Academic Press, pp. 21–132 (pages 44, 46, 103).
References 229
Kass, R and A Raftery (1995). ‘Bayes factors’. In: Journal of the American Statistical Association
90, pp. 773–795 (page 18).
Kass, RE, BP Carlin, A Gelman and RM Neal (1998). ‘Markov chain Monte Carlo in practice: a
roundtable discussion’. In: The American Statistician 52.2, pp. 93–100 (pages 140, 141).
Katoh, K and DM Standley (2013). ‘MAFFT multiple sequence alignment software version
7: improvements in performance and usability’. In: Molecular Biology and Evolution 30.4,
pp. 772–780 (page 9).
Katoh, K and DM Standley (2014). ‘MAFFT: iterative refinement and additional methods’. In:
Methods in Molecular Biology 1079, pp. 131–146 (page 9).
Katoh, K, K Misawa, K Kuma and T Miyata (2002). ‘MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform’. In: Nucleic Acids Research 30.14,
pp. 3059–3066 (page 9).
Keeling, MJ and P Rohani (2008). Modeling infectious diseases in humans and animals.
Princeton, NJ: Princeton, University Press (page 34).
Kendall, DG (1948). ‘On the generalized “birth-and-death” process’. In: Annals of Mathematical
Statistics 19.1, pp. 1–15 (page 36).
Kendall, DG (1949). ‘Stochastic processes and population growth’. In: Journal of the Royal
Statistical Society, Series B 11.2, pp. 230–282 (page 73).
Kimura, M (1980). ‘A simple model for estimating evolutionary rates of base substitutions
through comparative studies of nucleotide sequences’. In: Journal of Molecular Evolution 16,
pp. 111–120 (page 48).
Kingman, J (1982). ‘The coalescent’. In: Stochastic Processes and Their Applications 13.3,
pp. 235–248 (pages 6, 28).
Kishino, H, JL Thorne and WJ Bruno (2001). ‘Performance of a divergence time estimation
method under a probabilistic model of rate evolution’. In: Molecular Biology and Evolution
18.3, pp. 352–361 (page 62).
Knuth, D (1997). The art of computer programming. Vol. 2. Seminumerical algorithms. Reading,
MA: Addison-Wesley (page 114).
Kouyos, RD, CL Althaus and S Bonhoeffer (2006). ‘Stochastic or deterministic: what is the
effective population size of HIV-1?’ In: Trends in Microbiology 14.12, pp. 507–511 (page 29).
Kubatko, LS and JH Degnan (2007). ‘Inconsistency of phylogenetic estimates from concatenated
data under coalescence’. In: Systematic Biology 56.1, pp. 17–24 (page 116).
Kuhner, MK (2006). ‘LAMARC 2.0: maximum likelihood and Bayesian estimation of population
parameters’. In: Bioinformatics 22.6, pp. 768–770 (pages 10, 14).
Kuhner, MK, J Yamato and J Felsenstein (1995). ‘Estimating effective population size and
mutation rate from sequence data using Metropolis–Hastings sampling’. In: Genetics 140,
pp. 1421–1430 (pages 6, 31).
Kuhner, MK, J Yamato and J Felsenstein (1998). ‘Maximum likelihood estimation of population
growth rates based on the coalescent’. In: Genetics 149, pp. 429–434 (pages 6, 31, 109).
Kuhner, MK, J Yamato and J Felsenstein (2000). ‘Maximum likelihood estimation of recombin-
ation rates from population data’. In: Genetics 156.3, pp. 1393–1401 (pages 6, 10, 31).
Kühnert, D, CH Wu and AJ Drummond (2011). ‘Phylogenetic and epidemic modeling of rapidly
evolving infectious diseases’. In: Infection, Genetics and Evolution 11.8, pp. 1825–1841
(pages 8, 66, 75).
Kühnert, D, T Stadler, TG Vaughan and AJ Drummond (2014). ‘Simultaneous reconstruc-
tion of evolutionary history and epidemiological dynamics from viral sequences with
230 References
the birth–death SIR model’. In: Journal of the Royal Society Interface 11.94 (pages 40, 75,
130).
Kuo, L and B Mallick (1998). ‘Variable selection for regression models’. In: Sankhya B 60,
pp. 65–81 (page 17).
Lakner, C, P van der Mark, JP Huelsenbeck, B Larget and F Ronquist (2008). ‘Efficiency of
Markov chain Monte Carlo tree proposals in Bayesian phylogenetics’. In: Systematic Biology
57.1, pp. 86–103 (page 156).
Lambert, DM, PA Ritchie, CD Millar, et al. (2002). ‘Rates of evolution in ancient DNA from
Adelie penguins’. In: Science 295, pp. 2270–2273 (pages 31, 66, 129).
Laplace, P (1812). Théorie analytique des probabilités. Paris: Courcier (page 13).
Larget, B (2013). ‘The estimation of tree posterior probabilities using conditional clade probabil-
ity distributions’. In: Systematic Biology 62.4, pp. 501–511 (pages 159, 161).
Larget, B and D Simon (1999). ‘Markov chain Monte Carlo algorithms for the Bayesian analysis
of phylogenetic trees’. In: Molecular Biology and Evolution 16, pp. 750–759 (page 14).
Larkin, M, G Blackshields, NP Brown, et al. (2007). ‘Clustal W and Clustal X version 2.0’. In:
Bioinformatics 23, pp. 2947–2948 (page 9).
Lartillot, N and H Philippe (2006). ‘Computing Bayes factors using thermodynamic integration’.
In: Systematic Biology 55, pp. 195–207 (page 136).
Lartillot, N, T Lepage and S Blanquart (2009). ‘PhyloBayes 3: a Bayesian software package for
phylogenetic reconstruction and molecular dating’. In: Bioinformatics 25.17, pp. 2286–2288
(page 14).
Leaché, AD and MK Fujita (2010). ‘Bayesian species delimitation in West African forest geckos
(Hemidactylus fasciatus)’. In: Proceedings of the Royal Society B: Biological Sciences 277,
pp. 3071–3077 (page 121).
Leaché, AD and B Rannala (2011). ‘The accuracy of species tree estimation under simulation: a
comparison of methods’. In: Systematic Biology 60.2, pp. 126–137 (page 119).
Leaché, AD, MK Fujita, VN Minin and RR Bouckaert (2014). ‘Species delimitation using
genome-wide SNP data’. In: Systematic Biology, syu018 (page 126).
Leitner, T and W Fitch (1999). ‘The phylogenetics of known transmission histories’. In: The
evolution of HIV. Ed. by KA Crandall. Baltimore, MD: Johns Hopkins University Press,
pp. 315–345 (page 41).
Lemey, P, OG Pybus, B Wang, et al. (2003). ‘Tracing the origin and history of the HIV-2
epidemic’. In: Proceedings of the National Academy of Sciences USA 100.11, pp. 6588–6592
(page 32).
Lemey, P, OG Pybus, A Rambaut, et al. (2004). ‘The molecular population genetics of HIV-1
group O’. In: Genetics 167.3, pp. 1059–1068 (pages 32, 152).
Lemey, P, A Rambaut, AJ Drummond and MA Suchard (2009a). ‘Bayesian phylogeography finds
its roots’. In: PLoS Computational Biology 5.9, e1000520 (pages 17, 69, 71, 72, 76, 134).
Lemey, P, MA Suchard and A Rambaut (2009b). ‘Reconstructing the initial global spread of
a human influenza pandemic: a Bayesian spatial–temporal model for the global spread of
H1N1pdm’. In: PLoS Currents RRN1031 (pages 71, 129, 134).
Lemey, P, A Rambaut, JJ Welch and MA Suchard (2010). ‘Phylogeography takes a relaxed
random walk in continuous space and time’. In: Molecular Biology and Evolution 27.8,
pp. 1877–1885 (pages 73, 134).
Lemey, P, A Rambaut, T Bedford, et al. (2014). ‘Unifying viral genetics and human transportation
data to predict the global transmission dynamics of human influenza H3N2’. In: PLoS
Pathogens 10.2, e1003932 (page 135).
References 231
Leonard, J, R Wayne, J Wheeler, et al. (2002). ‘Ancient DNA evidence for Old World origin of
New World dogs’. In: Science 298, p. 1613 (page 31).
Lepage, T, D Bryant, H Philippe and N Lartillot (2007). ‘A general comparison of relaxed
molecular clock models’. In: Molecular Biology and Evolution 24.12, pp. 2669–2680
(page 63).
Leventhal, GE, H Guenthard, S Bonhoeffer and T Stadler (2014). ‘Using an epidemiological
model for phylogenetic inference reveals density-dependence in HIV transmission’. In:
Molecular Biology and Evolution 31.1, pp. 6–17 (pages 34, 40, 75).
Levinson, G and GA Gutman (1987). ‘High frequencies of short frameshifts in poly-CA/TG
tandem repeats borne by bacteriophage M13 in Escherichia coli K-12’. In: Nucleic Acids
Research 15.13, pp. 5323–5338 (page 52).
Lewis, PO (2001). ‘A likelihood approach to estimating phylogeny from discrete morphological
character data’. In: Systematic Biology 50.6, pp. 913–925 (page 57).
Lewis, PO, MT Holder and KE Holsinger (2005). ‘Polytomies and Bayesian phylogenetic
inference.’ In: Systematic Biology 54.2, pp. 241–253 (page 14).
Li, S, D Pearl and H Doss (2000). ‘Phylogenetic tree construction using Markov chain Monte
Carlo’. In: Journal of the American Statistical Association 95, pp. 493–508 (page 14).
Li, WLS and AJ Drummond (2012). ‘Model averaging and Bayes factor calculation of relaxed
molecular clocks in Bayesian phylogenetics’. In: Molecular Biology and Evolution 29.2,
pp. 751–761 (pages 63, 145).
Liu, L (2008). ‘BEST: Bayesian estimation of species trees under the coalescent model’. In:
Bioinformatics 24.21, pp. 2542–2543 (page 119).
Liu, L, DK Pearl, RT Brumfield and SV Edwards (2008). ‘Estimating species trees using multiple-
allele DNA sequence data’. In: Evolution 62.8, pp. 2080–2091 (page 7).
Liu, L, L Yu, L Kubatko, DK Pearl and SV Edwards (2009a). ‘Coalescent methods for
estimating phylogenetic trees’. In: Molecular Phylogenetics and Evolution 53.1, pp. 320–328
(page 7).
Liu, L, L Yu, DK Pearl and SV Edwards (2009b). ‘Estimating species phylogenies using
coalescence times among sequences’. In: Systematic Biology 58.5, pp. 468–477 (page 7).
Loreille, O, L Orlando, M Patou-Mathis, et al. (2001). ‘Ancient DNA analysis reveals divergence
of the cave bear, Ursus spelaeus, and brown bear, Ursus arctos, lineages’. In: Current Biology
11.3, pp. 200–203 (page 31).
Lunn, DJ, A Thomas, N Best and D Spiegelhalter (2000). ‘WinBUGS – a Bayesian modelling
framework: concepts, structure, and extensibility’. In: Statistics and Computing 10.4, pp. 325–
337 (page 156).
Lunn, D, D Spiegelhalter, A Thomas and N Best (2009). ‘The BUGS project: evolution, critique
and future directions’. In: Statistics in Medicine 28.25, pp. 3049–3067 (page 156).
Lunter, G, I Miklos, AJ Drummond, JL Jensen and J Hein (2005). ‘Bayesian coestimation of
phylogeny and sequence alignment’. In: BMC Bioinformatics 6, art. 83 (pages 4, 9, 99).
MacKay, DJ (2003). Information theory, inference and learning algorithms. Cambridge:
Cambridge University Press (page 10).
Maddison, DR and WP Maddison (2005). MacClade 4.08. Sunderland, MA: Sinauer Associates
(page 68).
Maddison, WP (2007). ‘Estimating a binary character’s effect on speciation and extinction’. In:
Systematic Biology 56.5, pp. 701–710 (pages 39, 69).
Matschiner, M and RR Bouckaert (2013). ‘A rough guide to CladeAge’. In: BEAST 2 wiki
(pages 66, 111).
232 References
Nee, SC (2001). ‘Inferring speciation rates from phylogenies’. In: Evolution 55.4, pp. 661–668
(page 107).
Nee, SC, EC Holmes, RM May and PH Harvey (1994a). ‘Extinction rates can be estimated from
molecular phylogenies’. In: Philosophical Transactions of the Royal Society B: Biological
Sciences 344.1307, pp. 77–82 (page 107).
Nee, SC, RM May and PH Harvey (1994b). ‘The reconstructed evolutionary process’. In:
Philosophical Transactions of the Royal Society B: Biological Sciences 344, pp. 305–311
(page 39).
Nee, SC, EC Holmes, A Rambaut and PH Harvey (1995). ‘Inferring population history from
molecular phylogenies’. In: Philosophical Transactions of the Royal Society B: Biological
Sciences 349.1327, pp. 25–31 (page 29).
Newton, M and A Raftery (1994). ‘Approximate Bayesian inference with the weighted likelihood
bootstrap’. In: Journal of the Royal Statistical Society, Series B 56, pp. 3–48 (page 18).
Nicholls, GK and RD Gray (2006). ‘Quantifying uncertainty in a stochastic model of vocabulary
evolution’. In: Phylogenetic methods and the prehistory of languages. Ed. by P Forster and C
Renfrew. Cambridge: McDonald Institute for Archaeological Research, pp. 161–171 (page 57).
Nicholls, GK and RD Gray (2008). ‘Dated ancestral trees from binary trait data and their
application to the diversification of languages’. In: Journal of the Royal Statistical Society,
Series B 70.3, pp. 545–566 (page 115).
Nielsen, R and Z Yang (1998). ‘Likelihood models for detecting positively selected amino acid
sites and applications to the HIV-1 envelope gene’. In: Genetics 148, pp. 929–936 (page 111).
Notredame, C, DG Higgins and J Heringa (2000). ‘T-coffee: a novel method for fast and accurate
multiple sequence alignment’. In: Journal of Molecular Biology 302.1, pp. 205 –217 (page 9).
Novák, A, I Miklós, R Lyngsø and J Hein (2008). ‘StatAlign: an extendable software package
for joint Bayesian estimation of alignments and evolutionary trees’. In: Bioinformatics 24.20,
pp. 2403–2404 (pages 9, 99).
Nylander, JAA, JC Wilgenbusch, DL Warren and DL Swofford (2008). ‘AWTY (are we there
yet?): a system for graphical exploration of MCMC convergence in Bayesian phylogenetics’.
In: Bioinformatics 24.4, pp. 581–583 (pages 141, 156).
Ohta, T and M Kimura (1973). ‘A model of mutation appropriate to estimate the number of
electrophoretically detectable alleles in a finite population’. In: Genetics 22.2, pp. 201–204
(page 52).
Opgen-Rhein, R, L Fahrmeir and K Strimmer (2005). ‘Inference of demographic history from
genealogical trees using reversible jump Markov chain Monte Carlo’. In: BMC Evolutionary
Biology 5.1, art. 6 (page 32).
Owen, M and JS Provan (2011). ‘A fast algorithm for computing geodesic distances in tree space’.
In: IEEE/ACM Transactions on Computational Biology and Bioinformatics 8.1, pp. 2–13
(page 154).
Pagel, MD and A Meade (2004). ‘A phylogenetic mixture model for detecting pattern-
heterogeneity in gene sequence or character-state data’. In: Systematic Biology 53.4, pp. 571–
581 (page 14).
Palacios, JA and VN Minin (2012). ‘Integrated nested Laplace approximation for Bayesian
nonparametric phylodynamics.’ In: Proceedings of the 28th Conference on Uncertainty in
Artificial Intelligence. Ed. by N de Freitas and K Murphy. Sterling, VA: AUAI Press,
pp. 726–735 (page 33).
Palacios, JA and VN Minin (2013). ‘Gaussian process-based Bayesian nonparametric inference
of population trajectories from gene genealogies’. In: Biometrics 69.1, pp. 8–18 (page 33).
234 References
Palmer, D, J Frater, R Phillips, AR McLean and G McVean (2013). ‘Integrating genealogical and
dynamical modelling to infer escape and reversion rates in HIV epitopes’. In: Proceedings of
the Royal Society B: Biological Sciences 280.1762, art. 2013.0696 (pages 69, 75).
Pamilo, P and M Nei (1988). ‘Relationships between gene trees and species trees’. In: Molecular
Biology and Evolution 5.5, pp. 568–83 (pages 41, 116).
Penny, D, BJ McComish, MA Charleston and MD Hendy (2001). ‘Mathematical elegance with
biochemical realism: the covarion model of molecular evolution’. In: Journal of Molecular
Evolution 53.6, pp. 711–723 (page 57).
Pereira, L, F Freitas, V Fernandes, et al. (2009). ‘The diversity present in 5140 human mitochon-
drial genomes’. In: American Journal of Human Genetics 84.5, pp. 628–640 (page 102).
Popinga, A, TG Vaughan, T Stadler and AJ Drummond (2015). ‘Inferring epidemiological
dynamics with Bayesion coalescent inference: the merits of deterministic and Stochastic
models’. In: Genetics 199.2, pp. 595–607 (page 36).
Posada, D (2008). ‘jModelTest: phylogenetic model averaging’. In: Molecular Biology and
Evolution 25.7, pp. 1253–1256 (page 145).
Posada, D and KA Crandall (1998). ‘Modeltest: testing the model of DNA substitution.’ In:
Bioinformatics 14.9, pp. 817–818 (page 145).
Procter, JB, J Thompson, I Letunic, et al. (2010). ‘Visualization of multiple alignments,
phylogenies and gene family evolution’. In: Nature Methods 7, S16–S25 (page 156).
Pybus, OG and A Rambaut (2002). ‘GENIE: estimating demographic history from molecular
phylogenies’. In: Bioinformatics 18.10, pp. 1404–1405 (page 29).
Pybus, OG and A Rambaut (2009). ‘Evolutionary analysis of the dynamics of viral infectious
disease’. In: Nature Reviews Genetics 10.8, pp. 540–550 (pages 8, 75).
Pybus, OG, A Rambaut and PH Harvey (2000). ‘An integrated framework for the inference of
viral population history from reconstructed genealogies’. In: Genetics 155, pp. 1429–1437
(pages 29, 32).
Pybus, OG, MA Charleston, S Gupta, et al. (2001). ‘The epidemic behavior of the hepatitis C
virus’. In: Science 292, pp. 2323–2325 (page 29).
Pybus, OG, AJ Drummond, T Nakano, BH Robertson and A Rambaut (2003). ‘The epidemiology
and iatrogenic transmission of hepatitis C virus in Egypt: a Bayesian coalescent approach’. In:
Molecular Biology and Evolution 20.3, pp. 381–387 (pages 29, 32, 106).
Pybus, OG, E Barnes, R Taggest, et al. (2009). ‘Genetic history of hepatitis C virus in East Asia’.
In: Journal of Virology 83.2, pp. 1071–1082 (page 33).
Pybus, OG, MA Suchard, P Lemey, et al. (2012). ‘Unifying the spatial epidemiology and
molecular evolution of emerging epidemics’. In: Proceedings of the National Academy of
Sciences USA 109.37, pp. 15 066–15 071 (page 135).
Pyron, RA (2011). ‘Divergence time estimation using fossils as terminal taxa and the origins of
Lissamphibia’. In: Systematic Biology, syr047 (page 57).
Rabosky, D (2007). ‘Likelihood methods for detecting temporal shifts in diversification rates’. In:
Evolution 60.6, pp. 1152–1164 (page 39).
Raftery, AE and SM Lewis (1992). ‘[Practical Markov chain Monte Carlo]: comment: one long
run with diagnostics: implementation strategies for Markov chain Monte Carlo’. In: Statistical
Science 7.4, pp. 493–497 (page 140).
Raftery, A, M Newton, J Satagopan and P Krivitsky (2007). ‘Estimating the integrated likelihood
via posterior simulation using the harmonic mean identity’. In: Bayesian statistics. Ed. by JM
Bernardo, MJ Bayarri, JO Berger et al. Vol. 8. Oxford: Oxford University Press, pp. 1–45
(page 18).
References 235
Stadler, T (2013a). ‘Recovering speciation and extinction dynamics based on phylogenies’. In:
Journal of Evolutionary Biology 26.6, pp. 1203–1219 (page 39).
Stadler, T (2013b). ‘How can we improve accuracy of macroevolutionary rate estimates?’ In:
Systematic Biology 62.2, pp. 321–329 (pages 37, 38).
Stadler, T and S Bonhoeffer (2013). ‘Uncovering epidemiological dynamics in heterogeneous
host populations using phylogenetic methods’. In: Philosophical Transactions of the Royal
Society B: Biological Sciences 368.1614 (pages 40, 70, 73, 75).
Stadler, T, R Kouyos, V von Wyl, et al. (2012). ‘Estimating the basic reproductive number from
viral sequence data’. In: Molecular Biology and Evolution 29, pp. 347–357 (page 40).
Stadler, T, D Kühnert, S Bonhoeffer and AJ Drummond (2013). ‘Birth–death skyline plot reveals
temporal changes of epidemic spread in HIV and HCV’. In: Proceedings of the National
Academy of Sciences USA 110.1, pp. 228–233 (pages 39, 40, 75, 108, 130, 132).
Stadler, T, TG Vaughan, A Gavruskin, et al. (2015). ‘How well can the exponential-growth
coalescent approximate constant rate birth–death population dynamics?’ In: Proceedings of
the Royal Society B: Biological Sciences, in press (pages 36, 38).
Stan Development Team (2014). Stan: A C++ Library for probability and sampling. Version 2.4
(page 156).
Steel, M (2005). ‘Should phylogenetic models be trying to “fit an elephant”?’ In: Trends in
Genetics 21.6, pp. 307–309 (page 149).
Steel, M and A Mooers (2010). ‘The expected length of pendant and interior edges of a Yule tree’.
In: Applied Mathematics Letters 23.11, pp. 1315–1319 (page 110).
Stephens, M and P Donnelly (2000). ‘Inference in molecular population genetics’. In: Journal of
the Royal Statistical Society. Series B, 62.4, pp. 605–655 (page 31).
Stewart, WJ (1994). Introduction to the numerical solution of Markov chains. Vol. 41. Princeton,
NJ: Princeton University Press (pages 12, 45).
Stockham, C, LS Wang and T Warnow (2002). ‘Statistically based postprocessing of phylogenetic
analysis by clustering’. In: Bioinformatics 18.suppl 1, S285–S293 (page 161).
Strimmer, K and OG Pybus (2001). ‘Exploring the demographic history of DNA sequences using
the generalized skyline plot’. In: Molecular Biology and Evolution 18.12, pp. 2298–2305
(page 32).
Suchard, MA and A Rambaut (2009). ‘Many-core algorithms for statistical phylogenetics’. In:
Bioinformatics 25.11, pp. 1370–1376 (pages 71, 114).
Suchard, MA and BD Redelings (2006). ‘BAli-Phy: simultaneous Bayesian inference of align-
ment and phylogeny’. In: Bioinformatics 22.16, pp. 2047–2048 (pages 9, 99).
Suchard, MA, RE Weiss and JS Sinsheimer (2001). ‘Bayesian selection of continuous-time
Markov chain evolutionary models’. In: Molecular Biology and Evolution 18, pp. 1001–1013
(page 147).
Suchard, MA, RE Weiss and JS Sinsheimer (2003). ‘Testing a molecular clock without an
outgroup: derivations of induced priors on branch length restrictions in a Bayesian framework’.
In: Systematic Biology 52, pp. 48–54 (page 99).
Sullivan, J and DL Swofford (2001). ‘Should we use model-based methods for phylogenetic
inference when we known that assumptions about among-site rate variation and nucleotide
substitution pattern are violated?’ In: Systematic Biology 50, pp. 723–729 (page 54).
Sullivan, J, DL Swofford and GJ Naylor (1999). ‘The effect of taxon sampling on estimating
rate heterogeneity parameters of maximum-likelihood models’. In: Molecular Biology and
Evolution 16, pp. 1347–1356 (page 54).
238 References
Swofford, DL (2003). PAUP*: phylogenetic analysis using parsimony (* and other methods).
Version 4. Sunderland, MA: Sinauer Associates (page 68).
Tajima, F (1983). ‘Evolutionary relationship of DNA sequences in finite populations’. In:
Genetics 105.2, pp. 437–460 (pages 29, 43).
Tajima, F (1989). ‘DNA polymorphism in a subdivided population: the expected number of
segregating sites in the two-subpopulation model’. In: Genetics 123.1, pp. 229–240 (page 29).
Takahata, N (1989). ‘Gene genealogy in three related populations: consistency probability
between gene and population trees’. In: Genetics 122.4, pp. 957–966 (page 29).
Tamura, K and M Nei (1993). ‘Estimation of the number of nucleotide substitutions in the
control region of mitochondrial DNA in humans and chimpanzees’. In: Molecular Biology
and Evolution 10, pp. 512–526 (page 103).
Teixeira, S, EA Serrão and S Arnaud-Haond (2012). ‘Panmixia in a fragmented and unstable
environment: the hydrothermal shrimp Rimicaris exoculata disperses extensively along the
mid-Atlantic Ridge’. In: PloS One 7.6, e38521 (page 133).
Thorne, JL and H Kishino (2002). ‘Divergence time and evolutionary rate estimation with
multilocus data.’ In: Systematic Biology 51.5, pp. 689–702 (page 62).
Thorne, JL, H Kishino and IS Painter (1998). ‘Estimating the rate of evolution of the rate of
molecular evolution’. In: Molecular Biology and Evolution 15.12, pp. 1647–1657 (page 62).
Vaughan, TG and AJ Drummond (2013). ‘A stochastic simulator of birth–death master equations
with application to phylodynamics’. In: Molecular Biology and Evolution 30.6, pp. 1480–1493
(pages 43, 138).
Vaughan, TG, D Kühnert, A Popinga, D Welch and AJ Drummond (2014). ‘Efficient Bayesian
inference under the structured coalescent’. In: Bioinformatics, btu201 (pages 14, 71, 72).
Vijaykrishna, D, GJ Smith, OG Pybus, et al. (2011). ‘Long-term evolution and transmission
dynamics of swine influenza A virus’. In: Nature 473.7348, pp. 519–522 (page 129).
Volz, EM (2012). ‘Complex population dynamics and the coalescent under neutrality’. In:
Genetics 190.1, pp. 187–201 (pages 35, 69, 75).
Volz, EM, SL Kosakovsky Pond, MJ Ward, AJ Leigh Brown and SDW Frost (2009). ‘Phylo-
dynamics of infectious disease epidemics’. In: Genetics 183.4, pp. 1421–1430 (page 75).
Volz, EM, K Koelle and T Bedford (2013). ‘Viral phylodynamics’. In: PLoS Computational
Biology 9.3, e1002947 (pages 8, 75).
Waddell, P and D Penny (1996). ‘Evolutionary trees of apes and humans from DNA sequences’.
In: Handbook of symbolic evolution. Ed. by AJ Lock and CR Peters. Oxford: Clarendon Press,
pp. 53–73 (page 54).
Wakeley, J and O Sargsyan (2009). ‘Extensions of the coalescent effective population size’. In:
Genetics 181.1, pp. 341–345 (page 29).
Wallace, R, H HoDac, R Lathrop and W Fitch (2007). ‘A statistical phylogeography of influenza
A H5N1’. In: Proceedings of the National Academy of Sciences USA 104.11, pp. 4473–4478
(page 68).
Welch, D (2011). ‘Is network clustering detectable in transmission trees?’ In: Viruses 3.6,
pp. 659–676 (page 75).
Whidden, C, I Matsen and A Frederick (2014). ‘Quantifying MCMC exploration of phylogenetic
tree space’. In: arXiv preprint arXiv:1405.2120 (pages 150, 156).
Wiens, JJ and DS Moen (2008). ‘Missing data and the accuracy of Bayesian phylogenetics’. In:
Journal of Systematics and Evolution 46.3, pp. 307–314 (page 99).
Wilson, AC and VM Sarich (1969). ‘A molecular time scale for human evolution’. In: Proceed-
ings of the National Academy of Sciences USA 63.4, pp. 1088–1093 (page 60).
References 239
Wilson, IJ and DJ Balding (1998). ‘Genealogical inference from microsatellite data’. In: Genetics
150.1, pp. 499–510 (page 14).
Wolinsky, S, B Korber, A Neumann, et al. (1996). ‘Adaptive evolution of human immuno-
deficiency virus type-1 during the natural course of infection’. In: Science 272, pp. 537–542
(page 31).
Wong, KM, MA Suchard and JP Huelsenbeck (2008). ‘Alignment uncertainty and genomic
analysis’. In: Science 319, pp. 473–476 (page 9).
Worobey, M, M Gemmel, DE Teuwen, et al. (2008). ‘Direct evidence of extensive diversity of
HIV-1 in Kinshasa by 1960’. In: Nature 455.7213, pp. 661–664 (page 32).
Worobey, M, P Telfer, S Souquière, et al. (2010). ‘Island biogeography reveals the deep history
of SIV’. In: Science 329, p. 1487 (page 129).
Wright, S (1931). ‘Evolution in Mendelian populations’. In: Genetics 16.2, pp. 97–159 (page 28).
Wu, CH and AJ Drummond (2011). ‘Joint inference of microsatellite mutation models, population
history and genealogies using transdimensional Markov chain Monte Carlo’. In: Genetics
188.1, pp. 151–164 (pages 17, 52).
Wu, CH, MA Suchard and AJ Drummond (2013). ‘Bayesian selection of nucleotide substitution
models and their site assignments’. In: Molecular Biology and Evolution 30.3, pp. 669–688
(pages 17, 57, 100, 145).
Xie, W, P Lewis, Y Fan, L Kuo and M Chen (2011). ‘Improving marginal likelihood estimation
for Bayesian phylogenetic model selection’. In: Systematic Biology 60, pp. 150–160 (pages 19,
136, 137).
Yang, Z (1994). ‘Maximum likelihood phylogenetic estimation from DNA sequences with
variable rates over sites: approximate methods’. In: Journal of Molecular Evolution 39.3,
pp. 306–314 (pages 54, 103).
Yang, Z and B Rannala (1997). ‘Bayesian phylogenetic inference using DNA sequences: a
Markov chain Monte Carlo method’. In: Molecular Biology and Evolution 14.7, pp. 717–724
(pages 14, 37, 108).
Yang, Z and B Rannala (2006). ‘Bayesian estimation of species divergence times under a
molecular clock using multiple fossil calibrations with soft bounds’. In: Molecular Biology
and Evolution 23.1, pp. 212–226 (page 66).
Yang, Z and A Yoder (1999). ‘Estimation of the transition/transversion rate bias and species
sampling’. In: Journal of Molecular Evolution 48.3, pp. 274–283 (page 102).
Yang, Z, N Goldman and A Friday (1995). ‘Maximum likelihood trees from DNA sequences: a
peculiar statistical estimation problem’. In: Systematic Biology 44.3, pp. 384–399 (page 146).
Yang, Z, R Nielsen, N Goldman and A Pedersen (2000). ‘Codon-substitution models for hetero-
geneous selection pressure at amino acid sites’. In: Genetics 155.1, pp. 431–449 (page 111).
Yoder, A and Z Yang (2000). ‘Estimation of primate speciation dates using local molecular
clocks’. In: Molecular Biology and Evolution 17.7, pp. 1081–1090 (page 63).
Ypma, RJF, WM van Ballegooijen and J Wallinga (2013). ‘Relating phylogenetic trees to
transmission trees of infectious disease outbreaks’. In: Genetics 195.3, pp. 1055–1062
(page 43).
Yu, Y, C Than, JH Degnan and L Nakhleh (2011). ‘Coalescent histories on phylogenetic networks
and detection of hybridization despite incomplete lineage sorting’. In: Systematic Biology 60.2,
pp. 138–149 (page 120).
Yule, G (1924). ‘A mathematical theory of evolution based on the conclusions of Dr. J.C. Willis’.
In: Philosophical Transactions of the Royal Society B: Biological Sciences 213, pp. 21–87
(pages 36, 38).
Index