Molecular Evolution Lecture Notes: Anders Gorm Pedersen
Molecular Evolution Lecture Notes: Anders Gorm Pedersen
Lecture Notes
Anders Gorm Pedersen
[email protected]
February 4, 2005
2
i
Contents
Chapter 1
Brief Introduction to
Evolutionary Theory
c Anders Gorm Pedersen, 2005
2 Molecular Evolution, lecture notes
1.1 Classification
One of the main goals of early biological research was classification, i.e., the
systematic arrangement of living organisms into categories reflecting their
natural relationships. The most successful system was invented by the swede
Carl Linnaeus, and presented in his book ”Systema Naturae” first published
in 1735. The system we use today is essentially the one devised by Linnaeus.
It is a hierarchical system with seven major ranks: kingdom, phylum, class,
order, family, genus, and species.
Carl Linnaeus Specifically, groups of similar species are placed together in a genus,
(1707–1778) groups of related genera are placed together in a family, families are grouped
into orders, orders into classes, classes into phyla, and phyla into kingdoms.
When depicted graphically, the Linnean system can be shown in the form
of a tree with individual species at the tips, and with internal nodes in
the tree representing higher-level categories (Fig. 1.1). Along with this
classification system, Linnaeus also developed the so-called binomial system
The Linnean system: in which all organisms are identified by a two-part Latinized name. The
• Kingdom first name is capitalized and identifies the genus, while the second identifies
the species within that genus. For example the genus Canis includes Canis
• Phylum lupus, the wolf, Canis latrans, the coyote, and Canis familiaris, the domestic
• Class dog. Similarly, the genus Vulpes contains Vulpes vulpes, the red fox, Vulpes
• Order chama the Cape fox, and others. Both genera (Canis and Vulpes) belong
to the family Canidae, which in its turn is part of the order Carnivora, the
• Family
carnivores.
• Genus Note that it is non-trivial to come up with a generally applicable defi-
• Species nition of what exactly a “species” is. According to the so-called biological
species concept, a species is a group of “actually or potentially interbreed-
ing natural populations which are reproductively isolated from other such
groups”. This definition is due to the evolutionary biologist Ernst Mayr
(1904–) and is perhaps what most people intuitively understand by the word
“species”. However, the biological species concept does not address the issue
of how to define species within groups of organisms that do not reproduce
sexually (e.g., bacteria), or when organisms are known only from fossils.
An alternative definition is the morphological species concept which states
that “species are groups of organisms that share certain morphological or
biochemical traits”. This definition is more broadly applicable, but is also
far more subjective than Mayr’s.
c Anders Gorm Pedersen, 2005
Chapter 1: Brief Introduction to Evolutionary Theory 3
species of animals. Linnaeus believed that God was the ordering principle
behind this classification system, and that its structure somehow reflected
the divine master plan.
It was not until after the 1859 publication of Charles Darwin’s “On the
Origin of Species” that an alternative explanation was widely accepted. Ac-
cording to Darwin (and others), the ordering principle behind the Linnean
system was instead a history of “common descent with modification”: all life
was believed to have evolved from one—or a few—common ancestors, and
taxonomic groupings were simply manifestations of the tree-shaped evolu-
tionary history connecting all present-day species (Fig. 1.2).
The theory of common descent did not in itself address the issue of how
evolutionary change takes place, but it was able to explain a great deal
of puzzling observations. For instance, similar species are often found in
adjacent or overlapping geographical regions, and fossils often resemble (but
are different from) present-day species living in the same location. These
phenomena are easily explained as the result of divergence from a common
ancestor, but have no clear cause if one assumes that each species has been
created individually.
c Anders Gorm Pedersen, 2005
4 Molecular Evolution, lecture notes
c Anders Gorm Pedersen, 2005
Chapter 1: Brief Introduction to Evolutionary Theory 5
1. Each generation more offspring is born than the environment can sup-
port - a fraction of offspring therefore dies before reaching reproductive
age.
If all four postulates are true (and this is generally the case) then advan-
tageous traits will automatically tend to spread in the population, which
thereby changes gradually through time. This is natural selection. Let us
consider, for instance, a population of butterflies that are preyed upon by
birds. Now imagine that at some point a butterfly is born with a muta-
tion that makes the butterfly more difficult to detect. This butterfly will
obviously have a smaller risk of being eaten, and will consequently have an
increased chance of surviving to produce offspring. A fraction of the fortu-
nate butterfly’s offspring will inherit the advantageous mutation, and in the
next generation there will therefore be several butterflies with an improved
chance of surviving to produce offspring. After a number of generations it
is possible that all butterflies will have the mutation, which is then said to
be “fixed”.
If two sub-populations of a species are somehow separated (for instance
due to a geographical barrier), then it is hypothesized that this process may
lead to the gradual build-up of differences to the point where the populations
are in fact separate species. This process is called speciation.
c Anders Gorm Pedersen, 2005
6 Molecular Evolution, lecture notes
One problem with the theory described in “Origin of Species”, was that its
genetic basis—the nature of heritability—was entirely unknown. In later
editions of the book, Darwin proposed a model of inheritance where “hered-
itary substances” from the two parents merge physically in the offspring, so
that the hereditary substance in the offspring will be intermediate in form
(much like blending red and white paint results in pink paint). Such “blend-
ing inheritance” is in fact incompatible with evolution by natural selection,
since the constant blending will quickly result in a completely homogeneous
population from which the original, advantageous trait cannot be recovered
(in the same way it is impossible to extract red paint from pink paint).
Moreover, due to the much higher frequency of the original trait, the result-
ing homogeneous mixture will be very close to the original trait, and very far
from the advantageous one. (In the paint analogy, if one single red butterfly
is born at some point, then it will have to mate with a white butterfly re-
sulting in pink offspring. The offspring will most probably mate with white
butterflies and their offspring will be a lighter shade of pink, etc., etc. In the
long run, the population will end up being a very, very light shade of pink,
instead of all red).
However, as shown by the Austrian monk Gregor Mendel, inheritance
is in fact particulate in nature: parental genes do not merge physically;
instead they are retained in their original form within the offspring, making
it possible for the pure, advantageous trait to be recovered and, eventually,
to be fixed by natural selection. Although Mendel published his work in
Gregor Mendel 1866 it was not widely noticed until around 1900, and not until the 1930’s
(1822–1884) was Mendelian genetics fully integrated into evolutionary theory (the so-
called “Modern Synthesis”). This led to the creation of the new science of
population genetics which now forms the theoretical basis for all evolutionary
biology.
c Anders Gorm Pedersen, 2005
Chapter 1: Brief Introduction to Evolutionary Theory 7
Figure 1.3: Mendelian genetics. Each diploid parent contains two alleles. The s
allele is recessive and results in wrinkled peas when present in two copies.
c Anders Gorm Pedersen, 2005
8 Molecular Evolution, lecture notes
haploid cells). There are also organisms (e.g., ferns) where the life cycle
alternates between a haploid, multicellular generation and a diploid, multi-
cellular generation. Asexual reproduction is seen in both haploid organisms
(e.g., bacteria) and diploid organisms (e.g., yeast and some plants).
c Anders Gorm Pedersen, 2005
9
Chapter 2
Brief Introduction to
Population Genetics
c Anders Gorm Pedersen, 2005
10 Molecular Evolution, lecture notes
2.1 Introduction
The science of population genetics deals with genetic variation within pop-
ulations, and with the forces that change this variation. I will now give you
a very brief introduction to a few important results in the field.
My goal with this section is mostly to make you aware of some of the
ways in which evolutionary theory can be approached in a stringent, quanti-
tative manner. Specifically, we will discuss the effects that growth, selection,
mutation, and genetic drift have on the genetic composition of a population.
Most of the concepts will be introduced in the context of haploid, asexually
reproducing organisms since that makes the analysis easier.
The material covered here does not directly relate to reconstruction of
phylogenetic trees. However, any evolutionary history is necessarily the
result of processes that resemble the ones described in this section, and it
is therefore relevant to have at least passing knowledge of the underlying
theory.
N1 = N0 × R
The population size after two generations can be found by multiplying this
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 11
900000
800000
700000
600000
Population size, N
500000
400000
300000
200000
100000
0
0 1 2 3 4 5 6 7 8
Generation no., t
c Anders Gorm Pedersen, 2005
12 Molecular Evolution, lecture notes
900000
800000
700000
600000
Population size, N
500000
400000
300000
200000
100000
0
0 1 2 3 4 5 6 7 8
Generation no., t
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 13
c Anders Gorm Pedersen, 2005
14 Molecular Evolution, lecture notes
10000
8000
Population size, N
6000
4000
2000
0
0 2 4 6 8 10 12
Generation no., t
Figure 2.3: Logistic growth. The plot shows the growth of a population with
initial size N0 = 100, rate of increase r = 1.1, and carrying capacity K = 10, 000.
NA,0 = pN0
Na,0 = qN0
We again assume that the average, per capita reproductive rate (R) of the
entire population remains constant in subsequent generations. We further-
more assume that the two genotypes have the same growth rate (Thus,
RA = Ra = R). The average, per capita life-time reproductive rate of a
genotype is also referred to as that genotype’s “absolute fitness”. From
equation 2.1 we have the following expressions for the number of individuals
with genotypes A and a after one generation:
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 15
N1 = NA,1 + Na,1
= pN0 R + qN0 R
= N0 R × (p + q)
= N0 R
2.4 Selection
Let us now consider the more interesting case where two haploid genotypes
do not have the same absolute fitness. We will again investigate an example
where the alleles A and a are present at a locus in a haploid organism that
reproduces as described above. Let us imagine that the absolute fitness of
genotype A is RA = 4, while that of genotype a is Ra = 2. Recall that
for organisms such as the one we are examining here, R is the product of
fecundity and survival rate. It is therefore possible that the difference in
fitness between the two genotypes is caused by differential fecundity, dif-
ferential survival rate, or both. Let us say, for instance, that genotype A
c Anders Gorm Pedersen, 2005
16 Molecular Evolution, lecture notes
20000
Genotype A
Genotype a
15000
Population size, N
10000
5000
0
0 1 2 3 4 5 6 7 8
Generation no., t
has a fecundity of 200 offspring per generation, and a survival rate of 2%,
giving RA = 200 × 2% = 4. We may further imagine that genotype a has
a higher fecundity (400 offspring per generation) but a much lower survival
rate (0.5%) resulting in an overall fitness that is half that of genotype A
(Ra = 400 × 0.5% = 2).
The number of individuals with genotype A therefore grows faster than
the number of individuals with genotype a. An example of this is shown
in Figure 2.4. Table 2.1 gives the corresponding genotype numbers and
frequencies, and includes a few extra generations compared to the figure.
In this example, the initial population consists of one single individual with
genotype A (perhaps a newly created mutation), and 99 individuals with
genotype a. It can be seen how the proportion of individuals with genotype
A rapidly increases from the initial 1%, and after 7 generations A is the
predominant genotype. After only 10 generations genotype A makes up more
than 90% of the population, and intuitively it seems to be approaching 100%
asymptotically (Table 2.1). But instead of guessing, we should of course
develop a mathematical model that we can use to predict the genotype
frequencies at any time t.
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 17
t NA Na Ntot p q
0 1 99 100 0.01 0.99
1 4 198 202 0.02 0.98
2 16 396 412 0.04 0.96
3 64 792 856 0.07 0.93
4 256 1584 1841 0.14 0.86
5 1024 3168 4192 0.24 0.76
6 4096 6336 10432 0.39 0.61
7 16384 12672 29056 0.56 0.44
8 65536 25344 90880 0.72 0.28
9 262144 50688 312832 0.84 0.16
10 1048576 101376 1149952 0.91 0.09
NA,0 = pN0
Na,0 = qN0
Ra
The term R A
is the so-called “relative fitness” of allele a. The relative fitness
of a genotype (usually denoted W ) is the fitness of that genotype relative
c Anders Gorm Pedersen, 2005
18 Molecular Evolution, lecture notes
Genotype A
Genotype a
1
0.8
Genotype frequency
0.6
0.4
0.2
0
0 2 4 6 8 10 12 14 16 18
Generation no., t
Ra
Substituting Wa for RA in equation 2.4, we get:
p
pt = (2.5)
p + qWat
And here, finally, is our result: equation 2.5 enables us to compute how
natural selection changes the frequencies of genotypes A and a over time
(The frequency of genotype a can, for instance, be found by qt = 1 − pt .) An
important conclusion from equation 2.5 is that the effect of natural selection
only depends on the relative fitness. This means that we would get the same
change in frequency regardless of whether the absolute fitnesses of A and a
were, for instance, 10 and 5, or 6 and 3, or even 0.8 and 0.4 respectively.
The so-called selection coefficient (s) is often used instead of the relative
fitness W . The selection coefficient against the least fit allele is defined as
s = 1 − W where W is the relative fitness of the least fit allele. This means
that W = 1 − s and equation 2.5 can of course be rewritten by substituting
(1 − s) for Wa , if one is interested in expressing the frequencies in terms of
selection coefficients instead of relative fitness.
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 19
Figure 2.5 shows how the genotype frequencies change over time in our
example (i.e., when Wa = 0.5). A relative fitness of 0.5 corresponds to a
selection coefficient of s = 1 − 0.5 = 0.5. It can be seen that even though
genotype A has an initial frequency of only 1%, it has essentially reached
fixation after just 16 generations. It should be noted that a selection coeffi-
cient of 0.5 is quite high, but it is not unrealistic. For instance it has been
estimated that natural selection acting on the so-called melanic peppered
moths, that spread in industrial Britain during the 1800’s, involved a selec-
tion coefficient of approximately 0.3. It is believed that this selection was
driven by the dark, melanic forms being harder to detect on soot-covered
tree bark compared to the lighter, more easily spotted form of the moth.
Selection for drug resistance in HIV and pesticide resistance in mosquitoes
has also been reported to be of this magnitude.
Figure 2.6 shows another example where the a allele has a relative fitness
of 0.99 (corresponding to a selection coefficient s = 0.01). In this example
genotype A has essentially reached fixation after 1000 generations. It is
important to note that there are situations where natural selection will not
lead to fixation of one allele, but will instead act to maintain a certain level
of diversity at a locus. One example of this is when a diploid organism
that is heterozygous at some locus has higher fitness then either of the two
homozygotes.
There is one final issue we can investigate using this simple model of
selection. You may have noted from figures 2.5 and 2.6 that the frequency of
A appears to change rapidly when both genotypes are fairly common, while
it changes more slowly when one genotype predominates. Let us derive an
expression that sheds light on this problem. The frequency of genotype A
is initially p. Using equation 2.5, we can see that after one generation, the
frequency of A will be:
p
p1 =
p + qWa
We can rewrite this expression using that p = 1 − q and that Wa = (1 − s):
p p p
p1 = = = (2.6)
(1 − q) + q(1 − s) 1 − q + q − sq 1 − sq
This expression tells us the frequency after one generation has passed. The
amount of change during one generation can now be seen to be:
p
∆p = p1 − p0 = −p
1 − sq
Multiplying p by 1−sq
1−sq (i.e., multiplying by 1) allows us to collect all the
terms in one single fraction:
c Anders Gorm Pedersen, 2005
20 Molecular Evolution, lecture notes
Genotype A
Genotype a
1
0.8
Genotype frequency
0.6
0.4
0.2
0
0 200 400 600 800 1000 1200
Generation no., t
spq
∆p = (2.7)
1 − sq
Since 1 > s > 0 (genotype a has a lower fitness than genotype A), it can be
seen from equation 2.7 that ∆p must be positive. This is consistent with the
fact that genotype A is more fit than genotype a and that selection should
therefore act to increase its frequency. The change in frequency of the less
fit genotype is simply:
−spq
∆q = −∆p = (2.8)
1 − sq
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 21
0.40
0.35
0.30
Frequency change (delta-p)
0.25
0.20
0.15
0.10
0.05
0.00
0 0.2 0.4 0.6 0.8 1
p
Figure 2.7: Change in p (i.e., the frequency of genotype A) during one generation,
as a function of the current value of p. In this example the selection coefficient
against allele a is s = 0.8
c Anders Gorm Pedersen, 2005
22 Molecular Evolution, lecture notes
p1 = p∗ (1 − m)
p(1 − m)
p1 = (2.10)
1 − sq
This expression tells us how the forces of selection and mutation combine
to change the frequency of A (from p to p1 ) in one generation. We can now
find the equilibrium frequency by using the following insight: at equilibrium
the frequency of A is constant. From this it follows that the frequency of
A after one generation (p1 ) must equal the frequency of A in the previous
life-cycle (p). Substituting p for p1 in equation 2.10 gives us:
p(1 − m)
p=
1 − sq
1
This derivation follows Felsenstein. See http://evolution.genetics.washington.
edu/pgbook/pgbook.html
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 23
We can now eliminate p from this equation and isolate the equilibrium fre-
quency of a:
1 − sq = 1 − m
sq = m
m
q= (2.11)
s
Equation 2.11 shows us that the equilibrium frequency of the disadvanta-
geous allele depends on only the mutation rate and the selection coefficient
in a very simple way. In our example we find that the equilibrium frequency
of a is:
m 10−7
q= = = 2 × 10−6
s 0.05
The conclusion from the analysis presented here, is that when both se-
lection and mutation are acting at the same time, then a constant and
predictable level of genetic diversity will be maintained in the population.
c Anders Gorm Pedersen, 2005
24 Molecular Evolution, lecture notes
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 25
0.0001, then there will still be the same chance for the frequency to either
increase or decrease in the next generation. If this fluctuation continues for
sufficiently long then p will eventually wander to either 0 or 1. Once that has
happened the allele frequency can no longer change (at least if we assume
that there is no mutation and no migration from other populations).
The process of random change in genotype frequencies is called genetic
drift. From the discussion above, it can be seen that genetic drift (on its own)
tends to reduce the level of genetic variation in a population. This is similar
to the effects of selection described in section 2.4, but in the case of fixation
by drift, the fixed allele will not be advantageous compared to the lost alleles.
In fact, fixation will be the result of entirely random processes, and different
alleles will be fixed in different populations. The DNA sequences in isolated
populations will therefore tend to drift apart over time.
c Anders Gorm Pedersen, 2005
26 Molecular Evolution, lecture notes
Genotype
frequency
A2
A4
A3
A1
Time
t1 t2 t3 t4
only note that, for some organisms, 2N (or 4N ) generations is a very long
time indeed. (This obviously depends on both the population size and the
generation time). You should compare these time spans to the speed with
which natural selection leads to fixation of an advantageous allele (section
2.4).
The loss of genetic diversity caused by genetic drift is, however, coun-
terbalanced by the constant production of new mutations. The net result is
a dynamic equilibrium where the population maintains a certain amount of
variation, but the specific alleles making up this variation are changing over
time. This process is illustrated in figure 2.8 where we follow the frequencies
of alleles at a specific locus. Initially, alleles A1 and A2 are present. At time
t1 , allele A3 is produced by mutation. Allele A1 is lost by drift at time t2 ,
and a new allele (A4 ) subsequently arises by mutation at time t3 . Later,
allele A3 is again lost by drift. Note that the average level of variability
at the locus remains roughly constant over time, but that the actual alleles
(and their frequencies) accounting for this variability changes over time.
We will now consider some aspects of how genetic drift interacts with
selectively neutral mutations that are generated at a constant rate. Let us
again assume that we are examining a haploid, asexual population with a
constant size of N individuals. Mutations are constantly being produced at
a rate µ. This rate is fairly constant and is perhaps mostly controlled by the
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 27
Genotype
frequency
2N 1/u
Time
Figure 2.9: The average time it takes for a neutral mutation to reach fixation
(in a population of haploid organisms) is 2N . The average time between the
fixation of different alleles at a locus is 1/u.
interplay between the error rate of the DNA polymerase during replication
and the activity of DNA repair systems. A certain fraction f0 of mutations
are neutral. These are produced at the rate u = f0 µ. Note that u only refers
to the neutral mutation rate and that this is lower than the total mutation
rate. Most newly arisen neutral mutations are immediately lost due to
genetic drift, but some eventually become fixed. We are now interested in
determining the over-all rate at which neutral mutations become fixed in the
population. This “rate of fixation” tells us how quickly the DNA sequences
of two isolated populations drift apart.
The number of mutations produced per generation (at the locus we are
examining) is N u. For instance, if u = 2 × 10−7 mutations per generation
for this locus, and if N = 106 then an average of 2 × 10−7 × 106 = 0.2
new mutations will be produced at this locus per generation in the entire
population (corresponding to one new mutation every five generations). In
the previous section we found that a single gene in a population of N haploid
individuals has the probability N1 of being fixed by genetic drift. This must
therefore also be the probability a newly arisen neutral mutation has of
becoming fixed, since the mutant allele will initially be present as a single
copy among a total of N genes.
Recall that the rate of fixation is the number of new mutations that
become fixed in a given population per generation (or any other unit of
time). We can now determine this rate simply by multiplying the number of
mutations produced per generation (N u) by the probability that a mutation
eventually becomes fixed ( N1 ). Denoting the rate of fixation by k we therefore
c Anders Gorm Pedersen, 2005
28 Molecular Evolution, lecture notes
have:
1
k= uN = u (2.12)
N
This simple but slightly surprising result shows that the rate, at which neu-
tral mutations become fixed, is independent of the population size. More-
over, the rate of fixation is simply equal to u - the neutral mutation rate.
Note that the average time between fixation of different alleles at a locus
is t = k1 = u1 . In our example from before, we have that k = u = 2 × 10−7
fixed mutations per generation. The average time between fixations is there-
fore t = k1 = 2×10 1
−7 = 5, 000, 000 generations per fixed mutation (at this
locus). You should distinguish between the average time it takes for a mu-
tation to reach fixation (2N generations in a haploid) and the average time
between fixation of different mutants ( k1 ; figure 2.9).
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 29
the observed molecular polymorphism was in fact neutral and therefore had
no effect on fitness.
According to this so-called “Neutral Theory” of molecular evolution most
mutations are disadvantageous and are quickly removed by natural selection,
a vanishingly small proportion are advantageous and are quickly brought to
fixation, while the vast majority of fixed (and therefore observed) muta-
tions are selectively neutral. That most mutations are disadvantageous and
rarely observed is in agreement with the previously prevalent views (now
referred to as “selectionist”). Selectionists and neutralists also agree that
adaptation must be the result of advantageous mutations that are brought
to fixation by natural selection. The main point of difference concerns the
fraction of mutations that are advantageous: the extreme selectionist view
is that almost all observed mutations are advantageous, while the neutralist
believes that practically all observed mutations are neutral with respect to
fitness. Today, we have many examples of mutations that appear to have
been fixed by natural selection, but there is also a great deal of evidence for
the importance of neutral mutation and genetic drift. The truth probably
lies somewhere between the two extreme viewpoints.
c Anders Gorm Pedersen, 2005
30 Molecular Evolution, lecture notes
Figure 2.10: The molecular clock. Genetic distance (number of amino acid
replacements per site) plotted against a geological estimate of divergence
times for various pairs of mammalian groups. Combined sequence data
from α-globin, β−globin, cytochrome c, and fibrinopeptide A. (From Graur
and Li, 2000, Molecular Evolution, 2nd edition, Sinauer Associates Inc.)
c Anders Gorm Pedersen, 2005
Chapter 2: Brief Introduction to Population Genetics 31
faster. Secondly, things are not quite as tidy as figure 2.10 implies. There
are many examples where evolution does not proceed at a constant rate.
However, it is probably fair to say that all in all, the examples that we
do have of rate constancy are sufficiently striking to require some sort of
explanation. Finally, selectionist explanations for the molecular clock have
also been proposed, although generally these seem to be slightly ad hoc and
unsatisfactory.
c Anders Gorm Pedersen, 2005