0% found this document useful (0 votes)
7 views

es-overview-2015

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

es-overview-2015

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Evolution Strategies

Nikolaus Hansen, Dirk V. Arnold, Anne Auger

To cite this version:


Nikolaus Hansen, Dirk V. Arnold, Anne Auger. Evolution Strategies. Janusz Kacprzyk; Witold
Pedrycz. Handbook of Computational Intelligence, 871-898, Springer, 2015, 978-3-622-43504-5. �hal-
01155533�

HAL Id: hal-01155533


https://inria.hal.science/hal-01155533v1
Submitted on 9 Aug 2016

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Evolution Strategies
Nikolaus Hansen, Dirk V. Arnold and Anne Auger
February 11, 2015

1
Contents
1 Overview 3

2 Main Principles 4
2.1 (µ/ρ +, λ) Notation for Selection and Recombination . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Two Algorithm Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Recombination Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Mutation Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Parameter Control 9
3.1 The 1/5th Success Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Self-Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Derandomized Self-Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Non-Local Derandomized Step-Size Control (CSA) . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Addressing Dependencies Between Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Covariance Matrix Adaptation (CMA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Natural Evolution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.8 Further Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Theory 19
4.1 Lower Runtime Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Progress Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 (1+1)-ES on Sphere Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 (µ/µ, λ)-ES on Sphere Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.3 (µ/µ, λ)-ES on Noisy Sphere Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.4 Cumulative Step-Size Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.5 Parabolic Ridge Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.6 Cigar Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.7 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Convergence Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Abstract
Evolution strategies are evolutionary algorithms that date back to the 1960s and that are most commonly
applied to black-box optimization problems in continuous search spaces. Inspired by biological evolution,
their original formulation is based on the application of mutation, recombination and selection in populations
of candidate solutions. From the algorithmic viewpoint, evolution strategies are optimization methods
that sample new candidate solutions stochastically, most commonly from a multivariate normal probability
distribution. Their two most prominent design principles are unbiasedness and adaptive control of parameters
of the sample distribution. In this overview the important concepts of success based step-size control, self-
adaptation and derandomization are covered, as well as more recent developments like covariance matrix
adaptation and natural evolution strategies. The latter give new insights into the fundamental mathematical
rationale behind evolution strategies. A broad discussion of theoretical results includes progress rate results
on various function classes and convergence proofs for evolution strategies.

2
1 Overview portant theoretical results.

Evolution Strategies [1, 2, 3, 4], sometimes also re- Symbols and Abbreviations Throughout this
ferred to as Evolutionary Strategies, and Evolution- chapter, vectors like z ∈ Rn are column vectors, their
ary Programming [5] are search paradigms inspired transpose is denoted as z > , and transformations like
by the principles of biological evolution. They belong exp(z), z 2 , or |z| are applied component-wise. Fur-
to the family of evolutionary algorithms that address ther symbols are
optimization problems by implementing a repeated
process of (small) stochastic variations followed by |z|= (|z1 |, |z2 |, . . . )> absolute value taken compo-
selection: in each generation (or iteration), new off- nent wise
spring (or candidate solutions) are generated from pP
2
their parents (candidate solutions already visited), kzk= i zi Euclidean length of a vector
their fitness is evaluated, and the better offspring are ∼ equality in distribution
selected to become the parents for the next genera-
tion. ∝ in the limit proportional to
Evolution strategies most commonly address the
problem of continuous black-box optimization. The ◦ binary operator giving the component-wise prod-
search space is the continuous domain, Rn , and solu- uct of two vectors or matrices (Hadamard prod-
tions in search space are n-dimensional vectors, de- uct), such that for a, b ∈ Rn we have a ◦ b ∈ Rn
noted as x. We consider an objective or fitness func- and (a ◦ b)i = ai bi .
tion f : Rn → R, x 7→ f (x) to be minimized. We
1. the indicator function, 1α = 0 if α is false or 0 or
make no specific assumptions on f , other than that
empty, and 1α = 1 otherwise.
f can be evaluated for each x, and refer to this search
problem as black-box optimization. The objective is, λ ∈ N number of offspring, offspring population size
loosely speaking, to generate solutions (x-vectors)
with small f -values while using a small number of µ ∈ N number of parents, parental population size
f -evaluations.1 Pµ 2 Pµ 2
In this context, we present an overview of meth- µw = k=1 |wk | / k=1 wk , the variance effec-
ods that sample new offspring, or candidate solu- tive selection mass or effective number of par-
tions, from normal distributions. Naturally, such an ents, where always µw ≤ µ and µw = µ if all
overview is biased by the authors’ viewpoints, and recombination weights wk are equal in absolute
our emphasis will be on important design principles value
and on contemporary evolution strategies that we (1+1) elitist selection scheme with one parent and
consider as most relevant in practice or future re- one offspring, see Section 2.1
search. More comprehensive historical overviews can
be found elsewhere [6, 7]. (µ +, λ), e.g. (1+1) or (1, λ), selection schemes, see
In the next section the main principles are intro- Section 2.1
duced and two algorithm templates for an evolution
strategy are presented. Section 3 presents six evolu- (µ/ρ, λ) selection scheme with recombination (if ρ >
tion strategies that mark important conceptual and 1), see Section 2.1
algorithmic developments. Section 4 summarizes im-
ρ ∈ N number of parents for recombination
1 Formally, we like to “converge” to an essential global opti-

mum of f , in the sense that the best f (x) value gets arbitrarily
σ > 0 a step-size and/or standard deviation
close to the essential infimum of f (i.e., the smallest f -value
for which all larger, i.e. worse f -values have sublevel sets with σ ∈ Rn+ a vector of step-sizes and/or standard devi-
positive volume). ations

3
ϕ ∈ R a progress measure, see Definition 2 and Sec- x, x(t) , xk ∈ Rn solution or object parameter vector
tion 4.2 of a single parent (at iteration t) or of the kth
offspring; an element of the search space Rn that
cµ/µ,λ the progress coefficient for the (µ/µ, λ)-ES serves as argument to the fitness function f :
[8] equals the expected value of the average of Rn → R.
the largest µ order statistics of λ independent
standard normally distributed
p random numbers diag : Rn → Rn×n the diagonal matrix from a vector
and is in the order of 2 log(λ/µ). P∞
expα : Rn×n → Rn×n , A 7→ i
i=0 (αA) / i! is
C ∈ Rn×n a (symmetric and positive definite) co- the matrix exponential for n > 1, otherwise
variance matrix the exponential function. If A is symmetric
>
and BΛB> = A is the eigendecomposition of
1 1 1
C /2 ∈ Rn×n a matrix that satisfies C /2 C /2 = C A with BB> = I and Λ diagonal, we
1  have
and is symmetric if not stated otherwise. If C /2
P∞
exp(A) = B exp(Λ)B> = B i=0 Λ i
/i! B> =
1/2
is symmetric, the eigendecomposition C = > 2 >
I + BΛB + BΛ B /2 + . . . . Furthermore
BΛB> with BB> = I and diagonal matrix Λ we have expα (A) = exp(A)α = exp(αA) and
exists and we find C = C /2 C /2 = BΛ2 B> as
1 1
expα (x) = ( eα )x = eαx .
eigendecomposition of C.

ei the ith canonical basis vector 2 Main Principles


f : Rn → R fitness or objective function to be min- Evolution strategies derive inspiration from princi-
imized ples of biological evolution. We assume a population,
P, of so-called individuals. Each individual consists
I ∈ Rn×n the identity matrix (identity transforma-
of a solution or object parameter vector x ∈ Rn (the
tion)
visible traits) and further endogenous parameters, s
i.i.d. independent and identically distributed (the hidden traits), and an associated fitness value,
f (x). In some cases the population contains only
N (x, C) a multivariate normal distribution with ex- one individual. Individuals are also denoted as par-
pectation and modal value x and covariance ma- ents or offspring, depending on the context. In a
trix C, see Section 2.4. generational procedure,

n ∈ N search space dimension 1. one or several parents are picked from the pop-
ulation (mating selection) and new offspring are
P a multiset of individuals, a population generated by duplication and recombination of
these parents;
s, sσ , sc ∈ Rn a search path or evolution path
2. the new offspring undergo mutation and become
s, sk endogenous strategy parameters (also known as new members of the population;
control parameters) of a single parent or the kth
offspring; they typically parametrize the muta- 3. environmental selection reduces the population
tion, for example with a step-size σ or a covari- to its original size.
ance matrix C
Within this procedure, evolution strategies employ
t ∈ N time or iteration index the following main principles that are specified and
applied in the operators and algorithms further be-
wk ∈ R recombination weights low.

4
Environmental Selection is applied as so-called Unbiasedness is a generic design principle of evo-
truncation selection. Based on the individuals’ fit- lution strategies. Variation resulting from mutation
nesses, f (x), only the µ best individuals from the or recombination is designed to introduce new, un-
population survive. In contrast to roulette wheel se- biased “information”. Selection on the other hand
lection in genetic algorithms [9], only fitness ranks are biases this information towards solutions with better
used. In evolution strategies, environmental selection fitness. Under neutral selection (i.e., fitness indepen-
is deterministic. In evolutionary programming, like dent mating and environmental selection), all varia-
in many other evolutionary algorithms, environmen- tion operators are desired to be unbiased. Maximum
tal selection has a stochastic component. Environ- exploration and unbiasedness are in accord. Evolu-
mental selection can also remove “overaged” individ- tion strategies are unbiased in the following respects.
uals first.
• The type of mutation distribution, the Gaus-
sian or normal distribution, is chosen in order
Mating Selection and Recombination. Mating to have rotational symmetry and maximum en-
selection picks individuals from the population to be- tropy (maximum exploration) under the given
come new parents. Recombination generates a single variances. Decreasing the entropy would intro-
new offspring from these parents. Specifically, we dif- duce prior information and therefore a bias.
ferentiate two common scenarios for mating selection
and recombination. • Object parameters and endogenous strategy pa-
rameters are unbiased under recombination and
fitness-independent mating selection and recom- unbiased under mutation. Typically, mutation
bination do not depend on the fitness values of has expectation zero.
the individuals and can be either deterministic
or stochastic. Environmental selection is then • Invariance properties avoid a bias towards a spe-
essential to drive the evolution toward better so- cific representation of the fitness function, e.g.
lutions. representation in a specific coordinate system or
using specific fitness values (invariance to strictly
fitness-based mating selection and recombination, monotonic transformations of the fitness values
where the recombination operator utilizes the fit- can be achieved). Parameter control in evolution
ness ranking of the parents (in a deterministic strategies strives for invariance properties [10].
way). Environmental selection can potentially
be omitted in this case. 2.1 (µ/ρ +, λ) Notation for Selection
and Recombination
Mutation and Parameter Control. Mutation
An evolution strategy is an iterative (generational)
introduces small, random and unbiased changes to
procedure. In each generation new individuals (off-
an individual. These changes typically affect all vari-
spring) are created from existing individuals (par-
ables. The average size of these changes depends on
ents). A mnemonic notation is commonly used to
endogenous parameters that change over time. These
describe some aspects of this iteration. The (µ/ρ +, λ)-
parameters are also called control parameters, or en-
ES, where µ, ρ and λ are positive integers, also fre-
dogenous strategy parameters, and define the notion
quently denoted as (µ +, λ)-ES (where ρ remains un-
of “small”, for example via the step-size σ. In con-
specified) describes the following.
trast, exogenous strategy parameters are fixed once
and for all, for example parent number µ. Parameter • The parent population contains µ individuals.
control is not always directly inspired by biological
evolution, but is an indispensable and central feature • For recombination, ρ (out of µ) parent individu-
of evolution strategies. als are used. We have therefore ρ ≤ µ.

5
• λ denotes the number of offspring generated in Template 1 The (µ/ρ +, λ)-ES
each iteration. 0 given n, ρ, µ, λ ∈ N+
• +, describes whether or not selection is addition- 1 initialize P = {(xk , sk , f (xk )) | 1 ≤ k ≤ µ}
ally based on the individuals’ age. An evolu- 2 while not happy
tion strategy applies either ’plus’- or ’comma’- 3 for k ∈ {1, . . . , λ}
selection. In ’plus’-selection, age is not taken 4 (xk , sk ) = recombine(select mates(ρ, P))
into account and the µ best of µ + λ individu- 5 sk ← mutate s(sk )
als are chosen. Selection is elitist and, in effect,
6 xk ← mutate x(sk , xk ) ∈ Rn
the parents are the µ all-time best individuals.
In ’comma’-selection, individuals die out after 7 P ← P ∪ {(xk , sk , f (xk )) | 1 ≤ k ≤ λ}
one iteration step and only the offspring (the 8 P ← select by age(P) // identity for ’+’
youngest individuals) survive to the next gen- 9 P ← select µ best(µ, P) // by f -ranking
eration. In that case, environmental selection
chooses µ parents from λ offspring.
Template 2 The (µ/µ +, λ)-ES
In a (µ, λ)-ES, λ ≥ µ must hold and the case λ = µ 0 given n, λ ∈ N+
requires fitness-based mating selection or recombina-
1 initialize x ∈ Rn , s, P = {}
tion. In a (µ + λ)-ES, λ = 1 is possible and known
as steady-state scenario. 2 while not happy
Occasionally, a subscript to ρ is used in order to 3 for k ∈ {1, . . . , λ}
denote the type of recombination, e.g. ρI or ρW for 4 sk = mutate s(s)
intermediate or weighted recombination, respectively. 5 xk = mutate x(sk , x)
Without a subscript we tacitly assume intermediate 6 P ← P ∪ {(xk , sk , f (xk ))}
recombination, if not stated otherwise. The nota-
7 P ← select by age(P) // identity for ’+’
tion has also been expanded to include the maxi-
mum age, κ, of individuals as (µ, κ, λ)-ES [11], where 8 (x, s) ← recombine(P, x, s)
’plus’-selection corresponds to κ = ∞ and ’comma’-
selection corresponds to κ = 1.
nition, the same age. Finally, the best µ individuals
are retained in P (Line 9).
2.2 Two Algorithm Templates The mutation of the x-vector in Line 6 always in-
Template 1 gives pseudocode for the evolution strat- volves a stochastic component. Lines 4 and 5 may
egy. Given is a population, P, of at least µ individ- have stochastic components as well.
uals (xk , sk , f (xk )), k = 1, . . . , µ. Vector xk ∈ Rn When select mates in Line 4 selects ρ = µ indi-
is a solution vector and sk contains the control or viduals from P, it reduces to the identity. If ρ = µ
endogenous strategy parameters, for example a suc- and recombination is deterministic, as is commonly
cess counter or a step-size that primarily serves to the case, the result of recombine is the same parental
control the mutation of x (in Line 6). The values centroid for all offspring. The computation of the
of sk may be identical for all k. In each generation, parental centroid can be done once before the for
first λ offspring are generated (Lines 3–6), each by loop or as the last step of the while loop, simplifying
recombination of ρ ≤ µ individuals from P (Line 4), the initialization of the algorithm. Template 2 shows
followed by mutation of s (Line 5) and of x (line 6). the pseudocode in this case.
The new offspring are added to P (Line 7). Over- In Template 2, only a single parental centroid (x, s)
aged individuals are removed from P (Line 8), where is initialized. Mutation takes this parental centroid
individuals from the same generation have, by defi- as input (notice that sk and xk in Line 4 and 5 are

6
now assigned rather than updated ) and “recombina- Weighted multi-recombination [12, 10, 13], denoted
tion” is postponed to the end of the loop, computing by (µ/ρW +, λ), is a generalization of intermedi-
in Line 8 the new parental centroid. While (xk , sk ) ate recombination, usually with ρ = µ. It takes
can contain all necessary information for this com- a weighted average of all ρ parents. The weight
putation, it is often more transparent to use x and values depend on the fitness ranking, in that bet-
s as additional arguments in Line 8. Selection based ter parents never get smaller weights than infe-
on f -values is now limited to mating selection in pro- rior ones. With equal weights, intermediate re-
cedure recombine (that is, procedure select µ best is combination is recovered. By using comma selec-
omitted and µ is the number of individuals in P that tion and ρ = µ = λ, where some of the weights
are actually used by recombine). may be zero, weighted recombination can take
Using a single parental centroid has become the over the role of fitness-based environmental se-
most popular approach, because such algorithms are lection and negative weights become a feasible
simpler to formalize, easier to analyze and even per- option [12, 13].2
form better in various circumstances as they allow for
maximum genetic repair (see below). All instances of In principle, recombination operators from genetic
evolution strategies given in Section 3 are based on algorithms, like one-point and two-point crossover
Template 2. or line recombination [14] can alternatively be used.
However, they have been rarely applied in evolution
strategies.
2.3 Recombination Operators In evolution strategies, the result of selection and
In evolution strategies, recombination combines in- recombination is often deterministic (namely, if ρ = µ
formation from several parents to generate a single and recombination is intermediate or weighted). This
new offspring. Often, multi-recombination is used, means that eventually all offspring are generated by
where more than two parents are recombined (ρ > 2). mutation from the same single solution vector (the
In contrast, in genetic algorithms often two offspring parental centroid) as in Template 2. This leads, for
are generated from the recombination of two parents. given variances, to maximum entropy because all off-
In evolutionary programming, recombination is gen- spring are independently drawn from the same nor-
erally not used. The most important recombination mal distribution.3
operators used in evolution strategies are the follow- The role of recombination in general is to keep the
ing. variation in a population high. Discrete recombina-
tion directly introduces variation by generating differ-
ent solutions. Their distance resembles the distance
Discrete or dominant recombination, denoted by
between the parents. However, discrete recombina-
(µ/ρD +, λ), is also known as uniform crossover
tion, as it depends on the given coordinate system,
in genetic algorithms. For each variable (com-
relies on separability: it can introduce variation suc-
ponent of the x-vector), a single parent is drawn
cessfully only if values of disrupted variables do not
uniformly from all ρ parents to inherit the vari-
strongly depend on each other. Solutions resulting
able value. For ρ parents that all differ in each
from discrete recombination lie on the vertices of an
variable value, the result is uniformly distributed
axis-parallel box.
across ρn different x-values. The result of dis-
crete recombination depends on the given coor- 2 The sum of weights must be either one or zero, or recom-
dinate system. bination must be applied to the vectors xk − x and the result
added to x.
3 With discrete recombination, the offspring distribution is
Intermediate recombination, denoted by (µ/ρI +
, generated from a mixture of normal distributions with differ-
λ), takes the average value of all ρ parents (com- ent mean values. The resulting distribution has lower entropy
putes the center of mass, the centroid). unless it has a larger overall variance.

7
Intermediate and weighted multi-recombination do i.e., the mutation distribution follows σN (0, I)
not lead to variation within the new population as with step-size σ > 0. The distribution is
they result in the same single point for all offspring. spherical and invariant under rotations about
However, they do allow the mutation operator to in- its mean. Below, Algorithm 1 uses this kind of
troduce additional variation by means of genetic re- mutation.
pair [15]: recombinative averaging reduces the effec-
tive step length taken in unfavorable directions by a Axis-parallel (Figure 1, middle) where the co-
√ √ variance matrix is a diagonal matrix, i.e., the
factor of µ (or µw in case of weighted recombi-
nation), but leaves the step length in favorable di- mutation distribution follows N (0, diag(σ)2 ),
rections essentially unchanged, see also Section 4.2. where σ is a vector of coordinate-wise standard
This may allow increased variation by enlarging mu- deviations and the diagonal matrix diag(σ)2 has
tations by a factor of about µ (or µw ) as revealed in eigenvalues σi2 with eigenvectors ei . The prin-
Eq. (16), to achieve maximal progress. cipal axes of the ellipsoid are parallel to the co-
ordinate axes. This case includes the previous
isotropic case. Below, Algorithms 2, 3, and 4
2.4 Mutation Operators implement this kind of mutation distribution.
The mutation operator introduces (“small”) varia-
General (Figure 1, right) where the covariance
tions by adding a point symmetric perturbation to
matrix is symmetric and positive definite (i.e.
the result of recombination, say a solution vector
x> Cx > 0 for all x 6= 0), generally non-diagonal
x ∈ Rn . This perturbation is drawn from a multivari-
and has (n2 + n)/2 degrees of freedom (control
ate normal distribution4 , N (0, C), with zero mean
parameters). The general case includes the pre-
(expected value) and covariance matrix C ∈ Rn×n .
vious axis-parallel and spherical cases. Below,
We have x+N (0, C) ∼ N (x, C), meaning that x de-
Algorithms 5 and 6 implement general multivari-
termines the expected value of the new offspring indi-
1/2 ate normally distributed mutations.
vidual. We also have x + N (0, C) ∼ x + C N (0, I),
1/2
meaning that the linear transformation C gener- In the first and the second cases, the variations of
ates the desired distribution from the vector N (0, I) variables are independent of each other, they are un-
that has i.i.d. N (0, 1) components.5 correlated. This limits the usefulness of the operator
Figure 1 shows different normal distributions in in practice. The third case is “incompatible” with
dimension n = 2. Their lines of equal density discrete recombination: for a narrow, diagonally ori-
are ellipsoids. Any straight section through the 2- ented ellipsoid (not to be confused with a diagonal
dimensional density recovers a 1-dimensional Gaus- covariance matrix), a point resulting from selection
sian bell. Based on multivariate normal distribu- and discrete recombination lies within this ellipsoid
tions, three different mutation operators can be dis- only if each coordinate is taken from the same parent
tinguished. (which happens with probability 1/ρn−1 ) or from a
parent with a very similar value in this coordinate.
Spherical/isotropic (Figure 1, left) where the
The narrower the ellipsoid the more similar (i.e. cor-
covariance matrix is proportional to the identity,
related) the value needs to be. As another illustration
4 Besides normally distributed mutations, Cauchy muta-
consider sampling, neutral selection and discrete re-
tions [16, 17, 18] have also been proposed in the context of
combination based on Figure 1, right: after discrete
evolution strategies and evolutionary programming.
5 Using the normal distribution has several advantages. The recombination the points (−2, 2) and (2, −2) outside
N (0, I) distribution is the most convenient way to implement the ellipsoid have the same probability as the points
an isotropic perturbation. The normal distribution is stable: (2, 2) and (−2, −2) inside the ellipsoid.
sums of independent normally distributed random variables
are again normally distributed. This facilitates the design and
The mutation operators introduced are unbiased
analysis of algorithms remarkably. Furthermore, the normal in several ways. They are all point-symmetrical and
distribution has maximum entropy under the given variances. have expectation zero. Therefore, mutation alone will

8
4 4 4

2 2 2

0 0 0

2 2 2

4 4 4
2 0 2 2 0 2 2 0 2
1
Figure 1: Three 2-dimensional multivariate normal distributions N (0, C) ∼ C /2 N (0, I). The covariance
matrix C of the distribution is, from left to right, the identity I (isotropic distribution), the diagonal matrix
1/4 0 2.125 1.875 ) with the same eigenvalues (1/4, 4) as the diagonal matrix.
0 4
(axis-parallel distribution) and ( 1.875 2.125
Shown are in each subfigure the mean at 0 as small black dot (a different mean solely changes the axis
annotations), two eigenvectors of C along the principal axes of the ellipsoids (thin black lines), two ellipsoids
reflecting the set of points {x : (x − 0)> C−1 (x − 0) ∈ {1, 4}} that represent the 1-σ and 2-σ lines of equal
density, and 100 sampled points (however, a few of them are likely to be outside of the area shown).

almost certainly not lead to better fitness values in bation. The step-size controls to a large extent the
expectation. The isotropic mutation operator features convergence speed. In situations where larger step-
the same distribution along any direction. The gen- sizes lead to larger expected improvements, a step-
eral mutation operator is, as long as C remains un- size control technique should aim at increasing the
specified, unbiased towards the choice of a Cartesian step-size (and decreasing it in the opposite scenario).
coordinate system, i.e. unbiased towards the repre- The importance of step-size control is illustrated
sentation of solutions x, which has also been referred with a simple experiment. Consider a spherical func-
to as invariance to affine coordinate system transfor- tion f (x) = kxkα , α > 0, and a (1+1)-ES with con-
mations [10]. This however depends on the way how stant step-size equal to σ = 10−2 , i.e. with mutations
C is adapted (see below). drawn from 10−2 N (0, I). The convergence of the al-
gorithm is depicted in Fig 2 (constant σ graphs).
3 Parameter Control We observe, roughly speaking, three stages: up to
600 function evaluations, progress towards the opti-
Controlling the parameters of the mutation operator mum is slow. In this stage the fixed step-size is too
is key to the design of evolution strategies. Consider small. Between 700 and 800 evaluations, fast progress
the isotropic operator (Figure 1, left), where the step- towards the optimum is observed. In this stage the
size σ is a scaling factor for the random vector pertur- step-size is close to optimal. Afterwards, the progress

9
0 random search
10 0
10

distance to optimum
random search
distance to optimum

constant σ

−3 −1
10 10
constant σ
adaptive
step−size σ −2
10
−6
10 step−size σ
adaptive
−3
10 step−size σ
−9
10 2 4 6
0 500 1000 1500 10 10 10
function evaluations function evaluations

Figure 2: Runs of the (1+1)-ES with constant step-size, of pure random search (uniform in [−0.2, 1]10 ),
and of the (1 + 1)-ES with 1/5th success rule (Algorithm 1) on a spherical function f (x) = kxkα , α > 0
(because of invariance to monotonic f -transformation the same graph is observed for any α > 0). For each
algorithm there are three runs in the left plot and three runs in the right plot. The x-axis is linear in the
left and in log-scale in the right hand plot. For the (1+1)-ES with constant step-size, σ equals 10−2 . For
the (1+1)-ES with 1/5th success rule, the initial step-size is chosen very small to 10−9 and the parameter
d equals 1 + 10/3. On the left, also the evolution of the step-size of one of the runs of the (1+1)-ES with
1/5th success rule is shown. All algorithms are initialized at 1. Eventually, the (1+1)-ES with 1/5th success
rule reveals linear behavior on the left, while the other two algorithms reveal eventually linear behavior in
the right hand plot.

decreases and approaches the rate of the pure random search space. In the most general case, the muta-
search algorithm, well illustrated in the right subfig- tion operator has (n2 + n)/2 degrees of freedom (see
ure. In this stage the fixed step-size is too large and Section 2.4). The conjecture is that in the desired
the probability to sample better offspring becomes scenario lines of equal density of the mutation op-
very small. erator resemble locally the lines of equal fitness [4,
The figure also shows runs of the (1 + 1)-ES with p242f]. In case of convex-quadratic fitness functions
1/5th success rule step-size control (as described in this resemblance can be perfect and, apart from the
Section 3.1) and the step-size evolution associated step-size, optimal parameters do not change over time
to one of these runs. The initial step-size is far too (as illustrated in Fig. 3 below).
small and we observe that the adaptation technique Control parameters like the step-size can be stored
increases the step-size in the first iterations. After- on different “levels”. Each individual can have its
wards, step-size is kept roughly proportional to the own step-size value (like in Algorithms 2 and 3), or
distance to the optimum, which is in fact optimal and a single step-size is stored and applied to all individ-
leads to linear convergence in the left subfigure. uals in the population. In the latter case, sometimes
Generally, the goal of parameter control is to drive different populations with different parameter values
the endogenous strategy parameters close to their are run in parallel [19].
optimal values. These optimal values, as we have In the following, six specific evolution strategies
seen for the step-size in Figure 2, can significantly are outlined, each of them representing an important
change over time or depending on the position in achievement in parameter control.

10
3.1 The 1/5th Success Rule Algorithm 1 The (1+1)-ES with 1/5th Rule

The 1/5th success rule for step-size control is based 0 given n ∈ N+ , d ≈ n + 1
on an important discovery made very early in the 1 initialize x ∈ Rn , σ > 0
research of evolution strategies [1]. A similar rule had 2 while not happy
also been found independently before in [20]. As a 3 x1 = x + σ × N (0, I) // mutation
control mechanism in practice, the 1/5th success rule
4 σ ← σ × exp1/d (1f (x1 )≤f (x) − 1/5)
has been mostly superseded by more sophisticated
methods. However, its conceptual insight remains 5 if f (x1 ) ≤ f (x) // select if better
remarkably valuable. 6 x = x1 // x-value of new parent
Consider a linear fitness
P function, for example f :
x 7→ x1 or f : x 7→ i xi . In this case, any point
symmetrical mutation operator has a success proba- While this cannot happen in a single generation, we
bility of 1/2: in one half of the cases, the perturbation still can find a stationary point for σ: log σ is unbi-
will improve the original solution, in one half of the ased if and only if the expected value of the argument
cases the solution will deteriorate. Following from of exp is zero. This is the case if E1f (x1 )≤f (x) = 1/5,
Taylors formula, smooth functions are locally linear, in other words, if the probability of an improvement
that is, they appear to be more and more linear with with f (x1 ) ≤ f (x) is 20%. Otherwise, log σ increases
decreasing neighborhood size. Therefore, the success in expectation if the success probability is larger than
probability becomes 1/2 for step-size σ → 0. On 1/5 and decreases if the success probability is smaller
most non-linear functions, the success rate is indeed than 1/5. Hence, Algorithm 1 indeed implements the
a monotonously decreasing function in σ and goes to 1/5th success rule.
zero for σ → ∞. This suggests to control the step-size
by increasing it for large success rates and decreas-
ing it for small ones. This mechanism can drive the 3.2 Self-Adaptation
step-size close to the optimal value. A seminal idea in the domain of evolution strate-
Rechenberg [1] investigated two simple but quite gies is parameter control via self-adaptation [3]. In
different functions, the corridor function self-adaptation, new control parameter settings are
 generated similar to new x-vectors by recombination
x1 if |xi | ≤ 1 for i = 2, . . . , n and mutation. Algorithm 2 presents an example with
f : x 7→
∞ otherwise adaptation of n coordinate-wise standard deviations
P 2 (individual step-sizes).
and the sphere function f : x 7→ xi . He found
First, for conducting the mutation, random events
optimal success rates for the (1+1)-ES with isotropic
are drawn in Lines 4–6. In Line 7, the step-size vector
mutation to be ≈ 0.184 > 1/6 and ≈ 0.270 < 1/3,
for each individual undergoes (i) a mutation common
respectively (for n → ∞) [1].6 This leads to approxi-
for all components, exp(ξk ), and (ii) a component-
mately 1/5 as being the success value where to switch
wise mutation with exp(ξ k ). These mutations are
between decreasing and increasing the step-size.
unbiased, in that E log σk = log σ. The mutation
Algorithm 1 implements the (1+1)-ES with 1/5th of x in Line 8 uses the mutated vector σk . After
success rule in a simple and effective way [21]. Lines selection in Line 9, intermediate recombination is ap-
4–6 implement Line 8 from Template 2, including se- plied to compute x and σ for the next generation.
lection in Line 7. Line 4 in Algorithm 1 updates the By taking the average over σk we have Eσ = Eσk in
step-size σ of the single parent. The step-size does Line 10. However, the application of mutation and
not change if and only if the argument of exp is zero. recombination on σ introduces a moderate bias such
6 Optimality here means to achieve the largest expected ap- that σ tends to increase under neutral selection [22].
proach of the optimum in a single generation. In order to achieve stable behavior of σ, the num-

11
Algorithm 2 The (µ/µ, λ)-σSA-ES • Selection of a small/large component of |σk ◦ z k |
√ does not imply that this is necessarily a favor-
0 given n ∈ N+ , λ ≥ 5n, µ ≈ λ/4 ∈ N, τ ≈ 1/ n,
τi ≈ 1/n1/4 able setting: more often than not, the sign of a
1 initialize x ∈ Rn , σ ∈ Rn+ component is more important than its size and
all other components influence the selection as
2 while not happy
well.
3 for k ∈ {1, . . . , λ}
// random numbers i.i.d. for all k Due to selection noise, poor values are frequently
4 ξk = τ N (0, 1) // global step-size inherited and we observe stochastic fluctuations of
5 ξ k = τi N (0, I) // coordinate-wise σ σ. Such fluctuations can in particular lead to very
6 z k = N (0, I) //x-vector change small values (very large values are removed by selec-
// mutation tion more quickly). The overall magnitude of these
fluctuations can be implicitly controlled via the par-
7 σk = σ ◦ exp(ξ k ) × exp(ξk ) ent number µ, because intermediate recombination
8 xk = x + σk ◦ z k (Line 10 in Algorithm 2) effectively reduces the mag-
9 P = sel µ best({(xk , σk , f (xk )) | 1 ≤ k ≤ λ}) nitude of σ-changes and biases log σ to larger values.
// recombination For µ  n the stochastic fluctuations become pro-
1 X hibitive and therefore µ ≈ λ/4 ≥ 1.25n is chosen to
10 σ = σk make σ-self-adaptation reliable.
µ
σk ∈P
Derandomization addresses the problem of selec-
1 X
11 x = xk tion noise on σ directly without resorting to a large
µ parent number. The derandomized (1, λ)-σSA-ES is
xk ∈P
outlined in Algorithm 3 and addresses selection noise
twofold. Instead of introducing new variations in σ by
ber of parents µ must be large enough, which is re- means of exp(ξ k ), the variations from z k are directly
flected in the setting of λ. A setting of τ ≈ 1/4 used for the mutation of σ in Line 7. The variations
has been proposed in combination with ξk being uni- are dampened compared to their use in the mutation
formly distributed across the two values in {−1, 1} of x (Line 6) via d and di , thereby mimicking the ef-
[2]. fect of intermediate recombination on σ [23, 24]. The
order of the two mutation equations becomes irrele-
vant.
3.3 Derandomized Self-Adaptation For Algorithm 3 also a (µ/µ, λ) variant with re-
combination is feasible. However, in particular in the
Derandomized self-adaptation [23] addresses the (µ/µI , λ)-ES, σ-self-adaptation tends to generate too
problem of selection noise that occurs with self- small step-sizes. A remedy for this problem is to use
adaptation of σ as outlined in Algorithm 2. Selection non-local information for step-size control.
noise refers to the possibility that very good offspring
may be generated with poor strategy parameter set-
3.4 Non-Local Derandomized Step-
tings and vice versa. The problem occurs frequently
and has two origins. Size Control (CSA)
When using self-adaptation, step-sizes are associated
• A small/large component in |σk ◦ z k | (Line 8 with individuals and selected based on the fitness of
in Algorithm 2) does not necessarily imply that each individual. However, step-sizes that serve indi-
the respective component of σk is small/large. viduals well by giving them a high likelihood to be
Selection of σ is disturbed by the respective re- selected are generally not step-sizes that maximize
alizations of z. the progress of the entire population or the parental

12
Algorithm 3 Derandomized (1, λ)-σSA-ES Algorithm 4 The (µ/µ, λ)-ES with Search Path

0 given n ∈ N+ , λ ≈ 10, τ ≈ 1/3, d ≈ n, di ≈ n 0 given
p n ∈ N+ , λ ∈pN, µ ≈ λ/4 ∈ N, cσ ≈
1 initialize x ∈ Rn , σ ∈ Rn+ µ/(n + µ), d ≈ 1 + µ/n, di ≈ 3n
2 while not happy 1 initialize x ∈ Rn , σ ∈ Rn+ , sσ = 0
3 for k ∈ {1, . . . , λ} 2 while not happy
// random numbers i.i.d. for all k 3 for k ∈ {1, . . . , λ}
4 ξk = τ N (0, 1) 4 z k = N (0, I) // i.i.d. for each k
5 z k = N (0, I) 5 xk = x + σ ◦ z k
// mutation, re-using random events 6 P ← sel µ best({(xk , z k , f (xk )) | 1 ≤ k ≤ λ})
6 xk = x + exp(ξk ) × σ ◦ z k // recombination and parent update
7 sσ ← (1 − cσ ) sσ + √
 
|z k |
7 σk = σ ◦ exp1/di −1 p µ X
E|N (0, 1)| 7b cσ (2 − cσ ) zk
7b × exp1/d (ξk ) µ
z k ∈P
 
8 (x1 , σ1 , f (x1 )) ← select single best( |sσ |
8 σ ← σ ◦ exp1/di −1
{(xk , σk , f (xk )) | 1 ≤ k ≤ λ}) E|N (0, 1)| 
// assign new parent cσ /d ksσ k
8b × exp −1
9 σ = σ1 EkN (0, I)k
1 X
10 x = x1 9 x= xk
µ
xk ∈P

centroid x. We will see later that, for example, the


optimal step-size may increase linearly with µ (Sec- disappeared (compared to Algorithm 3) and the up-
tion 4.2 and Eq. (16)). With self-adaptation on the date of σ is postponed until after the for loop. In-
other hand, the step-size of the µth best offspring is stead of the additional random variate ξk , the length
typically even smaller than the step-size of the best of the search path ksσ k determines the global step-
offspring. Consequently, Algorithm 3 assumes often size change in Line 8b. For the individual step-size
too small step-sizes and can be considerably improved change, |z k | is replaced by |sσ |.
by using non-local information about the evolution Using a search path is justified in two ways. First,
of the population. Instead of single (local) muta- it implements a low-pass filter for selected z-steps,
tion steps z, an exponentially fading record, sσ , of removing high frequency (most likely noisy) infor-
mutation steps is taken. This record, referred to as mation. Second, and more importantly, it utilizes
search path or evolution path, can be pictured as a se- information that is otherwise lost: even if all single
quence or sum of consecutive successful z-steps that steps have the same length, the length of sσ can vary,
is non-local in time and space. A search path carries because it depends on the correlation between the
information about the interrelation between single directions of z-steps. If single steps point intopsimi-
steps. This information can improve the adaptation lar directions, the path will be up to almost 2/cσ
and search procedure remarkably. Algorithm 4 out- times longer than a single step and the step-size will
lines the (µ/µI , λ)-ES with cumulative path length increase. If they
p oppose each other the path will be
control, also denoted as cumulative step-size adapta- up to almost cσ /2 times shorter and the step-size
tion (CSA), and additionally with non-local individ- will decrease. The same is true for single components
ual step-size adaptation [25, 26]. of sσ . p √
In the (µ/µ, λ)-ES with search path, Algorithm 4, The factors cσ (2 − cσ ) and µ in Line 7b guar-
the factor ξk for changing the overall step-size has anty unbiasedness of sσ under neutral selection, as

13
usual. the mutation distribution, but they might be advan-
All evolution strategies described so far are of tageous for larger dimensional problems, say larger
somewhat limited value, because they feature only than a hundred.
isotropic or axis-parallel mutation operators. In the
remainder we consider methods that entertain not 3.6 Covariance Matrix Adaptation
only an n-dimensional step-size vector σ, but also
correlations between variables for the mutation of x.
(CMA)
The covariance matrix adaptation evolution strat-
3.5 Addressing Dependencies Be- egy (CMA-ES) [27, 10, 36] is a de facto standard in
continuous domain evolutionary computation. The
tween Variables CMA-ES is a natural generalization of Algorithm 4
The evolution strategies presented so far sample the in that the mutation ellipsoids are not constrained to
mutation distribution independently in each compo- be axis-parallel, but can take on a general orientation.
nent of the given coordinate system. The lines of The CMA-ES is also a direct successor of the gener-
equal density are either spherical or axis-parallel el- ating set adaptation [26], replacing self-adaptation to
lipsoids (compare Figure 1). This is a major draw- control the overall step-size with cumulative step-size
back, because it allows to solve problems with a long adaptation.
or elongated valley efficiently only if the valley is The (µ/µW , λ)-CMA-ES is outlined in Algo-
aligned with the coordinate system. In this section rithm 5. Two search paths are maintained, sσ and
we discuss evolution strategies that allow to traverse sc . The first path, sσ , accumulates steps in the co-
non-axis-parallel valleys efficiently by sampling dis- ordinate system where the mutation distribution is
tributions with correlations. isotropic and which can be derived by scaling in the
principal axes of the mutation ellipsoid only. The
path generalizes sσ from Algorithm 4 to non-diagonal
Full Covariance Matrix Algorithms that adapt
covariance matrices and is used to implement cumu-
the complete covariance matrix of the mutation dis-
lative step-size adaptation, CSA, in Line 10 (resem-
tribution (compare Section 2.4) are correlated mu-
bling Line 8b in Algorithm 4). Under neutral selec-
tations [3], the generating set adaptation [26], the
tion, sσ ∼ N (0, I) and log σ is unbiased.
covariance matrix adaptation (CMA) [27], a muta-
The second path, sc , accumulates steps, disregard-
tive invariant adaptation [28], and some instances of
ing σ, in the given coordinate system.7 The co-
natural evolution strategies [29, 30, 31]. Correlated
variance matrix update consists of a rank-one up-
mutations and some natural evolution strategies are
date, based on the search path sc , and a rank-µ up-
however not invariant under changes of the coordi-
date with µ nonzero recombination weights wk . Un-
nate system [32, 10, 31]. In the next sections we
der neutral selection the expected covariance matrix
outline two evolution strategies that adapt the full
equals the covariance matrix before the update.
covariance matrix reliably and are invariant under
The updates of x and C follow a common princi-
coordinate system changes: the covariance matrix
ple. The mean x is updated such that the likelihood
adaptation evolution strategy (CMA-ES) and the ex-
of successful offspring to be sampled again is maxi-
ponential natural evolution strategy (xNES).
mized (or increased if cm < 1). The covariance ma-
trix C is updated such that the likelihood of success-
Restricted Covariance Matrix Algorithms that ful steps (xk −x)/σ to appear again, or the likelihood
adapt non-diagonal covariance matrices, but are re-
7 Whenever s is large and therefore σ is increasing fast,
stricted to certain matrices, are the momentum adap- σ
the coefficient hσ prevents sc from getting large and quickly
tation [33], direction adaptation [26], main vector changing the distribution shape via C. Given h ≡ 1, under
σ
adaptation [34], and limited memory CMA-ES [35]. neutral selection sc ∼ N (0, C). The coefficient ch in line 11
These variants are limited in their capability to shape corrects for the bias on sc introduced by events hσ = 0.

14
Algorithm 5 The (µ/µW , λ)-CMA-ES larger than the inherited ones, resembling the deran-
0 given P n ∈ N+ , λ ≥ 5, µ ≈ λ/2, wk = domization technique of damping step-size changes
µ
w0 (k)/ k w0 (k), w0 (k) P = log(λ/2 + 1/2) − to address selection noise as described in Section 3.3.
µ 2 An elegant way to replace Line 10 is
log rank(f (xk )),
p µ w = 1/ k wk , cσ ≈ µw /(n +
µw ), d ≈ 1 + µw /n, cc ≈ (4 + µw /n)/(n + 4 +
ksσ k2
 
2µw /n), c1 ≈ 2/(n2 + µw ), cµ ≈ µw /(n2 + µw ), σ ← σ exp(cσ /d)/2 −1 (1)
cm = 1 n
1 initialize sσ = 0, sc = 0, C = I, σ ∈ Rn+ , x ∈ Rn and often used in theoretical investigations of this
2 while not happy update as those presented in Section 4.2.
3 for k ∈ {1, . . . , λ} A single run of the (5/5W , 10)-CMA-ES on a
4 z k = N (0, I) // i.i.d. for all k convex-quadratic function is shown in Fig. 3. For
5
1
xk = x + σC /2 × z k sake of demonstration the initial step-size is chosen
far too small (a situation that should be avoided in
6 P = sel µ best({(z k , f (xk )) | 1 ≤ k ≤ λ})
X practice) and increases quickly for the first 400 f -
1
7 x ← x + cm σ C /2 wk z k evaluations. After no more than 5500 f -evaluations
z k ∈P the adaptation of C is accomplished. Then the eigen-
8 sσ ← (1 − cσ ) sσ + // search path for σ values of C (square roots of which are shown in the
p √ X
cσ (2 − cσ ) µw wk z k lower left) reflect the underlying convex-quadratic
z k ∈P function and the convergence speed is the same as
9 sc ← (1 − cc ) sc + // searchXpath for C on the sphere function and about 60% of the speed

wk C z k of the (1+1)-ES as observed in Fig. 2. The resulting
p 1/2
hσ cc (2 − cc ) µw
z k ∈P
convergence speed is about ten thousand times faster

ksσ k
 than without adaptation of C and at least one thou-
10 σ ← σ expcσ /d −1 sand times faster compared to any of the algorithms
EkN (0, I)k
from the previous sections.
11 C ← (1 − c1 + ch − cµ )X C+
c1 sc s> wk C /2 z k (C /2 z k )>
1 1
c + cµ
z k ∈P 3.7 Natural Evolution Strategies
where hσ = 1ksσ k2 /n<2+4/(n+1) , ch = c1 (1 −
1 The idea of using natural gradient learning [39] in
h2σ )cc (2 − cc ), and C /2 is the unique symmetric
1 1 evolution strategies has been proposed in [29] and
positive definite matrix obeying C /2 × C /2 = C.
further pursued in [40, 31]. Natural evolution strate-
All c-coefficients are ≤ 1.
gies (NES) put forward the idea that the update of
all distribution parameters can be based on the same
fundamental principle. NES have been proposed as a
to sample (in direction of) the path sc , is increased. more principled alternative to CMA-ES and charac-
A more fundamental principle for the equations is terized by operating on Cholesky factors of a covari-
given in the next section. ance matrix. Only later was it discovered that also
Using not only the µ best but all λ offspring can CMA-ES implements the underlying NES principle
be particularly useful for the “rank-µ” update of C of natural gradient learning [41, 31].
in line 11 where negative weights wk for inferior off- For simplicity, let the vector θ represent all param-
spring are advisable. Such an update has been intro- eters of the distribution to sample new offspring. In
duced as active CMA [37]. case of a multivariate normal distribution as above,
The factor cm in Line 7 can be equally written as a we have a bijective transformation between θ and
mutation scaling factor κ = 1/cm in Line 5, compare mean and covariance matrix of the distribution, θ ↔
[38]. This means that the actual mutation steps are (x, σ 2 C).

15
blue:abs(f), cyan:f-min(f), green:sigma, red:axis ratio Object Variables (mean, 10-D, popsize~10)
4
106
3 x(1)=1.01745059573e-0
104 x(4)=5.66438882667e-0
102 2 x(5)=-2.67431288352e-0
x(9)=-3.61631365147e-0
100 1 x(6)=-4.98514523696e-0
10-2 x(7)=-7.73608407424e-0
0 x(2)=-9.54313967452e-0
10-4 max std
min std x(3)=-1.24587137456e-0
10-6 1 x(0)=-1.53215862586e-0
x(8)=-1.60994800665e-0
10-8 .f_recent=2.5884720664009635e-08 2
0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000
Scaling (All Main Axes) Standard Deviations in All Coordinates
101 101

100 4
0
1
2
10-1 100 8
9
3
10-2 7
5
6
10-3 10-1
0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000
function evaluations function evaluations
Pn
Figure 3: A single run of the (5/5W , 10)-CMA-ES on the rotated ellipsoid function x 7→ i=1 αi2 yi2 with
αi = 103(i−1)/(n−1) , y = Rx, where R is a random matrix with R> R = I, for n = 10. Shown is the evolution
of various parameters against the number of function evaluations. Upper left: best (thick blue line), median
and worst fitness value that reveal the final convergence phase after about 5500 function evaluations where
the ellipsoid function has been reduced to the simple sphere; minimal and maximal coordinate-wise standard
deviation of the mutation distribution and in between (mostly hidden) the step-size σ that is initialized far
too small and increases quickly in the beginning, that increases afterwards several times again by up to
one order of magnitude and decreases with maximal rate during the last 1000 f -evaluations; axis ratio of
the mutation ellipsoid (square root of the condition number of C) that increases from 1 to 1000 where the
latter corresponds to αn /α1 . Lower left: sorted principal axis lengths of the mutation ellipsoid disregarding
σ (square roots of the sorted eigenvalues of C, see also Fig. 1) that adapt to the (local) structure of the
underlying optimization problem; they finally reflect almost perfectly the factors αi−1 up to a constant factor.
Upper right: x (distribution mean) that is initialized with all ones and converges to the global optimum in
zero while correlated movements of the variables can be observed. Lower right: standard deviations in the
coordinates disregarding σ (square roots of diagonal elements of C) showing the R-dependent projections
of the principal axis lengths into the given coordinate16system. The straight lines to the right of the vertical
line at about 6300 only annotate the coordinates and do not reflect measured data.
We consider a probability density p(.|θ) over Rn for a natural gradient update of θ, where xk ∼ p(.|θ)
parametrized by θ, a non-increasing function Wθf : is sampled from the current distribution. The natu-
ral gradient can be computed as ∇ eθ = F−1 ∇θ , where
R → R,8 and the expected Wθf0 -transformed fitness θ
[42] Fθ is the Fisher information matrix expressed in θ-
coordinates. For the multivariate Gaussian distribu-
J(θ) = E(Wθf0 (f (x))) x ∼ p(.|θ) tion, ∇
eθ ln p(xk |θ) can indeed be easily expressed and
Z computed efficiently. We find that in CMA-ES (Al-
= Wθf0 (f (x)) p(x|θ)dx , (2) gorithm 5), the rank-µ update (Line 11 with c1 = 0)
Rn
and the update in Line 7 are natural gradient up-
where the expectation is taken under the given sam- dates of C and x, respectively [41, 31], where the kth
ple distribution. The maximizer of J w.r.t. p(.|θ) is, largest wk is a consistent estimator for the kth largest
for any fixed Wθf0 , a Dirac distribution concentrated Wθf (f (xk )) [42].
on the minimizer of f . A natural way to update θ While the natural gradient does not depend on
is therefore a gradient ascent step in ∇θ J direction. the parametrization of the distribution, a finite step
However, the “vanilla” gradient ∇θ J depends on the taken in the natural gradient direction does. This
specific parametrization chosen in θ. In contrast, the becomes relevant for the covariance matrix update,
natural gradient, denoted by ∇ eθ , is associated to the where natural evolution strategies take a differ-
Fisher metric that is intrinsic to p and independent ent parametrization than CMA-ES. Starting from
of the chosen θ-parametrization. Developing ∇ eθ J(θ) Line 11 in Algorithm 5, we find for c1 = ch = 0
under mild assumptions on f and p(.|θ) by exchang-
ing differentiation and integration, recognizing that X
wk C /2 z k (C /2 z k )>
1 1
the gradient ∇ eθ does not act on W f0 , using the log- C ← (1 − cµ ) C + cµ
θ
z k ∈P
likelihood trick ∇eθ p(.|θ) = p(.|θ) ∇eθ ln p(.|θ) and fi- !
nally setting θ0 = θ yields9 X
wk z k z >
1/2 1
=C (1 − cµ ) I + cµ k C /2
 
eθ J(θ) = E W f (f (x)) ∇ z k ∈P
∇ θ
eθ ln p(x|θ) . (3) !
P
wk =1 X
wk z k z >
1/2 1
C /2

= C I + cµ k −I
A Monte-Carlo approximation of the expected value
z k ∈P
by the average finally yields the comparatively simple !
expression cµ 1 X
wk z k z >
1/2 1
expcµ C /2 . (5)

≈ C k −I
preference weight z k ∈P
λ z
eθ J(θ) ≈ 1
X }| {
∇ Wθf (f (xk )) ∇
eθ ln p(xk |θ) (4) The term bracketed between the matrices C /2 in the
1
λ | {z }
k=1 intrinsic candidate direction lower three lines is a multiplicative covariance matrix
update expressed in the natural coordinates, where
8 More specifically, Wθf : y 7→ w(Prz∼p(.|θ) (f (z) ≤ y)) 1
computes the pθ -quantile, or cumulative distribution function, the covariance matrix is the identity and C /2 serves
of f (z) with z ∼ p(.|θ) at point y, composed with a non- as coordinate system transformation into the given
increasing predefined weight function w : [0, 1] → R (where coordinate system. Only the lower
P two lines of Eq. (5)
w(0) > w(1/2) = 0 is advisable). The value of Wθf (f (x)) do not rely on the constraint k wk = 1 in order to
is invariant under strictly monotonous transformations of f .
satisfy a stationarity condition on C.10 The last line
For x ∼ p(.|θ) the distribution of Wθf (f (x)) ∼ w(U [0, 1]) de-
pends only on the predefined w; it is independent of θ and of Eq. (5) is used in the exponential natural evolution
f and therefore also (time-)invariant under θ-updates. Given 10 For a given C on the right hand side of Eq. (5), we have
λ samples xk , we  have the rank-based consistent estimator
rank(f (xk ))−1/2
 under neutral selection the stationarity condition E(Cnew ) =
Wθf (f (xk )) ≈ w λ
. C for the first three lines and E(log(Cnew )) = log(C) for the
9 We set θ 0 = θ because we will estimate W 0 using the last line, where log is the inverse of the matrix exponential
θ
current samples that are distributed according to p(.|θ) exp.

17
1
Algorithm 6 The Exponential NES (xNES) keeps the determinant of C /2 (and thus the trace of
1
0 given n ∈ N+ , λ ≥ 5, wk = w0 (k)/ k |w0 (k)|,
P log C /2 ) constant and is of rather cosmetic nature.
w0 (k) ≈ log(λ/2 + 1/2) − log rank(f (xk )), ηc ≈ Omitting the term is equivalent to using ησ + ηc in-
(5 + λ)/(5 n1.5 ) ≤ 1, ησ ≈ ηc , ηx ≈ 1 stead of ησ in line 8.
1 The exponential natural evolution strategy is a
1 initialize C /2 = I, σ ∈ R+ , x ∈ Rn
very elegant algorithm. Like CMA-ES it can be inter-
2 while not happy preted as an incremental Estimation of Distribution
3 for k ∈ {1, . . . , λ} Algorithm [43]. However, it performs generally infe-
4 z k = N (0, I) // i.i.d. for all k rior compared to CMA-ES because it does not use
5
1
xk = x + σC /2 × z k search paths for updating σ and C.
6 P = {(z k , f (xk )) | 1 ≤ k ≤ λ}
7 x ← x + ηx σ C /2
1
X
wk z k
3.8 Further Aspects
z k ∈P Internal Parameters Adaptation and self-
!
kz k k2 adaptation address the control of the most important
X 
8 σ ← σ expησ /2 wk −1 internal parameters in evolution strategies. Yet,
n
z k ∈P all algorithms presented have hidden and exposed
1 1
9 C /2 ← C /2 × ! parameters in their implementation. Many of them
kz k k2
 
ηc /2
X
> can be set to reasonable and robust default values.
exp wk z k z k − I The population size parameters µ and λ however
n
z k ∈P
change the search characteristics of an evolution
strategy significantly. Larger values, in particular
for parent number µ, often help address highly
strategy, xNES [31] and guarantees positive definite- multimodal or noisy problems more successfully.
ness of C even with negative weights, independent In practice, several experiments or restarts are ad-
of cµ and of the data z k . The xNES is depicted in visable, where different initial conditions for x and
Algorithm 6. σ can be employed. For exploring different popula-
In xNES, sampling is identical to CMA-ES and en- tion sizes, a schedule with increasing population size,
vironmental selection is omitted entirely. Line 8 re- IPOP, is advantageous [44, 45, 46], because runs with
sembles the step-size update in (1). Comparing the larger populations take typically more function eval-
updates more closely, with cσ = 1 Eq. (1) uses uations. Preceding long runs (large µ and λ) with
P 2 short runs (small µ and λ) leads to a smaller (rela-
µw k wk z k tive) impairment of the later runs than vice versa.
−1
n
whereas xNES uses Internal computational complexity Algo-
rithms presented in Sections 3.1–3.4 that sample
kz k k2
 
isotropic or axis-parallel mutation distributions have
X
wk −1
n an internal computational complexity linear in the
k
dimension. The internal computational complexity
for updating σ. For µ = 1 the updates are the of CMA-ES and xNES is, for constant population
same. For µ > 1, the latter only depends on the size, cubic in the dimension due to the update of
lengths of the z k , while the former depends on their C1/2 . Typical implementations of the CMA-ES
lengths and directions. Finally, xNES expresses the however have quadratic complexity, as they im-
update Eq. (5) in Line 9 on the Cholesky factor plement a lazy update scheme for C1/2 , where C
1
C /2 , which does not remain symmetric in this case is decomposed into C1/2 C1/2 only after about n/λ
1 1 >
(C = C /2 × C /2 still holds). The term −kz k k2 /n iterations. An exact quadratic update for CMA-ES

18
has also been proposed [47]. While never considered strategy) is generally smaller than one (but larger
in the literature, a lazy update for xNES to achieve than zero), as suggested by observations and theo-
quadratic complexity seems feasible as well. retical results [54]. Without parameter control on
the other hand, elitist strategies always converge to
11
Invariance Selection and recombination in evolu- the essential global optimum, however at a much
tion strategies are based solely on the ranks of off- slower rate (compare random search in Figure 2).
spring and parent individuals. As a consequence, In this section we use a time index t to denote it-
the behavior of evolution strategies is invariant un- eration and assume, for notational convenience and
der order-preserving (strictly monotonous) transfor- without loss of generality (due to translation invari-
mations of the fitness function value. In particular, ance), that the optimum of f is in x∗ = 0. This
(t) ∗ (t)
all spherical unimodal functions belong to the same simplifies writing x − x to simply x and then
(t)
function class, which the convex-quadratic sphere kx k measures the distance to the optimum of the
function is the most pronounced member of. This parental centroid in time step t.
function is more thoroughly investigated in Section 4. Linear convergence plays a central role for evolu-
All algorithms presented are invariant under trans- tion strategies. For a deterministic sequence x(t) lin-
lations and Algorithms 1, 5 and 6 are invariant under ear convergence (towards zero) takes place if there
rotations of the coordinate system, provided that the exists a c > 0 such that
initial x is translated and rotated accordingly. kx(t+1) k
Parameter control can introduce yet further in- lim = exp(−c) (6)
t→∞ kx(t) k
variances. All algorithms presented are scale invari-
ant due to step-size adaptation. Furthermore, ellip- which means, loosely speaking, that for t large
soidal functions that are in the reach of the muta- enough, the distance to the optimum decreases in ev-
tion operator of the evolution strategies presented in ery step by the constant factor exp(−c). Taking the
Sections 3.2 to 3.7 are eventually transformed, ef- logarithm of Eq. (6), then exchanging the logarithm
fectively, into spherical functions. These evolution and the limit and taking the Cesàro mean yields
strategies are invariant under the respective affine
T −1
transformations of the search space, given the initial 1 X kx(t+1) k
conditions are chosen respectively. lim log = −c . (7)
T →∞ T
t=0
kx(t) k
| {z }
= T1 log kx(T ) k/kx(0) k
Variants Evolution strategies have been extended
and combined with other approaches in various ways. For a sequence of random vectors we define linear
We mention here constraint handling [48, 49], fitness convergence based on Eq. (7) as follows.
surrogates [50], multi-objective variants [51, 52], and
exploitation of fitness values [53]. Definition 1 (linear convergence). The sequence of
random vectors x(t) converges almost surely linearly
to 0 if there exists a c > 0 such that
4 Theory
1 kx(T ) k
There is ample empirical evidence, that on many uni- −c = lim log a.s.
T →∞ T kx(0) k
modal functions evolution strategies with step-size T −1
control, as those outlined in the previous section, con- 1 X kx(t+1) k
= lim log a.s. (8)
verge fast and with probability one to the global op- T →∞ T
t=0
kx(t) k
timum. Convergence proofs supporting this evidence 11 On a bounded domain and with mutation variances
are discussed in Section 4.3. On multimodal func- bounded away from zero, non-elitist strategies generate a sub-
tions on the other hand, the probability to converge sequence of x-values converging to the essential global opti-
to the global optimum (in a single run of the same mum.

19
The sequence converges in expectation linearly to 0 The normalized progress rate ϕ∗ for evolution
if there exists a c > 0 such that strategies has been extensively studied in various sit-
uations, see Section 4.2. Scale-invariance and (some-
kx(t+1) k
−c = lim E log . (9) times artificial) assumptions on the step-size typically
t→∞ kx(t) k ensure that the progress rates do not depend on t.
Another way to describe how fast an algorithm ap-
The constant c is the convergence rate of the algo- proaches the optimum is to count the number of func-
rithm. tion evaluations needed to reduce the distance to the
Linear convergence hence means that asymptoti- optimum by a given factor 1/ or, similarly, the run-
cally in t, the logarithm of the distance to the opti- time to hit a ball of radius  around the optimum,
mum decreases linearly in t like −ct. This behavior starting, e.g., from distance one.
has been observed in Figure 2 for the (1+1)-ES with
Definition 3 (runtime). The runtime is the first hit-
1/5th success rule on a unimodal spherical function.
ting time of a ball around the optimum. Specifically,
Note that λ function evaluations are performed per
the runtime in number of function evaluations as a
iteration and it is then often useful to consider a con-
function of  reads
vergence rate per function evaluation, i.e. to normal-
ize the convergence rate by λ. n o
The progress rate measures the reduction of the λ × min t : kx(t) k ≤  × kx(0) k
distance to optimum within a single generation [1]. 
kx(t) k

= λ × min t : ≤ . (13)
Definition 2 (progress rate). The normalized kx(0) k
progress rate is defined as the expected relative re-
duction of kx(t) k Linear convergence with rate c as given in Eq. (9)
implies that, for  → 0, the expected runtime divided
kx(t) k − kx(t+1) k (t) (t) by log(1/) goes to the constant λ/c.
 
ϕ∗ = n E x , s
kx(t) k
 (t+1)
4.1 Lower Runtime Bounds
 
kx k (t) (t)
=n 1−E x ,s , (10)
kx(t) k
Evolution strategies with a fixed number of parent
where the expectation is taken over x (t+1)
, given and offspring individuals cannot converge faster than
(x(t) , s(t) ). In situations commonly considered in the- linearly and with a convergence rate of O(1/n). This
oretical analyses, ϕ∗ does not depend on x(t) and is means that their runtime is lower bounded by a con-
expressed as a function of strategy parameters s(t) . stant times log(1/n ) = n log(1/) [56, 57, 58, 59, 60].
This result can be obtained by analyzing the branch-
Definitions 1 and 2 are related, in that for a given ing factor of the tree of possible paths the algo-
x(t) rithm can take. It therefore holds for any opti-
mization algorithm taking decisions based solely on
kx(t+1) k a bounded number of comparisons between fitness
ϕ∗ ≤ −n log E (11)
kx(t) k values [56, 57, 58]. More specifically, the runtime
kx(t+1) k of any (1 +, λ)-ES with isotropic mutations cannot be
≤ −n E log = nc . (12) asymptotically faster than ∝ n log(1/) λ/ log(λ) [61].
kx(t) k
Considering more restrictive classes of algorithms can
Therefore, progress rate ϕ∗ and convergence rate nc provide more precise non-asymptotic bounds [59, 60].
do not agree and we might observe convergence (c > Different approaches address in particular the (1+1)-
0) while ϕ∗ < 0. However for n → ∞, we typically and (1, λ)-ES and precisely characterize the fastest
have ϕ∗ = nc [55]. convergence rate that can be obtained with isotropic

20
normal distributions on any objective function with
any step-size adaptation mechanism [62, 55, 63, 64]. 0.35
Considering the sphere function, the optimal con- 0.3
vergence rate is attained with distance proportional
step-size, that is, a step-size proportional to the 0.25
distance of the parental centroid to the optimum,

nc
0.2
σ = const × kxk = σ ∗ kxk/n. Optimal step-size
and optimal convergence rate according to Eqs. (8) 0.15
and (9) can be expressed in terms of expectation of
0.1
some random variables that are easily simulated nu-
merically. The convergence rate of the (1+1)-ES with 0.05
distance proportional step-size is shown in Figure 4 as
0 −2
a function of the normalized step-size σ ∗ = nσ/kxk. 10 10
−1
10
0
10
1

The peak of each curve is the upper bound for the σ*


convergence rate that can be achieved on any func-
tion with any form of step-size adaptation. As for
the general bound, the evolution strategy converges Figure 4: Normalized convergence rate nc versus
linearly and the convergence rate c decreases to zero normalized step-size nσ/kxk of the (1+1)-ES with dis-
like 1/n for n → ∞ [55, 65, 64], which is equiva- tance proportional step-size for n = 2, 3, 5, 10, 20, ∞
lent to linear scaling of the runtime in the dimension. (top to bottom). The peaks of the graphs repre-
The asymptotic limit for the convergence rate of the sent the upper bound for the convergence rate of the
(1+1)-ES, as shown in the lowest curve in Figure 4, (1+1)-ES with isotropic mutation (corresponding to
coincides with the progress rate expression given in the lower runtime bound). The limit curve for n to
the next section. infinity (lower black curve) reveals the optimal nor-
malized progress rate of ϕ∗ ≈ 0.202 of the (1+1)-ES
4.2 Progress Rates on sphere functions for n → ∞.

This section presents analytical approximations to


progress rates of evolution strategies for sphere, ridge, gies that successfully adapt the shape of the mutation
and cigar functions in the limit n → ∞. Both one- distribution (such as CMA-ES) effectively transform
generation results and those that consider multiple ellipsoidal functions into (almost) spherical ones, thus
time steps and cumulative step-size adaptation are lending extra relevance to the analysis of sphere and
considered. sphere-like functions.
The first analytical progress rate results date back The simplest convex quadratic functions to be op-
to the early work of Rechenberg [1] and Schwefel [3], timized are variants of the sphere function (see also
who considered the sphere and corridor models and the discussion of invariance in Section 3.8)
very simple strategy variants. Further results have n
since been derived for various ridge functions, several f (x) = kxk2 =
X
x2i = R2 ,
classes of convex quadratic functions, and more gen- i=1
eral constrained linear problems. The strategies that
results are available for have increased in complex- where R denotes the distance from the optimal solu-
ity as well and today include multi-parent strategies tion. Expressions for the progress rate of evolution
employing recombination as well as several step-size strategies on sphere functions can be computed by de-
adaptation mechanisms. Only strategy variants with composing mutation vectors into two components z
isotropic mutation distributions have been considered and z as illustrated in Fig. 5. Component z is the
up to this point. However, parameter control strate- projection of z onto the negative of the gradient vec-

21
0 term in Eq. (14) is the contribution to the normal-
ized progress rate from the component z of the mu-
R tation vector that is parallel to the gradient vector.
The second term results from the component z that
z⊙
is perpendicular to the gradient direction.
x z The black curve in Figure 4 illustrates how the
normalized progress rate of the (1+1)-ES on sphere
z⊖ functions in the limit n → ∞ depends on the normal-
ized mutation strength. For small normalized muta-
y tion strengths, the normalized progress rate is small
as the short steps that are made do not yield sig-
nificant progress. The success probability is nearly
one half. For large normalized mutation strengths,
Figure 5: Decomposition of mutation vector z into progress is near zero as the overwhelming majority
a component z in the direction of the negative of of steps result in poor offspring that are rejected.
the gradient vector of the objective function and a The normalized progress rate assumes a maximum
perpendicular component z . value of ϕ∗ = 0.202 at normalized mutation strength
σ ∗ = 1.224. The range of step-sizes for which close to
optimal progress is achieved is referred to as the evo-
tor ∇f of the objective function. It contributes pos-
lution window [1]. In the runs of the (1+1)-ES with
itively to the fitness of offspring candidate solution
constant step-size shown in Fig. 2, the normalized
y = x + z if and only if −∇f (x) · z > 0. Component
step-size initially is to the left of the evolution win-
z = z − z is perpendicular to the gradient direc-
dow (large relative distance to the optimal solution)
tion and contributes negatively to the offspring fit-
and in the end to its right (small relative distance to
ness. Its expected squared length exceeds that of z
the optimal solution), achieving maximal progress at
by a factor of n − 1. Considering normalized quanti-
a point in between.
ties σ ∗ = σn/R and ϕ∗ = ϕn/R allows giving concise
mathematical representations of the scaling proper-
ties of various evolution strategies on spherical func- 4.2.2 (µ/µ, λ)-ES on Sphere Functions
tions as shown below. Constant σ ∗ corresponds to The normalized progress rate of the (µ/µ, λ)-ES on
the distance proportional step-size from Section 4.1. sphere functions is described by

σ∗ 2
4.2.1 (1+1)-ES on Sphere Functions ϕ∗ = σ ∗ cµ/µ,λ − (15)

The normalized progress rate of the (1 + 1)-ES on
sphere functions is in the limit n → ∞ [2]. The term cµ/µ,λ is the ex-
pected value of the average of the µ largest order
σ∗ 1 ∗2
statistics of λ independent standard normally dis-
ϕ∗ = √ e− 8 σ tributed random numbers. For λ fixed, cµ/µ,λ de-

creases with increasing µ. For fixed truncation ratio
σ∗ 2
  ∗ 
σ µ/λ, cµ/µ,λ approaches a finite limit value as λ and
− 1 − erf √ (14)
4 8 µ increase [15, 8].
It is easily seen from Eq. (15) that the normalized
in the limit of n → ∞ [1]. The expression in square progress rate of the (µ/µ, λ)-ES is maximized by nor-
brackets is the success probability (i.e., the proba- malized mutation strength
bility that the offspring candidate solution is supe-
rior to its parent and thus replaces it). The first σ ∗ = µcµ/µ,λ . (16)

22
0.25

progress per offspring ϕ∗ /λ


The normalized progress rate achieved with that set-
ting is 0.20
µc2µ/µ,λ
ϕ∗ = . (17) 0.15
2
The progress rate is negative if σ ∗ > 2µcµ/µ,λ . Fig-
0.10
ure 6 illustrates how the optimal normalized progress
rate per offspring depends on the population size pa- 0.05
rameters µ and λ. Two interesting observations can
be made from the figure. 0.00
0.0 0.2 0.4 0.6 0.8 1.0
• For all but the smallest values of λ, the (µ/µ, λ)- truncation ratio µ/λ
ES with µ > 1 is capable of significantly more
rapid progress per offspring than the (1, λ)-ES. Figure 6: Maximal normalized progress per offspring
This contrasts with findings for the (µ/1, λ)-ES, of the (µ/µ, λ)-ES on sphere functions for n → ∞
the performance of which on sphere functions for plotted against the truncation ratio. The curves cor-
n → ∞ monotonically deteriorates with increas- respond to, from bottom to top, λ = 4, 10, 40, 100, ∞.
ing µ [8]. The dotted line represents the maximal progress rate
of the (1+1)-ES.
• For large λ, the optimal truncation ratio is
µ/λ = 0.27, and the corresponding progress per
offspring is 0.202. Those values are identical
a consequence of the factor µ in the denominator of
to the optimal success probability and result-
the term in Eq. (15) that contributes negatively to
ing normalized progress rate of the (1 + 1)-ES.
the normalized progress rate. The components z of
Beyer [8] shows that the correspondence is no co-
mutation vectors selected for survival are correlated
incidence and indeed exact. The step-sizes that
and likely to point in the direction opposite to the
the two strategies employ differ widely, however.
gradient vector. The perpendicular components z
The optimal step-size of the (1+1)-ES is 1.224;
in the limit n → ∞ have no influence on whether a
that of the (µ/µ, λ)-ES is µcµ/µ,λ and for fixed
candidate solution is selected for survival and are thus
truncation ratio µ/λ increases (slightly superlin-
uncorrelated. The recombinative averaging of muta-
early) with the population size. For example, op-
tion vectors results in a length of the z -component
timal step-sizes of (µ/µ, 4µ)-ES for µ ∈ {1, 2, 3}
similar to those of individual mutation vectors. How-
are 1.029, 2.276, and 3.538, respectively. If
ever, the squared length of the components perpen-
offspring candidate solutions can be evaluated
dicular to the gradient direction is reduced by a factor
in parallel, the (µ/µ, λ)-ES is preferable to the
of µ, resulting in the reduction of the negative term
(1+1)-ES, which does not benefit from the avail-
in Eq. (15) by a factor of µ. Beyer [15] has coined
ability of parallel computational resources.
the term genetic repair for this phenomenon.
Equation (15) holds in the limit n → ∞ for any fi- Weighted recombination (compare Algorithms 5
nite value of λ. In finite but high dimensional search and 6) can significantly increase the progress rate of
spaces, it can serve as an approximation to the nor- (µ/µ, λ)-ES on sphere functions. If n is large, the
malized progress rate of the (µ/µ, λ)-ES on sphere kth best candidate solution is optimally associated
functions in the vicinity of the optimal step-size pro- with a weight proportional to the expected value of
vided that λ is not too large. A better approximation the kth largest order statistic of a sample of λ inde-
for finite n is derived in [15, 8] (however compare also pendent standard normally distributed random num-
[55]). bers. The resulting optimal normalized progress rate
The improved performance of the (µ/µ, λ)-ES for per offspring candidate solution for large values of
µ > 1 compared to the strategy that uses µ = 1 is λ then approaches a value of 0.5, exceeding that of

23
0.25

progress per offspring ϕ∗ /λ


optimal unweighted recombination by a factor of al-
most two and a half [13]. The weights are symmetric 0.20
about zero. If only positive weights are employed and
µ = bλ/2c, the optimal normalized progress rate per 0.15
offspring with increasing λ approaches a value of 0.25.
The weights in Algorithms 5 and 6 closely resemble 0.10
those positive weights.
0.05

4.2.3 (µ/µ, λ)-ES on Noisy Sphere Functions 0.00


0.0 5.0 10.0 15.0 20.0
Noise in the objective function is most commonly noise strength σǫ∗
modeled as being Gaussian. If evaluation of a can-
didate solution x yields a noisy objective function Figure 7: Optimal normalized progress rate per off-
value f (x) + σ N (0, 1), then inferior candidate so- spring of the (µ/µ, λ)-ES on noisy sphere functions
lutions will sometimes be selected for survival and for n → ∞ plotted against the normalized noise
superior ones discarded. As a result, progress rates strength. The solid lines depict results for, from
decrease with increasing noise strength σ . Introduc- bottom to top, λ = 4, 10, 40, 100, ∞ and optimally
ing normalized noise strength σ∗ = σ n/(2R2 ), in chosen µ. The dashed line represents the optimal
the limit n → ∞, the normalized progress rate of the progress rate of the (1+1)-ES [67].
(µ/µ, λ)-ES on noisy sphere functions is
σ ∗ cµ/µ,λ σ∗ 2
ϕ∗ = √ − (18) noise on selection for survival. Through genetic re-
1 + ϑ2 2µ pair, the (µ/µ, λ)-ES thus implicitly implements the
where ϑ = σ /σ is the noise-to-signal ratio that the rescaling of mutation vectors proposed in [2] for the
∗ ∗

strategy operates under [66]. Noise does not impact (1, λ)-ES in the presence of noise. (Compare cm and
the term that contributes negatively to the strategy’s ηx in Algorithms 5 and 6 that, for values smaller than
progress. However, it acts to reduce the magnitude one, implement the explicit rescaling). It needs to be
of the positive term stemming from the contributions emphasized though that in finite-dimensional search
of mutation vectors parallel to the gradient direction. spaces, the ability to increase λ without violating the
Notice that unless the noise scales such that σ∗ is in- assumptions made in the derivation of Eq. (18) is
dependent of the location in search space (i.e., the severely limited. Nonetheless, the benefits resulting
standard deviation of the noise term increases in di- from genetic repair are significant, and the perfor-
rect proportion to f (x), such as in a multiplicative mance of the (µ/µ, λ)-ES is much more robust in the
noise model with constant noise strength), Eq. (18) presence of noise than that of the (1+1)-ES.
describes progress in single time steps only rather
than a rate of convergence. 4.2.4 Cumulative Step-Size Adaptation
Figure 7 illustrates for different offspring popula-
tion sizes λ how the optimal progress rate per off- All progress rate results discussed up to this point
spring depends on the noise strength. The curves consider single time steps of the respective evolution
have been obtained from Eq. (18) for optimal values strategies only. Analyses of the behavior of evolu-
of σ ∗ and µ. As the averaging of mutation vectors re- tion strategies that include some form of step-size
sults in a vector of reduced length, increasing λ (and adaptation are considerably more difficult. Even for
µ along with it) allows the strategy to operate using objective functions as simple as sphere functions, the
larger and larger step-sizes. Increasing the step-size state of the strategy is described by several variables
reduces the noise-to-signal ratio ϑ that the strategy with nonlinear, stochastic dynamics, and simplifying
operates under and thereby reduces the impact of assumptions need to be made in order to arrive at

24
mutation strength σ ∗ /(µcµ/µ,λ )
quantitative results. 1.5
In the following we consider the (µ/µ, λ)-ES with
cumulative step-size adaptation (Algorithm 4 with
Eq. (1) in place of Lines 8 and 8b for mathematical 1.0
convenience) and parameters set such that cσ → 0
as n → ∞ and d = Θ(1). The state of the strat-
egy on noisy sphere functions with σ∗ = const (i.e., 0.5
noise that decreases in strength as the optimal solu-
tion is approached) is described by the distance R of
the parental centroid from the optimal solution, nor- 0.0
0.0 0.5 1.0 1.5 2.0
malized step-size σ ∗ , the length of the search path s
noise strength σǫ∗ /(µcµ/µ,λ )
parallel to the direction of the gradient vector of the
objective function, and that path’s overall squared

progress rate ϕ∗ /(µc2µ/µ,λ /2)


length. After initialization effects have faded, the dis- 1.0
tribution of the latter three quantities is time invari-
0.8
ant. Mean values of the time invariant distribution
can be approximated by computing expected values 0.6
of the variables after a single iteration of the strategy
in the limit n → ∞ and imposing the condition that 0.4
those be equal to the respective values before that
0.2
iteration.√Solving the resulting system of equations
for σ∗ ≤ 2µcµ/µ,λ yields 0.0
0.0 0.5 1.0 1.5 2.0
s
σǫ∗ /(µcµ/µ,λ )
2
σ∗ noise strength


σ = µcµ/µ,λ 2 − (19)
µcµ/µ,λ
Figure 8: Normalized mutation strength and normal-
for the average normalized mutation strength as- ized progress rate of the (µ/µ, λ)-ES with cumula-
sumed by the strategy [68, 69]. The corresponding tive step size adaptation on noisy sphere functions
normalized progress rate for n → ∞ plotted against the normalized noise
√ " 2 # strength. The dashed lines depict optimal values.
σ∗

∗ 2−1 2
ϕ = µcµ/µ,λ 2 − (20)
2 µcµ/µ,λ
size parameters µ and λ allows shifting the operating
is obtained from Eq. (18). Both the average muta- regime of the strategy toward the left hand side of the
tion strength and the resulting progress rate are plot- graphs in Fig. 8, where step-sizes are near optimal.
ted against the noise strength in Fig. 8. For small As above, it is important to keep in mind the limita-
noise strengths cumulative step-size adaptation gen- tions of the results derived in the limit n → ∞. In fi-
erates mutation strengths that are larger than op- nite dimensional search spaces the ability to compen-
timal. The evolution window continually shifts to- sate for large amounts of noise by increasing the pop-
ward smaller values of the step-size, and adaptation ulation size is more limited than Eqs. (19) and (20)
remains behind its target. However, the resulting suggest.
mutation strengths achieve progress rates within 20
percent of optimal ones. For large noise strengths the 4.2.5 Parabolic Ridge Functions
situation is reversed and the mutation strengths gen-
erated by cumulative step-size adaptation are smaller A class of test functions that poses difficulties very
than optimal. However, increasing the population different from those encountered in connection with

25
sphere functions are ridge functions, for the (finite) average mutation strength [71]. From
n
!α/2 Eq. (21), the corresponding progress rate
X
2
f (x) = x1 + ξ xi = x1 + ξRα , µc2µ/µ,λ
i=2 ϕ= (23)
2nξ
which include the parabolic ridge for α = 2. The x1 -
axis is referred to as the ridge axis, and R denotes is greater than half of the progress rate attained with
the distance from that axis. Progress can be made any finite step size.
by minimizing the distance from the ridge axis or by
proceeding along it. The former requires decreasing 4.2.6 Cigar Functions
step-sizes and is limited in its effect as R ≥ 0. The
latter allows indefinite progress and requires that the While parabolic ridge functions provide an envi-
step-size does not decrease to zero. Short and long ronment for evaluating whether step-size adaptation
term goals may thus be conflicting, and inappropriate mechanisms are able to avoid stagnation, the ability
step-size adaptation may lead to stagnation. to make continual meaningful positive progress with
As an optimal solution to the ridge problem does some constant nonzero step-size is of course atypical
not exist, the progress rate ϕ of the (µ/µ, λ)-ES on for practical optimization problems. A class of ridge-
ridge functions is defined as the expectation of the like functions that requires continual adaptation of
step made in the direction of the negative ridge axis. the mutation strength and is thus a more realistic
For constant step-size, the distance R of the parental model of problems requiring ridge following are cigar
centroid from the ridge axis assumes a time-invariant functions
limit distribution. An approximation to the mean n
X
value of that distribution can be obtained by identi- f (x) = x21 + ξ x2i = x21 + ξR2
fying that value of R for which the expected change i=2

is zero. Using this value yields with parameter ξ ≥ 1 being the condition number
2µc2µ/µ,λ of the Hessian matrix. Small values of ξ result in
ϕ= p (21) sphere-like characteristics, large values in ridge-like
nξ(1 + 1 + (2µcµ/µ,λ /(nξσ))2 )
ones. As above, R measures the distance from the
for the progress rate of the (µ/µ, λ)-ES on parabolic x1 -axis.
ridge functions [70]. The strictly monotonic behavior Assuming successful adaptation of the step-size,
of the progress rate, increasing from a value of zero evolution strategies exhibit linear convergence on
for σ = 0 to ϕ = µc2µ/µ,λ /(nξ) for σ → ∞, is fun- cigar functions. The expected relative per iteration
damentally different from that observed on sphere change in objective function value of the population
functions. However, the derivative of the progress centroid is referred to as the quality gain ∆ and de-
rate with regard to the step-size for large values of σ termines the rate of convergence. In the limit n → ∞
tends to zero. The limited time horizon of any search it is described by
as well as the intent of using ridge functions as local
∗2

rather than global models of practically relevant ob-  σ ξ−1
if σ ∗ < 2µcµ/µ,λ

jective functions both suggest that it may be unwise  2µ(ξ − 1)

ξ
∆∗ =
to increase the step-size without bounds.  σ∗ 2
cµ/µ,λ σ ∗ −

The performance of cumulative step-size adapta-  otherwise

tion on parabolic ridge functions can be studied us-
ing the same approach as described above for sphere where σ ∗ = σn/R and ∆∗ = ∆n/2 [72]. That rela-
functions, yielding tionship is illustrated in Fig. 9 for several values of
µcµ/µ,λ the conditioning parameter. The parabola for ξ = 1
σ= √ (22) reflects the simple quadratic relationship for sphere
2nξ

26
functions seen in Eq. (15). (For the case of sphere 0.6

quality gain ∆∗ /(µc2µ/µ,λ )


functions, normalized progress rate and normalized
quality gain are the same.) For cigar functions with 0.4
large values of ξ, two separate regimes can be iden- ξ=1
tified. For small step-sizes, the quality gain of the 0.2
strategy is limited by the size of the steps that can ξ=4
be made in the direction of the x1 -axis. The x1 - 0.0
component of the population centroid virtually never ξ = 100
changes sign. The search process resembles one of -0.2
ridge following, and we refer to the regime as the 0.0 0.5 1.0 1.5 2.0 2.5

ridge regime. In the other regime, the step-size is mutation strength σ /(µcµ/µ,λ )
such that the quality gain of the strategy is effec-
tively limited by the ability to approach the optimal Figure 9: Normalized quality gain of (µ/µ, λ)-ES on
solution in the subspace spanned by the x2 , . . . , xn - cigar functions for n → ∞ plotted against the nor-
axes. The x1 -component of the population centroid malized mutation strength for ξ ∈ {1, 4, 100}. The
changes sign much more frequently than in the ridge vertical line represents the average normalized muta-
regime, as is the case on sphere functions. We thus tion strength generated by cumulative step-size adap-
refer to the regime as the sphere regime. tation.
The approach to the analysis of the behavior of
cumulative step-size adaptation explained above for
sphere and parabolic ridge functions can be applied cludes work analyzing the behavior of mutative self-
to cigar functions as well, yielding adaptation for linear [22], spherical [73], and ridge
√ functions [74]. Hierarchically organized evolution
σ ∗ = 2µcµ/µ,λ strategies have been studied when applied to both
parabolic ridge and sphere functions [75, 76]. Sev-
for the average normalized mutation strength gener-
eral step-size adaptation techniques have been com-
ated by cumulative step-size adaptation [72]. The
pared for ridge functions, including, but not lim-
corresponding normalized quality gain is
ited to, parabolic ones [77]. A further class of con-

√ vex quadratic functions for which quality gain results

2 2

 √
( 2 − 1)µcµ/µ,λ if ξ < 2 − 1

have been derived is characterized by the occurrence
∆∗ = of only two distinct eigenvalues of the Hessian, both
 µc2 of which occur with high multiplicity [78, 79].
 µ/µ,λ

 otherwise .
ξ−1 An analytical investigation of the behavior of the
Both are compared with optimal values in Fig. 10. (1+1)-ES on noisy sphere functions finds that failure
For small condition numbers, (µ/µ, λ)-ES operate in to reevaluate the parental candidate solution results
the sphere regime and are within 20 percent of the op- in the systematic overvaluation of the parent and thus
timal quality gain as seen above. For large condition in potentially long periods of stagnation [67]. Con-
numbers, the strategy operates in the ridge regime trary to what might be expected, the increased dif-
and achieves a quality gain within a factor of two of ficulty of replacing parental candidate solutions can
the optimal one, in accordance with the findings for have a positive effect on progress rates as it tends
the parabolic ridge above. to prevent the selection for survival of offspring can-
didate solutions solely due to favorable noise values.
The convergence behavior of the (1+1)-ES on finite
4.2.7 Further Work
dimensional sphere functions is studied by Jebalia et
Further research regarding the progress rate of evo- al. [80] who show that the additive noise model is in-
lution strategies in different test environments in- appropriate in finite dimensions unless the parental

27
mutation strength σ ∗ /(µcµ/µ,λ )

2.0
analyses of simple constraint handling techniques [84,
85, 86].
1.5
4.3 Convergence Proofs
1.0
In the previous section we have described theoretical
results that involve approximations in their deriva-
0.5 tion and consider the limit for n → ∞. In this sec-
realised
optimal tion, exact results are discussed.
0.0 Convergence proofs with only mild assumptions on
1.0 10.0 100.0
the objective function are easy to obtain for evolution
condition number ξ strategies with a step-size that is effectively bounded
from below and above (and, for non-elitist strate-
1.0e+00
gies, when additionally the search space is bounded)
quality gain ∆∗ /(µc2µ/µ,λ )

realised
optimal [63, 12]. In this case, the expected runtime to reach
an -ball around the global optimum (see also Defi-
nition 3) cannot be faster than ∝ 1/n , as obtained
1.0e-01 with pure random search for  → 0 or n → ∞.12 Sim-
ilarly, convergence proofs can be obtained for adap-
tive strategies that include provisions for using a fixed
step-size and covariance matrix with some constant
1.0e-02 probability.
1.0 10.0 100.0
Convergence proofs for strategy variants that do
condition number ξ not explicitly ensure that long steps are sampled
for a sufficiently long time typically require much
Figure 10: Normalized mutation strength and nor- stronger restrictions on the set of objective functions
malized quality gain of the (µ/µ, λ)-ES with cu- that they hold for. Such proofs however have the
mulative step-size adaptation on cigar functions for potential to reveal much faster, namely linear con-
n → ∞ plotted against the condition number of the vergence. Evolution strategies with the artificial dis-
cigar. The dashed curves represent optimal values. tance proportional step-size, σ = const×kxk, exhibit,
as shown above, linear convergence on the sphere
function with an associated runtime proportional to
candidate solution is reevaluated, and who suggest log(1/) [88, 62, 80, 64]. This result can be eas-
a multiplicative noise model instead. An analysis ily proved by using a law of large numbers, because
of the behavior of (µ, λ)-ES (without recombination) kx(t+1) k/kx(t) k are independent and identically dis-
for noisy sphere functions finds that in contrast to tributed for all t.
the situation in the absence of noise, strategies with Without the artificial choice of step-size, σ/kxk be-
µ > 1 can outperform (1, λ)-ES if there is noise comes a random variable. If this random variable is
present [81]. The use of non-singleton populations a homogeneous Markov chain and stable enough to
increases the signal-to-noise ratio and thus allows for satisfy the law of large numbers, linear convergence
more effective selection of good candidate solutions. is maintained [88, 63]. The stability of the Markov
The effects of non-Gaussian forms of noise on the per- chain associated to the self-adaptive (1, λ)-ES on the
formance of (µ/µ, λ)-ES applied to the optimization
12 If the mutation distribution is not normal and exhibits a
of sphere functions have also been investigated [82].
singularity in zero, convergence can be much faster than with
Finally, there are some results regarding the op- random search even when the step-size is bounded away from
timization of time-varying objectives [83] as well as zero [87].

28
sphere function has been shown in dimension n = 1 [6] H.-G. Beyer, H.-P. Schwefel: Evolution strate-
[89] providing thus a proof of linear convergence of gies — A comprehensive introduction, Natural
this algorithm. The extension of this proof to higher computing 1(1), 3 – 52 (2002)
dimensions is straightforward.
[7] D. B. Fogel: Evolutionary Computation: The
Proofs that are formalized by upper bounds on the
Fossil Record (Wiley – IEEE Press, 1998)
time to reduce the distance to the optimum by a given
factor can also associate the linear dependency of the [8] H.-G. Beyer: The Theory of Evolution Strategies
convergence rate in the dimension n. The (1 + λ)- (Springer Verlag, 2001)
and the (1, λ)-ES with common variants of the 1/5th
success rule converge linearly on the√sphere function [9] D. E. Goldberg: Genetic Algorithms in Search,
with a runtime of O(n log(1/) λ/ log λ) [90, 61]. Optimization and Machine Learning (Addison
When λ is smaller than O(n)√the (1 + λ)-ES with a Wesley, 1989)
modified success rule is even log λ times faster and
[10] N. Hansen, A. Ostermeier: Completely de-
therefore matches the general lower runtime bound
randomized self-adaptation in evolution strate-
Ω(n log(1/) λ/ log(λ)) [61, Theorem 5]. On convex-
gies, Evolutionary Computation 9(2), 159 – 195
quadratic functions, the asymptotic runtime of the
(2001)
(1+1)-ES is the same as on the sphere function and,
at least in some cases, proportional to the condition [11] H.-P. Schwefel, G. Rudolph: Contemporary evo-
number of the problem [91]. lution strategies, Advances in Artificial Life, ed.
Convergence proofs of modern evolution strategies by F. Morán et al. (Springer Verlag 1995) 891 –
with recombination, of CSA-ES, CMA-ES or xNES 907
are not yet available, however we believe that some of
them are likely to be achieved in the coming decade. [12] G. Rudolph: Convergence Properties of Evolu-
tionary Algorithms (Verlag Dr. Kovač, 1997)
[13] D. V. Arnold: Weighted multirecombination
References evolution strategies, Theoretical Computer Sci-
ence 361(1), 18 – 37 (2006)
[1] I. Rechenberg: Evolutionstrategie: Optimierung
[14] H. Mühlenbein, D. Schlierkamp-Voosen: Predic-
technischer Systeme nach Prinzipien der biolo-
tive models for the breeder genetic algorithm I.
gischen Evolution (Frommann-Holzboog Verlag,
Continuous parameter optimization, Evolution-
1973)
ary Computation 1(1), 25 – 49 (1993)
[2] I. Rechenberg: Evolutionsstrategie ’94[15] H.-G. Beyer: Toward a theory of evolution
(Frommann-Holzboog Verlag, 1994) strategies: On the benefits of sex — The (µ/µ,
λ) theory, Evolutionary Computation 3(1), 81 –
[3] H.-P. Schwefel: Numerische Optimierung von 111 (1995)
Computer-Modellen mittels der Evolution-
[16] C. Kappler: Are evolutionary algorithms im-
sstrategie (Birkhäuser, 1977)
proved by large mutations?, Parallel Problem
Solving from Nature (PPSN IV), ed. by H.-M.
[4] H.-P. Schwefel: Evolution and Optimum Seeking
Voigt et al. (Springer Verlag 1996) 346 – 355
(Wiley, 1995)
[17] G. Rudolph: Local convergence rates of simple
[5] L. J. Fogel, A. J. Owens, M. J. Walsh: Artificial evolutionary algorithms with Cauchy mutations,
Intelligence through Simulated Evolution (Wiley, IEEE Transactions on Evolutionary Computa-
1966) tion 1(4), 249 – 258 (1997)

29
[18] X. Yao, Y. Liu, G. Lin: Evolutionary program- International Conference on Evolutionary Com-
ming made faster, IEEE Transactions on Evolu- putation (ICEC ’96) (IEEE Press 1996) 312 –
tionary Computation 3(2), 82 – 102 (1999) 317
[19] M. Herdy: The number of offspring as strategy [28] A. Ostermeier, N. Hansen: An evolution strat-
parameter in hierarchically organized evolution egy with coordinate system invariant adapta-
strategies, ACM SIGBIO Newsletter 13(2), 2 – 9 tion of arbitrary normal mutation distributions
(1993) within the concept of mutative strategy param-
[20] M. Schumer, K. Steiglitz: Adaptive step size eter control, Genetic and Evolutionary Com-
random search, IEEE Transactions on Auto- putation Conference (GECCO 1999), ed. by
matic Control 13(3), 270 – 276 (1968) W. Banzhaf et al. (Morgan Kaufmann 1999)
902 – 909
[21] S. Kern, S. D. Müller, N. Hansen, D. Büche,
J. Ocenasek, P. Koumoutsakos: Learning prob- [29] D. Wierstra, T. Schaul, J. Peters, J. Schmid-
ability distributions in continuous evolutionary huber: Natural evolution strategies, IEEE
algorithms — A comparative review, Natural Congress on Evolutionary Computation (CEC
Computing 3(1), 77 – 112 (2004) 2008) (IEEE Press 2008) 3381 – 3387
[22] N. Hansen: An analysis of mutative σ-self- [30] Y. Sun, D. Wierstra, T. Schaul, J. Schmidhu-
adaptation on linear fitness functions, Evolu- ber: Efficient natural evolution strategies, Ge-
tionary Computation 14(3), 255 – 275 (2006) netic and Evolutionary Computation Conference
[23] A. Ostermeier, A. Gawelczyk, N. Hansen: A (GECCO 2009) (ACM Press 2009) 539 – 546
derandomized approach to self-adaptation of
evolution strategies, Evolutionary Computation [31] T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra,
2(4), 369 – 380 (1994) J. Schmidhuber: Exponential natural evolution
strategies, Genetic and Evolutionary Computa-
[24] T. Runarsson: Reducing random fluctuations in tion Conference (GECCO 2010) (ACM Press
mutative self-adaptation, Parallel Problem Solv- 2010) 393 – 400
ing from Nature (PPSN VII), ed. by J. J. Merelo
Guervós et al. (Springer Verlag 2002) 194 – 203 [32] N. Hansen: Invariance, self-adaptation and cor-
related mutations and evolution strategies, Par-
[25] A. Ostermeier, A. Gawelczyk, N. Hansen: Step- allel Problem Solving from Nature (PPSN VI),
size adaptation based on non-local use of selec- ed. by M. Schoenauer et al. (Springer Verlag
tion information, Parallel Problem Solving from 2000) 355 – 364
Nature (PPSN III), ed. by Y. Davidor et al.
(Springer Verlag 1994) 189 – 198 [33] A. Ostermeier: An evolution strategy with mo-
[26] N. Hansen, A. Ostermeier, A. Gawelczyk: On mentum adaptation of the random number dis-
the adaptation of arbitrary normal mutation dis- tribution, Parallel Problem Solving from Nature
tributions in evolution strategies: The generat- (PPSN II) 1992, ed. by R. Männer, B. Mander-
ing set adaptation, International Conference on ick (Elsevier 1992) 199 – 208
Genetic Algorithms (ICGA ’95), ed. by L. J. Es-
helman (Morgan Kaufmann 1995) 57 – 64 [34] J. Poland, A. Zell: Main vector adaptation: A
CMA variant with linear time and space com-
[27] N. Hansen, A. Ostermeier: Adapting arbi- plexity, Genetic and Evolutionary Computation
trary normal mutation distributions in evolution Conference (GECCO 2001), ed. by L. Spector et
strategies: The covariance matrix adaptation, al. (Morgan Kaufmann 2001) 1050 – 1055

30
[35] J. N. Knight, M. Lunacek: Reducing the [44] G. R. Harik, F. G. Lobo: A parameter-less ge-
space-time complexity of the CMA-ES, Ge- netic algorithm, Genetic and Evolutionary Com-
netic and Evolutionary Computation Conference putation Conference (GECCO 1999), ed. by
(GECCO 2007) (ACM Press 2007) 658 – 665 W. Banzhaf et al. (Morgan Kaufmann 1999)
258 – 265
[36] N. Hansen, S. Kern: Evaluating the CMA evolu-
tion strategy on multimodal test functions, Par- [45] F. G. Lobo, D. E. Goldberg: The parameter-
allel Problem Solving from Nature (PPSN VIII), less genetic algorithm in practice, Information
ed. by X. Yao et al. (Springer Verlag 2004) 282 – Sciences 167(1), 217 – 232 (2004)
291
[46] A. Auger, N. Hansen: A restart CMA evolution
[37] G. A. Jastrebski, D. V. Arnold: Improving evo- strategy with increasing population size, IEEE
lution strategies through active covariance ma- Congress on Evolutionary Computation (CEC
trix adaptation, IEEE Congress on Evolutionary 2005) (IEEE Press 2005) 1769 – 1776
Computation (CEC 2006) (IEEE Press 2006)
2814 – 2821 [47] T. Suttorp, N. Hansen, C. Igel: Efficient covari-
ance matrix update for variable metric evolu-
[38] H.-G. Beyer: Mutate large, but inherit small! tion strategies, Machine Learning 75(2), 167 –
On the analysis of rescaled mutations in (1̃, λ̃)- 197 (2009)
ES with noisy fitness data, Parallel Problem
Solving from Nature (PPSN V), ed. by A. E. [48] Z. Michalewicz, M. Schoenauer: Evolutionary
Eiben et al. (Springer Verlag 1998) 109 – 118 algorithms for constrained parameter optimiza-
tionproblems, Evolutionary Computation 4(1),
[39] S. I. Amari: Natural gradient works efficiently in 1 – 32 (1996)
learning, Neural Computation 10(2), 251 – 276
(1998) [49] E. Mezura-Montes, C. A. Coello Coello:
Constraint-handling in nature-inspired numer-
[40] Y. Sun, D. Wierstra, T. Schaul, J. Schmidhu- ical optimization: Past, present, and future,
ber: Stochastic search using the natural gradi- Swarm and Evolutionary Computation 1(4),
ent, International Conference on Machine Learn- 173 – 194 (2011)
ing (ICML ’09), ed. by A. P. Danyluk et al.
(ACM Press 2009) 1161 – 1168 [50] M. Emmerich, A. Giotis, M. Özdemir, T. Bäck,
K. Giannakoglou: Metamodel-assisted evolution
[41] Y. Akimoto, Y. Nagata, I. Ono, S. Kobayashi: strategies, Parallel Problem Solving from Nature
Bidirectional relation between CMA evolution (PPSN VII), ed. by J. J. Merelo Guervós et al.
strategies and natural evolution strategies, Par- (Springer Verlag 2002) 361 – 370
allel Problem Solving from Nature (PPSN XI),
ed. by R. Schaefer et al. (Springer Verlag 2010) [51] C. Igel, N. Hansen, S. Roth: Covariance ma-
154 – 163 trix adaptation for multi-objective optimization,
Evolutionary Computation 15(1), 1 – 28 (2007)
[42] L. Arnold, A. Auger, N. Hansen, Y. Ollivier:
Information-geometric optimization algorithms: [52] N. Hansen T. Voß, C. Igel: Improved step
A unifying picture via invariance principles, size adaptation for the MO-CMA-ES, Ge-
ArXiv e-prints (2011) netic and Evolutionary Computation Conference
(GECCO 2010) (ACM Press 2010) 487 – 494
[43] M. Pelikan, M. W. Hausschild, F. G. Lobo: In-
troduction to estimation of distribution algo- [53] R. Salomon: Evolutionary algorithms and
rithms. In: Handbook of Computational Intelli- gradient search: Similarities and differences,
gence, ed. by J. Kacprzyk, W. Pedrycz (Springer IEEE Transactions on Evolutionary Computa-
Verlag 2013) tion 2(2), 45 – 55 (1998)

31
[54] G. Rudolph: Self-adaptive mutations may lead [63] A. Auger, N. Hansen: Theory of evolution
to premature convergence, IEEE Transactions strategies: A new perspective. In: Theory of
on Evolutionary Computation 5(4), 410 – 414 Randomized Search Heuristics: Foundations and
(2001) Recent Developments, ed. by A. Auger, B. Do-
err (World Scientific Publishing 2011) Chap. 10,
[55] A. Auger, N. Hansen: Reconsidering the pp. 289 – 325
progress rate theory for evolution strategies
in finite dimensions, Genetic and Evolutionary [64] A. Auger, D. Brockhoff, N. Hansen: Analyzing
Computation Conference (GECCO 2006) (ACM the impact of mirrored sampling and sequential
Press 2006) 445 – 452 selection in elitist evolution strategies, Founda-
tions of Genetic Algorithms (FOGA 11) (ACM
[56] O. Teytaud, S. Gelly: General lower bounds Press 2011) 127 – 138
for evolutionary algorithms, Parallel Problem
Solving from Nature (PPSN IX), ed. by T. P. [65] A. Auger, D. Brockhoff, N. Hansen: Mirrored
Runarsson et al. (Springer Verlag 2006) 21 – 31 sampling in evolution strategies with weighted
recombination, Genetic and Evolutionary Com-
[57] H. Fournier, O. Teytaud: Lower bounds for putation Conference (GECCO 2011) (ACM
comparison based evolution strategies using Press 2011) 861 – 868
VC-dimension and sign patterns, Algorithmica
59(3), 387 – 408 (2011) [66] D. V. Arnold, H.-G. Beyer: Local performance of
the (µ/µI , λ)-ES in a noisy environment, Foun-
[58] O. Teytaud: Lower bounds for evolution strate- dations of Genetic Algorithms (FOGA 6), ed.
gies. In: Theory of Randomized Search Heuris- by W. N. Martin, W. M. Spears (Morgan Kauf-
tics: Foundations and Recent Developments, ed. mann 2001) 127 – 141
by A. Auger, B. Doerr (World Scientific Publish-
ing 2011) Chap. 11, pp. 327 – 354 [67] D. V. Arnold, H.-G. Beyer: Local performance
of the (1 + 1)-ES in a noisy environment,
[59] J. Jägersküpper: Lower bounds for hit-and- IEEE Transactions on Evolutionary Computa-
run direct search, International Symposium on tion 6(1), 30 – 41 (2002)
Stochastic Algorithms: Foundations and Appli-
cations (SAGA 2007), ed. by J. Hromkovic et al. [68] D. V. Arnold: Noisy Optimization with Evo-
(Springer Verlag 2007) 118 – 129 lution Strategies (Kluwer Academic Publishers,
2002)
[60] J. Jägersküpper: Lower bounds for randomized
direct search with isotropic sampling, Opera- [69] D. V. Arnold, H.-G. Beyer: Performance analy-
tions Research Letters 36(3), 327 – 332 (2008) sis of evolutionary optimization with cumulative
step length adaptation, IEEE Transactions on
[61] J. Jägersküpper: Probabilistic runtime analysis Automatic Control 49(4), 617 – 622 (2004)
of (1+,λ) evolution strategies using isotropic mu-
tations, Genetic and Evolutionary Computation [70] A. I. Oyman, H.-G. Beyer: Analysis of the
Conference (GECCO 2006) (ACM Press 2006) (µ/µ, λ)-ES on the parabolic ridge, Evolution-
461 – 468 ary Computation 8(3), 267 – 289 (2000)

[62] M. Jebalia, A. Auger, P. Liardet: Log-linear con- [71] D. V. Arnold, H.-G. Beyer: Evolution strategies
vergence and optimal bounds for the (1+1)-ES, with cumulative step length adaptation on the
Evolution Artificielle (EA ’07), ed. by N. Mon- noisy parabolic ridge, Natural Computing 7(4),
marché et al. (Springer Verlag 2008) 207 – 218 555 – 587 (2008)

32
[72] D. V. Arnold, H.-G. Beyer: On the behaviour [82] D. V. Arnold, H.-G. Beyer: A general noise
of evolution strategies optimising cigar func- model and its effects on evolution strategy per-
tions, Evolutionary Computation 18(4), 661 – formance, IEEE Transactions on Evolutionary
682 (2010) Computation 10(4), 380 – 391 (2006)
[73] H.-G. Beyer: Towards a theory of evolution [83] D. V. Arnold, H.-G. Beyer: Optimum tracking
strategies: Self-adaptation, Evolutionary Com- with evolution strategies, Evolutionary Compu-
putation 3(3) (1995) tation 14(3), 291 – 308 (2006)
[74] S. Meyer-Nieberg, H.-G. Beyer: Mutative self- [84] D. V. Arnold, D. Brauer: On the behaviour
adaptation on the sharp and parabolic ridge, of the (1 + 1)-ES for a simple constrained
Foundations of Genetic Algorithms (FOGA 9), problem, Parallel Problem Solving from Nature
ed. by C. R. Stephens et al. (Springer Verlag (PPSN X), ed. by G. Rudolph et al. (Springer
2007) 70 – 96 Verlag 2008) 1 – 10
[75] D. V. Arnold, A. MacLeod: Hierarchically or- [85] D. V. Arnold: On the behaviour of the (1, λ)-ES
ganised evolution strategies on the parabolic for a simple constrained problem, Foundations
ridge, Genetic and Evolutionary Computation of Genetic Algorithms (FOGA 11) (ACM Press
Conference (GECCO 2006) (ACM Press 2006) 2011) 15 – 24
437 – 444
[86] D. V. Arnold: Analysis of a repair mechanism
[76] H.-G. Beyer, M. Dobler, C. Hämmerle,
for the (1, λ)-ES applied to a simple constrained
P. Masser: On strategy parameter control by
problem, Genetic and Evolutionary Computa-
meta-ES, Genetic and Evolutionary Computa-
tion Conference (GECCO 2011) (ACM Press
tion Conference (GECCO 2009) (ACM Press
2011) 853 – 860
2009) 499 – 506
[87] Anatoly A. Zhigljavsky: Theory of Global Ran-
[77] D. V. Arnold, A. MacLeod: Step length adapta-
dom Search (Kluwer Academic Publishers, 1991)
tion on ridge functions, Evolutionary Computa-
tion 16(2), 151 – 184 (2008) [88] A. Bienvenüe, O. François: Global convergence
[78] D. V. Arnold: On the use of evolution strategies for evolution strategies in spherical problems:
for optimising certain positive definite quadratic Some simple proofs and difficulties, Theoretical
forms, Genetic and Evolutionary Computation Computer Science 306(1-3), 269 – 289 (2003)
Conference (GECCO 2007) (ACM Press 2007) [89] A. Auger: Convergence results for (1,λ)-SA-ES
634 – 641 using the theory of ϕ-irreducible Markov chains,
[79] H.-G. Beyer, S. Finck: Performance of the Theoretical Computer Science 334(1-3), 35 – 69
(µ/µI , λ)-σSA-ES on a class of PDQFs, IEEE (2005)
Transactions on Evolutionary Computation [90] J. Jägersküpper: Algorithmic analysis of a ba-
14(3), 400 – 418 (2010) sic evolutionary algorithm for continuous opti-
[80] M. Jebalia, A. Auger, N. Hansen: Log- mization, Theoretical Computer Science 379(3),
linear convergence and divergence of the scale- 329 – 347 (2007)
invariant (1+1)-ES in noisy environments, Algo-
[91] J. Jägersküpper: How the (1+1) ES using
rithmica 59(3), 425 – 460 (2011)
isotropic mutations minimizes positive definite
[81] D. V. Arnold, H.-G. Beyer: On the benefits of quadratic forms, Theoretical Computer Science
populations for noisy optimization, Evolutionary 361(1), 38 – 56 (2006)
Computation 11(2), 111 – 127 (2003)

33

You might also like