Heuristic Optimization
Heuristic Optimization
Heuristic Optimization
2.1 Introduction
Optimization problems are concerned with finding the values for one or several deci-
sion variables that meet the objective(s) the best without violating the constraint(s).
The identification of an efficient portfolio in the Markowitz model (1.7) on page 7
is therefore a typical optimization problem: the values for the decision variables xi
have to be found under the constraints that (i) they must not exceed certain bounds
((1.7f): 0 ≤ xi ≤ 1 and (1.7e): ∑i xi = 1) and (ii) the portfolio return must have a
given expected value (constraint (1.7c)); the objective is to find values for the assets’
weights that minimize the risk which is computed in a predefined way. If there are
several concurring objectives, usually a trade-off between them has to be defined: In
the modified objective function (1.7a*) on page 9, the objectives of minimizing the
risk while maximizing the return are considered simultaneously.
Markowitz model except for the non-negativity constraint). Though not solvable
analytically, there exist numerical procedures by which the Markowitz model can be
solved for a given set of parameters values.
In portfolio management, these difficulties with the objective functions are fre-
quently observed when market frictions have to be considered. To find solutions
anyway, common ways of dealing with them would be to either eliminate these fric-
tions (leading to models that represent the real-world in a stylized and simplified
way) or to approach them with inappropriate methods (which might lead to sub-
optimal and misleading results without being able to recognize these errors). This
40 2 Heuristic Optimization
Opposed to the well-defined problems considered so far, there also exist prob-
lems where the underlying structure is unknown, partially hidden – or simply
too complex to be modeled. When an underlying structure can be assumed or
when there are pairs of input/output data, these questions can be approached, e.g.,
with econometric1 or Artificial Intelligence2 methods. In finance, time series analy-
sis, pricing of complex securities, model selection problems, and artificial markets
would be typical examples.3 In this contribution, however, only well-defined opti-
mization problems will be considered.
a constant specific for the problem or algorithm. E.g., reading n pages of a book
might consume linear time (i.e., O (n)), but finding out whether there are any dupli-
cates in a pile of n pages demands that each of them is compared to the remaining
(n − 1) pages, and the complexity becomes O (n · (n − 1)) ≈ O (n2 ) and is therefore
quadratic in n.
The constant c will differ across programming platforms as well as actual CPU
capacities. For sufficiently large n, the main contribution to computational time will
come from the argument related to n or, if the argument consists of several compo-
nents, the “worst” of them, i.e., the one that eventually outgrows the other: When the
complexity is O ln(n) · (n/k)2 then, for any constant k and sufficiently large n, the
quadratic term will outweigh the logarithmic term and the CPU time can be consid-
ered quadratic in n. A sole increase in CPU power which implies a reduction in c will
therefore not have a sustainable effect on the (general) computability of a problem. A
real improvement of the complexity can only be achieved when a better algorithm is
found. This applies not only for the (sometimes called “easy”) polynomial problems
where the exponent k can become quite large, too; it is all the more true for a special
class of optimization and search problems: For the group of non-deterministic poly-
nomial time (NP) complete problems, there is no deterministic algorithm known
that can find an exact solution within polynomial time.5 This means that for decid-
ing whether the found solution is optimal, the number of necessary steps is not a
polynomial, but (at least) an exponential function of the problem size in the worst
case. Two well-known problems of this group are the Traveling Salesman Problem
(TSP)6 and the Knapsack Problem (KP)7 . One of the difficulties with optimization
5 Actually, it can be shown that if one of these problems could be solved in polynomial time, this
solution could be transferred and all of these problems could be solved in polynomial time.
6 In the TSP, a salesperson has to find the shortest route for traveling to n different cities, usually
without visiting one city twice. The problem can be modeled by a graph where nodes are the cities
and the arcs are the distances between any two cities. Finding the shortest route corresponds then
to finding the shortes path through this graph (or the shortest Hamiltonian cycle, if the salesperson
takes a round trip where the tour starts and ends in the same city). If the graph is fully connected,
i.e., there is a direct connection between any two cities, there are n! alternative routes to choose from
– which also characterizes the worst case computational complexity of this problem.
7 In the KP, a tourist finds a number of precious stones that differ in size and in value per size unit.
As the capacity of her knapsack is limited, the task of this 1/0 knapsack problem is to select stones
such that the value of the contents is maximized. Fast decisions to this problem are possible only
under rare circumstances. See Kellerer, Pferschy, and Pisinger (2004) and section 4.2.1.
42 2 Heuristic Optimization
problems is that the complexity for solving them is not always apparent: Proofs that
an optimization problem belongs to a certain complexity class are therefore helpful
when deciding which solution strategy ought to be considered.8
If such solutions do not exist, then the problem has usually to be solved for each
individual set of parameters. The approach that needs the least optimization skills
would be complete enumeration where simply all possible (and valid) values for the
decision variables are tested. This approach has some severe downsides: first and
foremost, it is frequently time-consuming far beyond acceptability. Second, it de-
mands the set of candidate solutions to be discrete and finite; if the decision vari-
ables are continuous (i.e., have infinitely many alternatives) then they have to be
discretized, i.e., transformed into a countable number of alternatives – ensuring that
the actual optimum is not excluded due to too large steps while keeping the resulting
number of alternatives manageable.
In chapter 4, e.g., the problem will be to select k out of N assets and optimize their
weights. The problem size can quickly get out of hand: there are Nk = N! ((N − k)! · k!)
alternative combinations for selecting k out of N assets without optimizing the
weights; correspondingly, the complexity of an exhaustive search would be O Nk .
Selecting just 10 out of 100 assets comes with 10010 = 1.73 × 10 alternatives. For
13
each of these alternatives, the optimal weights had to be found: When the weights of
10 assets may be either zero or multiples of 10% and short-sales are disallowed, the
granularity of the weights is g = 1 10% which comes with kg = 1010 possible weight
N g
structures per alternative; the complexity is then increased to O k · k . Having
a computer that is able to evaluate a million cases per second, complete enumera-
tion would take 1.32 × 1011 years – which is approximately ten times the time since
the Big Bang. When k is increased by just one additional asset from 10 to 11 (other
things equal), the CPU time would increase to 233 times the time since the Big Bang;
and if, in addition, the granularity would be increased to multiples of 5% (which
would still be too rough for realistic applications), then the CPU time increased to
more than 6 trillion times since the Big Bang.
Some of the problems dealt with in the following chapters have opportunity sets
that are magnitudes larger; complete enumeration is therefore not a realistic alterna-
tive, nor would a sheer increase in computational power (by faster CPU’s or having
parallel computers) do the trick.
Linear Programming will be applied when the optimization problem has a linear
objective function and its constraints, too, are all linear (in-)equalities. The
most popular method is the Simplex Algorithm where first the inequalities are
transformed into equalities by adding slack variables and then including and
excluding base variables until the optimum is found. Though its worst case
computational complexity is exponential, it is found to work quite efficiently
for many instances. Some parametric models for portfolio selection therefore
prefer linear risk measures (accepting that these risk measures have less de-
sirable properties than the variance).
Quadratic and Concave Programming can be applied when the constraints are lin-
ear (in-)equalities, yet the objective function is quadratic. This is the case for
the Markowitz model (1.7); how this can be done, will be presented in sec-
tion 2.1.2.4. When the Kuhn-Tucker conditions hold,11 a modified version of
the Simplex Algorithm exists that is capable of solving these problems – the
computational complexity of which, however, is also exponential in the worst
case.
Dynamic Programming is a general concept rather than a strict algorithm and ap-
plies to problems that have, e.g., a temporal structure. For financial multi-
temporal problems, the basic idea would be to split the problem into several
sub-problems which are all myopic, i.e., have no temporal aspects when con-
sidered separately. First, the sub-problem for the last period, T , is solved. Next
the optimal solution for last but one period, T − 1, is determined, that leads
to the optimal solution for T , and so on until all sub-problems are solved.
Greedy Algorithms always prefer the next one step that yields the maximum im-
provement but does not assess its consequences. Given a current (subopti-
mal) solution, a greedy algorithm would search for a modified solution within
a certain neighborhood and choose the “best” among them. This approach is
sometimes called hill-climbing, referring to a mountaineer who will choose
her every next step in a way that brings the utmost increase. As these algo-
rithms are focused on the next step only, they get easily stuck when there are
many local optima and the initial values are not chosen well. Hence, this ap-
proach demands smooth solution spaces and a monotonous objective func-
tion for good solutions and is related to the concept of gradient search.
Gradient Search can be performed when the objective function f (x) is differen-
tiable and strictly convex12 and the optimum can be found with the first or-
der condition ∂ f /∂ x = 0. Given the current candidate solution, the gradient
∂f ∂f
∇ f (x ) = ∂ x1 , ..., ∂ xn is computed for x = x. The solution is readjusted ac-
cording to x = x + δ · ∇ f (x) which corresponds to xj := xj + δ · ∂∂ xf ∀ j.
i x=x
This readjustment is repeated until the optimum x∗ with ∇ f (x) = 0 is
reached. Graphically speaking, this procedure determines the tangency at
point x and moves the decision variables towards values for which the tan-
gency’s slope is expected to be 0 and any slight change of any x j would worsen
the value of the objective function f .
Divide and Conquer Algorithms iteratively split the problem into sub-problems un-
til the sub-problems can be solved in reasonable time. These partial results
are then merged for the solution of the complete problem. These approaches
demand that the original problem can be partitioned in a way that the quality
of the solutions for sub-problems will not interfere with each other, i.e., that
the sub-problems are not interdependent.
Branch and Bound Algorithms can be employed in some instances where parts of
the opportunity set and candidate solutions can be excluded by selection tests.
The idea is to iteratively split the opportunity space into subsets and iden-
tify as soon as possible those subsets where the optimum is definitely not
a member of, mainly by either repeatedly narrowing the boundaries within
which the solution must fall, by excluding infeasible solutions, or by “prun-
ing” those solutions that are already outperformed by some other solution
12 Here, maximization problems are considered. For minimization problems, the similar arguments
for concave functions can be considered. Note that any maximization problem can be transformed
into a minimization problem (usually by taking the inverse of the objective function or multiplying
it by −1) and vice versa.
46 2 Heuristic Optimization
found so far. The opportunity set is therefore repeatedly narrowed down until
either a single valid solution is found or until the problem is manageable with
other methods, such as complete enumeration of all remaining solutions or
another numerical method.
Many software packages for optimization problems offer routines and standard so-
lutions for Linear and Quadratic Programming Problems. Linear Programming (LP)
can be applied when the objective function is linear in the decision variable and the
constraints are all equalities or inequalities that, too, are linear. A general statement
would be
min f x
x
subject to
Ax= a
Bx ≤ b
where x is the vector of decision variables and A, B, a, and b are matrices and vec-
tors, respectively, that capture the constraints. Note that any minimization problem
can be transformed into a maximization problem simply by changing the sign of the
objective function (i.e., by multiplying f with −1) and that by choosing the appro-
priate signs, inequalities of the type Cx ≥ c can be turned into −Cx ≤ −c, i.e., b
can contain upper and lower limits alike.
subject to
Ax = a
Bx ≤ b.
This can be applied to determine the Markowitz efficient portfolio for a return of
rP by implementing the model (1.7) as follows. If r and Σ denote the return vector
and covariance matrix, respectively, then f = 01×N , H = 2Σ , A = [1N ×1 r] , a =
[1 rP ] , B = − IN ×N and b = 0N ×1 where 0 is the zero vector.
If, however, the whole efficient line of the Markowitz model is to be identified,
then the objective function (1.7a*) is to be applied and the respective parameters
are f = −λr, H = 2(1 + λ)Σ , A = 11×N , a = 1, B = − IN ×N and b = 0N ×1 . Since λ
measures the trade-off between risk and return, λ = 0 will lead to the identification
of the Minimum Variance Portfolio. On the other hand, λ = 1 puts all the weight on
the expected return and will therefore report the portfolio with the highest possible
yield which, for the given model with non-negativity constraints but no upper limits
on xi , will contain exclusively the one asset with the highest expected return. To
identify the efficient portfolios between these two extremes, a usual way would be
to increase λ in sufficiently small steps from zero to one and solve the optimization
problem for these values.
In a Tobin framework as presented in section 1.1.2.3 where the set of risky assets
is supplemented with one safe asset, the investor will be best off when investing an
amount α into the safe asset and the remainder of (1 − α ) into the tangency portfo-
lio T . Given an exogenously chosen value 0 < α < 1 (for convenience, α → 0), the
respective parameters for the quadratic programming model are f = −[r rs ] , H =
Σ 0N ×1 11×N 1
2 ,A= , a = [1 α ] , B = − I(N+1)×(N+1) and b = 0(N+1)×1 .
01×N 0 01×N 1
The resulting vector x is of dimension (N + 1) × 1, where the first N elements repre-
sent (1 − α ) · xT , whereas the (N + 1)–st element is the weight of the safe asset in the
investor’s overall portfolio and (by constraint) has the value of α . By the separation
theorem, the weights for T can then be determined by xT = 1−1α x1 ... xN .
2.1. Introduction 49
Classical optimization techniques as presented so far can be divided into two main
groups. The first group of methods is based on exhaustive search or (complete) enu-
meration, i.e., testing all candidate solutions. The crux of approaches like branch
and bound is to truncate as much of the search space as possible and hence to elim-
inate groups of candidates that can been identified as inferior beforehand. However,
even after pruning the search space, the remaining number of candidates might still
exceed the available capacities, provided the number of solutions is discrete and fi-
nite in the first place.
The second type comprises techniques that are typically based on the differen-
tial calculus, i.e., they apply the first order conditions and push the decision vari-
ables towards values where the first derivative or gradient of the objective function is
(presumably) zero. An implicit assumption is that there is just one optimum and/or
that the optimum can be reached on a “direct path” from the starting point. The
search process itself is usually based on deterministic numerical rules. This implies
that, given the same initial values, repeated runs will always report the same result
– which, as argued, is not necessarily a good thing: repeated runs with same (de-
terministically generated) initial values will report the same results, unable to judge
whether the global or just a local optimum has been found. To illustrate this prob-
lem, reconsider the function depicted in Figure 2.1 on page 39. If the initial guess
is a value for x that is near one of the local optima xA or xC , then a traditional nu-
merical procedure is likely to end up at the local maximum closest to the initial
guess, and the global optimum, xB , will remain undiscovered. In practice, the deter-
ministic behavior and the straightforward quest for the closest optimum from the
current solutions perspective can be a serious problem, in particular when there are
many local optima which are “far apart” from the global optimum, but close to the
starting value. Also, slight improvements in the objective function might come with
substantially different values for the decision variables.
The heuristics discussed in due course and applied in the main part of this con-
tribution were designed to solve optimization problems by repeatedly generating
and testing new solutions. These techniques therefore address problems where there
actually exist a well-defined model and objective function. If this is not the case,
there exist alternative methods in soft computing 14 and computational intelligence15 .
13 It is not always possible to guarantee beforehand that none of the constraints is violated; also, ascer-
taining that only valid candidate solutions are generated might be computationally costly. In these
cases, a simple measure would be not to care about these constraints when generating the candidate
solutions but to add a punishment term to the objective value when this candidate turns out to be
invalid.
14 Coined by the inventor of fuzzy logic, Lotfi A. Zadeh, the term soft computing refers to methods and
procedures that not only tolerate uncertainty, fuzziness, imprecision and partial correctness but also
make use of them; see, e.g., Zadeh and Garibaldi (2004).
15 Introduced by James Bezdek, computational intelligence refers to methods that use numerical pro-
cedures to simulate intelligent behavior; see Bezdek (1992, 1994).
2.2. Heuristic Optimization Techniques 51
A popular method of this type are Neural Networks which mimic the natural brain
process while learning by a non-linear regression of input-output data.16
Although there are 360 different combinations in the standard case17 the second
player is supposed to find the right solution within eight guesses or less. Complete
enumeration is therefore not possible. The typical beginner’s approach is to per-
form a Monte Carlo search by trying several perfectly random guesses (or the other
player’s favorite colors) and hoping to find the solution either by sheer chance or
by eventually interpreting the outcome of the independent guesses. With unlimited
guesses, this strategy will eventually find the solution; when limited to just eight
guesses, the hit rate is disappointingly low.
More advanced players also start off with a perfectly random guess, but they re-
duce the “degree of randomness” in the subsequent guesses by considering the out-
comes from the previous guesses: E.g., when the previous guess brought two white
16 See Russell and Norvig (2003) for a general presentation; applications to time series forecasting are
presented in Azoff (1994).
17 The standard case demands all four pegs to be of different color with six colors to choose from.
Alternative versions allow for “holes” in the structure, repeated colors and/or also the “white” and
“black” pegs, used to indicate “correct color” and “correct color and position”, respectively – result-
ing in up to 6 561 combinations.
52 2 Heuristic Optimization
pegs (i.e., only two right colors, none in the right position, and two wrong colors),
the next guess should contain some variation in the color; if the answer were four
white pegs (i.e., all the colors are right, yet all in the wrong position), the player
can concentrate on the order of the previously used pegs rather than experimenting
with new colors. The individual guesses are therefore not necessarily independent,
yet (usually) there is no deterministic rule for how to make the next guess. Master-
mind might therefore serve as an example where the solution to a problem can be
found quite efficiently by applying an appropriate heuristic optimization method.
Generation of new solutions. A new solution can be generated by modifying the cur-
rent solution (neighborhood search) or by building a new solution based on
past experience or results. In doing so, a deterministic rule, a random guess
or a combination of both (e.g., deterministically generating a number of alter-
natives and randomly selecting one of them) can be employed.
18 For an alternative classification, see, e.g., Silver (2002) and Winker and Gilli (2004).
2.2. Heuristic Optimization Techniques 53
Number of search agents. Whereas in some methods, a single agent aims to improve
her solution, population based methods often make use of collective knowl-
edge gathered in past iterations.
Limitations of the search space. Given the usually vast search space, new solutions
can be found by searching within a certain neighborhood of a search agent’s
current solution or of what the population (implicitly) considers promising.
Some methods, on the other hand explicitly exclude certain neighborhoods
or regions to avoid cyclic search paths or spending too much computation
time on supposedly irrelevant alternatives.
Prior knowledge. When there exist general guidelines of what is likely to make a
good solution, this prior knowledge can be incorporated in the choice of the
initial solutions or in the search process (guided search). Though the inclusion
of prior knowledge might significantly reduce the search space and increase
the convergence speed, it might also lead to inferior solutions as the search
might get guided in the wrong direction or the algorithm might have severe
problems in overcoming local optima. Prior knowledge is therefore found in
a rather limited number of HO methods and there, too, rather an option than
a prerequisite.
Flexibility for specific constraints. Whereas there exist true general purpose meth-
ods that can be applied to virtually any type of optimization problem, some
methods are tailor-made to particular types of constraints and are therefore
difficult to apply to other classes of optimization problems.
Other aspects allow for testing and ranking different algorithms and might also
affect the decision which method to select for a particular optimization problem:
54 2 Heuristic Optimization
Convergence speed. The CPU time (or, alternatively, the number of evaluated can-
didate solutions) until no further improvement is found, is often used as a
measure to compare different algorithms. Speed might be a salient property of
an algorithm in practical solutions – though not too meaningful when taken
as a sole criterion as it does not necessarily differentiate between local and
global optimum convergence and as long as a “reasonable” time limit is not
exceeded.
Reliability. For some major heuristics, proofs exist that these methods will converge
towards the global optimum – given sufficient computation time and an ap-
propriate choice of parameters. In practice, one often has to accept a trade-off
between low computational time (or high convergence speed) and the chance
that the global optimum is missed. With the inherent danger of getting stuck
in a local optimum, heuristics are therefore frequently judged by their ratio of
reporting local optima or other inferior solutions.
To reduce the vagueness of these aspects, section 2.3 presents some of the ma-
jor and commonly used heuristics that are typical representatives for this type of
methods and that underline the differences in the methods with regard to the above
2.3. Some Selected Methods 55
19 For general presentations and comparisons of HO methods, see, e.g., Osman and Kelly (1996), Tail-
lard, Gambardella, Gendreau, and Potvin (2001), Michalewicz and Fogel (1999), Aarts and Lenstra
(2003) or Winker and Gilli (2004). Osman and Laporte (1996) offer an extensive bibliography of the
theory and application of meta-heuristics, including 1 380 references. Ausiello and Protasi (1995)
investigate local search heuristics with respect to NP optimization problems.
20 See, e.g., Hertz and Widmer (2003).
56 2 Heuristic Optimization
Listing 2.1: Basic structure for Simulated Annealing (SA) and Threshold Accepting (TA)
21 See Aarts and van Laarhoven (1985) and Althöfer and Koschnik (1991), respectively.
22 For a concise presentation of TA, its properties and applications in economics as well as issues re-
lated to evaluating heuristically obtained results, see Winker (2001).
2.3. Some Selected Methods 57
Evolution based methods gained significant recognition with the advent of Ge-
netic Algorithms (GA). Based on some of his earlier writings as well as related ap-
proaches in the literature, Holland (1975) attributes probabilities for reproduction
to the individual “chromosomes,” xi , that reflect their relative fitness within the pop-
ulation. In the sense of the “survival of the fittest” principle, high fitness increases
the chances of (multiple) reproduction, low fitness will ultimately lead to extinction.
New offspring is generated by combining the chromosomes of two parent chromo-
somes; in the simplest case this cross-over can be done by “cutting” each parent’s
chromosomes into two pieces and creating two siblings by recombining each par-
parents 1 0 0 1 0 1 1 0 1 1
offspring 1 0 0 1 1 1 1 0 1 0
before 1 1 0 1 0
after 0 1 0 1 0
(b) Mutation
ent’s first part with the other parent’s second part (see Figure 2.2(a)). In addition
mutation can take place, again by randomly modifying an existing (i.e., parent’s or
newly generated offspring’s) solution (see Figure 2.2(b)).
Over the last decades, GA have become the prime method for evolutionary op-
timization – with a number of suggestions for alternative cross-over operations
(not least because GA were originally designed for chromosomes coded as binary
strings), mutation frequency, cloning (i.e., unchanged replication) of existing chro-
mosomes, etc. Listing 2.3 therefore indicates just the main steps of a GA; the struc-
ture of actual implementations might differ. Fogel (2001) offers a concise overview
of methods and literature in evolutionary computation.
2.3. Some Selected Methods 59
Fig. 2.3: Simple foraging example for a colony with two ants
pheromone trail while returning to the nest. Meanwhile the second ant has reached
F and wants to bring the food to the nest. Again, F can be left on two routes: the left
one (=long) has now one trail on it, the right one (=short) has already two trails. As
the ant prefers routes with more pheromone in it, it is likely to return on the right
path – which is the shorter one and will then have a third trail on it (versus one on
the left path). The next time the ants leave the nest, they already consider the right
route to be more attractive and are likely to select it over the left one. In real live,
this self-reinforcing principle is enhanced by two further effects: shorter routes get
more pheromone trails as ants can travel on them more often within the same time
span than they could on longer routes; and old pheromone trails tend to evaporate
making routes without new trails less attractive.
Based on this reinforcement mechanism, the tendency towards the shorter route
will increase. At the same time, there remains a certain probability that routes with
less scent will be chosen; this assures that new, yet unexplored alternatives can be
considered. If these new alternatives turn out to be shorter (e.g., because to a closer
food source), the ant principle will enforce it, and – on the long run – it will become
the new most attractive route; if it is longer, the detour is unlikely to have a lasting
impression on the colony’s behavior.
Dorigo, Maniezzo, and Colorni (1991) transfer this metaphor to heuristic opti-
mization called Ant System (AS) by having a population of artificial ants search in
a graph where the knots correspond to locations and the arcs represent the amount
of pheromone, i.e., attractiveness of choosing the path linking these locations. Be-
ing placed at an arbitrary location and having to decide where to move next, the
artificial ant will choose (among the feasible) routes those with higher probability
that are marked with more pheromone. The pheromone is usually administered in
a pheromone matrix where two basic kinds of updates take place: on the one hand,
new trails are added that are the stronger the more often they are chosen and the
better the corresponding result; on the other hand, trails evaporate making rarely
chosen paths even less attractive.
Since the original concept of AS parallels the Traveling Salesman Problem,25 List-
ing 2.4 presents this algorithm for the task of finding the shortest route when a given
number of cities have to be visited. Meanwhile, there exist several extensions and
variants most of which suggest improved trail update rules or selection procedures
leading to higher reliability. Also there exist modifications to open this algorithm
for optimization problems other than ordering. A survey can be found in Bonabeau,
Dorigo, and Theraulaz (1999).
The concept of the pheromone matrix facilitates the gathering and sharing of
collective knowledge and experience: While in the previously presented methods
SA, TA and GA derive their new solutions from one (or two paternal) existing so-
lution(s) and adding a random term to it, the contributions of many ants (from the
current and past generations) support the generation of a new solution. As a result,
ant based systems usually have high convergence speed and reliability – yet are also
computationally more demanding as trail updates and the generation of new solu-
tions is more complex. Another disadvantage is that ant based algorithms are less
flexible in their application.
Single agent neighborhood search methods such as SA or TA, where one solution
is modified step by step until convergence, are successful in particular when there
is a limited number of local optima, when the agent can at least roughly figure out
in which direction the global optimum can be expected and when this optimum
62 2 Heuristic Optimization
initialize population;
REPEAT
perform individual neighborhood search;
compete;
perform individual neighborhood search;
cooperate;
adjust acceptance criterion;
UNTIL halting criterion met;
is easily reachable given the step size and the distance between initial and optimal
solution. If the algorithm appears to have problems of finding a solution or is likely
to get stuck in local optima, one common remedy is to have a higher number of
independent runs with different starting points, i.e., the optimization problem is
solved repeatedly, and eventually the best of all found solutions is reported. Though
the advantage of the independence between the runs is that mislead paths to local
optima cannot be misleading in the current search, prior experience is lost and has
to be gathered again. This increases inefficiency and run time. In population based
methods such as GA, a whole population of agents produces several solutions at a
time, which are regularly compared and the best of which are combined or re-used
for new solutions. Population based methods therefore tend to be more likely to
(eventually) overcome local optima. At the same time, they might have problems
when already being close to the optimum where local neighborhood search would
easily do the trick.
In section 1.1.3, different ways for estimating the volatility were presented, including
GARCH models where the volatility can change over time and is assumed to follow
an autoregressive process. Applying these models, however, is not always trivial as
the parameters have to be estimated by maximizing the likelihood function (1.14)
(see page 22) which might have many local optima. In the lack of closed-form so-
lutions, traditional numerical procedures are usually employed – which might pro-
duce quite different results.
26 See Dawkins (1976, chapter 7). For a more in-depth presentation and discussion of the meme con-
cept and its application in social sciences, see, e.g., Blackmore (1999).
27 Meanwhile, the literature holds many different versions of Memetic Algorithms, some of which are
population based whereas others aren’t, where the local search is not based on SA but on alternative
methods such as Fred Glover’s Tabu Search (where a list of recently visited solutions is kept that must
not revisited again in order to avoid cycles), etc.; more details can be found in several contributions
in Corne, Glover, and Dorigo (1999).
64 2 Heuristic Optimization
Given a time series rt , Fiorentini, Calzolari, and Panattoni (1996) consider the
simple GARCH(1,1) model (our notation)
To illustrate how to use heuristic optimization techniques and to test whether the
obtained results are reliable, we approach the maximization problem (2.1) with one
of the simpler of the introduced HO techniques, namely Simulated Annealing (SA)
which has been introduced in section 2.3.1. Based on the pseudo-code in listing 2.1,
the SA heuristic includes the following steps:
• First, initial values for all decision variables (collected in the vector ψ =
[µ , α0 , α1 , β1 ]) are generated by random guesses. The only prerequisite for
these guesses is that the guessed values are “valid” with respect to the con-
straints.
• The main part of SA consists of a series of iterations where the following steps
will be repeated:
Listing 2.6 provides a pseudocode for the algorithm as presented. As the counter
for the iterations starts with the value 2, the number of candidate solutions pro-
66 2 Heuristic Optimization
FOR i := 2 TO I do
ψ := ψ;
j := RandomInteger ∈ [1, ..., narg(ψ)];
z̃i := RandomValue ∈ [−1, +1];
ψj := ψ j + ui · z̃i ;
∆ L := L (ψ ) − L (ψ);
IF ∆ L > 0 THEN
ψ := ψ
ELSE
∆L
with probability p = p(∆ L , Ti ) = exp Ti DO
ψ := ψ
END;
END;
Listing 2.6: Pseudo-code for GARCH parameter estimation with Simulated Annealing
duced by the algorithm (including the initialization) is equal to I; note also that the
iteration loop will be entered only if I ≥ 2 and skipped otherwise.
Unfortunately, there is no unique recipe for how to approach this task. Actually,
it can be considered a demanding optimization problem in itself – which is partic-
2.4. Heuristic Optimization at Work 67
ularly tricky: A certain parameter setting will not produce a unique, deterministic
result but rather various results that are (more or less) randomly distributed; the
task is therefore to find a combination where the distribution of the reported results
is favorable. And as with many demanding problems, there are many possible pa-
rameter settings that appear to work equally well, yet it is hard to tell which one is
actually the best among them.
Generally speaking, a good parameter setting is one where the algorithm finds
reliable solutions within reasonable time. A common way for tuning the algorithm’s
parameters is to predefine several plausible parameter settings and to perform a se-
ries of independent experiments with each of these settings. The results can then be
evaluated statistically, e.g., by finding the median or the quantiles of the reported
solutions, and eventually select the parameter setting for which the considered sta-
tistics are the best; this approach will be used in the following section. Alternative
approaches include response surface analysis and regression analysis where a func-
tional relationship between the algorithm’s parameters and the quality of the re-
ported solutions is considered.
For complex algorithms where the number of parameters is high and their effects
on the algorithm’s quality are highly interdependent, a preselection of plausible pa-
rameter values is more difficult; in these circumstances, the parameter values can be
found either by a Monte Carlo search – or by means of a search heuristic.
to be seen relative to the variable that is to be changed; hence, the proper value for u
will also depend on the magnitude of the different ψ j ’s.
Large values for u allow fast movements through the solution space – yet also
increase the peril that the optimum is simply stepped over and therefore remains
unidentified. Smaller step widths, on the other hand, increase the number of steps
necessary to trespass a certain distance; if u is rather small, the number of iterations
has to be high. Furthermore, u is salient for overcoming local optima: To escape a
local optimum, a sequence of (interim) impairments of the objective function has to
be accepted; the smaller u, the longer this sequence is. Smaller values for u demand
a more tolerant acceptance criterion which might eventually lead to a perfectly ran-
dom search strategy, not really different from a Monte Carlo search. With a strict
acceptance criterion, small values for u will enforce an uphill search and therefore
help to find the optimum close to the current position which might be advantageous
in an advance stage of the search.
All this indicates that it might be favorable to have large values for u during the
first iteration steps and small values during the last. Also, it might be reasonable to
allow for different values for each decision variable if there are large differences in
the plausible ranges for the values of the different decision variables.
Finding Proper Values Given the data set for which the GARCH model is to be es-
timated, the optimal value for ψ1 = µ can be supposed to be in the range [−1, +1].
Also, it is reasonable to assume that the estimated variance should be non-negative
and finite at any point of time. For the variables α0 , α1 , and β1 in equation (2.1b)
(represented in the algorithm by ψ2 , ψ3 , and ψ4 ), it is plausible to assume that their
values are non-negative, but do not exceed 1; hence, their optimal values are ex-
pected in the range [0, +1].
As only one of these decision variables is modified per iteration and the number
of iterations might be rather small, we will test three alternatives where u1 will have
an initial value of 0.05, 0.025, or 0.01; the actual modification, u1 · z̃, will then be
equally distributed in the range [−u1 , +u1 ].
10, 100, or 1 000. The value for u shall be lowered gradually in the course of iterations
according to ui+1 = ui · γu . This implies that the parameter γu is to be determined
√
according to γu = I uI/u1 . With the chosen values for u1 and uI , γu can take the
√ √ √
values 1, I 0.1, I 0.01, and I 0.001.
Finding Proper Values The algorithm will report a solution which it has actually
guessed and reached by chance at one stage of the search process. The algorithm,
however, does not have a mechanism that guides the search, e.g., by using gradi-
ents, estimating the step width with some interpolation procedure or based on past
experience. At the same time, we demand a high precision for the parameters. The
algorithm is therefore conceded 50 000 guesses before it reports a solution; these
guesses can be spent on few independent runs with many iterations or the other way
round. We distinguish four versions where all guesses are used on one search run
(i.e., the number of iterations is set to I = 50 000), 5 and 50 independent runs with
I = 10 000 and 1 000 iterations, respectively, and version where no iterative search
70 2 Heuristic Optimization
is performed but all the guesses are used on perfectly random values; this last ver-
sion corresponds to a Monte Carlo search and can serve as a benchmark on whether
the iterative search by SA has a favorable effect on the search process or whether a
perfectly random search strategy might be enough.
Finding good values for the temperature is strongly dependent on what are “typ-
ical” impairments. This can be achieved by performing a series of modifications,
evaluating the resulting changes in the objective function, ∆ L , and determining
the quantiles of the distribution of negative ∆ L ’s.
Tab. 2.1: Quantiles for modifications with negative ∆ L for different values of u, based on 10 000
replications each (italics and boldface as explained in the text)
found, the cooling parameter γT = I
TI /T1 can be determined where I is the cho-
sen number of iterations per run. In each iteration, the temperature is then lowered
according to Ti+1 = Ti · γT .
Finding Proper Values In order to find suitable values for the temperature, the dis-
tribution of the potential impairments due to one local neighborhood search step
has to be found. This can be done by a Monte Carlo approach where first a num-
ber of candidate solutions (that might occur in the search process) are randomly
generated and for which the effect of a modification is evaluated. As this distrib-
ution depends on what is considered a local neighborhood, Table 2.1 summarizes
the quantiles for impairments for the different values of u in the first and the last
iteration.
As stated above, the initial values for u were selected from the alternatives [0.05,
0.025, 0.01]. For these three alternatives, the 90% quantiles of the impairments were
approximately –13 (see figures in italics in Table 2.1). Hence, if in the beginning,
90% of all impairments should be accepted with a probability of at least p = 0.5,
then temperature should be set to T1 = −13 ln(0.5) ≈ 20.
The 10% quantiles of the impairments depend strongly on the value of u; they can
be approximated by –6 · u (see figures in boldface in Table 2.1). Hence, if only the
72 2 Heuristic Optimization
10% of impairments that are smaller than this critical value shall be accepted with a
probability of more than p = 0.1 in the last iteration, I, the temperature should to be
set to TI = −6 · uI ln(0.1) ≈ 2.6 · uI . As this shall be the case in the last iteration, the
cooling factor is set to γT = I uI · 2.6 20 where I is the number of iterations.31
2.4.3 Results
Based on the above considerations, there are three candidate values for the initial
value of u (u1 = 0.05, 0.025 or 0.01) and four alternatives for the terminal value of
u ( u1/uI = 1 1, 1 10, 1 100, and 1 1 000; the values for γu follow directly according to
√
γu = I u1/uI ). In addition, we test four different alternatives to use the conceded
50 000 guesses (ranging from a single run with I = 50 000 iterations to 50 000 inde-
pendent runs with I = 1, i.e., without subsequent iterations32 ), there are 48 different
parameter settings to be tested. With each of these combinations, the algorithm was
applied to the optimization problem several hundred times, and from each of these
experiments, the best of the 50 000 candidate solutions was reported and the dis-
tributions of the reported solutions are evaluated. Implemented in Delphi (version
7), the CPU time per experiment (i.e., per 50 000 candidate solutions) was approx-
imately 15 seconds on a Centrino Pentium M 1.4 GHz. Table 2.2 summarizes the
medians and 10% quantiles for the deviations between reported solutions and the
optimum.
When an adequate parameter setting has been chosen, the algorithm is able to
find good solutions with high probability: when the number of iterations is suffi-
ciently high (e.g., I = 50 000) and the parameters for the neighborhood search are
well chosen (e.g., u1 = 0.025 and uI/u1 = 0.001), then half of the reported solutions
will have a L which is at most 0.00001 below the optimum. The best 10% of the
solutions generated with this parameter setting will deviate by just 0.000001 or less.
31 A more sophisticated consideration could take into account that during the last iteration, the al-
gorithm has converged towards the optimum and that a random step in this region might have a
different effect on ∆ L as the same modification would cause in a region far from the optimum.
Also, the values for p where chosen based on experience from implementations for similar prob-
lems; more advanced considerations could be performed. However, for our purpose (and for many
practical applications), the selection process as presented seems to generate good enough results.
32 Note that in listing 2.6, the loop of iterations is not entered when I = 1: To assure that the initial
values, too, count towards to the total number of guesses, the loop is performed only I − 1 times.
2.4. Heuristic Optimization at Work 73
Tab. 2.2: 10% quantiles and medians of differences between reported solutions and optimum
solution for different parameter settings from several hundred independent experiments (typi-
cally 400; MC: 7 800) with 50 000 candidate solutions each
74 2 Heuristic Optimization
If, on the other hand, the heuristic search part is abandoned and all of the allowed
50 000 guesses are used on generating independent random (initial) values for the
vector of decision variables, then the algorithm performs a sheer Monte Carlo (MC)
search where there is no neighborhood search (and, hence, the values for ui are
irrelevant) and where, again, only the best of the 50 000 guesses per experiment is
reported. The results are by magnitude worse than for SA with a suitable set of pa-
rameters (see last column, labeled MC). This also supports that the use of the search
heuristic leads to significantly better results – provided an appropriate parameter
setting has been selected.
A closer look at the results also underlines the importance of suitable parameters
and that inappropriate parameters might turn the algorithm’s advantages into their
exact opposite. When there are only few iterations and the neighborhood is chosen
too small (i.e., small initial value for u which is further lowered rapidly), then the
step size is too small to get anywhere near the optimum within the conceded number
of search steps. As a consequence, the algorithm virtually freezes at (or near) the
initial solution.
However, it also becomes apparent that for most of the tested parameter com-
binations, the algorithm performs well and that preliminary considerations might
help to quickly tune a heuristic optimization algorithm such that it produces good
results with high reliability. Traditional methods are highly dependent on the initial
values which might lead the subsequent deterministic search to the ever same local
optimum. According to Brooks, Burke, and Persand (2001) the lack of sophisticated
initializations is one of the reasons why the tested software packages found solutions
for the considered problem that sometimes differ considerably from the benchmark.
Table 2.3 reproduces their parameter estimates from different software packages33
together with the benchmark values and the optimum as found by the SA algorithm
33 Brooks, Burke, and Persand (2001) report only three significant figures for the estimates from the
different software packages, also the packages might use alternative initializations for σ02 . (Our im-
plementation uses the popular approach σ02 = e20 = T1 ∑t=1
T
et2 with et coming from equation (2.1a).)
Reliable calculations of the respective values for L that would allow for statistically sound tests on
the estimation errors are not possible.
2.5. Conclusion 75
Method ψ0 = µ ψ1 = α0 ψ2 = α1 ψ3 = β1
Benchmark –0.00619041 0.0107613 0.153134 0.805974
Heuristic optimization –0.00619034 0.0107614 0.153134 0.805973
E-Views –0.00540 0.0096 0.143 0.821
Gauss-Fanpac –0.00600 0.0110 0.153 0.806
Software packages
Tab. 2.3: Results for the GARCH estimation based on the benchmark provided in Bollerslev and
Ghysels (1996), the results from the software packages (with default settings) as reported in
Brooks, Burke, and Persand (2001)
– which, by concept, uses perfectly random initial values.34 Unlike with traditional
deterministic optimization techniques, this reliability can arbitrarily be increased
by increasing the runtime (which is the basic conclusion from convergence proofs
for HO algorithms).
2.5 Conclusion
In this chapter, some basic concepts of optimization in general and heuristic opti-
mization methods in particular were introduced. The heuristics presented in this
chapter differ significantly in various aspects: the varieties range from repeatedly
modifying one candidate solution per iteration to whole populations of search
agents each of them representing one candidate solution; from neighborhood search
strategies to global search methods, etc. As diverse these methods are, as diverse are
34 In practice, HO techniques do not always benefit when the initial values come from some “sophis-
ticated guess” or another optimization as this often means that the optimizer first and prime task
is to overcome a local optimum. On the contrary, heuristically determined solutions might some-
times be used as initial values for traditional methods. Likewise, it might be reasonable to have the
fine-tuning of the heuristic’s last iterations done by a strict up-hill search.
76 2 Heuristic Optimization
also their advantages and disadvantages: Simulated Annealing and Threshold Ac-
cepting are relatively easy to implement and are good general purpose methods,
yet they tend to have problems when the search space is excessively large and has
many local optima. Other methods such as Genetic Algorithms or Memetic Algo-
rithms, on the other hand, are more complex and their implementation demands
some experience with heuristic optimization, yet they can deal with more compli-
cated and highly demanding optimization problems. Hence, there is not one best
heuristic that would be superior to all other methods. It is rather a “different courses,
different horses” situation where criteria such as the type of optimization problem,
restrictions on computational time, experience with implementing different HO al-
gorithms, the programming environment, the availability of toolboxes, and so on
that influence the decision which heuristic to choose – or eventually lead to new or
hybrid methods.