0% found this document useful (0 votes)
95 views

A Genetic Algorithm Tutorial

This document provides a tutorial on genetic algorithms. It begins with an introduction to genetic algorithms and their application to optimization problems. It describes how potential solutions are encoded and evaluated, and discusses issues like discretization of continuous parameters. The tutorial then covers variants of the canonical genetic algorithm and the concept of hyperplane sampling. It also discusses crossover operators and the schema theorem. Later sections address alternative genetic algorithm models and specialized parallel implementations. The goal is to provide new users with an understanding of the basic concepts and theory behind genetic algorithms.

Uploaded by

Márcio Ramos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

A Genetic Algorithm Tutorial

This document provides a tutorial on genetic algorithms. It begins with an introduction to genetic algorithms and their application to optimization problems. It describes how potential solutions are encoded and evaluated, and discusses issues like discretization of continuous parameters. The tutorial then covers variants of the canonical genetic algorithm and the concept of hyperplane sampling. It also discusses crossover operators and the schema theorem. Later sections address alternative genetic algorithm models and specialized parallel implementations. The goal is to provide new users with an understanding of the basic concepts and theory behind genetic algorithms.

Uploaded by

Márcio Ramos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2425017

A Genetic Algorithm Tutorial

Article  in  Statistics and Computing · October 1998


DOI: 10.1007/BF00175354 · Source: CiteSeer

CITATIONS READS
1,722 2,147

1 author:

Darrell Whitley
Colorado State University
359 PUBLICATIONS   14,736 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Discrete Mathematical Modeling of Biological Regulatory Networks View project

All content following this page was uploaded by Darrell Whitley on 29 October 2015.

The user has requested enhancement of the downloaded file.


Department of
Computer Science

A Genetic Algorithm Tutorial


Darrell Whitley

Technical Report CS-93-103

March 10, 1993

Colorado State University


A Genetic Algorithm Tutorial
Darrell Whitley
Computer Science Department, Colorado State University
Fort Collins, CO 8052 [email protected]

Abstract
This tutorial covers the canonical genetic algorithm as well as more experimental
forms of genetic algorithms, including parallel island models and parallel cellular genetic
algorithms. The tutorial also illustrates genetic search by hyperplane sampling. The
theoretical foundations of genetic algorithms are reviewed, include the schema theorem
as well as recently developed exact models of the canonical genetic algorithm.

Keywords: Genetic Algorithms, Search, Parallel Algorithms

1 Introduction
In its most general usage, Genetic Algorithms refer to a family of computational models
inspired by evolution. These algorithms encode a potential solution to a speci c problem
on a simple chromosome-like data structure and apply recombination operators to these
structures so as to preserve critical information. Genetic algorithms are often viewed as
function optimizers, although the range of problems to which genetic algorithms have been
applied is quite broad.
An implementation of a genetic algorithm begins with a population of (typically random)
chromosomes. One then evaluates these structures and allocates reproductive opportunities
in such a way that those chromosomes which represent a better solution to the target problem
are given more chances to \reproduce" than those chromosomes which are poorer solutions.
The \goodness" of a solution is typically de ned with respect to the current population
average, or the population median.
This particular description of a genetic algorithm is intentionally abstract because in
some sense, the term genetic algorithm has two meanings. In a strict interpretation, the
genetic algorithm refers to a model introduced and investigated by John Holland (1975) and
by students of Holland (e.g., DeJong, 1975). It is still the case that most of the existing
theory for genetic algorithms applies either solely or primarily to the model introduced by
Holland, as well as variations on what will be referred to in this paper as the canonical
genetic algorithm. Recent theoretical advances in modeling genetic algorithms also apply
primarily to the canonical genetic algorithm (Vose, 1993).

1
In a broader usage of the term, a genetic algorithm is any population-based model that
uses selection and recombination operators to generate new sample points in a search space.
Many genetic algorithm models have been introduced by researchers largely working from
an experimental perspective. Many of these researchers are application oriented and are
typically interested in genetic algorithms as optimization tools.
The goal of this tutorial is to present genetic algorithms in such a way that students new
to this eld can grasp the basic concepts behind genetic algorithms as they work through
the tutorial. It should allow the more sophisticated reader to absorb this material with
relative ease. The tutorial also covers topics, such as inversion, which have sometimes been
misunderstood and misused by researchers new to the eld.
The tutorial begins with a very low level discussion of optimization to both introduce
basic ideas in optimization as well as basic concepts that relate to genetic algorithms. In
Section 2 variants of the canonical genetic algorithm are reviewed. In Section 3 the principle
of hyperplane sampling is explored and some basic crossover operators are introduced. In
Section 4 various versions of the schema theorem are developed in a step by step fashion
and other crossover operators are discussed. In Section 5 binary alphabets and their e ects
on hyperplane sampling is considered. In Section 6 an exact model of the genetic algorithm
is developed which assumes in nite population sizes. The last three sections of the tuto-
rial covers alternative forms of genetic algorithms and evolutionary computational models,
including specialized parallel implementations.

1.1 Encodings and Optimization Problems


Usually there are only two main components of most genetic algorithms that are problem
dependent: the problem encoding and the evaluation function.
Consider a parameter optimization problem where we must optimize a set of variables to
either maximize some target such as pro t, or to minimize cost or some measure of error. We
might view such a problem as a black box with a series of control dials representing di erent
parameters; the only output of the black box is a value returned by an evaluation function
indicating how well a particular combination of parameter settings solves the optimization
problems. The goal is to set the various parameters so as to optimize some output. In more
traditional terms, we wish to minimize (or maximize) some function F (X ; X ; :::XM ).
1 2

Most users of genetic algorithms typically are concerned with problems that are nonlinear:
in other words, it is not possible to treat each parameter as an independent variable which
can be solved in isolation from the other variables. There are interactions such that the
combined e ects of the parameters must be considered in order to maximize or minimize
the output of the black box. In the genetic algorithm community, the interaction between
variables is sometimes referred to as epistasis.
The rst assumption that is typically made is that the variables representing parameters
can be represented by bit strings. This means that the variables are discretized in an a
priori fashion, and that the range of the discretization corresponds to some power of 2. For
example, with 10 bits per parameter, we obtain a range with 1024 discrete values. If the
parameters are actually continuous then this discretization is not a particular problem. This

2
assumes, of course, that the discretization provides enough resolution to make it possible to
adjust the output with the desired level of precision. It also assumes that the discretization
is in some sense representative of the underlying function.
If some parameter can only take on an exact nite set of values then the coding issue
becomes more dicult. For example, what if there are exactly 1200 discrete values which
can be assigned to some variable Xi . We need at least 11 bits to cover this range, but this
codes for a total of 2048 discrete values. The 848 unnecessary bits patterns may result in
no evaluation (i.e., a default worst possible evaluation) or some parameter settings may be
represented twice so that all binary strings result in a legal set of parameter values which can
be evaluated. Solving such coding problems is usually considered to be part of the design of
the evaluation function.
Aside from the coding issue, the evaluation function is usually given as part of the problem
description. On the other hand, developing an evaluation function can sometimes involve
developing a simulation. In other cases, the evaluation may be performance based and
may represent only an approximate or partial evaluation. For example, consider a control
application where the system can be in any one of an exponentially large number of possible
states. Assume a genetic algorithm is is used to optimize some form of control strategy. In
such cases, the state space must be sampled in a limited fashion and the resulting evaluation
of control strategies is approximate and noisy (c.f., Fitzpatrick and Grefenstette, 1988).
The evaluation function must also be relatively fast. This is typically true for any opti-
mization method, but it may particularly pose an issue for genetic algorithms. Since genetic
algorithms work with a population of potential solutions, it incurs the cost of evaluating this
population. Furthermore, the population is replaced (all or in part) on a generational basis.
The members of the population reproduce, and their o spring must then be evaluated. If it
takes 1 hour to do an evaluation, then it takes over 1 year to do 10,000 evaluations. This
would be approximately 100 generations for a population of 100, but only 10 generations
using a population of 1000 strings.

1.2 How Hard is Hard?


Assuming the interaction between parameters is nonlinear, the size of the search space is
related to the number of bits used in the problem encoding. For a bit string encoding of
length L; the size of the search space is 2L and forms a hypercube. The genetic algorithm
samples the corners of this L-dimensional hypercube.
Generally, most test functions are at least 30 bits in length and most researchers would
probably agree that larger test functions are probably needed. Anything much smaller
represents a space which can be enumerated. (Considering for a moment that the national
debt of the United States in 1993 is approximately 2 dollars, 2 does not sound quite
42 30

so large.) Of course, the expression 2L grows exponentially with respect to L. Consider a


problem with an encoding of 400 bits. How big is the associated search space? A classic
introductory textbook on Arti cial Intelligence gives one characterization of a space of this
size. Winston (1992:102) points out that 2 is a good approximation of the e ective size
400

of the search space of possible board con gurations in chess. (This assumes the e ective

3
branching factor at each possible move to be 16 and that a game is made up of 100 moves;
16 = (2 ) = 2 ). Winston states that this is \a ridiculously large number. In fact, if
100 4 100 400

all the atoms in the universe had been computing chess moves at picosecond rates since the
big bang (if any), the analysis would be just getting started."
The point is that as long as the number of \good solutions" to a problem are sparse with
respect to the size of the search space, then random search or search by enumeration of a large
search space is not a practical form of problem solving. On the other hand, any search other
than random search imposes some bias in terms of how it looks for better solutions and where
it looks in the search space. Genetic algorithms indeed introduce a particular bias in terms
of what new points in the space will be sampled. Nevertheless, genetic algorithms belong
to the class of methods known as \weak methods" in the Arti cial Intelligence community
because it makes relatively few assumptions about the problem that is being solved.
Of course, there are many optimization methods that have been developed in mathe-
matics and operations research. What role do genetic algorithms play as an optimization
tool? Genetic algorithms are often described as a global search method that does not use
gradient information. Thus, nondi erentiable functions as well as functions with multiple
local minima represent classes of problems to which genetic algorithms might be applied.
Genetic algorithms, as a weak method, are robust but very general. If there exists a good
specialized optimization method for a speci c problem, then genetic algorithm may not be
the best optimization tool for that application. On the other hand, some researchers work
with hybrid algorithms that combine existing methods with genetic algorithms.

2 Variants of the Canonical Genetic Algorithm


The rst step in the implementation of any genetic algorithm is to generate an initial pop-
ulation. In the canonical genetic algorithm each member of this population will be a binary
string of length L which corresponds to the problem encoding. Each string is sometimes
referred to as a \genotype" (Holland, 1975) or, alternatively, a \chromosome" (Scha er,
1987). In most cases the initial population is generated randomly. After creating an initial
population, each string is then evaluated and assigned a tness value.
The notion of evaluation and tness are sometimes used interchangeably. However, it
is useful to distinguish between the evaluation function and the tness function used by a
genetic algorithm. In this tutorial, the evaluation function, or objective function, provides a
measure of performance with respect to a particular set of parameters. The tness function
transforms that measure of performance into an allocation of reproductive opportunities.
The evaluation of a string representing a set of parameters is independent of the evaluation
of any other string. The tness of that string, however, is always de ned with respect to
other members of the current population.
In the canonical genetic algorithm, tness is de ned by: fi =f where fi is the evaluation
associated with string i and f is the average evaluation of all the strings in the population.
Fitness can also be assigned based on a string's rank in the population (Baker, 1985; Whitley,
1989) or by sampling methods, such as tournament selection (Goldberg, 1990).

4
Selection Recombination
(Duplication) (Crossover)
String 1 String 1 Offspring-A (1 X 2)
String 2 String 2 Offspring-B (1 X 2)
String 3 String 2 Offspring-A (2 X 4)
String 4 String 4 Offspring-B (2 X 4)

Current Intermediate Next


Generation t Generation t Generation t + 1
Figure 1: One generation is broken down into a selection phase and recombination phase.
This gure shows strings being assigned into adjacent slots during selection. In fact, they
can be assigned slots randomly in order to shue the intermediate population. Mutation (not
shown) can be applied after crossover.

It is useful to view the execution of the genetic algorithm as a two stage process. It
starts with the current population. Selection is applied to the current population to create an
intermediate population. Then recombination and mutation are applied to the intermediate
population to create the next population. The process of going from the current population
to the next population constitutes one generation in the execution of a genetic algorithm.
Goldberg (1989) refers to this basic implementation as a Simple Genetic Algorithm (SGA).
We will rst consider the construction of the intermediate population from the current
population. In the rst generation the current population is also the initial population. After
calculating fi =f for all the strings in the current population, selection is carried out. In the
canonical genetic algorithm strings in the current population are copied (i.e., duplicated) in
proportion to their tness and placed in the intermediate generation.
There are a number of ways that the selection process can be carried out. For example,
we might view the entire population as mapping onto a roulette wheel, where each individual
is represented on the roulette wheel by a space that proportionally corresponds to its tness.
By repeatedly spinning the roulette wheel, individuals are chosen to ll the intermediate
population. This amounts to \stochastic sampling with replacement."

5
A selection process that will more closely match the expected tness values is \remainder
stochastic sampling." For each string i where fi=f is greater than 1.0, the integer portion of
this number indicates how many copies of that string are directly placed in the intermediate
population. All strings (including those with fi =f less than 1.0) then place additional copies
in the intermediate population with a probability corresponding to the fractional portion of
fi=f. For example, a string with fi=f = 1:36 places 1 copy in the intermediate population,
and then receives a 0:36 chance of placing a second copy. A string with a tness of fi=f = 0:54
has a 0:54 chance of placing one string in the intermediate population.
One can view \remainder stochastic sampling" as using a roulette wheel which represents
the fractional parts of the tness values in the correct proportions. In sampling with replace-
ment, the same roulette wheel is used to pick copies of strings until the population is full.
In sampling without replacement, the number of strings represented on the roulette wheel is
reduced by one each time a string is chosen to be placed in the intermediate population.
After selection has been carried out the construction of the intermediate population is
complete and recombination can occur. This can be viewed as creating the next population
from the intermediate population. Crossover is applied to randomly paired strings with
a probability denoted pc . (The population should already be suciently shued by the
random selection process.) Pick a pair of strings. With probability pc \recombine" these
strings to form two new strings that are inserted into the next population.
Consider the following binary string: 1101001100101101. The string would represent a
possible solution to some parameter optimization problem. New sample points in the space
are generated by recombining two parent strings. Consider the string 1101001100101101 and
another binary string, yxyyxyxxyyyxyxxy, in which the values 0 and 1 are denoted by x and
y. Using a single randomly chosen crossover point, recombination occurs as follows.
11010 \/ 01100101101
yxyyx /\ yxxyyyxyxxy

Swapping the fragments between the two parents produces the following o spring.
11010yxxyyyxyxxy and yxyyx01100101101

After recombination, we can apply a mutation operator. For each bit in the population,
mutate with some low probability m. Typically the mutation rate is applied with less than
1% probability. In some cases, mutation is interpreted as randomly generating a new bit,
in which case, only 50% of the time will the \mutation" actually change the bit value. In
other cases, mutation is interpreted to mean actually ipping the bit. The di erence is no
more than an implementation detail as long as the user/reader is aware of the di erence
and understands that the rst form of mutation produces a change in bit values only half as
often as the second, and that one version of mutation is just a scaled version of the other.
After the process of selection, recombination and mutation are complete, the next pop-
ulation can be evaluated. The process of evaluation, selection, recombination and mutation
forms one generation in the execution of a genetic algorithm.

6
The algorithm that has just been described is not precisely the genetic plan rst described
by Holland. All strings are selected according to tness in the preceding version of the
canonical genetic algorithm. However, Holland (1975) assumed that one string was picked
according to tness, and the second was picked randomly. These strings were then placed in
the intermediate population as a pair. Holland's original algorithm can be summarized as
follows. We will now characterize the process as one of directly constructing a new population
from the old population.
1. Initialize Population.
2. Evaluate each member of the Population.
3. Select a chromosome from the Population based on tness.
4. Perform crossover with probability pc . If crossover is not performed, put chromosome
into the New Population and Goto Step 5. Otherwise:
(a) Select mate from population with uniform probability.
(b) Select crossover point between 1 and L-1 with uniform probability.
(c) Recombine chromosomes and place both o spring in the New Population.
5. If New Population not full, Goto step 3.
6. New Population full, Population = New Population. Goto step 2.
One reason for considering both the simple genetic algorithm and Holland's original
genetic plan is to understand the di erent theoretical constructs which can be found in the
literature. Some literature is based on one form of the algorithm, while other works are
based on the alternative form. See Scha er (1987) for a nice discussion of these two forms.

2.1 Why does it work? Search Spaces as Hypercubes.


The question that most people who are new to the eld of genetic algorithms ask at this
point is why such a process should do anything useful. Why should one believe that this is
going to result in an e ective form of search or optimization?
The answer which is most widely given to explain the computational behavior of genetic
algorithms came out of John Holland's work. In his classic 1975 book, Adaptation in Nat-
ural and Arti cial Systems, Holland develops several arguments designed to explain how a
\genetic plan" or \genetic algorithm" can result in complex and robust search by implicitly
sampling hyperplane partitions of a search space.
Perhaps the best way to understand how a genetic algorithm can sample hyperplane
partitions is to consider a simple 3-dimensional space (see Figure 2). Assume we have a
problem encoded with just 3 bits; this can be represented as a simple cube with the string
000 at the origin. The corners in this cube are numbered by bit strings and all adjacent
corners are labelled by bit strings that di er by exactly 1-bit. An example is given in the
top of Figure 2. The front plane of the cube contains all the points that begin with 0.
If \*" is used as a \don't care" or wild card match symbol, then this plane can also be
represented by the special string 0**. Strings that contain * are referred to as schemata;
each schema corresponds to a hyperplane in the search space. The \order" of a hyperplane
7
refers to the number of actual bit values that appear in its schema. Thus, 1** is order-1
while 1**1******0** would be of order-3.
The bottom of Figure 2 illustrates a 4-dimensional space represented by a cube \hanging"
inside another cube. The points can be labeled as follows. Label the points in the inner cube
and outer cube exactly as they are labeled in the top 3-dimensional space. Next, pre x each
inner cube labeling with a 1 bit and each outer cube labeling with a 0 bit. This creates an
assignment to the points in hyperspace that gives the proper adjacency in the space between
strings that are 1 bit di erent. The inner cube now corresponds to the hyperplane 1***
while the outer cube corresponds to 0***. It is also rather easy to see that *0** corresponds
to the subset of points that corresponds to the fronts of both cubes. The order-2 hyperplane
10** corresponds to the front of the inner cube.
A bit string matches a particular schemata if that bit string can be constructed from
the schemata by replacing the \*" symbol with the appropriate bit value. In general, all
bit strings that match a particular schemata are contained in the hyperplane partition rep-
resented by that particular schemata. Every binary encoding is a \chromosome" which
corresponds to a corner in the hypercube and is a member of 2l ? 1 di erent hyperplanes,
where l is the length of the binary encoding. (The string of all * symbols corresponds to the
space itself and is not counted as a partition of the space (Holland 1975:72)). This can be
shown by taking a bit string and looking at all the possible ways that any subset of bits can
be replaced by \*" symbols. In other words, there are l positions in the bit string and each
position can be either the bit value contained in the string or the \*" symbol.
It is also relatively easy to see that 3l ? 1 hyperplane partitions can be de ned over the
entire search space. For each of the l positions in the bit string we can have either the value
*, 1 or 0 which results in 3l combinations.
Establishing that each string is a member of 2l ? 1 hyperplane partitions doesn't provide
very much information if each point in the search space is examined in isolation. This is why
the notion of a population based search is so critical to genetic algorithms. A population
of sample points provides information about numerous hyperplanes; furthermore, low order
hyperplanes should be sampled by numerous points in the population. (This issue is reexam-
ined in more detail in subsequent sections of this paper.) A key part of a genetic algorithm's
intrinsic or implicit parallelism is derived from the fact that many hyperplanes are sampled
when a population of strings is evaluated (Holland 1975); in fact, it can be argued that
far more hyperplanes are sampled than the number of strings contained in the population.
Many di erent hyperplanes are evaluated in an implicitly parallel fashion each time a single
string is evaluated (Holland 1975:74); but, of course, it is the cumulative e ects of evaluating
a population of points that provides statistical information about any particular subset of
hyperplanes. 1

1Holland initially used the term intrinsic parallelism in his 1975 monograph, then decided to switch to
implicit parallelism to avoid confusion with terminology in parallel computing. Unfortunately, the term
implicit parallelism in the parallel computing community refers to parallelism which is extracted from code
written in functional languages that have no explicit parallel constructs. Implicit parallelism does not refer to
the potential for running genetic algorithms on parallel hardware, although genetic algorithms are generally
viewed as highly parallelizable algorithms.

8
110
111

010 011

100 101

000 001
0110
0111

1110

0010

1010
1101

0101
1000
1001

0000
0001

Figure 2: A 3-dimensional cube and a 4-dimensional hypercube. The corners of the inner
cube and outer cube in the bottom 4-D example are numbered in the same way as in the upper
3-D cube, except a 1 is added as a pre x to the labels of inner cube and a 0 is added as a
pre x to the labels of the outer cube. Only select points are labeled in the 4-D hypercube.

9
Implicit parallelism implies that many hyperplane competitions are simultaneously solved
in parallel. The theory indicates that through the process of reproduction and recombina-
tion, the schemata of competing hyperplanes increase or decrease their representation in
the population according to the relative tness of the strings that lie in those hyperplane
partitions. Because genetic algorithms operate on populations of strings one can track the
proportional representation of a single schema representing a particular hyperplane in a pop-
ulation and indicate whether that hyperplane will increase or decrease its representation in
the population over time when tness based selection is combined with crossover to produce
o spring from existing strings in the population.

3 Two Views of Hyperplane Sampling


Another way of looking at hyperspace partitions is presented in Figure 3. A function over a
single variable is plotted as a one-dimensional space, with function maximization as a goal.
The hyperplane 0****...** spans the rst half of the space and 1****...** spans the second
half of the space. Since the strings in the 0****...** partition are on average better than those
in the 1****...** partition, the search will be proportionally biased toward this partition.
In the second graph the portion of the space corresponding to **1**...** is shaded, which
also highlights the intersection of 0****...** and **1**...**, namely, 0*1*...**. Finally, in
the third graph, 0*10 is highlighted.
One of the points of Figure 3 is that the sampling of hyperplane partitions is not really
e ected by local minima. At the same time, increasing the sampling rate of partitions that
are above average compared to other competing partitions does not guarantee convergence
to a global optimum. The global optimum could be a relatively isolated peak, for example.
Nevertheless, good solutions that are globally competitive should be found.
It is also a useful exercise to look at an example of a simple genetic algorithm in action.
In Table 1, the rst 3 bits of each string is given explicitly while the remainder of the bit
positions are unspeci ed. The goal is to look at only those hyperplanes de ned over the
rst 3 bit positions in order to see what actually happens during the selection phase where
strings are duplicated according to tness. The theory behind genetic algorithms suggest
that the sampling rate of each hyperplane should change according to the average tness of
the strings in the population that are contained in the corresponding hyperplane partition.
Thus, even though a genetic algorithm never explicitly evaluates any particular hyperplane
partition, it should change the sampling rates of these hyperplanes as if it had.
The example population in Table 1 contains only 21 (partially speci ed) strings. Since we
are not particularly concerned with the exact evaluation of these strings, the tness values
will be assigned according to rank. (The notion of assigning tness by rank rather than by
tness proportional representation has not been discussed in detail, but the current example
relates to change in representation due to tness and not how that tness is assigned.) The
table includes information on the tness of each string and the number of copies to be
placed in the intermediate population. The number of copies produced during selection is
determined by automatically assigning the integer part, then assigning the fractional part by
generating a random value between 0.0 and 1.0 (a form of remainder stochastic sampling).
10
1

F(X)

0
0 K/2 K
Variable X

F(X)

0
0 K/8 K/4 K/2 K
Variable X

F(X)

0
0 K/8 K/4 K/2 K
Variable X

0***...* **1*...* 0*10*...*

Figure 3: A function and various partitions of hyperspace. Fitness is scaled to a 0 to 1 range


in this diagram.

11
Strings and Fitness Values
String Fitness Random copies
001b ; ...b ;L
14 1 2.0 { 2
101b ; ...b ;L
24 2 1.9 0.93 2
111b ; ...b ;L
34 3 1.8 0.65 2
010b ; ...b ;L
44 4 1.7 0.02 1
111b ; ...b ;L
54 5 1.6 0.51 2
101b ; ...b ;L
64 6 1.5 0.20 1
011b ; ...b ;L
74 7 1.4 0.93 2
001b ; ...b ;L
84 8 1.3 0.20 1
000b ; ...b ;L
94 9 1.2 0.37 1
100b ; ...b ;L 1.1
10 4 10 0.79 1
010b ; ...b ;L 1.0
11 4 11 { 1
011b ; ...b ;L 0.9
12 4 12 0.28 1
000b ; ...b ;L 0.8
13 4 13 0.13 0
110b ; ...b ;L 0.7
14 4 14 0.70 1
110b ; ...b ;L 0.6
15 4 15 0.80 1
100b ; ...b ;L 0.5
16 4 16 0.51 1
011b ; ...b ;L 0.4
17 4 17 0.76 1
000b ; ...b ;L 0.3
18 4 18 0.45 0
001b ; ...b ;L 0.2
19 4 19 0.61 0
100b ; ...b ;L 0.1
20 4 20 0.07 0
010b ; ...b ;L 0 0
21 4 21 { 0

Table 1: A population with tness assigned to strings according to rank. Random is used
to determine whether or not to award a copy of a string for the fractional portion of the
tness measure.
If the random value is greater than the fractional remainder, then an additional copy is
awarded to the corresponding individual.
Since the theory behind genetic algorithms suggests that many hyperplanes are processed
implicitly in parallel when selection acts on the population of strings, Table 2 enumerates the
27 hyperplanes (3 ) that can be de ned over the rst three bits of the strings in the population
3

and explicitly calculates the tness of the sample of the corresponding hyperplane partition.
The true value of the tness associated with that hyperplane partition corresponds to the
average tness of all strings that lie in that hyperplane partition. The genetic algorithm
uses the population as a sample for estimating the tness of that hyperplane partition. Of
course, the only time the sample is random is during the rst generation. After this, the
sampling is biased toward regions that have proven to contain strings that are above average
with respect to previous populations.

12
Schemata and Fitness Values
Schema Avg Exp Obs Schema Avg Exp Obs
101*...* 1.70 3.40 3 *0**...* 0.99 10.9 10
111*...* 1.70 3.40 4 00**...* 0.96 5.76 4
1*1*...* 1.70 6.80 7 0***...* 0.93 9.30 10
01*...* 1.38 6.90 6 011*...* 0.90 2.70 4
*1*...* 1.30 13.00 14 010*...* 0.90 2.70 2
11*...* 1.22 6.10 8 01**...* 0.90 5.40 6
11**...* 1.17 4.68 6 0*0*...* 0.83 4.98 3
001*...* 1.16 3.48 3 *10*...* 0.80 4.00 4
1***...* 1.09 8.72 11 000*...* 0.76 2.28 1
0*1*...* 1.03 6.18 7 **0*...* 0.72 7.92 7
10**...* 1.02 5.10 5 *00*...* 0.66 3.96 4
1**...* 1.01 10.1 12 110*...* 0.65 1.95 2
***...* 1.00 21.0 21 1*0*...* 0.60 3.00 4
100*...* 0.50 1.00 1

Table 2: The average tnesses (Avg) associated with the samples from the 27 hyperplanes
de ned over the rst three bit positions are explicitly calculated. The Expected representation
(Exp) and Observed representation (Obs) are shown.
If the genetic algorithm works as advertised, the number of copies of strings that actu-
ally fall in a particular hyperplane partition after selection should approximate the expected
number of copies that should fall in that partition. The expected number of strings sampling
a hyperplane partition after selection can be calculated by multiplying the number of copies
in the current population before selection by the average tness of the strings in the popu-
lation that fall in that partition. Both the expected and observed number of copies in all 27
hyperplane partitions are given in Table 2. As can be seen, in most cases the match between
expected and observed sampling rate is fairly good: the error is largely a result of sampling
error due to the small population size.
At this point it is useful to begin formalizing the idea of tracking the sampling rate of a
particular hyperplane, H. Let M (H; t) be the number of strings sampling hyperplane H at
the current generation t in some population. Let (t + intermediate) index the generation t
after selection, and f (H ) be the average evaluation of the sample of strings in partition H in
the current population. Formally, the notion that we are implicitly sampling a hyperplane
H according to hyperplane tness is expressed as follows.

M (H; t + intermediate) = M (H; t) f (fH )


 :

13
3.1 Crossover Operators and Schemata
The expected representation of hyperplanes in Table 2 corresponds to the representation in the
intermediate population after selection but before recombination. What does recombination
do to the observed sampling rate of the hyperplane partitions? It is easy to see that the
order-1 hyperplane samples are not a ected by recombination, since the single critical bit is
always inherited by one of the o spring. However, the observed sampling rate of hyperplane
partitions of order-2 and higher can be a ected by crossover. Furthermore, all hyperplanes
of the same order are not necessarily a ected with the same probability. Consider 1-point
crossover. This recombination operator is nice because it is relatively easy to quantify its
e ects on di erent schemata representing hyperplanes. To keep things simple, assume we
are are working with a string encoded with just 12 bits. Now consider the following two
schemata.
11********** and 1**********1

The probability that the bits in the rst schema will be separated during 1-point crossover
is only 1=L ? 1, since in general there are L ? 1 crossover points in a string of length L. The
probability that the bits in the second rightmost schema are disrupted by 1-point crossover
however is (L ? 1)=(L ? 1), or 1.0, since each of the L-1 crossover points separates the bits in
the schema. This leads to a general observation: when using 1-point crossover the positions
of the bits in the schema are important in determining the likelihood that those bits will
remain together during crossover.

3.1.1 2-point Crossover


What happens if a 2-point crossover operator is used? A 2-point crossover operator uses two
randomly chosen crossover points. Strings exchange the segment that falls between these
two points. Ken DeJong rst observed (1975) that two point crossover treats strings and
schemata as if they form a ring, which can be illustrated as follows.
b7 b6 b5 * * *
b8 b4 * *
b9 b3 * *
b10 b2 * *
b11 b12 b1 * 1 1

When viewed in this way, 1-point crossover is a special case of 2-point crossover where
one of the crossover points always occurs at the wrap-around position between the rst
and last bit. Maximum disruptions for order-2 schemata now occur when the 2 bits are at
complementary positions on this ring.
For 1-point and 2-point crossover it is clear that schemata which have bits that are close
together on the string encoding are less likely to be disrupted by crossover. More precisely,
hyperplanes represented by schemata with more compact representations should be sampled
14
at rates that are closer to those achieved under selection without crossover. To be even
more precise, we need to de ne what is meant by a compact representation with respect to
schemata. For current purposes a compact representation with respect to schemata is one
that minimizes the probability of disruption during crossover. Note that this de nition is
operator dependent, since both of the 2 order-2 schemata examined in section 3.1 are equally
and maximally compact with respect to 2-point crossover, but are maximally di erent with
respect to 1-point crossover.

3.1.2 Linkage and De ning Length


Another concept that relates to this idea that di erent schemata are a ected di erently
by crossover is linkage. Usually linkage refers to a set of bits that are physically adjacent
(or at least close together) in the problem encoding. The idea behind linkage is that such
bits act as \coadapted alleles" that tend to be inherited as a block. In this case an allele
would correspond to a particular bit value in a speci c position on the chromosome. Of
course, linkage can be seen as a generalization of the notion of a compact representation with
respect to schema. Equating linkage with physical adjacency in a string implicitly assumes
that 1-point crossover is the operator being used; linkage for 2-point crossover is di erent
and must be de ned with respect to distance on the chromosome when treated as a ring.
Nevertheless, linkage usually is equated with physical adjacency on a string, as measured by
de ning length.
The de ning length of a schemata is based on the distance between the rst and last
bit in the schema with value either 0 or 1 (i.e., not a * symbol). Given that each position
in a schema can be 0, 1 or *, then scanning left to right, if Ix is the index of the position
of the rightmost occurance of either a 0 or 1 and Iy is the index of the leftmost occurance
of either a 0 or 1, then the de ning length is merely Iy ? Ix: Thus, the de ning length of
****1**0**10** is 12 ? 5 = 7. The de ning length of a schema representing a hyperplane
H is denoted here by (H ). The de ning length is a direct measure of how many crossover
points fall within the signi cant portion of a schemata. If 1-point crossover is used, then
(H )=L ? 1 is also a direct measure of how likely crossover is to fall within the signi cant
portion of a schemata during crossover.

3.1.3 Linkage and Inversion


Along with mutation and crossover, inversion is often considered to be a basic genetic oper-
ator. Inversion can potentially change the linkage of bits on the chromosome such that bits
which have greater nonlinear interactions can potentially be moved closer together on the
chromosome.
Typically, inversion is implemented by reversing a random segment of the chromosome.
However, before one can start moving bits around on the chromosome to improve linkage,
the bits must have a position independent decoding. A common error that some researchers
make when rst implementing inversion is too start reversing bit segments of a directly
encoded chromosome. But just reversing some random segment of bits is nothing more than
large scale mutation if the mapping from bits to parameters is position dependent.
15
A position independent encoding requires that each bit be tagged in some way. For
example, consider the following encoding composed of pairs where the rst number is a bit
tag which indexes the bit and the second represents the bit value.
((6 0) (2 1) (7 1) (5 1) (3 0) (1 0) (4 0))
The linkage in this string can now be changed by moving around the tag-bit pairs, but
the string remains the same when decoded: 010010110. One must now also consider how
recombination is to be implemented. Goldberg and Bridges (1990), Whitley (1991) as well
as Holland (1975) discuss the problems of exploiting linkage and the recombination of tagged
representations.

4 The Schema Theorem


A foundation has been laid to now develop the fundamental theorem of genetic algorithms.
The schema theorem (Holland, 1975) provides a lower bound on the change in the sampling
rate for a single hyperplane from generation t to generation t + 1.
Consider again what happens to a particular hyperplane, H when only selection occurs.
M (H; t + intermediate) = M (H; t) f (fH )
 :
To calculate M(H,t+1) we must consider the e ects of crossover as the next generation
is created from the intermediate generation. First we consider that crossover is applied
probabilistically to a portion of the population. For that part of the population that does
not undergo crossover, the representation due to selection is unchanged. When crossover
does occur, then we must calculate losses due to its disruptive e ects.
" #
M (H; t + 1) = (1 ? pc)M (H; t) f (fH ) f (H )
 + pc M (H; t) f (1 ? losses) + gains
In the derivation of the schema theorem a conservative assumption is made at this point.
It is assumed that crossover within the de ning length of the schema is always disruptive
to the schema representing H. In fact this is not true and an exact calculation of the e ects
of crossover is presented later in this paper. For example, assume we are interested in the
schema 11*****. If a string such as 1110101 were recombined between the rst two bits with
a string such as 1000000 or 0100000, no disruption would occur in hyperplane 11***** since
one of the o spring would still reside in this partition. Also, if 1000000 and 0100000 were
recombined exactly between the rst and second bit, a new independent o spring would
sample 11*****; this is the sources of gains that is referred to in the above calculation. To
simplify things, gains are ignored and the conservative assumption is made that crossover
falling in the signi cant portion of a schema always leads to disruption. Thus,
" #
f (H ) f (H )
M (H; t + 1)  (1 ? pc )M (H; t) f + pc M (H; t) f (1 ? disruptions)

16
where disruptions overestimates losses by assuming crossover is always disruptive. We might
wish to consider one exception: if two strings that both sample H are recombined, then no
disruption occurs. In Holland's original algorithm a mate is chosen uniformly without bias.
Thus the probability of a mate sampling H is just M (H; t) divided by the population size
which will be denoted by P (H; t): Disruption is thus given by:
(H ) (1 ? P (H; t)):
L?1
At this point, the inequality can be simpli ed. Both sides can be divided by the popula-
tion size to convert this into an expression for P (H; t + 1), the proportional representation
of H at generation t + 1: Furthermore the expression can be rearranged with respect to pc .
" #
P (H; t + 1)  P (H; t) f (fH ) (H )
 1 ? pc L ? 1 (1 ? P (H; t))
We now have a useful version of the schema theorem (although it does not yet consider
mutation); but it is not the only version in the literature. For example, this version is
consistent with the fact that selection for the rst parent string is tness based and the second
parent is chosen randomly. But we have also examined a form of the simple genetic algorithm
where both parents are chosen based on tness. This can be added to the schema theorem
by merely indicating the alternative parent is chosen from the intermediate population after
selection.
" #
P (H; t + 1)  P (H; t) f (fH ) (H ) f (H )
 1 ? pc L ? 1 (1 ? P (H; t) f )
Finally, mutation is included. Let o(H ) be a function that returns the order of the
hyperplane H. The order of H exactly corresponds to a count of the number of bits in the
schema representing H that have value 0 or 1. Let the mutation probability be pm where
mutation always ips the bit. Thus the probability that mutation does a ect the schema
representing H is (1 ? pm)o H . This leads to the following expression of the schema theorem.
( )

" #
P (H; t + 1)  P (H; t) f (fH

) 1 ? p (H ) (1 ? P (H; t) f (H ) ) (1 ? p )o H
c
L?1 f m
( )

4.1 Crossover, Mutation and Premature Convergence


Clearly the schema theorem places the greatest emphasis on the role of crossover and hy-
perplane sampling in genetic search. To maximize the preservation of hyperplane samples
after selection, the disruptive e ects of crossover and mutation should be minimized. This
suggests that mutation should perhaps not be used at all, or at least used at very low levels.
The motivation for using mutation, then, is to prevent the permanent loss of any partic-
ular bit or allele. After several generations it is possible that selection will drive all the bits

17
in some position to a single value: either 0 or 1. If this happens without the genetic algo-
rithm converging to a satisfactory solution, then the algorithm has prematurely converged.
This may particularly be a problem if one is working with a small population. Without a
mutation operator, there is no possibility for reintroducing the missing bit value. Also, if the
target function is nonstationary and the tness landscape changes over time (which is cer-
tainly the case in real biological systems), then there needs to be some source of continuing
genetic diversity. Mutation, therefore acts as a background operator, occasionally changing
bit values and allowing alternative alleles (and hyperplane partitions) to be retested.
This particular interpretation of mutation ignores its potential as a hill-climbing mech-
anism: from the strict hyperplane sampling point of view imposed by the schema theorem
mutation is a necessary evil. But this is perhaps a limited point of view. There are several
experimental researchers that point out that genetic search using mutation and no crossover
often produces a fairly robust search. And there is little or no theory that has addressed the
interactions of hyperplane sampling and hill-climbing in genetic search.
Another problem related to premature convergence is the need for scaling the population
tness. As the average evaluation of the strings in the population increases, the variance
in tness decreases in the population. There may be little di erence between the best and
worst individual in the population after several generations, and the selective pressure based
on tness is correspondingly reduced. This problem can partially be addressed by using
some form of tness scaling (Grefenstette, 1986; Goldberg 1989). In the simplest case, one
can subtract the evaluation of the worst string in the population from the evaluations of
all strings in the population. One can now compute the average string evaluation as well
as tness values using this adjusted evaluation, which will increase the resulting selective
pressure. Alternatively, one can use some rank based form of selection.

4.2 How Recombination Moves Through a Hypercube


The nice thing about 1-point crossover is that it is easy to model analytically. But it is
also easy to show analytically that if one is interested in minimizing schema disruption, then
2-point crossover is better. But operators that use many crossover points should be avoided
because of extreme disruption to schemata. This is again a point of view imposed by a strict
interpretation of the schema theorem. On the other hand, disruption may not be the only
factor a ecting the performance of a genetic algorithm.

4.2.1 Uniform Crossover


The operator that has received the most attention in recent years is uniform crossover.
Uniform crossover was studied in some detail by Ackley (1987) and popularized by Syswerda
(1989). Uniform crossover works as follows: for each bit position 1 to L, randomly pick each
bit from either of the two parent strings. This means that each bit is inherited independently
from any other bit and that there is, in fact, no linkage between bits. It also means that
uniform crossover is unbiased with respect to de ning length. In general the probability of
disruption is 1 ? (1=2)o H ? , where o(H) is the order of the schema we are interested in.
( ) 1

(It doesn't matter which o spring inherits the rst critical bit, but all other bits must be
18
inherited by that same o spring. This is also a worst case probability of disruption which
assumes no alleles found in the schema of interest are shared by the parents.) Thus, for
any order-3 schemata the probability of uniform crossover separating the 2 critical bits is
always 1 ? (1=2) = 0:75. Consider for a moment a string of 9 bits. The de ning length
2

of a schema must equal 6/8 before the disruptive probabilities of 1-point crossover match
those associated with uniform crossover. We can de ne 84 di erent order-3 schemata over
any particular string of 9 bits (i.e., 9 choose 3). Of these schemata, only 19 of the 84
order-2 schemata have a disruption rate higher than 0.75 under 1-point crossover. Another
15 have exactly the same disruption rate, and 50 of the 84 order-2 schemata have a lower
disruption rate. It is relative easy to show that, while uniform crossover is unbiased with
respect to de ning length, it is also generally more disruptive than 1-point crossover. Spears
and DeJong (1991) have shown that uniform crossover is in every case more disruptive than
2-point crossover for order-3 schemata for all de ning lengths.
Despite these analytical results, several researchers have suggested that uniform crossover
is sometimes a better recombination operator. One can of course point to its lack of rep-
resentational bias with respect to schema disruption as a possible explanation, but this is
unlikely since it is uniformly worse than 2-point crossover. However, Spears and DeJong
(1991:314) speculated that, \With small populations, more disruptive crossover operators
such as uniform or n-point (n >> 2) may yield better results because they help overcome
the limited information capacity of smaller populations and the tendency for more homo-
geneity." Eshelman (1991) has made similar arguments outlining the advantages of disruptive
operators.
There is another sense in which uniform crossover is unbiased. Assume we wish to
recombine the bits string 0000 and 1111. We can conveniently lay out the 4-dimensional
hypercube as shown in Figure 4. We can also view these strings as being connected by a set
of minimal paths through the hypercube. To de ne a minimal path though the hypercube,
pick one parent string as the origin and the other as the destination. Now change a single
bit in the binary representation corresponding to the point of origin. Any such move will
reach a point that is one move closer to the destination. In the following graph, which is
arranged with respect to 0000 and 1111, it is easy to see that changing a single bit is a move
up or down in this graph.
All of the points between 0000 and 1111 are reachable by some single application of
uniform crossover. However, 1-point crossover only generates strings that lie along two com-
plementary paths (in the gure, the leftmost and rightmost paths) through this 4-dimensional
hypercube. In general, uniform crossover will draw a complementary pair of sample points
with equal probability from all points that lie along any complementary minimal paths in
the hypercube between the two parents, while 1-point crossover samples points from only
two speci c complementary minimal paths between the two parent strings. It is also easy to
see that 2-point crossover is less restrictive than 1-point crossover, since it can sample pairs
of points where the bits inherited from each parent appears as a continuous substring{in
this case all strings except 0101 and 1010. But, of course, this coverage does not hold for
spaces of higher dimensions. The number of bits that are di erent between two strings is
just the Hamming distance, H. Not including the original parent strings, uniform crossover
can generate 2H? di erent strings, while 1-point crossover can generate 2(H ? 1) di erent
1

19
1111

0111 1011 1101 1110

0011 0101 0110 1001 1010 1100

0001 0010 0100 1000

0000
Figure 4: This graph illustrates paths though 4-D space. A 1-point crossover of 1111 and
0000 can only generate o spring that reside along the dashed paths at the edges of this graph.
strings since there are H crossover points that produce unique o spring (see the discussion
in the next section)
  and each crossover produces 2 o spring. The 2-point crossover operator
can generate 2 H = H ?H di erent o spring since there are H choose 2 di erent crossover
2
2

points that will result in o spring that are not copies of the parents and each pair of crossover
points generates 2 strings.

4.3 Reduced Surrogates


Consider crossing the following two strings and a \reduced" version of the same strings,
where the bits the strings share in common have been removed.
0001111011010011 ----11---1-----1
0001001010010010 ----00---0-----0

Both strings lie in the hyperplane 0001**101*01001*. The ip side of this observation
is that crossover is really restricted to a subcube de ned over the bit positions that are
di erent. We can isolate this subcube by removing all of the bits that are equivalent in
the two parent structures. Booker (1987) refers to strings such as ----11---1-----1 and
----00---0-----0 as the \reduced surrogates" of the original parent chromosomes.
When viewed in this way, it is clear that recombination of these particular strings occurs in
a 4-dimensional subcube, more or less identical to the one examined in the previous example.
Uniform crossover is unbiased with respect to this subcube in the sense that uniform crossover
will still sample in an unbiased, uniform fashion from all of the pairs of points that lie
along complementary minimal paths in the subcube de ned between the two original parent
strings. On the other hand, simple 1-point or 2-point crossover will not. To help illustrate
this idea, we recombine the original strings, but examine the o spring in their \reduced"
forms. For example, simple 1-point crossover will generate o spring ----11---1-----0
and ----00---0-----1 with a probability of 6/15 since there are 6 crossover points in the
20
original parent strings between the third and fourth bits in the reduced subcube and L-1
= 15. On the other hand, ----10---0-----0 and ----01---1-----1 are sampled with a
probability of only 1/15 since there is only a single crossover point in the original parent
structures that falls between the rst and second bits that de ne the subcube.
One can remove this particular bias, however. We apply crossover on the reduced surro-
gates. Crossover can now exploit the fact that there is really only 1 crossover point between
any signi cant bits that appear in the reduced surrogate forms. There is also another bene t.
If at least 1 crossover point falls between the rst and last signi cant bits in the reduced
surrogates, the o spring are guaranteed not to be duplicates of the parents. (This assumes
the parents di er by at least two bits). Thus, new sample points in hyperspace are generated.
Note that recombining the original parent strings between any of the rst four bits results
in o spring that are duplicates of the parents.
The debate on the merits of uniform crossover and operators such as 2-point reduced sur-
rogate crossover is not a closed issue. To fully understand the interaction between hyperplane
sampling, population size, premature convergence, crossover operators, genetic diversity and
the role of hill-climbing by mutation requires better analytical methods.

5 The Case for Binary Alphabets


The motivation behind the use of a minimal binary alphabet is based on relatively simple
counting arguments. A minimal alphabet maximizes the number of hyperplane partitions di-
rectly available in the encoding for schema processing. These low order hyperplane partitions
are also sampled at a higher rate than would occur with an alphabet of higher cardinality.
Any set of order-1 schemata such as 1*** and 0*** cuts the search space in half. Clearly,
there are L pairs of order-1 schemata. For order-2 schemata, there are L ways to pick
2
locations in which to place the 2 critical bits positions, and there are 2 possible ways to
2

assign values to those bits. This can be illustrated as follows:


Order 1 Schemata Order 2 Schemata
0*** *0** **0* ***0 00** 0*0* 0**0 *00* *0*0 **00
1*** *1** **1* ***1 01** 0*1* 0**1 *01* *0*1 **01
10** 1*0* 1**0 *10* *1*0 **10
11** 1*1* 1**1 *11* *1*1 **11

In general, if we wish to count how many schemata


  representing hyperplanes exist at
L L
some order i, this value is given by 2 i where i counts the number of ways to pick i
i
positions that will have signi cant bit values in a string of length L and 2i is the number of
ways to assign values to those positions.
These counting arguments naturally lead to questions about the relationship between
population size and the number of hyperplanes that are sampled by a genetic algorithm.
One can take a very simple view of this question and ask how many schemata of order-1
are processed and how well are they represented in a population of size N. These numbers

21
are based on the assumption that we are interested in hyperplane representations associated
with the initial random population, since selection changes the distributions over time. In
a population of size N there should be N/2 samples of each of the 2L order-1 hyperplane
partitions. Therefore 50% of the population falls in any particular order-1 partition. Each
order-2 partition is sampled by 25% of the population. In general then, each hyperplane of
order i is sampled by 1=2i of the population.

5.1 The N 3 Argument


These counting arguments set the stage for the claim that a genetic algorithm processes on
the order of N hyperplanes when the population size is N. The derivation used here is based
3

on the proof found in the appendix of Fitzpatrick and Grefenstette (1988).


First, pick a minimal level of representation. We wish to have at least  samples before
claiming that we are statistically sampling hyperplanes of order . Let  be the highest order
of hyperplane which is represented in a population of size N by at least  copies;  is given
by log(N=).  
Recall that the number of di erent hyperplane partitions of order- is given by 2 L
which is just the number of di erent ways to pick  di erent positions and to assign all
possible binary values to each subset of the  positions. Thus, we now only need to show
! !
2 L N 3
which implies 2 L  (2 ) 3
 
since  = log(N=), 2 = N= and N = 2 . Fitzpatrick and Grefenstette now make the
following arguments. Assume L  64 and 2  N  2 . Pick  = 8, which implies
6 20

3    17: By inspection the number of schemata processed is greater than N . 3

Notice that this counts only those schemata that are exactly of order-. While it is true
that the number of schemata of order- is the dominating factor when counting the total
number of schemata processed, the sum of all schemata from order-1 to order- that are
processed is given by: Px 2x Lx .
=1

5.2 The Case for Nonbinary Alphabets


There are two basic arguments against using higher cardinality alphabets. First, there
will be fewer explicit hyperplane partitions. Second, the alphabetic characters (and the
corresponding hyperplane partitions) associated with a higher cardinality alphabet will not
be as well represented in a nite population. This either forces the use of larger population
sizes or the e ectiveness of statistical sampling is diminished.
The arguments for using binary alphabets assume that the schemata representing hyper-
planes must be explicitly and directly manipulated by recombination. Antonisse (1989) has
argued that this need not be the case and that higher order alphabets o er as much richness
in terms of hyperplane samples as lower order alphabets. For example, using an alphabet of
the four characters A, B, C, D one can de ne all the same hyperplane partitions in a binary
22
alphabet by de ning partitions such as (A and B), (C and D), etc. In general, Antonisse
argues that one can look at the all subsets of the power set of value assignments over the
a \don't care" positions in a schema as de ning hyperplanes. Viewed in this way, higher
cardinality alphabets yield more hyperplane partitions that binary alphabets. Antonisse's
arguments fail to show however, that the hyperplanes that corresponds to the subsets de ned
in this scheme are actually processed in a meaningful and useful way by a genetic algorithm.
There are other arguments for nonbinary encodings. Davis (1991) typically argues that
the disadvantages of nonbinary encodings can be o set by the larger range of operators
that can be applied to problems, and that more natural or problem dependent aspects of
the coding can be exploited. Scha er and Eshelman (1992) as well as Wright (1991) have
made interesting arguments for real-valued encodings. Goldberg (1991) has suggested that
virtual minimal alphabets can emerge from higher cardinality alphabets, thus facilitating
hyperplane sampling.

6 An Executable Model of the Genetic Algorithm


Consider the complete version of the schema theorem before dropping the gains term and
simplifying the calculation of losses.

P (Z; t + 1) = P (Z; t) f (fZ ) (1 ? fpc lossesg) + fpc gains.g

This is also part of an idealized model of a simple genetic algorithm (Goldberg, 1987).
In the current formulation, Z might refer to either a string or a schema representing a
hyperplane. Since modeling strings models the highest order schemata, the model implicitly
includes all lower schemata. In the remainder of this paper, the equations are developed
only for strings. Given a speci cation of Z, one can calculate string losses and gains. Losses
occur when a string crosses with another string and the resulting o spring fails to preserve
the original string. Gains occur when two di erent strings cross and independently create
a new copy of some string. For example, if Z = 000 then recombining 100 and 001 will
always produce a new copy of 000. Assuming 1-point crossover is used as an operator, the
probability of \losses" and \gains" for the string Z = 000 are calculated as follows:
losses = (111) (111 ) +
f (101) (101 ) f
PI 0
 f
I0
P
 ;t P
f
P ;t

+ f (110) (110 ) + (011 ) (011 ) f ;t


PI 1
 f
P I2  ;t P
f
P ;t :

gains = f (001) (001 ) (100) (100 ) +f f (010) (010 ) (100) (100 ) f


PI 0
f P
 ;t
f
P ;t PI 1
 f
P
 ;t
f
P ;t

+ f (011) (011 ) (100) (100 ) +


f (001) (001 ) (110) (110 ) f f
PI 1
f P
 ;t I2 
f
P
 ;t P
f
P ;t
f
P ;t

+ I 2 (001) P
(010)
 (001 )  (010 )
f

f
P ;t
f

f
P ;t :

23
When Z is a string PI = 1 and represents the probably of crossover somewhere on the
0
string. For Z = 000 there are two other order-2 schemata that are of interest. The rst
order-2 schema of interest, h , spans from the rst to the second bit and the second order-2
1
schema of interest, h , spans from the second to the third bit. These are of interest because
2
the de ning lengths of these order-2 schemata provide probability information about the
disruption of the order-3 schema. The probability that one-point crossover will fall between
the rst and second bit is merely (h )=(L ? 1), which will be referred to as PI . Likewise,
1 1
PI will denote the probability that one-point crossover will fall between the second and
2
third bit which is given by (h )=(L ? 1).
2

The equations can be generalized to cover the remaining 7 strings in the space. This trans-
lation is accomplished using bitwise addition modulo 2 (i.e., a bitwise exclusive-or denoted
by . See Figure 4 and Section 6.4). The function (Si  Z ) is applied to each bit string, Si,
contained in the equation presented in this section to produce the appropriate corresponding
strings for generating an expression for computing all terms of the form P(Z,t+1).

6.1 A Generalized Form Based on Equation Generators


The 3 bit equations were generated by hand; however, a general pattern exists which allows
the automatic generation of equations for arbitrary problems (Whitley, Das and Crabb,
1992). The number of terms in the equations is greater than the number of strings in the
search space. Therefore it is only practical to develop exact equations for problems with
approximately 15 bits in the encoding.
The development of a general form for these equations is illustrated by generating the
loss and gain terms in a systematic fashion. Note that the equations need only be de ned
once for one string in the space; the standard form of the equation is always de ned for
the string composed of all zero bits. Let S represent the set of binary strings of length L,
indexed by i. In general, the string composed of all zero bits is denoted S . 0

6.2 Generating String Losses for 1-point crossover


Consider two strings 00000000000 and 00010000100. Using 1-point crossover, if the crossover
occurs before the rst \1" bit or after the last \1" bit, no disruption will occur. Any crossover
between the 1 bits, however, will produce disruption: neither parent will survive crossover.
Also note that recombining 00000000000 with any string of the form 0001####100 will
produce the same pattern of disruption. We will refer to this string as a generator: it is like
a schema, but # is used here instead of * to better distinguish between a generator and the
corresponding hyperplane. Bridges and Goldberg (1987) formalize the notion of a generator
in the following way. Consider strings B and B0 where the rst x bits are equal, the middle
( + 1) bits have the pattern b##:::#b for B and b##:::#b for B0. Given that the strings
are of length L, the last (L ?  ? x ? 1) bits are equivalent. The b bits are referred to as
\sentry bits" and they are used to de ne the probability of disruption. In standard form,
B = S and the sentry bits must be 1. The following directed acyclic graph illustrates all
0
generators for \string losses" for the standard form of a 5 bit equation for S . 0

24
1###1
/ \
/ \
01##1 1##10
/ \ / \
/ \ / \
001#1 01#10 1#100
/ \ / \ / \
/ \ / \ / \
00011 00110 01100 11000

The graph structure allows one to visualize the set of all generators for string losses. In
general, the root of this graph is de ned by a string with a sentry bit in the rst and last bit
positions, and the generator token \#" in all other intermediate positions. A move down
and to the left in the graph causes the leftmost sentry bit to be shifted right; a move down
and to the right causes the rightmost sentry bit to be shifted left. All bits outside
PL?
the sentry
positions are \0" bits. Summing over the graph, one can see that there are j j  2L?j?
1
=1
1

or (2L ? L ? 1) strings generated as potential sources of string loss.


For each string Si produced by one of the \middle" generators in the above graph struc-
ture, a term of the following form is added to the losses equations:

(Si) f (fSi) P (Si; t)

where (Si) is a function that calculates the number of crossover points between sentry bits
in string Si, thus de ning the critical crossover region for this particular string.

6.3 Generating String Gains for 1-point crossover


Bridges and Goldberg (1987) note that string gains for a string B are produced from two
strings Q and R which have the following relationship to B.
Region -> beginning middle end
Length -> a r w

Q Characteristics ##:::#b = =
R Characteristics = 
= b#:::#
The \=" symbol denotes regions where the bits in Q and R match those in B; again B =
S for the standard form of the equations. Sentry bits are located such that 1-point crossover
0
between sentry bits produces a new copy of B, while crossover of Q and R outside the sentry
bits will not produce a new copy of B.
Bridges and Goldberg de ne a beginning function A[B, ] and ending function
[B; !],
assuming L ? ! > ? 1, where for the standard form of the equations:

25
A [S ; ] = ##:::##1 ? 0 :::0l?
0 1 1 and
[S ; !] = 0 :::0L?!? 1L?! ##:::##:
0 0 1

These generators can again be presented as a directed acyclic graph structure composed
of paired templates which will be referred to as the upper A-generator and lower
-generator.
The following are the generators in a 5 bit problem.
10000
00001
/ \
/ \
#1000 10000
00001 0001#
/ \ / \
/ \ / \
##100 #1000 10000
00001 0001# 001##
/ \ / \ / \
/ \ / \ / \
###10 ##100 #1000 10000
00001 0001# 001## 01###

In this case, the root of the directed acyclic graph is de ned by starting with the most
speci c generator pair. The A-generator of the root has a \1" bit as the sentry bit in the
rst position, and all other bits are \0." The
-generator of the root has a \1" bit as the
sentry bit in the last position, and all other bits are \0." A move down and left in the graph
is produced by shifting the left sentry bit of the current upper A-generator to the right.
A move down and right is produced by shifting the right sentry bit of the current lower

-generator to the left. Each vacant bit position outside of the sentry bits which results
from a shift operation is lled using the # symbol.
This graph structure allows one to observe that the number of generator pairs at each
level of the tree is equal to the depth of the level at which the pair is located (where the
root is level 1). Also, for any level k, the number of string pairs generated at that level is
2k? for each pair of generators. Therefore, the total number of stringPpairs that must be
1

included in the equations to calculate string gains for S of length L is kL? k  2k? :
0
1
=1
1

Let S x and S! y be two strings produced by a generator pair, such that S x was
+ + +
produced by the A-generator and has a sentry bit at location ? 1 and S! y was produced
+
by the
-generator with a sentry bit at L?!. (The x and y terms are simply correction factors
added to and ! in order to uniquely index a string in S .) Let the critical crossover region
associated with S x and S! y be computed by a corresponding function (S x; S! y ) =
+ + + +
L ? ! ? : For each string pair S x and S! y a term of the following form is added to the
+ +
gains equations:
(S x; S! y ) + 1 f (S x P (S ; t) f (S! y ) P (S ; t)
+ + + ) +

L?1 f x +
f ! y +

26
where ((S x; S! y ) + 1)=(L ? 1) is the probability that 1-point crossover will fall within
+ +
the critical region de ned by the sentry bits located at ? 1 and L ? !.
The generators are used as part of a two stage computation where the generators are
rst used to create an exact equation in standard form. This allows the resulting equations
to be implemented in an executable form with a simple transformation function for all other
strings in the space.

6.4 The Vose and Liepins Models


The executable equations developed by Whitley (1993a) represent a special case of the model
of a simple genetic algorithm introduced by Vose and Liepins (1991). In the Vose and Liepins
model, the vector st  < represents the t th generation of the genetic algorithm and the i
th component of st is the probability that the string i is selected for the gene pool. Using
i to refer to a string in s can sometimes be confusing. The symbol S has already been
used to denote the set of binary strings, also indexed by i: This notation will be used
where appropriate to avoid confusion. Note that st corresponds to the expected distribution
of strings in the intermediate population in the generational reproduction process (after
selection has occurred, but before recombination).
In the Vose and Liepins formulation,
sti  P (Si; t)f (Si)
where  is the equivalence relation such that x  y if and only if 9 > 0 j x = y: In this
formulation, the term 1=f, which would represent the average population tness normally
associated with tness proportional reproduction, can be absorbed by the term.
The expected number of strings is also represented as a vector pt  < where the k th
component of the vector is equal to the proportional representation of k at generation t
before selection occurs. This would be the same as P (Sk ; t) in the notation more commonly
associated with the schema theorem. Finally let ri;j (k) be the probability that string k
results from the recombination of strings i and j: Now, using E to denote expectation,
X
E ptk = sti stj ri;k (k):
+1

i;j

To further generalize this model, the function ri;j (k) is used to construct a mixing matrix
M where the i; j th entry mi;j = ri;j (0). Note that this matrix gives the probabilities that
crossing strings i and j will produce the string S : For current purposes, assume no mutation
0
is used and 1-point crossover is used as the recombination operator. Then matrix M is
obviously symmetric and is zero everywhere on the diagonal except for entry m ; which is
00
1.0. Note that M is expressed entirely in terms of string gain information. Therefore, the
rst row/column of the matrix is the inverse of that portion of the string losses probabilities
that corresponds to (Si), where each string in the set S is crossed with S . For completeness,
0
the (Si) for strings not produced by the string generators is 0.0 and, thus, the probability
of obtaining S during reproduction is 1.0. The remainder of the matrix can be calculated
0
using string gain generators and  S +xL;S?!+y : For each pair of strings produced by the
(
1
)+1

27
A Transform Function to Rede ne Equations
000  010 ) 010 100  010 ) 110
001  010 ) 011 101  010 ) 111
010  010 ) 000 110  010 ) 100
011  010 ) 001 111  010 ) 101

Figure 5: The operator  is bit-wise exclusive-or. Let ri;j (k) be the probability that k results
from the recombination of strings i and j. If recombination is a combination of crossover and
mutation then ri;j (k  0) = rik;jk (0). The strings shown here are reordered with respect to
010.
string gains generators determine their index and enter the value returned by the function
into the corresponding location in M. For completeness, (Sj ; Sk ) = ?1 for all arbitrary
pairs of strings not generated by the string gains generators, which implies entry mj;k = 0.
Once de ned M does not change since it is not a ected by variations in tness or pro-
portional representation in the population. Thus, given the assumption of no mutations,
that is updated each generation to correct for changes in the population average, and that
1-point crossover is used, then the standard form of the executable equations corresponds to
the following portion of the Liepins and Vose model:
sT Ms
where T denotes transpose.
While the tness terms are de ned explicitly in the executable equation, so far it has
been assumed that s includes tness information without explicitly indicating how this is
calculated. Fitness information is later added to the Vose and Liepins model. For now,
however, this expression is rst generalized to cover all strings in the search space. Vose and
Liepins formalize the notion that bitwise exclusive-or can be used to remap all the strings in
the search space, in this case represented by the vector s. They show that if recombination
is a combination of crossover and mutation then
ri;j (k  q) = rik;jk (q) and speci cally ri;j (k) = ri;j (k  0) = rik;jk (0):
This allows one to reorder the elements in s with respect to any particular point in the
space. This reordering is equivalent to remapping the variables in the executable equations
(See Figure 4). A permutation function, , is de ned as follows:
j < s ; :::; sV ? >T = < sj ; :::; sj V ? >T
0 1 0 ( 1)

where the vectors are treated as columns and V is the size of the search space. A general
operator M can now be de ned over s which remaps sT Ms to cover all strings in the search
space.
M(s) = < ( s)T M s; :::; (N ? s)T MN ? s >T
0 0 1 1

28
Recall that s carries tness information such that it corresponds to the intermediate
phase of the population (after selection, but before recombination) as the genetic algorithm
goes from generation t to t + 1: Thus, to complete the cycle and reach a point at which the
Vose and Liepins models can be executed in an iterative fashion, tness information is now
explicitly introduced to transform the population at the beginning of iteration t + 1 to the
next intermediate population. A tness matrix F is de ned such that tness information is
stored along the diagonal; the i; i th element is given by f (i) where f is the tness function.
The transformation from the vector pt to the next intermediate population represented
+1

by st is given as follows:
+1

st  F M(st):
+1

Vose and Liepins give equations for calculating the mixing matrix M which not only
includes probabilities for 1-point crossover, but also mutation. More complex extension of
the Vose and Liepins model include nite population models using Markov Models (Nix and
Vose, 1992). Vose (1993) surveys the current state of this research.

7 Other Models of Evolutionary Computation


Up to this point, the discussion has focused on the canonical genetic algorithm introduced
by Holland (1975) and on related forms, such as the Simple Genetic Algorithm described by
Goldberg (1989). But there are also several other population based algorithms that are either
spin-o s of Holland's genetic algorithm, or which were developed independently. Evolution
Strategies and Evolutionary Programming refers to two computational paradigms that use a
population based search and largely rely on mutation rather than crossover.
Evolutionary Programming is based on the early work by Fogel, Owens and Walsh (1966),
Arti cial Intelligence Through Simulated Evolution. The individuals, or \organisms," in this
study were nite state machines. Organisms that best solved some target function obtained
the opportunity to reproduce. Parents were mutated to create o spring.
There has been renewed interest in Evolution Programming as re ected by the 1992
First Annual Conference on Evolutionary Programming (Fogel and Atmar 1992). Fogel and
Atmar (1990) also compare genetic operators against Gaussian mutation and report that
a population based search using only mutation found the best solution slightly more often
search using crossover (6 out of 10 experiments), thus raising the issue as to the relative
merits of mutation versus crossover.
Evolution Strategies are based on the work of Rechenberg (1973) and Schwefel (1975;
1981) and are discussed in a survey by Back, Ho meister and Schwefel (1991). Two examples
of Evolution Strategies (ES) are the (+)-ES and (; )-ES. In (+)-ES  parents produce
 o spring; the population is then reduced again to  parents by selecting the best solutions
from among both the parents and o spring. Thus, parents survive until they are replaced by
better solutions. The (; )-ES is closer to the generational model used in canonical genetic
algorithms; o spring replace parents and then undergo selection. Recombination operators

29
have been de ned for evolutionary strategies, but they tend to look di erent from Holland-
style crossover, allowing operations such as averaging parameters, for example, to create an
o spring. There is also generally a much greater emphasis on mutation.

7.1 Genitor
Genitor (Whitley 1988; 1989) was the rst of what Syswerda (1989) has termed \steady
state" genetic algorithms. The name \steady state" is somewhat of a misnomer, since
Syswerda (1991) has shown that such genetic algorithms show more variance that canonical
genetic algorithms in the terms of hyperplane sampling behavior. Such algorithms are there-
fore more susceptible to sample error and genetic drift. These algorithms would not closely
conform to the kind of \idea sampling behavior" seem in the executable models of canonical
genetic algorithms. The advantage, on the other hand, is that the best points found in the
search are maintained in the population. This often results in a more aggressive form of
search that in practice is often quite e ective.
On the surface, there are three main di erences between Genitor-style algorithms and
canonical genetic algorithms. First, reproduction occurs one individual at a time. Two
parents are selected for reproduction and produce an o spring that is immediately placed
back into the population. The second major di erence is in how that o spring is placed
back in the population. In Genitor, the worst individual in the population is replaced. The
important thing is that o spring do not replace parents, but rather some the least t (or
some relatively less t) member of the population.
The third major di erence between Genitor and other forms of genetic algorithms is
that tness is assigned according to rank rather than by tness proportionate reproduction.
Ranking helps to maintain a more constant selective pressure over the course of search.
As pointed out in a recent paper by DeJong and Sarma (1993) Holland examined both
generational reproductive plans such as the canonical genetic algorithm and plans where
a single individual was produced and inserted into the population; replacement was done
by deleting one individual using a random uniform distribution. Holland also showed that
generational reproduction and one-at-a-time selection and reproduction were theoretically
equivalent, but that one-at-a-time selection was more subject to genetic drift.
One real di erence between algorithms such as Genitor and the canonical genetic algo-
rithm lies in the replacement scheme. Goldberg and Deb (1991) have shown that by replacing
the worst member of the population Genitor generates much higher selective pressure that
the canonical genetic algorithm. But higher selective pressure is not the only di erence be-
tween Genitor and the canonical genetic algorithm. To borrow from the terminology used by
the Evolution Strategy community (as suggested by Larry Eshelman), Genitor is a ( + )
strategy while the canonical genetic algorithm is a (; ) strategy. This implies that the
accumulation of improved strings in the population is monotonic in Genitor.

30
7.2 CHC
Another genetic algorithm that monotonically collects the best strings found so far is the
CHC algorithm developed by Larry Eshelman (1991). CHC stands for Crossover using gener-
ational elitist selection, Heterogeneous recombination (by incest prevention) and Cataclysmic
mutation, which is used to restart the search when the population starts to converge.
CHC explicitly borrows from the ( + ) strategy of Evolution Strategies. After recom-
bination, the N best unique individuals are drawn from the parent population and o spring
population to create the next generation. This also implies that duplicates are removed from
the population. As Goldberg has shown with respect to Genitor, this kind of \survival of the
ttest" replacement method already imposes considerable selective pressure, so that there
is no real need to use other selection mechanism. Thus CHC used random selection, except
restrictions are imposed on which strings are allowed to mate. Strings with binary encod-
ings must be a certain Hamming distance away from one another before they are allowed
to reproduce. This form of \incest prevention" is designed to promote diversity. Eshelman
also uses a form of uniform crossover called HUX where exactly half of the di ering bits
are swapped during crossover. CHC is typically run using small population sizes (e.g. 50);
thus using uniform crossover in this context is consistent with DeJong and Spears (1991)
conjecture that uniform crossover can provide better sampling coverage in the context of
small populations.
The rationale behind CHC is to have a very aggressive search (by using monotonic se-
lection through survive of the best strings) and to o set the aggressiveness of the search by
using high disruptive operators such as uniform crossover. With such small population sizes,
however, the population convergences to the point that the population begins to more or
less reproduce many of the same strings. At this point the CHC algorithm uses cataclysmic
mutation; this is the only time mutation is used in the CHC algorithm. All strings undergo
heavy mutation, except that the best string is preserved intact. After mutation, the search
is restarted using only crossover. Eshelman and Scha er have reported quite good results
using CHC on a wide variety of test problems (Eshelman 1991; Eshelman and Scha er 1991).

7.3 Hybrid Algorithms


L. \Dave" Davis states in the Handbook of Genetic Algorithms, \Traditional genetic algo-
rithms, although robust, are generally not the most successful optimization algorithm on
any particular domain" (1991:59). Davis argues that hybridizing genetic algorithms with
the most successful optimization methods for particular problems gives one the best of both
worlds: correctly implemented, these algorithms should do no worst than the (usually more
traditional) method with which the hybridizing is done. And in many case the hybridization
can result in a superior method since the genetic algorithm introduces the advantages of a
broader, population based search. Of course, it also introduces the additional computational
overhead of a population based search.
In general, Davis' approach to genetic algorithms is to customize the genetic algorithm
to t the problem rather than forcing the problem to t into the constraints imposed by

31
the canonical genetic algorithm. Thus, Davis often uses real valued encodings instead of
binary encodings, and employs \recombination operators" that may be domain speci c.
Other researchers, such as Michalewicz (1992) also use nonbinary encoding and specialized
operations in combination with genetic based model of search.
Muhlenbein takes a similar opportunistic view of hybridization. In a description of a
parallel genetic algorithm Muhlenbein (1991) states, after the initial population is created,
\Each individual does local hill-climbing." Furthermore, after each o spring is create, \The
o spring does local hill-climbing."
Researchers interested in theory and those interested in practical results tend to have very
di erent view toward problem encodings. But the experimental researchers and theoreticians
are particularly divided on the issue of hybridization.
By adding hill-climbing or hybridizing with some other optimization methods, learning
is being added to the evolution process. Coding the learned information back onto the chro-
mosome means that the search utilizes a form of Lamarckian evolution. The chromosomes
improved by local hill-climbing or other methods are placed in the genetic population and
allowed to compete for reproductive opportunities.
The main criticism is that if we wish to preserve the schema processing capabilities of
the genetic algorithm, then Lamarckian learning should not be used. Changing information
in the o spring inherited from the parents results in a loss of inherited schemata. This
alters the statistical information about hyperplane partitions that is implicitly contained in
the population. Therefore using local optimization to improve each o spring undermines
the genetic algorithm's ability to search via hyperplane sampling. Hybrid algorithms that
incorporate local optimizations should result in greater reliance on hill-climbing and less
emphasis on hyperplane sampling. It follows that the resulting search should be less global,
since it is hyperplane sampling that gives the genetic algorithm its global search ability.
Despite the theoretical objections, hybrid genetic algorithms typically do well at opti-
mization tasks. There may be several reasons for this. First, the hybrid genetic algorithm
is hill-climbing from multiple points in the search space. Unless the objective function is
severely multimodal it may be likely that some strings (o spring) will be in the basin of
attraction of the global solution, in which case hill-climbing is a fast and e ective form of
search. Second, a hybrid strategy impairs hyperplane sampling, but does not disrupt it en-
tirely. For example, using learning or local optimization to improve the initial population
of strings only biases the initial hyperplane samples, but does not interfere with subsequent
hyperplane sampling. Third, in general hill-climbing may nd a small number of signi cant
improvements, but may not dramatically change the o spring. In this case, the e ects on
schemata and hyperplane sampling may be minimal.

8 Hill-climbers or Hyperplane Samplers?


In a recent paper entitled, \How genetic algorithms really work: I. Mutation and Hillclimb-
ing," Muhlenbein shows that an Evolution Strategy algorithm using only mutation work
quite well on a relatively simple test suite. Muhlenbein implies that for many problems

32
\many nonstandard genetic algorithms work well and the standard genetic algorithm per-
forms poorly." (1992:24).
Whitley et al. (1991) analyzed the behavior of a Genitor-style genetic algorithm to inves-
tigate its hill-climbing behavior. The test problem was a small neural network optimization
problem. Each parameter (i.e., weight) was represented a single real-valued number instead
of a bit encoding. Relatively high mutation rates were also used. The experiments showed
that as the population size was decreased down to a population of 1 individual, search
became faster but that the probability of converging to a satisfactory solution decreased.
Convergence was 90% for a population of 50 strings, 80% for a population of 5 strings and
72% for a population of 1 string that used only mutation. But the population of 1 string
converged (when it converged) a order of magnitude faster than the population of 50 strings.
This raised a very interesting issue. When is a genetic algorithm a hyperplane sampler and
when it is a hill-climber? This is a nontrivial question since it is the hyperplane sampling
abilities of genetic algorithms that are usually touted as the source of global sampling.
On the other hand, researchers in the evolutionary programming community tend to argue
that crossover is unnecessary and that mutation is sucient for robust and e ective search.
It should be emphasized that all of the theory concerning hyperplane sampling has been
developed with respect to the canonical genetic algorithm. And the alternative forms of
genetic algorithms often use mechanisms, such as monotonic selection of the best strings,
which could easily lead to increased hill-climbing. This is a question which the genetic
algorithm community needs to consider in more detail. Of course, the answer may often be
that many genetic algorithms exploit a bit of both. In practice, however, there may some
clues as to when hill-climbing is a dominant factor in a search.
First, hyperplane sampling requires larger population. Small populations are much more
likely to rely on hill-climbing. A population of 20 individuals just doesn't provide very much
information about hyperplane partitions, except perhaps the very low order hyperplanes
(note that there are only 5 samples of each order-2 hyperplane in a population of 20).
Second, very high selective pressure suggests hill-climbing may a dominant role in a search.
Some researchers use an exponential bias toward the best individuals. But if one has a
population of 100 strings and the 5 best individuals reproduce 95% of the time, then the
e ect population size is probably not large enough to support hyperplane sampling.

9 Parallel Genetic Algorithms


Part of the biological metaphor used to motivate genetic algorithms is that it is inherently
a parallel form of search. In nature populations of thousands or even millions of individuals
exist in parallel. This suggests a degree of parallelism that is directly proportional to the
population size used in genetic search. It also suggests a way of using larger population
sizes without increasing the amount of time that it takes to evaluate a single generation.
Gordon, Whitley and Bohm (1992) have shown that when genetic algorithms are executed
on a data ow simulator which exploits all sources of parallelism, the time required to execute
a single generation is indeed independent of population size. Of course, large populations
also typically display slower convergence, which means that while the time to execute a
33
single generation may not change with larger population sizes, the number of generations to
convergence may increase.
In this paper, three di erent ways of exploiting parallelism in genetic algorithms will be
reviewed. First, a parallel genetic algorithm similar to the canonical genetic algorithm will
be reviewed; next an Island Model using distinct subpopulations will be presented. Finally,
a ne grain massively parallel implementation that assumes one individual resides at each
processor will be explored. It can be shown that such algorithms are in fact a subclass of
cellular automata (Whitley 1993b). Therefore, while these algorithms have been referred to
by a number of somewhat awkward names (e.g., ne grain genetic algorithms, or massively
parallel genetic algorithms) the name cellular genetic algorithm is used in this tutorial.
In each of the following models, strings are mapped to processors in a particular way.
Usually this is done in a way that maximizes parallelism while avoiding unnecessary processor
communication. However, any of these models could be implemented in massively parallel,
ne grain fashion; similarly, any of these models could be implemented using a more course
grain implementation. What does tend to be di erent is the role of local versus global
communication.

9.1 Global Populations with Parallelism


The most direct way to implement a parallel genetic algorithm is to implement something
close to a canonical genetic algorithm. The only change that will be made is that selection
will be done by Tournament Selection (Goldberg 1990).
Tournament selection implements an approximate form of ranking. Recall that the im-
plementation of one generation in a canonical genetic algorithm can be seen as a two step
process. First, selection is used to create an intermediate population of duplicate strings
selected according to tness. Second, crossover and mutation is applied to produce the next
generation. Instead of using tness proportionate reproduction or directly using ranking,
tournaments are held to ll the intermediate population. Assume two strings are selected
out of the current population after evaluation. The best of the two strings is then placed in
the intermediate population. This process of randomly selecting two strings from the current
population and placing the best in the intermediate population is repeated until the inter-
mediate population is full. A simple argument can be used to illustrate how this procedure
will approximate a linear ranking with a bias of 2.0. A linear bias of 2.0 implies the best
individual should be assigned a tness of 2.0, the individual at the 75th percentile in the
population should be assigned a tness of 1.5, the median individual should be assigned a
tness of 1.0, the individual at the 25th percentile should be assigned a tness of 0.5 and
the worst individual in the population should have a tness of 0.0. Now consider a tourna-
ment size of two, where the best of the two is selected. On average, each individual in the
population is involved in two tournaments. Of course, the best individual will win both and
in expectation will place 2 copies in the intermediate population. The median individual
should have a 50% chance of winning any tournament, and thus, in expectation, places 1
copy in the intermediate population. The worst individual cannot win a tournament and
has an e ective tness of 0.0.

34
Tournament
odd Selection
even
odd
even
odd
even Crossover and
odd Mutation
even

Figure 6: An example of exploiting ne-grain parallelism in a simple genetic algorithm using


Tournament Selection.
With the addition of tournament selection, a parallel form of the canonical genetic al-
gorithm can now be implemented in a fairly direct fashion. A simple genetic algorithm
can be executed in parallel on N/2 processors, where N is the population size. Assume the
processors are numbered 1 to N/2 and the population size is even. At each processor X
resides two strings labeled 2X and 2X-1. First, the strings are evaluated. Next each pro-
cessor holds two independent tournaments by randomly sampling strings in the population
and each processor then keeps the winners of the two tournaments. The new strings that
now reside in the processors represent the intermediate generation. Tournament selection
also has the side of e ect of randomly shuing the intermediate population. Probabilistic
crossover can be implemented by instructing those processors numbered less than pc  N=2
to execute crossover on the strings in residence. All processors can carry out mutation on
both strings, thus completing a single generation.
Note that this model is relatively ne grained, but preserves the global (i.e., panmictic)
mating scheme characteristic of canonical genetic algorithms. This same basic approach also
be implemented in parallel with 4, 6 or any even number of strings per processor.

9.2 Island Models


One motivation for using Island Models is to exploit a more course grain parallel model.
Assume we wish to use 16 processors and have a population of 1,600 strings; or we might wish
to use 64 processors and 6,400 strings. One way to do this is to break the total population
down into subpopulations of 100 strings each. Each one of these subpopulations could then
execute as a normal genetic algorithm. It could be a canonical genetic algorithm, or Genitor,
or CHC. But occasionally, perhaps every ve generation or so, the subpopulation would
swap a few strings. This migration would allow the subpopulation to share genetic material.

35
An Island Model Genetic Algorithm A Cellular Genetic Algorithm

Figure 7: An example of both an island models and a cellular genetic algorithm. The coloring
of the cells in the cellular genetic algorithm represents genetically similar material that forms
virtual islands isolated by distance.

(Whitley and Starkweather, 1990; Gorges-Schleuter, 1991; Tanese 1989; Starkweather et al.,
1991.)
Assume for a moment that one executes 16 separate genetic algorithms, each using a
population of 100 string without migration. In this case, 16 independent searches occurs.
Each search will be somewhat di erent since the initial populations will impose a certain
sampling bias; also, genetic drift will tend to drive these populations in di erent directions.
Sampling error and genetic drift are particularly signi cant factors in small populations and,
as previous noted, are even more pronounced in genetic algorithms such as Genitor and CHC
when compared to the canonical genetic algorithm.
By introducing migration, the Island Model is able to exploit di erences in the various
subpopulation, which in fact represent a source of genetic diversity. Each subpopulation is
an island, and there is some designated way in which genetic material is moved from one
island to another. If a large number of strings migrate each generation, then global mixing
occurs and local di erences between islands will be driven out. If migration is too infrequent,
it may not be enough to prevent each small population from prematurely converging.

9.3 Cellular Genetic Algorithms


Perhaps the easiest way to understand this model is to work from the architecture to the
algorithm. Assume we have 2,500 simple processors laid out on a 50x50 2-dimensional grid.
Processors only communicate with their immediate neighbors (e.g. north, south, east and

36
west: NSEW). But, the processors on the left edge of the grid wrap around so that they
are directly connected to the processors in the same row on the right edge. Likewise, the
processors on the bottom row are connect to the processors in the same column in the
top row, thus forming a torus. How should one implement a genetic algorithm on such an
architecture?
One can of course assign one string per processor or cell. But global random mating
would now seen inappropriate given the communication restrictions. Instead, it is much
more practical to have each string (i.e., processor) seek a mate close to home. One way to do
this would be to have each processor check the location to the north, south, east and west,
make a copy of the best string that it nds, then recombine the resident string with the
selected string. Alternatively, some form of local probabilistic selection could be used. In
either case, only one o spring would be produced. (Either one of the two possible o spring
could be randomly selected, or they both could be evaluated and the best retained.) The
o spring would become the new resident at that processor.
There are no explicit islands in the model, but there is the potential for similar e ects.
Assuming that mating is restricted to adjacent processors (i.e., NSEW), if one neighborhood
of strings is 20 or 25 moves away from another neighbor of strings, these neighborhoods
are just as isolated as two subpopulations on separate islands. This kind of separation is
referred to as isolation by distance. (Wright, 1932; Collins and Je erson, 1991; Muhlenbien,
1991; Gorges-Schleuter, 1991). Of course, neighbors that are only 4 or 5 moves away have a
greater potential for interaction.
After the rst random population is evaluated, the pattern of string over the set of
processors tends to be itself rather random. After a few generations, however, there are
many small local pockets of similar strings with similar tness values. This is because local
mating and selection creates local evolutionary trends, again due to sampling e ects in the
initial population and genetic drift. After several more generations, competition between
these local groups means that now there are fewer di erent kinds of neighborhoods and of
course the remaining neighborhoods are larger.
The mating scheme outlined for the cellular genetic algorithm is one of many possible
schemes. The common theme is that selection and mating is typically restricted to a local
neighborhood. Manderick (personal communication) reports that the population must be
composed of at least 1000 individuals before local trends can emerge. Search proceeds by
one neighborhood taking over another, although interesting new genetic mixes often appear
at the boundaries between neighborhoods and may give rise to the emergence of a whole
new local neighborhood.

10 Conclusions
One thing that is striking after reviewing the various genetic algorithms and the various
parallel models is the richness of this form of computation. What may seem like simple
changes in the algorithm often result in surprising kinds of emergent behavior. Recent
theoretical advances have also improved out understanding of genetic algorithms and have

37
opened to the door to using more advanced analytical methods. We now need to use these
tools in combination with empirical experiments to decide to which classes of problems
genetic algorithms are best suited, as well as what kinds of problems are best solved using
other algorithms.

Acknowledgements
A paper such as this is really a product of years of interaction with many people. It not only
represents information transmitted through scholarly works, but also through conference
presentations, personal discussions, debates and even disagreements. My thanks to all of the
people in the Genetic Algorithm community who have educated me over the years. Work
presented in the tutorial has been supported by NSF grant IRI-9010546.

References
Ackley, D. (1987) A Connectionist Machine for Genetic Hillclimbing. Kluwer Academic Publishers.
Antonisse, H.J. (1989) A New Interpretation of the Schema Notation that Overturns the Binary
Encoding Constraint. Proc 3rd International Conf on Genetic Algorithms, Morgan-Kaufmann.
Back, T., Ho meister, F. and Schwefel, H.P. (1991) A Survey of Evolution Strategies. Proc. 4th
International Conf. on Genetic Algorithms, Morgan-Kaufmann.
Baker, J. (1985) Adaptive selection methods for genetic algorithms. Proc. International Conf. on
Genetic Algorithms and Their Applications. J. Grefenstette, ed. Lawrence Erlbaum.
Booker, L. (1987) Improving Search in Genetic Algorithms. In, Genetic Algorithms and Simulating
Annealing, L. Davis, ed. Morgan Kaufman, pp. 61-73.
Bridges, C. and Goldberg, D. (1987) An analysis of reproduction and crossover in a binary-coded
genetic Algorithm. Proc. 2nd International Conf. on Genetic Algorithms and Their Applications.
J. Grefenstette, ed. Lawrence Erlbaum.
Collins, R. and Je erson, D. (1991) Selection in Massively Parallel Genetic Algorithms. Proc. 4th
International Conf. on Genetic Algorithms, Morgan-Kaufmann, pp 249-256.
Davidor, Y. (1991) A Naturally Occurring Niche & Species Phenomenon: The Model and First
Results. Proc 4th International Conf on Genetic Algorithms, Morgan-Kaufmann, pp 257-263.
Davis, L.D. (1991) Handbook of Genetic Algorithms. Van Nostrand Reinhold.
DeJong, K. (1975) An Analysis of the Behavior of a Class of Genetic Adaptive Systems. PhD
Dissertation. Department of Computer and Communication Sciences, University of Michigan, Ann
Arbor.
DeJong, K. and Sarma, J. (1993) Generation Gaps Revisited. Foundations of Genetic Algorithms
-2-, D. Whitley, ed. Morgan-Kaufmann.
Eshelman, L. (1991) The CHC Adaptive Search Algorithm. Foundations of Genetic Algorithms,
G. Rawlins, ed. Morgan-Kaufmann. pp 256-283.
Eshelman, L. and Scha er, J.D. (1991) Preventing Premature Convergence in Genetic Algorithms
by Preventing Incest. Proc 4th International Conf on Genetic Algorithms, Morgan-Kaufmann.
Fitzpatrick, J.M. and Grefenstette, J.J. (1988) Genetic Algorithms in Noisy Environments. Ma-
chine Learning, 3(2/3): 101-120.

38
Fogel, L.J., Owens, A.J., and Walsh, M.J. (1966) Arti cial Intelligence Through Simulated Evolu-
tion. John Wiley.
Fogel, D., and Atmar, J.W. (1990) Comparing Genetic Operators with Gaussian Mutations in
Simulated Evolutionary Processes Using Linear Systems. Biological Cybernetics 63:111-114.
Fogel, D., and Atmar, J.W., eds. (1992) First Annual Conference on Evolutionary Programming.
Goldberg, D. and Bridges, C. (1990) An Analysis of a Reordering Operator on a GA-Hard Problem.
Biological Cybernetics, 62:397-405.
Goldberg, D. (1987) Simple Genetic Algorithms and the Minimal, Deceptive Problem. In, Genetic
Algorithms and Simulated Annealing, L. Davis, ed., Pitman.
Goldberg, D. (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Reading,
MA: Addison-Wesley.
Goldberg, D. (1990) A Note on Boltzmann Tournament Selection for Genetic Algorithms and
Population-oriented Simulated Annealing. TCGA Report 90003, Department of Engineering Me-
chanics, University of Alabama.
Goldberg, D. (1991) The Theory of Virtual Alphabets. Parallel Problem Solving from Nature,
Springer Verlag.
Goldberg, D., and Deb, K. (1991) A Comparative Analysis of Selection Schemes Used in Genetic
Algorithms. Foundations of Genetic Algorithms, G. Rawlins, ed. Morgan-Kaufmann. pp 69-93.
Gordon, S., Whitley, D., and Bohm, A. (1992) Data ow Parallelism in Genetic Algorithms. Parallel
Problem Solving from Nature -2-, R. Manner and B. Manderick, eds. North Holland.
Gorges-Schleuter, M. (1991) Explicit Parallelism of Genetic Algorithms through Population Struc-
tures. Parallel Problem Solving from Nature, Springer Verlag, pp 150-159.
Grefenstette, J.J. (1986) Optimization of Control Parameters for Genetic Algorithms. IEEE Trans.
Systems, Man, and Cybernetics, 16(1): 122-128.
Holland, J. (1975) Adaptation In Natural and Arti cial Systems. University of Michigan Press.
Liepins, G. and Vose, M. (1990) Representation Issues in Genetic Algorithms. Journal of Experi-
mental and Theoretical Arti cial Intelligence, 2:101-115.
Michalewicz, Z. (1992) Genetic Algorithms + Data Structures = Evolutionary Programs. Springer-
Verlag, AI Series, New York.
Muhlenbein, H. (1991) Evolution in Time and Space - The Parallel Genetic Algorithm. Foundations
of Genetic Algorithms, G. Rawlins, ed. Morgan-Kaufmann. pp 316-337.
Muhlenbein, H. (1992) How genetic algorithms really work: I. Mutation and Hillclimbing, Parallel
Problem Solving from Nature -2-, R. Manner and B. Manderick, eds. North Holland.
Nix, A. and Vose, M. (1992) Modeling Genetic Algorithms with Markov Chains. Annals of Math-
ematics and Arti cial Intelligence. 5:79-88.
Rechenberg, I. (1973) Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der
biologischen Evolution. Frommann-Holzboog Verland, Stuttgart.
Scha er, J.D. (1987) Some E ects of Selection Procedures on Hyperplane Sampling by Genetic
Algorithms. In, Genetic Algorithms and Simulated Annealing, L. Davis, ed. Pitman.
Scha er, J.D., and Eshelman, L. (1993) Real-Coded Genetic Algorithms and Interval Schemata.
Foundations of Genetic Algorithms -2-, D. Whitley, ed. Morgan-Kaufmann.

39
Schwefel, H.P. (1975) Evolutionsstrategie und numerische Optimierung. Dissertation, Technische
Universitat Berlin.
Schwefel, H.P. (1981) Numerical Optimization of Computer Models. Wiley.
Spears, W. and DeJong, K. (1991) An Analysis of Multi-Point Crossover. Foundations of Genetic
Algorithms, G. Rawlins, ed. Morgan-Kaufmann.
Syswerda, G. (1989) Uniform Crossover in Genetic Algorithms. Proc 3rd International Conf on
Genetic Algorithms, Morgan-Kaufmann, pp 2-9.
Syswerda, G. (1991) A Study of Reproduction in Generational and Steady-State Genetic Algo-
rithms. Foundations of Genetic Algorithms, G. Rawlins, ed. Morgan-Kaufmann. pp 94-101.
Starkweather, T., Whitley, D., and Mathias, K. (1991) Optimization Using Distributed Genetic
Algorithms. Parallel Problem Solving from Nature, Springer Verlag.
Tanese, R. (1989) Distributed Genetic Algorithms. Proc 3rd International Conf on Genetic Algo-
rithms, Morgan-Kaufmann, pp 434-439.
Vose, M. (1993) Modeling Simple Genetic Algorithms. Foundations of Genetic Algorithms -2-, D.
Whitley, ed., Morgan Kaufmann.
Vose, M. and Liepins, G. (1991) Punctuated Equilibria in Genetic Search. Complex Systems 5:31-
44.
Whitley, D. (1989) The GENITOR Algorithm and Selective Pressure. Proc 3rd International Conf
on Genetic Algorithms, Morgan-Kaufmann, pp 116-121.
Whitley, D. (1991) Fundamental Principles of Deception in Genetic Search. Foundations of Genetic
Algorithms. G. Rawlins, ed. Morgan Kaufmann.
Whitley, D. (1993a) An Executable Model of a Simple Genetic Algorithm. Foundations of Genetic
Algorithms -2-. D. Whitley, ed. Morgan Kaufmann.
Whitley, D. (1993b) Toward Models of Island and Cellular Parallel Genetic Algorithms. submitted
to International Conference on Genetic Algorithms, 1993.
Whitley, D., and Kauth, J. (1988) GENITOR: a Di erent Genetic Algorithm. Proceedings of the
Rocky Mountain Conference on Arti cial Intelligence, Denver, CO. pp 118-130.
Whitley, D. and Starkweather, T. (1990) Genitor II: a Distributed Genetic Algorithm. Journal
Expt. Theor. Artif. Intell., 2:189-214
Whitley, D., Dominic, S., and Das, R. (1991) Genetic Reinforcement Learning with Multilayer
Neural Networks. Proc 4th International Conf on Genetic Algorithms, Morgan-Kaufmann.
Whitley, D., Das, R., and Crabb, C. (1992) Tracking Primary Hyperplane Competitors During
Genetic Search. Annals of Mathematics and Arti cial Intelligence. 6:367-388.
Winston, P. (1992) Arti cial Intelligence, Third Edition. Addison-Wesley.
Wright, A. (1991) Genetic Algorithms for Real Parameter Optimization. Foundations of Genetic
Algorithms. G. Rawlins, ed. Morgan Kaufmann.
Wright, S. (1932) The Roles of Mutation, Inbreeding, Crossbreeding, and Selection in Evolution.
Proc. 6th Int. Congr. on Genetics, pp 356-366.

40

View publication stats

You might also like