Download
Download
net/publication/3455445
CITATIONS READS
123 1,718
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Simon Lucas on 02 September 2014.
Graham Kendall
University of Nottingham, UK
Evolutionary
Computation
and Games
ames provide competitive, dynamic
1Although the term brute force is widely used in this context, we should point out that making this kind of search work well in practice
typically requires very elegant and finely-tuned programming.
Fitness Function
Depending on the nature of the game and the objectives of the research, the fitness function can be based upon game-playing ability
against a fixed (non-evolved) set of opponents, or against a population of evolved agents; the latter case is known as co-evolution and has
the advantage of not requiring any human-designed playing strategies to be implemented. Instead, natural selection leads to an improving
population of players. When using a fixed opponent, the problem of choosing the correct level of difficulty arises; for if the opponent is
either too strong or too weak, then randomly constructed agents will populate regions of the space with insufficient gradient. A useful tech-
nique here, when possible, is to choose a strong fixed player, but have it make random moves with a probability that is tuned to give an
even chance of a member of the current population winning.
Exploration
It is important to encourage the evolving agents to make sufficient exploration of game strategy space. This can be done by ensuring the
population of players is sufficiently large, and/or adding noise to the player selections. Otherwise, deterministic players may play too lim-
ited a range of games.
Implementation
An important aspect of evolving game playing agents, and one that often receives scant attention in research papers, is that of efficient
implementation of the game. This is important since the quality of the evolved strategies, and of the complexity of architecture that can be
evolved, is critically dependent upon the number of fitness evaluations or in other words, upon the number of game players, or the time
steps executed in the case of a real-time game. In recent work on evolving Othello strategies, one of the authors (Lucas) made an initial
Java prototype implementation of the system that was able to play only five games per second (when evolving a weighted piece counter at
1-ply). However, through careful re-engineering of the software, he improved this to more than one thousand games per second for the
same setup. The tricks used included the use of a profiler to observe which operations were taking the most time, replacing any Java Collec-
tions used with simpler, more restricted custom-designed collection classes, removing all unnecessary dynamic memory management, and
performing incremental evaluation of the board (i.e., evaluating the effects of making a move given a particular weighted piece counter,
without actually making the move). Also, using 1-dimensional instead of 2-dimensional arrays, with a blank border around the board to
detect off-board cases, was observed to make a significant difference; this latter trick being borrowed from Thomas Runarsson’s Othello
implementation in C.
Game Interface
This defines how the evolved agent will interact with the game. The main choices are state evaluator or action selector. The state evalua-
tor’s task is to evaluate the desirability of any given state from the evaluator’s perspective. This is a very flexible interface and allows it to
be used directly with mini-max search, where the leaves of the game tree are passed to the state evaluation function. The alternative is the
action selector, which gives the agent more control as to which move to make but is typically harder to learn. The action selector is given
the current board state, and asked to choose an action. Given this setup, it is challenging for a neural network to even select a legal move,
and this interface is often modified to select the highest-rated legal move chosen by the agent, rather than the highest-rated move (which
may be illegal).
Ply Depth
Each game has its unique space of strategies, and the game dynamics should influence the setup of the system. For two-player perfect
information games, an important consideration is the ply-depth to use during evolution. Chellapilla and Fogel, and Hughes independently
settled on 4-ply when evolving checkers players, as a compromise between good quality evaluation and CPU time. For other games,
a different ply-depth may be appropriate. Chong et al. [57] used 2-ply when evolving Othello players, for example.
Evolutionary Algorithm
For some games, it seems that the setup of the EA may be non-critical. Runarsson and Lucas, however, found that when evolving piece
counters for small-board Go, many of the details were very important [58]. To be get good performance with co-evolution, they found it
essential to use parent/child weighted averaging, and a sufficiently large population of 30 players (10 was not sufficient).
board position, and it outputs a value that is used in a mini- learning techniques. Go [3], which remains a massive chal-
max search. During the training period, the program is given lenge, has led to advances in knowledge representation and
no information other than whether it won or lost (it is not search techniques. Bridge [30] has inspired areas such as parti-
even told by how much). Blondie24 is not provided with a tion search, the practical application of Monte-Carlo tech-
strategy and contains no database of opening and ending niques to realistic problems, and the use of squeaky wheel
game positions. Co-evolution is used to develop Blondie24 optimization [31] in game playing. Othello has also been the
by playing games against itself. Once it is able to play at a subject of significant research, culminating in Logistello’s 6-0
suitable level, it often searches to a depth of 10; but depths defeat of the current world champion [32]. Furthermore,
of 6 and 8 are common in play. This program was available there is an Othello competition associated with CEC 2006,
to the delegates of the Congress on Evolutionary Comput- which aims to find the best Othello position evaluation func-
ing (CEC) conference for two years (CEC’00 San Diego tion when restricted to 1-ply search.
and CEC’01 Seoul) with Fogel offering a prize of $100
(CEC’00) and $200 (CEC’01) to anybody who could defeat Games of Imperfect Information
it. The prize remained unclaimed. Games of imperfect information are classified by the fact that
Hughes has shown that good playing performance experi- some of the information is hidden from some (or all) of the
mented could be achieved by evolving position-weighted players. Card games typically fall into this category with, per-
piece counters for Checkers [23], with
his Brunette player competing closely with
Blondie when allowed to search to a There are many goals of this research, and one emerging
deeper ply, commensurate with the
theme is using EC to generate opponents that are more
reduced evaluation time per board that
results from the simpler evaluation func- interesting and fun to play against, rather than being
tion. Hughes has also investigated both necessarily superior.
the use of co-evolution and Monte-
Carlo methods for position evaluation as
alternatives to mini-max search [24]. haps, poker attracting the most recent research interest. As far
Monte-Carlo simulation can be applied to evaluate a game back as 1944, game theory (developed by von Neumann and
state simply by playing out a game to its conclusion by making Morgensten [33] to model and study the economic environ-
a succession of random moves, and repeating this process a ment) was using a simplified version of poker as a test bed.
large number of times to estimate the probability of winning They recognized that accomplished poker players regularly
from that state. Although the probability is actually estimated adopt bluffing as part of their game, which would have to be
for purely random players, this method has, nonetheless, accounted for in any automated poker player.
proven to give surprisingly good estimates. It has been used to Findler [34] studied automated poker during a 20-year
good effect in a number of different games, including Go [26] period. He also worked on a simplified game that was based on
and Real-Time Strategy games [27]. 5-card draw poker with no ante and no consideration of bet-
As well as chess and checkers, which tend to receive the ting position due to the computer always playing last. He con-
most media interest, other games have also made significant cluded that dynamic and adaptive algorithms are required for
contributions. Research into Backgammon [28], [29] has successful play, and static mathematical models were unsuc-
made advances in reinforcement learning and other machine- cessful and easily beaten. Other than von Neumann’s and
Much is made of the fact that co-evolution can be used to Kalita [59] found that co-evolution significantly outperformed
learn game strategies simply by having a population of players TDL. Runarsson and Lucas compared TDL with co-evolution for
play many games against each other, and allowing the more learning strategies for small-board Go [58] and found that,
successful players to have more offspring than the less suc- under most experimental settings, TDL was able to learn bet-
cessful players. However, temporal difference learning (TDL) is ter strategies than co-evolution and to learn them more quick-
also able to achieve the same remarkable kind of feat, of ly. They also found, however, that the very best strategies were
learning simply by playing. learned by a carefully designed co-evolutionary algorithm, with
A key difference between the two approaches is that tempo- special attention paid to the weight sharing scheme that was
ral difference learning attempts to learn a value function by necessary to smooth the noise inherent in the fitness function.
observing the progression of game states during game-play. It A slightly less direct (in the sense that he did not perform all
learns from the sequence of game states in an entirely unsu- of the experiments) comparison was made by Darwen for
pervised way; there is no expert knowledge used to train the Backgammon [60] in which he found that co-evolution out-
system. In the simplest form of TDL, called TD(0), the learner performed TDL for training a perceptron, but found the oppo-
adjusts the value function (this could involve using back-propa- site to be true when learning the parameters of an MLP. An
gation to update the weights of an MLP, for example) to make interesting exercise would be to construct (or evolve!) new
the value of the current state more like the value of the next games that could clearly show cases where one method out-
state. It may seem surprising that this works, since changes in performed the other. In a similar way, it is noted that GP has
an incorrect value function would seem meaningless, but there already been used to evolve simple test functions to illustrate
are two reasons why this is not the case. Firstly, the true values cases in which differential evolution outperforms particle
of board positions are available at the end of the game, so the swarm optimization and vice versa [61].
information is fed back directly and is guaranteed to be correct.
Secondly, as the value function starts to become meaningful,
the changes in the value function during game-play will also
have some statistical importance, in the sense that they should Co-Evolution
be better than a purely random estimate. As this happens, the
information available to the learner (albeit partially self-gener-
ated) begins to improve dramatically. To give a chess-based
example, consider a weighted piece counter style of evaluation
function. Assume that initially the learner knows nothing of the
value of the pieces (they are set to random values, or to zero).
After playing many games, the learner notes that 80 percent of
the time, the winner has one more queen than the loser. Now,
during game-play, the loss of a queen will have a negative
effect on the player’s value estimate, and it can potentially
learn that board positions (i.e., game states) that lead to such a
loss should have a low value. This is a more direct form of
credit assignment than is usually performed with co-evolution- TDL
ary learning, in which only win/lose/draw information is fed
FIGURE 3 Illustration of game/learner interactions for TDL and co-
back at the end of a set of games (see Figure 3).
evolution. A co-evolutionary learner gets feedback on the number of
There have been very few direct comparisons between TDL games won after playing a set of games while the TDL learner gets
and co-evolutionary learning, where the algorithms have been feedback after every move of every game. Note also that co-evolu-
tion involves a population of players (depicted as neural networks
used to learn the parameters of the same architecture applied
here) whereas TDL typically uses a single player, playing against
to the same game. So far, only two such studies are known. In itself. The bold arrows indicate the reliable feedback that TDL
a study using neural networks to play Gin Rummy, Kotnik and receives at the end of each game on the value of a final position.