Adaptive Behavior: Anticipations Control Behavior: Animal Behavior in An Anticipatory Learning Classifier System
Adaptive Behavior: Anticipations Control Behavior: Animal Behavior in An Anticipatory Learning Classifier System
http://adb.sagepub.com
Published by:
http://www.sagepublications.com
On behalf of:
Additional services and information for Adaptive Behavior can be found at:
Subscriptions: http://adb.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
The concept of anticipations controlling behavior is introduced. Background is provided about the
importance of anticipations from a psychological perspective. Based on the psychological background
wrapped in a framework of anticipatory behavioral control, the anticipatory learning classifier system
ACS2 is explained. ACS2 learns and generalizes on-line a predictive environmental model (a model
that allows the prediction of future environmental states). The model is a subjective model, that is, no
global state information is available to the agent. It is shown that ACS2 can simulate anticipatory
learning processes and anticipatory controlled behavior by means of the model. The simulations of
various rat experiments, previously conducted by Colwill and Rescorla, show that the incorporation of
anticipations is indeed crucial for simulating the behavior observed in rats. Despite the simplicity of
the tasks, we show that the observed behavior reaches beyond the capabilities of model-free rein-
forcement learning as well as model-based reinforcement learning without on-line generalization.
Possible future impacts of anticipations in adaptive learning systems are outlined.
Correspondence to: M. V. Butz, Department of Cognitive Psychology, Copyright © 2002 International Society for Adaptive Behavior
University of Würzburg, Roentgenring II, 97070 Würzburg, Germany. (2002), Vol 10(2): 75–96.
E-mail: [email protected] [1059–7123 (200206) 10:2;75–96; 032179]
75
tion takes place. That is, the RL agent usually does 2 Anticipations Control Instrumental
not generalize over the provided perceptions (or sen- Behavior: Recent Experimental
sations, or features) while interacting with the envi- Evidence in Animals and Humans
ronment. Moreover, the actual benefit or necessity of
anticipatory behavior has not been investigated so A legacy of behaviorism, which restricted itself to
far. objectively observable behavioral phenomena and dis-
This work provides evidence for anticipatory regarded any cognitive- or even consciousness-related
behavioral influences in animals and humans, intro- explanations of behavior, is that in artificial intelli-
duces a psychologically motivated behavioral learn- gence learning is mostly considered as being the for-
ing theory of anticipatory behavioral control, and mation of stimulus–action connections associated
analyzes the theoretical constraints in the anticipa- with previous reinforcement sensations. This notion
tory learning classifier system ACS2 (Stolzmann, can be traced back to Thorndike’s “law of effect,”
1997; Butz, 2002). It is investigated how well ACS2 according to which the presentation of a reinforcer
implements the psychological theory and how the following an action strengthens a connection between
approach allows the realization of anticipatory con- the stimulus or situation present when the action is
trolled behavior. It is shown that additional anticipa- performed and the action itself so that subsequent
tory learning and behavioral mechanisms can be presentations of the stimulus elicits the action as a
added easily. Taking the animat approach (Wilson, response (Thorndike, 1911).
1991) to competent adaptive behavior systems, we More recently, though, it has become clear that
simulate the behavior of ACS2 in several simple rat learning theories based simply on reinforcement are
experiments. Despite the experiments’ simplicity, insufficient to explain all observed behavior in cogni-
we reveal that the observed rat behavior cannot be tive psychology experiments. This section provides
simulated with model-free RL since effect associa- evidence for anticipations controlling behavior in ani-
tions beyond plain reinforcement values are neces- mals and humans. Moreover, a framework of anticipa-
sary. Moreover, we reveal that if at all, off-line tory behavioral control is sketched.
generalizing model-based RL techniques cannot
simulate the behavior, either, since modifications in
2.1 Evidence in Experiments with Animals
the environment occur during the experiment. We
suppose that an on-line learned and on-line general- The crucial role of action–outcome relations in instru-
ized predictive model representation in combination mental behavior of animals was first acknowledged by
with anticipatory processes enables strong behavio- Tolman and his collaborators (Tolman & Honzik,
ral competence. 1930; Tolman, 1932, 1949). Tolman’s major argument
In the next section, we reveal the importance of for the insufficiency of traditional behaviorism is the
anticipations and knowledge about anticipation influ- observation of latent learning. In a typical latent learn-
enced behavior from a cognitive psychology perspec- ing experiment by Tolman and Honzik (1930) two
tive. Moreover, we provide a framework for anticipatory groups of rats explore a multiple T maze in several tri-
behavioral control that is consistent with the psycholog- als, with the first group receiving reinforcement
ical findings. Section 3 introduces ACS2, providing (food) at the end of the maze and the second group
details to all relevant processes as well as compar- not. It is shown that the rats in the second group move
ing the learning processes to the anticipatory behav- toward the end of the maze faster once food is also
ioral control framework. Section 4 introduces a rat provided to them. This shows that the rats must have
experiment and compares behavior of ACS2 with formed a predictive model representation of their
that of the rats. Section 5 studies two further rat environment that they subsequently exploit to solve an
experiments. In these experiments additional antici- explicit task.
patory processes are simulated in ACS2 to solve the Although the diverse latent learning experiments
tasks. Section 6 summarizes and offers conclusions (cf. Thistlethwaite, 1951 for an overview) have been
on the findings. subject to several critiques, convincing experimental
demonstrations of animal action–outcome learning
have been given by the use of an outcome-devaluation
Figure 1 In Colwill and Rescorla (1985), rats were able Figure 2 Colwill and Rescorla (1990) showed that some
either to press a lever or pull a chain (R1, R2) that led to forms of situation-dependent R–O relations are learned
either food pellets or sucrose. After the devaluation of one by rats. After the rats were tought different S–(R–O) rela-
outcome (O1), the action that previously led to the other tions (light or noise in combination with pressing a lever or
outcome (O2) was preferred. The result is not explainable pulling a chain leads to food pellet or sucrose), and one
with a stimulus–response approach. outcome was devalued, the action that, dependent on the
situation (light or noise), previously led to the other out-
come was preferred.
procedure, first employed by Adams and Dickinson
(1981). Let us consider an outcome-devaluation
experiment by Colwill and Rescorla (1985), which we two different actions were reversed in the presence of
will investigate throughout this work: A group of rats discriminative stimuli (see Figure 2). In one setting,
is first trained to perform two different actions that for example, rats received food pellets for pressing a
lead to two different outcomes (e.g., lever pressing lever and a sucrose solution for pulling a chain in the
leads to food pellets and chain pulling leads to a presence of noise, whereas in the presence of light,
sucrose solution). After training, one of the two out- lever pressing resulted in sucrose and chain pulling
comes is devalued by associating it with a mild nausea resulted in food pellets. After this discrimination
(in this case lithium chloride, LiCl). When the rats are training one of the two outcomes, say, sucrose, was
subsequently given the choice between the perform- devalued by pairing its consumption with a mild nau-
ance of the two actions in an extinction phase, in sea. Finally, the animals were again given the choice
which no reinforcement is provided, the animals between the two actions in the presence of either the
clearly prefer the action that previously led to the non- noise or the light. They clearly preferred the action
devalued outcome. Figure 1 shows the experimental under the present stimulus that previously resulted in
setup schematically. the nondevalued outcome (in our case, food pellets).
Obviously, the animals did not respond to the sit- Particularly, the rats preferred lever pressing in the
uation with any action that was directly reinforced presence of noise whereas they preferred chain pulling
before, but rather expectations of the forthcoming out- in the presence of light. This preference is again unex-
come of the available actions led to the avoidance of plainable by S–R theories since the devaluation was
the previously devalued outcome. Thus, animal experienced in the absence of any pressing or pulling
behavior is at least partly determined by anticipations action. Simple stimulus associations are not sufficient,
of expected action outcomes. The result suggests three either, since the action dependence is the crucial
conclusions: First, the animals have not only acquired ingredient of the experiment. Thus, as Colwill and
stimulus–response (S–R) relations, but also some rela- Rescorla (1990) argue, the rats have acquired hierar-
tions about which actions will lead to which outcome, chical S–(R–O) representations that enable them to
that is, response–outcome (R–O) associations. Second, predict the outcomes of their actions dependending on
the acquired R–O representations are involved in the the given situation. Consequently, they preferred that
propagation of the subsequently modified outcome action that in the present situation led to the relatively
value (devaluation). Third, and most important, the more desirable outcome. Similar behavior in ACS2 is
(modified) R–O representations influence the choice demonstrated in Section 5.5.
of behavior. In Section 5.4 we examine performance The impact of R–O relations on animal behavior
of ACS2 in this experiment validating the three sug- as well as their conditionalization to discriminative
gested conclusions. situational contexts has been demonstrated in numer-
In a further experiment, Colwill and Rescorla ous other experiments (cf. Roitblat, 1994; Rescorla,
(1990) examined the impact of the situational context 1990, 1991, 1995; Dickinson, 1994; Pearce, 1997). Of
on an animal’s choice of actions with different out- course, this does not exclude that S–R and S–O rela-
comes. The assignments of two different outcomes to tions contribute to the control of animal behavior.
However, the available evidence for the acquisition of ing R–O compatibility effects. Participants were
contingent R–O relations, which are conditionalized required to perform as quickly as possible, for exam-
to discriminative stimuli if necessary, is by far ple, a strong key press to a red signal and a soft key
stronger than the evidence for direct S–R associations. press to a green signal. In the compatible R–O condi-
Thus, anticipations of the expected outcomes of tion the strong key press was consistently followed by
actions are a central part of animal behavioral control. a loud tone and the soft key press by a soft tone. In the
incompatible condition the action–outcome assign-
ments were reversed. Although the tones were exclu-
2.2 Evidence in Experiments with Humans
sively delivered after the required action had been
The emphasis of the role of action–outcome anticipa- initiated, the action–outcome compatibility neverthe-
tions in animal action control has its pendant in the less substantially influenced response times: On aver-
classical ideomotor hypothesis (IMH) of human age, participants responded about 50 ms faster if the
action control. According to the IMH, humans (and required actions resulted in compatible outcomes than
animals) select and initiate voluntary actions by an if they resulted in incompatible outcomes. Since influ-
anticipation of their sensory outcomes: ences of possible associations between the response
signals and the outcome tones were ruled out by a
An anticipatory image, then, of the sensorial conse- control experiment, “[the results] confirm the central
quences of a movement, plus (on certain occasions) assumption of IMH that anticipatory effect representa-
the fiat that these consequences shall become actual, tions become endogenously activated for the purpose
is the only psychic state which introspection lets us of response selection” (Kunde, 2001, p. 393).
discern as the forerunner of our voluntary acts. To summarize, there is growing evidence that
(James, 1890/1981, p. 501) goal-oriented behavior in animals as well as in
humans is to a great part determined by anticipations
Although the IMH was widely acknowledged at the of the expected outcomes of available actions. Behav-
end of the 19th century (Harle, 1861; Lotze, 1852; ioral control by outcome anticipations necessarily pre-
Münsterberg, 1889), it soon fell into disrepute because supposes the learning and representation of consistent
the notion that instrumental behavior might be deter- R–O relations. The integration of discriminative stim-
mined by only introspectively available mental states ulus conditions seems to be a secondary process by
like “anticipatory images” was not respectable in the which R–O relations become conditionalized, that is
upcoming rigorous behaviorism (cf. Greenwald, S–(R–O) representations are formed.
1970). Recently however, the IMH has experienced a
revival in theoretical considerations (e.g., Hoffmann,
2.3 Anticipatory Behavioral Control: A
1993; Prinz, 1990, 1997; Hommel, 1998) as well as in
Tentative Framework
experimental research (e.g., Elsner & Hommel, 2001;
Hommel, 1996; Stock & Hoffman, 2002; Kunde, Hoffmann (1993) proposed a tentative framework for
2001; Hoffmann, Sebald, & Stöcker, 2001; Ziessler, the acquisition of behavioral competence that takes
1998; Ziessler & Nattkemper, 2001). the primacy of R–O learning as well as the condition-
For an empirical confirmation of the IMH two alization of R–O relations on relevant situational con-
things need to be shown: First, when performing goal- texts into account. The framework departs from the
oriented actions, primarily associations between the following basic assumptions (cf. Figure 3):
performed actions and their contingent sensory out-
comes should be formed instead of associations 1. It is supposed that any voluntary action is pre-
between stimulus conditions and actions. Second, ceded by an anticipation of to-be-reached out-
anticipations of the expected outcomes should be the comes. Hereby, a voluntary action is defined as
forerunners of action initiation. The following exem- performing an action to attain some desired out-
plar study strongly supports the IMH. come. Thus, a desired outcome, as general and
To examine the impact of action outcomes as imprecise as it might be specified in the first
forerunners of action initiation, Kunde (2001) came place, has to be represented in some way before a
up with the simple but straightforward idea of explor- voluntary action can be performed.
current knowledge about the environment. RL tech- was present in the RNN and the net was situated in the
niques are applied to adapt behavior. environmental context. Problems appeared to be scal-
This section introduces ACS2, the current state- ability and reliability of the model-learning approach
of-the-art of ACS including genetic generalization and as well as the difficulty of determining the accuracy of
further modifications. Moreover, ACS2 is compared the predictions.
to the theory of anticipatory behavioral control. First, On the ALCS side, Drescher (1991) provided an
a background of related artificial learning systems is early approach (not yet calling the system an ALCS).
provided. Based on Piagetian development theory he developed
a schema mechanism that forms a generalized envi-
ronmental model on-line. He was able to show inter-
3.1 Background
esting developmental stages in his system, drawing
All learning systems that represent and utilize predic- relations to the Piagetian theory of development.
tions of future states to adapt behavior are related to However, his system did not prove to be robust, as can
ACS2. One of the first approaches in this respect was be seen in his limited experimental results. Further
pursued in Sutton’s dynamical architecture Dyna (Sut- investigation of Drescher’s ideas, however, seems
ton, 1991b). In Dyna an environmental model is worthwhile.
learned for the further improvement of RL capabili- Recently, another ALCS, termed YACS, has been
ties. With the learned environmental model, anticipa- introduced that applies different learning mechanisms
tory behavioral processes can also be simulated. but evolves a similar model (Gérard & Sigaud,
Whereas previous Dyna approaches usually explicitly 2001b). Also a generalization mechanism was added
stored each experienced situation–action–resulting- to YACS that proved to evolve maximally compact
situation triple with statistics, ACS2 generalizes on- environmental representations (Gérard & Sigaud,
line over perceptual attributes. Thus, ACS2 is basi- 2001a). It is necessary to study further the differences
cally the next step in the general Dyna architecture. between YACS and the current ACS2 system.
Several algorithms and processes introduced in this Although more research has been published with
work actually stem from work on Dyna. ACS2 and more problems solved, YACS has been
Holland (1990) proposed a somewhat similar idea shown to solve certain maze tasks with a smaller
in the learning classifier system framework (Holland, number of overall classifiers. The size of the environ-
1976; Lanzi, Stolzmann, & Wilson, 2000). The idea is mental model, however, is similar in both systems.
to include tags in the message list (comparable to a Another ALCS system is the dynamic expectancy
feature vector) that allow the distinction of predic- model (Witkowski, 1997). The system builds an envi-
tions, perceptions, actions, and so forth. Riolo (1991) ronmental model consisting of rules, similar to ACS2.
integrated this concept in his CFSC2 system showing However, although Witkowski mentions a generaliza-
that the system is able to form a predictive environ- tion mechanism, the mechanism has not been applied
mental model and use the model to adapt behavior. in the provided results. Interesting animat behavior
However, CFSC2 did not apply any generalization has nevertheless been shown as, for example, extinc-
mechanisms so that the learning classifier system tion behavior in Witkowski (2000).
spirit was somewhat lost. Moreover, the tags appear Finally, we want to mention Robert Rosen’s con-
hard to handle and seem to cause more interference tribution to the notion of anticipations. His book on
than benefit. anticipatory systems (Rosen, 1985) was the first con-
Other related systems with predictive environ- tribution that approached anticipations from a mathe-
mental model representations include model-learning matical perspective. Later, Rosen sees anticipations as
artificial neural networks (NNs) as well as anticipa- a necessary ingredient in the manifestation of life
tory learning classifier systems (ALCSs). On the NN (Rosen, 1991). Anticipations allow a new kind of
side, for example, Tani (1996) succeeded in the simu- complexity that is mandatory for living beings. These
lation of model-based learning on a mobile-robot plat- propositions might sound rather strong and we do not
form. His recurrent neural net (RNN) succeeded in pursue them any further herein. We rather intend to
diminishing the state-prediction error. Moreover it contribute to the general idea and importance of antic-
was shown that planning was possible once the model ipations in adaptive behavior. For this, we give an
overview of the investigated ACS2 system in the next ifies any action possible in the environment. The
sections. measures q, r, and ir are scalar values where q ∈[0,1],
r ∈ ℜ , and ir ∈ ℜ . A classifier with a quality q
greater than the reliability threshold θr (usually set to
3.2 Agent and Knowledge Representation
0.9) is called reliable and becomes part of the internal
Similar to other agent architectures, ACS2 autono- environmental model. A classifier with a quality q
mously interacts with an environment. In a behavioral lower than the inadequacy threshold θi (usually set to
act at a certain time t, the agent perceives (or senses) a 0.1) is considered as inadequate and is consequently
certain situation σ ( t ) ∈ { ι1, …, ιm }L (where L de- deleted. The immediate reward prediction ir is sepa-
notes the string length of a sensed situation, m denotes rated from the usual reward prediction r to enable
the number of possible values for each attribute in proper internal RL updates. All parts are modified ac-
σ( t ) , and ιi denotes a possible value). The system cording to an RL mechanism, and according to two
then acts upon the environment executing an action model-learning mechanisms specified in Section 3.3.
α( t ) ∈ A (where A denotes the set of all possible ac- Additionally, each classifier comprises a mark
tions). After the execution of α( t ) , the environment (M) that records the values of each attribute of all situ-
provides a scalar reinforcement R. ations in which the classifier did not predict correctly
While interacting, ACS2 iteratively learns a pre- sometimes. The mark has the structure M = (m1, … ,
dictive model of the encountered environment. The mL). Each attribute mi ⊆ { ι1, …, ιm } records all val-
model is represented by a population [P] of condition– ues at position i of perceptual strings in which the
action–effect rules, that is, the classifiers. Each classi- specified effect did not take place after execution of
fier predicts action effects given the specified condi- action A. Moreover, each classifier specifies a genetic
tion. A classifier in ACS2 always specifies the state of algorithm (GA) time stamp tga , an anticipatory learn-
all resulting sensory attributes. It consists of the fol- ing process (ALP) time stamp talp , an application
lowing main components: average aav, an experience counter exp, and a numer-
osity num. The two time stamps record the time of the
• Condition part (C) specifies the set of situations last learning module applications. The application
in which the classifier is applicable. average estimates the frequency with which the classi-
• Action part (A) proposes a possible action. fier is updated (i.e., part of an action set). The experi-
• Effect part (E) predicts the effects of the proposed ence counter counts the number of applications. The
action in the specified conditions. numerosity denotes how many identical classifiers this
• Quality (q) measures the accuracy of the pre- macroclassifier represents.
dicted effects.
• Reward prediction (r) estimates the long-term
3.3 Learning Processes
reinforcement encountered after the execution of
action A in condition C. Figure 4 illustrates the interaction of ACS2 with its
• Immediate reward prediction (ir) estimates the environment and its learning application in further
direct reinforcement encountered after execution detail. After the perception of the current situation
of action A in condition C. σ( t ) , ACS2 forms a match set [M] comprising all
classifiers in the population [P] whose conditions are
C and E consist of the values perceived from the en- satisfied in σ( t ) . Thus, [M] holds the complete pre-
vironment and “#” symbols (i.e., C, E ∈{ι1, …, ιm, #}L ). dictive knowledge for the current situation. Next, an
A # symbol in C, called the don’t care symbol, de- action α ( t ) is chosen according to the applied behav-
notes that the classifier matches any value in this at- ioral policy. Herein, a simple ²–greedy strategy is
tribute. A # symbol in E, called the pass-through applied, as is often used in RL (Sutton & Barto, 1998).
symbol, specifies that the classifier predicts that the With respect to α( t ) , an action set [A] is generated
value of this attribute will not change after the execu- that consists of all classifiers in [M] whose action
tion of the specified action. Non pass-through sym- equals α( t ) . Thus, [A] comprises the predictive
bols in E anticipate the change of the particular knowledge restricted to the chosen action given the
attribute to the specified value. The action part A spec- current situation. After the execution of α ( t ) and the
Figure 4 During one agent/environment interaction, ACS2 forms a match set representing the predictive
knowledge with respect to the current perceptions. Next, it generates an action set representing the knowl-
edge about the consequences of the chosen action in the given situation. Classifier parameters are updated
by reinforcement learning (RL) and the anticipatory learning process (ALP). Moreover, new classifiers might
be added and old classifiers might be deleted by genetic generalization and ALP.
reception of reinforcement ρ ( t ) , classifier parameters Additional to the parameter updates, the ALP
are updated by the ALP and the applied RL technique generates specialized offspring and/or deletes inaccu-
and new classifiers might be generated as well as old rate classifiers. Specialized classifiers are generated
classifiers deleted by the ALP and the genetic-gener- in two cases. In the expected case, a classifier might
alization process. be generated if the mark M differs from the situation
The basic learning mechanisms are two interact- σ( t ) in some attributes. This means that the classifier
ing model-learning mechanisms and one RL mecha- previously encountered situation(s) (characterized by
nism. The ALP is the specializing component of the the mark) in which its predictions were incorrect.
model-learning mechanism. The ALP evaluates rules Thus, the condition of the new classifier is specialized
and detects which rules are over-general. Once an in those differing attributes. In the unexpected case, a
over-general rule is detected, specialized offspring is classifier is generated if the effect part of the classifier
generated. Genetic generalization, on the other hand, can be further specialized (by changing pass-through
is an indirect generalization procedure. Accurate clas- symbols to specific values) to specify the perceived
sifiers are chosen for generating generalized offspring. effect correctly. All positions in condition and effect
In turn, over-specialized as well as inaccurate classifi- part are specialized that change from σ ( t ) to σ( t + 1 ).
ers are deleted. A classifier is also generated if there was no clas-
sifier in the actual action set [A] that anticipated the
effect correctly. In this case, covering applies, in which
3.3.1 Anticipatory Learning Process The ALP up- a classifier is generated that is specialized in all
dates the quality q, the mark M, the ALP time stamp attributes in condition and effect part that changed
talp , the application average aav, and the experience from σ( t ) to σ ( t + 1 ).
counter exp. The quality q is updated according to the The attributes of the Mark M of a new classifier
classifier’s anticipation. If a classifier correctly speci- are initially empty. Quality q is set to 0.5 in the cover-
fied changes and nonchanges, called expected case, its ing case and is inherited from the parental classifier
quality is increased ( q ← q + β ( 1 – q ) ). If the classi- (minimally set to 0.5) in the other reproduction cases.
fier specifies an incorrect effect, termed unexpected Reward prediction r and immediate reward prediction
case, its quality is decreased ( q ← q – β q ). Parame- ir are set to 0 in the covering case but are inherited
ter β ∈ [ 0,1 ] denotes the learning rate of ACS2. from the parent in the other cases. For further details
on the learning process please refer to (Stolzmann, sumer is increased dependent on if the new classifier
2000; Butz, 2002). was generated by ALP or GG, respectively.
3.3.2 Genetic Generalization Mechanism Although 3.3.4 Interaction of ALP and GG Several distinct stud-
the ALP specializes classifiers in a quite competent ies in various environments revealed that the interac-
way, over-specializations can occur sometimes, as tion of ALP and GG is able to evolve a complete,
Butz (2002) has studied. Since the over-specialization accurate, and maximally general model in various
cases can be caused by various circumstances, a environments in a competent way (cf. Butz, Goldberg,
genetic generalization (GG) mechanism was applied & Stolzmann, 2000; Butz, 2002). The basic idea
that, interacting with the ALP, results in the evolution behind the interacting model-learning processes is that
of a complete, accurate, and maximally general the specialization process extracts as much informa-
model. The basic framework of the genetic algorithm tion as possible from the encountered environment
was derived from Wilson’s accuracy-based learning continuously specializing over-general classifiers. The
classifier system XCS (Wilson, 1995). The mecha- GG mechanism, on the other hand, randomly general-
nism works as follows. izes exploiting the power of a genetic algorithm where
After the application of the ALP, it is determined no more additional information is available from the
if the mechanism should be applied. Classifiers are environment. The ALP ensures diversity and prevents
reproduced in the action set [A] proportionally to their the loss of information of a particular niche in the
quality value q. Reproduced classifiers are crossed environment. Only GG generates identical classifiers
and mutated in the conditions. Hereby, a generalizing and causes convergence in the population.
mutation is applied that randomly changes specialized
attributes back to don’t-care symbols. If a generated
3.4 Behavioral Policy
classifier already exists in the population, the new
classifier is discarded and if the existing classifier is The behavioral policy of ACS2 is directly represented
not marked its numerosity is increased by one. If no in the evolving model. Each classifier specifies the
identical classifier exists, the quality q of the new reward prediction estimate r and the immediate
classifier is decreased by 0.5 and it is inserted in the reward prediction estimate ir, which control behavior.
population. If an action set [A] exceeds the action set Thus, the reward estimates are dependent on the struc-
size threshold θas, excess classifiers are deleted in [A]. ture of the classifiers so that the environmental model
Deletion causes the extinction of low-quality as well as a whole needs to be specific enough to prevent mis-
as over-specialized classifiers. leading averaging of the estimates. Only if no averag-
ing takes place is it assured that the classifier
population can represent an optimal policy within the
3.3.3 Subsumption To further emphasize a proper predictive model. If averaging takes place, model
model convergence, subsumption is applied similarly aliasing (Butz, 2002) might prevent the evolution of
to the subsumption method in XCS (Wilson, 1998). If an optimal policy as previously identified in different
a new classifier is generated, regardless if by ALP or contexts (e.g., Whitehead & Ballard, 1991; Dorigo &
GG, the set is searched for a subsuming classifier. The Colombetti, 1997).
new classifier is subsumed if a classifier exists that is As visualized in Figure 4, the reward-related
more general in the conditions, specifies the same parameters r and ir are updated after the action is exe-
effect, is reliable (its quality is higher than the thresh- cuted, the next environmental situation perceived, and
old θr ), is not marked, and is experienced (its experi- the subsequent match set formed. The update com-
ence counter exp is higher than the threshold θexp ). If bines immediate reinforcement ρ( t ) with discounted
there exists more than one possible subsumer, the sub- future reward.
sumer with the most don’t-care symbols is chosen. In
the case of a draw, the subsumer is chosen at random.
If a subsumer was found, the new classifier is dis- r ← r + β( ρ( t ) (1)
carded and either quality or numerosity of the sub- + γ maxcl ∈ [ M ] ( t + 1 ) ∧ cl.E ≠ { # } L ( cl.q ⋅ cl.r ) – r )
4 Stimulus-Dependent Response–Effect
Relations T. Furthermore, inter-trial intervals (ITI) were pre-
sented in which no stimulus was present and no action
To validate the idea of anticipations controlling behav- had any effect. During stage 2, neither action had any
ior, ACS2 is now and in the next section compared to effect and only light stimuli were presented. Thus, the
results of three psychological experiments previously learned R1–O1 and R2–O2 were extinct during that
conducted with rats. The intention is to show that stage. In the first and second stages, always either
ACS2 is able to mimic animal behavior as well as that chain or lever were present but not both. Finally, in
anticipatory behavioral control is necessary for simu- stage 3 the actions of the rats were monitored under
lating similar behavior. The performance of ACS2 is the auditory stimuli in the presence of chain and lever.
compared to other artificial learning frameworks as Actions again did not cause any effect. Rescorla
well. This section introduces a rat experiment pub- (1990) supposed that only if the rats form hierarchical
lished in Rescorla (1990). The experiment is simu- S–(R–O) relations, could the extinction phase affect
lated and performance of ACS2 is evaluated. the preference during the test phase to execute that
action that previously produced a not-extinct R–O
relation. For further details on the rat experiment the
4.1 The Rat Experiment: Hierarchical S–(R–O)
interested reader is referred to the cited article.
Relations
The suspicion was confirmed. Rats significantly
The major intention of Rescorla (1990) was to evalu- prefer that action that resulted in the R–O relation dur-
ate whether hierarchical [S–(R–O)] relations are ing phase 1 that was not extinct during phase 2 as
formed in rats. To evaluate this suspicion, Rescorla depicted in Figure 6. Thus, R–O relations must have
trained rats with a standard procedure teaching vari- been formed that are extinct rather independently of
ous R–O relations with respect to discriminative stim- situational context. It was also observed that the mean
uli. response time during stimulus presentation is signifi-
Figure 5 shows the experimental setup. The cantly higher than during the ITI, which confirms that
experiment was subdivided into three stages. During the rats learned that a stimulus is necessary to obtain
the first stage, each animal was trained with three reinforcement successfully. Moreover, during the sec-
stimuli [i.e., light (L), noise (N), and tone (T)] in ond half of the test phase the response frequency
which two different responses (i.e., pressing a lever or declined significantly on stimulus presentation.
pulling a chain) were reinforced with one of two out-
comes. In the presence of light (L), the associative R–
4.2 Simulating the Experiment with ACS2
O relations R1–O1 and R2–O2 were in effect, each of
which were also in effect in one of the auditory stim- To specify the simulation of the rat experiment with
uli. Thus, during the first stage, L shared the R1–O1 ACS2, we need to define perceptions, actions, rein-
relation with N and it shared the R2–O2 relation with forcement, and the length of each experimental stage.
reinforcement representation in the classifiers. ACS2 phase differed in the setup structure (either lever or
generates classifiers that specify accurate action– chain was present during training and extinction but
effect relations with maximally general conditions. In both were present during testing). However, even if
this experiment, ACS2 forms a classifier that specifies the simulation would have been conducted in a way
that if either L or N is present, O1 will follow R1. Sim- that both manipulanda always were present, model-
ilarly, it forms a classifier that specifies that if either L based RL would not be able to show similar behavior
or T is present, O2 will follow R2. When R1 and R2 are since it would learn all situation–action–effect rela-
now devalued in the L condition, R1 is consequently tions exemplarily. For on-line generalizing model-free
also devalued in the N condition and R2 is devalued in RL mechanisms such as previous learning classifier
the T condition. Thus, ACS2 makes the distinction. systems (Holland, 1976; Lanzi et al., 2000), the sys-
For a classifier to represent L or N in its condition tem would not distinguish between the different out-
part in the chosen coding, it can only specify ¬T since comes and would backpropagate simple reinforcement.
an explicit or representation is not possible in the con- Thus, a learning classifier system would not distin-
ditions of classifiers in ACS2 right now. Thus, the guish between the outcomes. The comparison stresses
result is only obtainable if no ITI is simulated. In a the importance of a predictive model representation in
simulation with ITI, ¬T is also applicable in the ITI combination with on-line generalization. Moreover, it
and consequently not sufficient to represent the rela- points out the necessary distinction between condi-
tion. This suspicion was confirmed in experiments tions, actions, and effects. Only due to the condition-
with ITI in which ACS2 does not exhibit any differen- alized generation of action–effect associations could
tiation between the same and different R–O relations. behavior match with the rat behavior.
Moreover, when not applying genetic generalization
in the setting without ITI, the result was not achieva-
ble, either. The anticipatory learning process usually 5 Explicit Anticipations Influence
generates the individual classifiers as well as the clas- Behavior
sifier with condition ¬T. In the test phase, the classi-
fier that specifies N–R1–O1 overrules the more general Whereas the previous section showed emergent antici-
but devalued classifier ¬T–R1–O1 so that the distinc- patory behavior in ACS2, this section shows how the
tion does not apply. evolving generalized environmental model can be
Several important observation were made in this used to distribute reinforcement internally. It is shown
simulation. First, ACS2 exhibits an implicit S–(R–O) that reinforcement values can be adapted to draw con-
structure since it differentiates between same and dif- clusions that are appropriate but would not have been
ferent R–O relations dependending on S in the test possible without the generalized anticipatory model.
phase. Second, emergent behavior results from the In more psychological terms, it is shown that ACS2 is
interaction of the reinforcement representation in clas- able to use its internal generalized environmental
sifiers and the on-line generalized model. Although model for distinct cognitive processes that allow a
the generalized representation might not be compara- “mental” adaptation of behavior.
ble to the rats (the rats most probably did not specify The study herein is mainly based on the work
that if not T then R1–O1 but rather if L or N, then R1– published in Stolzmann, Butz, Hoffmann, and Gold-
O1), it showed that the S–(R–O) structures can also be berg (2000). Due to the changes from ACS to ACS2,
obtained without any explicit hierarchical structure. though, some parts of the additional mechanisms have
Finally, the results were obtained independent of changed. Moreover, genetic generalization is applied
parameter settings. Thus, the results point to the plau- throughout. To evaluate the mental adaptation possi-
sibility of the learning mechanism and the theory of bilities, ACS2 is tested in a simulation of the two rat
anticipatory behavioral control. experiments published by Colwill and Rescorla (1985,
As a final point it is interesting to see how other 1990) introduced in Section 2.1.
learning systems would behave. In model-free RL This section recapitulates the response–effect
approaches as well as model-based RL approaches experiment by Colwill and Rescorla (1985) and
without on-line generalization the transfer would not stresses its peculiarity. Next, anticipatory mechanisms
be possible at all since training, extinction, and test are introduced to ACS2 to enable the system to draw
Figure 7 In three different settings, rats preferred the action that previously led to the still-valued reinforcer to
the one that led to the now less-valued one in the Colwill and Rescorla (1985) experiments. In the first and sec-
ond setting, one reinforcer was devalued by pairing its consumption with LiCl; in the last experiment one rein-
forcer was sated.
mental conclusions. Finally, performance of ACS2 is formance of the rats during the test phase in all three
revealed in the simulation of Colwill and Rescorla settings. Additional to the observed successful dis-
(1985) as well as in the simulation of the more diffi- tinction during testing, the rats also showed a decrease
cult stimulus–response–effect experiment (Colwill & in response frequency during testing. Moreover,
Rescorla, 1990). sucrose was always more appealing than food pellets.
Finally, also in the last experiment, in which one rein-
forcer was supplied until the rats were sated, the rats
5.1 Response-Effect Learning Task
showed the basic distinction. Only motivational influ-
The herein investigated response–effect learning task ences, that is, the motivation to go for the not-sated
was originally done with rats by Colwill and Rescorla reinforcer, could have triggered the difference in this
(1985). Section 2.1 already revealed the basic impli- case.
cations of the experiment. The intention was to inves- The experiment shows that rats must have formed
tigate if and in what way rats evolve response–effect context-independent response–outcome associations
(R–O) relations. that control behavior. Once an outcome is devalued,
Figure 1 gives an abstract view of the experiment. the associations that lead to the devalued outcome are
Rats were tested in a three-stage experiment. First, they (possibly implicitly) devalued as well so that the rats
were taught to execute two distinct possible actions R1 prefer to execute that action that led in phase 1 to the
and R2 (pressing a lever and pulling a chain). One outcome that was not devalued in phase 2.
action led to one type of (positive) reinforcer (sucrose) This outcome-dependent action selection can be
and the other to a different (positive) reinforcer (food obtained neither by any model-free RL mechanism,
pellet). Next, without the presence of lever or chain, nor by model-based RL approaches without on-line
reinforcers were provided separately and one of the generalization. Model-free RL fails since it relies on a
reinforcers was devalued. Finally, the rats were tested direct interaction with the environment for learning
on if they would choose to press the lever or pull the but the connection “action (pressing or pulling) leads
chain, which were simultaneously present during test- to the devalued reinforcer” is never encountered on-
ing. All three slightly different experimental settings line. Model-based RL can learn this association since
in the original work showed that the rats preferred the reinforcement can be propagated internally by means
action that previously led to the non-devalued rein- of the learned predictive model (e.g. Sutton, 1991b).
forcer during the test phase. Figure 7 shows the per- However, only model-based approaches that general-
ize on-line over perceptual attributes are able to solve ments. The environmental model was stored in a
the transfer task since each experimental stage slightly completely specialized, tabular form. The algorithm
differs in its setup. Note that on-line generalization is randomly updated state–action pairs by anticipating
mandatory. Approaches that pregeneralize the input the next state and backpropagating the highest Q-
space before learning, such as tile coding approaches value additional to the expected direct reward.
(e.g. Kuvayev & Sutton, 1996), cannot solve the prob- Due to the on-line generalized model in ACS2,
lem since they would learn three different models for the internal update process needs to be modified. First,
the three stages and consequently would not be able to since classifiers usually only specify parts of the per-
draw the appropriate conclusion. (It is impossible to ceptual attributes in their condition parts, classifiers
provide an identical coding for each stage in this usually predict a set of possible next states and not an
experiment since it is essential that no manipulanda exact situation–action–resulting-situation triple. Sec-
are present during the devaluation phase.) ond, the prediction of the next state is only valid to a
Without any further enhancements, ACS2 is not degree expressed in the quality of the classifier.
able to solve the task, either. To this point, the rein- Finally, transitions are often represented by more than
forcement distribution is only done during interaction one classifier. Thus, it is necessary to assure that the
with the environment. Moreover, the policy is only relation between the classifier whose reward predic-
based on the reward prediction and the quality of the tion r is updated and the classifier(s) that cause the
evolving environmental model. The remainder of this update is reliable.
section shows that ACS2 can be enhanced to adapt its A mental action is realized by comparing effect
behavioral policy further, exploiting the generalized, parts of classifiers with condition parts of other classi-
internal environmental model. Hereby, reinforcement fiers. Figure 8 shows the applied one-step mental act-
is distributed internally, termed mental acting, or ing algorithm in pseudo code. The algorithm forms a
explicit anticipations influence the behavioral policy, link set [L] that restricts the update to reliable classi-
termed lookahead action selection. ACS2 is able to fier relations.
solve the task with either anticipatory mechanism. The algorithm only updates reliable classifiers that
anticipate changes. This restricts the updates to mean-
ingful ones and makes sure that only sufficiently stable
5.2 Mental Acting
action–outcome relations are modified. The link set [L]
In the mental acting approach, the classifier’s reward includes all classifiers that could take place after a suc-
prediction value r is updated internally (i.e., without cessful execution of classifier cl. The restriction to
environmental interaction). Anticipated events are only those classifiers that actually explicitly specify
formed in which reward predictions are evaluated and the attributes in C that are specified in cl.E is rather
modified in the classifiers. Thus, the behavior of strong. However, this restriction proved to be neces-
ACS2 is altered by executing mental actions. sary in the investigated tasks. Allowing more loose
Sutton (1991b) applied a similar approach to the connections did not result in the desired learning
Dyna architecture. He showed that it is possible to effect. The one-step mental acting algorithm is exe-
adapt behavior faster in static environments, and fur- cuted after each real executed action. The number of
ther, to achieve a faster adaptivity in dynamic environ- executions is specified in the experimental runs.
In more cognitive terms, mental acting is compa- action is now selected according to the currently best
rable to a thought process that takes place independ- qr value for each possible action combined with the
ently of the current (outside) environment such as best qr value in the anticipated resulting state. The
mental problem solving, the imagination of certain action selection algorithm is specified in Figure 9.
events, or even dreaming. Dreaming is recently more First, the algorithm generates an action array of
and more recognized as a fundamental consolidation the usual values considered for action selection. Next,
process in learning (Stickgold, 1998) which is indeed the result of each action is predicted, and the highest
what mental acting is doing. Mental acting causes the qr value in the consequent set of matching classifiers
consolidation of memory, that is, the consolidation of is used to update the action values in the action array.
utility measures represented in reward prediction val- Note, as before for the best qr values, only classifiers
ues. that anticipate a change are considered. Finally, the
Before we validate mental acting, another approach algorithm chooses the consequent best action in the
to the problem is introduced that modifies the policy resulting action array.
determination. In combination with the applied ²-greedy policy,
instead of executing the best action, as considered pre-
viously during exploitation, the algorithm chooses the
5.3 Lookahead Action Selection
best lookahead action for execution. For now, the
While mental acting influences action selection only algorithm is a one-step lookahead procedure. Deeper
indirectly, lookahead action selection forms explicit versions are possible. An animat could, for example,
outcome anticipations before action execution. With determine how much time it can afford to invest in a
respect to the theory of anticipatory behavioral control deeper action selection consideration and act accord-
(Section 2.3) this approach explicitly realizes the first ingly. However, the computational costs, which
point of the theory. All possible action outcome repre- increase exponentially with the depth, need to be con-
sentations are formed when performing lookahead sidered. In the experiments herein, we leave the ques-
action selection. The reinforcement prediction in the tion of scale-up on the side and concentrate on the
outcome, then, influences action selection. general effect on behavior.
The actual algorithm is derived from the idea of a
tag-mediated lookahead (Holland, 1990) and the suc-
5.4 ACS2 in the Response–Effect Learning
cessive implementation in CFSC2 (Riolo, 1991).
Task
Although ACS2 already demonstrated its capability of
generating plans in the above section about model- To validate the two anticipatory behavior approaches,
learning improvement, the possibility of lookahead the above-described environment is simulated. During
has not yet been combined with the reinforcement the first phase, ACS2 can act upon a manipulandum
learning procedure. This is the aim of the process in and consume the possible resulting reinforcer. The
this section. Instead of selecting an action according consumption leads to a reinforcement of 1,000, the
to the highest qr value in the current match set [M], an perception of the environment without the food, and
Figure 10 In the simulation of the Colwill and Rescorla (1985) experiment, ACS2 is able to exploit its on-line
generalized environmental model for an adaptive behavior beyond model-free RL and off-line generalizing
model-based RL architectures. Regardless if lookahead action selection or mental acting is applied, ACS2
prefers that action that previously led to the not-devalued outcome.
the generation of a new trial. Either lever or chain is drops off faster. In the testing phase, the quality values
present in each trial during this phase. In the second q of the classifiers that specify the provision of one or
phase, the presence of one type of reinforcer is indi- the other reinforcer after pulling or pushing decrease
cated at random. The consumption of the devalued under the reliability threshold since during testing no
reinforcer leads to a reinforcement of 0 while the rein- action has any effect. Thus, the mental updates no
forcement of the still-valued reinforcer stays at 1,000. longer take place and the distinction between the two
After a consumption one trial ends. In the final phase, actions decreases faster than in the lookahead action
both manipulanda are present, no action leads to any selection case in which anticipations are also formed
effect, and the selected actions are recorded. with classifiers that are not reliable. In both cases, the
Environmental situations are coded by four bits. distinction between better and worse action decreases
The first two bits indicate the presence of either type as observed in the rats. Eventually, ACS2 does not
of reinforcer and the second two bits indicate the pres- distinguish between the two actions at all since it
ence of lever or chain. The phases were executed for learns that the actions no longer have any effect.
204, 100, and 50 trials, which approximately corre- Again, we confirmed the consistent distinction in dif-
sponds to the number of trials the rats experienced. ferent parameter settings for ² and θga that always
Parameter settings are identical to the ones above and showed a similar distinction between the two actions.
the curves are again averaged over 1,000 runs. Thus, although the degree of distinction might be
Figure 10 shows that ACS2 is able to exploit its dependent on parameter settings, the distinction per se
environmental model to simulate anticipation-control- as well as the decrease in the distinction consistently
led behavior. Regardless of if mental acting, looka- applies throughout.
head action selection, or both are applied, ACS2
consistently distinguishes the action that leads to the
5.5 Stimulus–Response–Effect Learning Task
devalued reinforcer from the still-valued one. The
results show that ACS2 sufficiently generalizes the The stimulus–response–effect experiment was con-
model to make the appropriate conclusions. ducted with rats by Colwill and Rescorla (1990).
In addition to the confirmation of the distinction, Section 2.1 revealed the basic implications of this
several behavioral characteristics can be observed. experiment for anticipation-controlled behavior. The
Similar to the rats, ACS2 decreases its distinction experimental setup is very similar to the 1985 experi-
between the two actions during testing. In the mental ment except for the additional requirement of a stimu-
acting applications with different steps, the distinction lus distinction. Figure 2 shows the experimental setup
Figure 11 The results in the simulation of the stimulus–response–effect experiment show that ACS2 is able
to further adapt its behavior, differentiating between different stimuli, similar to the differentiation observed in
rats. Again, adaptive behavior beyond model-free RL approaches or off-line generalizing model-based RL
architectures is achieved.
schematically. During the first phase, an additional ing application. Due to the additional situational
discriminative stimulus (noise or light) was presented dependencies, mental acting is not as effective as in
that altered the response–effect pairing. During the the first experiment since more connections can be
test phase one or the other discriminative stimulus was updated.
presented at random. Also, the first phase was altered The results confirm again the efficiency and use-
in that at first either one or the other manipulandum fulness of the evolving generalized environmental rep-
was present and later both manipulanda were present. resentation. Anticipation-influenced behavior is able
Although with a slightly lower effect, the rats again to mimic animal behavior, which would not be possi-
preferred the presumably better action during testing, ble with previous mechanisms or an ALCS without
as shown in Figure 11. processes similar to mental acting or lookahead action
To code the two additional discriminative stimuli, selection. Also the on-line generalization is manda-
two bits are added to the previously used coding that tory since otherwise the knowledge transfer from the
indicate the presence of either the noise or the light devaluation phase to the test phase would not have
stimulus. Moreover, the first phase is altered in accord- been possible at all. Moreover, it shows the necessary
ance with the rat experiment by executing 64 trials specialization of situational dependencies—the third
with either one or the other manipulandum present and point of the anticipatory behavioral control theory.
a further 174 with both manipulanda present (the num- Both simulations show that the representation of a
bers again roughly correspond to the number of trials predictive environmental model in combination with
the rats experienced). The second phase is executed for on-line generalization of the model is a prerequisite
100 trials and the test phase for 50 trials. for a successful simulation of rat behavior. Moreover,
The behavior of ACS2 during testing is visualized an additional anticipatory mechanism is necessary that
in Figure 11. Results are averaged over 1,000 experi- influences behavior in an anticipatory fashion. In our
ments and the parameters are set as specified above. simulations the two distinct mechanisms can cause the
The graphs confirm that ACS2 is able to distinguish same behavioral effect. Whether one, or the other,
discriminative stimuli, exploit the generalized model, both, or a different mechanism might take place in the
and consequently adapt its behavior appropriately. In rats is certainly not derivable from the results. How-
the results, the lookahead action winner method ever, what can be derived is that some anticipatory
results in a much stronger effect than the mental act- mechanism that influences behavior must be present.
6 Summary and Conclusions sued herein. Mental acting might be rather expensive
in larger tasks and also less effective since too many
This article provided evidence for anticipation-con- relations can be updated. Prioritized updates could be
trolled behavior from the psychological side in exem- helpful as, for example, pursued in Moore and Atke-
plar animal and human experiments. Latent learning son (1993) or Kaelbling (1993). Salient situations
in rats suggested learning beyond the basic stimulus– (such as an unexpected result) could be remembered
response assumption in behaviorism long ago. More that would further direct the internal reinforcement
recently, various outcome-devaluation experiments updates. Lookahead action selection looks only one
confirmed response–outcome representations in rats. step into the future, which could be insufficient in
In humans, anticipations have a definite influence on many cases. Longer chains of lookahead, on the other
response speed. Other experiments were mentioned hand, cause exponential computational effort. Thus,
providing evidence for anticipatory influences in rea- other mechanisms seem necessary to speed up the
soning, learning, attention, and preparedness. lookahead possibilities, such as the formation of hier-
After the provision of evidence for anticipatory archies in the model representation (e.g., Donnart &
influences on behavior, we suggested a basic frame- Meyer, 1994; Sutton, Precup, & Singh, 1999).
work of anticipation-controlled behavior. It was sug- Although ACS2 proved to be a suitable learning
gested that (1) anticipations precede any voluntary mechanism for the implementation of anticipation-
act, (2) primarily action–outcome coincidences are controlled behavior, many extensions seem possible.
learned, (3) situational dependencies are learned as a To name a few, ACS2 should be enhanced to be able
secondary process, (4) needs or desires of outcomes to handle stochastic environments. ACS2 should be
trigger action–outcome representations, and (5) cer- able to ignore attributes that are not influenced by its
tain stimuli cause the preparedness for action–outcome actions as well as attributes that are irrelevant for its
relations. The framework is partly realized in the goals. Essentially, the current goal of ACS2, that is, to
anticipatory learning classifier system ACS2 whose learn a complete predictive model of the environment,
performance was evaluated next. The behavioral eval- should be relaxed to enable learning in more complex
uations in different rat experiments confirmed that environments. Furthermore, more particular action-
anticipatory representations and on-line generalization and task-dependent attentional processes could be
are necessary to mimic rat behavior in various experi- included to improve and speed up behavior. Finally,
mental setups. Not only was rat behavior mimicked the formation of behavioral hierarchies and subpro-
but also behavior was achieved that is not possible grams could allow further scalability.
with model-free reinforcement learning methods nor With respect to adaptive behavior in anticipatory
with not on-line generalizing model-based reinforce- learning systems in general, it seems necessary to use
ment-learning approaches. That is, a predictive envi- anticipatory mechanisms for the realization of other
ronmental model needs to be learned while interacting cognitive processes such as attentional processes, pre-
with the environment and the model representation paredness, intentional and motivational mechanisms,
needs to be generalized over the provided sensory as well as emotions. Anticipations should prove to be
input while interacting with the environment. helpful for further competence in adaptive behavior as
The results allow the following conclusions. (1) the diverse manifestations in animals and humans
To enable competent adaptive behavior, explicit antic- indicate. As a final point along this way, it still
ipatory influences on behavior are necessary in certain remains to be shown in which problems anticipations
tasks. (2) To be able to realize such behavioral influ- are actually necessary for competent adaptive behav-
ences, a predictive environmental model needs to be ior. This article investigated small dynamic environ-
learned on-line. (3) Learning of such a model should ments in which dynamic changes demanded backward
primarily form action–effect relations that are condi- conclusions. Although the demand for backward con-
tionalized where necessary. (4) The predictive model clusions seems to be a general indicator for the utility
representation needs to be generalized on-line over the of anticipations, future research must identify which
provided perceptual input. dynamic changes demand backward conclusions,
In the future, it is necessary to evaluate the scaling when the demand of backward conclusions actually
behavior of the additional anticipatory approaches pur- requires anticipatory controlled behavior, and if the
demand for backward conclusions is the only one in Dickinson, A. (1994). Instrumental conditioning. In N. Mack-
which anticipation-controlled behavior is helpful. intosh (Ed.), Animal learning and cognition (pp. 45–79).
San Diego, CA: Academic Press.
Donnart, J.-Y., & Meyer, J.-A. (1994). A hierarchical classifier
system implementing a motivationally autonomous ani-
Note
mat. In D. Cliff, P. Husbands, J.-A. Meyer, & S. W. Wil-
son (Eds.), From animals to animats 3: Proceedings of the
1 The parameters in ACS2 were set to: β = 0.05, umax = ∞,
Third International Conference on Simulation of Adaptive
γ = 0.95, θga = 10, µ = 0.3, χ = 0.8, θas = 20, θ exp = 20, ² =
Behavior (pp. 144–153). Cambridge, MA: MIT Press.
0.4. The ACS2 results are averaged over 1,000 experi-
Dorigo, M., & Colombetti, M. (1997). Robot shaping: An
ments. Similar results were obtained with variations in θga
experiment in behavior engineering. Cambridge, MA:
and ², the two most influential parameters in ACS2.
MIT Press.
Drescher, G. L. (1991). Made-up minds: A constructivist
approach to artificial intelligence. Cambridge, MA: MIT
Acknowledgments Press.
Elsner, B., & Hommel, B. (2001). Effect anticipation and
The authors would like to thank Wolfgang Stolzmann for his action control. Journal of Experimental Psychology:
contributions to this work. Moreover, the authors would like to Human Perception and Performance, 27, 229–240.
thank David E. Goldberg for his support as well as the whole Gérard, P., & Sigaud, O. (2001a). Adding a generalization
IlliGAL lab at the University of Illinois at Urbana-Champaign, mechanism to YACS. In L. Spector, E. D. Goodman, A.
including Martin Pelikan, Kumara Sastry, and others. Many Wu, W. B. Longdon, H.-M. Voigt, M. Gen, S. Sen, M.
thanks also to the three anonymous reviewers as well as Jason Dorigo, S. Pezeshk, M. H. Garzon, & E. Burke (Eds.),
Noble for their great comments to improve this work and to Proceedings of the Genetic and Evolutionary Computa-
make it accessible to a wider audience. Finally, the authors tion Conference (GECCO 2001) (pp. 951–957). San Fran-
would like to thank their colleagues at the department of cogni- cisco: Morgan Kaufmann.
tive psychology at the University of Würzburg including And- Gérard, P., & Sigaud, O. (2001b). YACS: Combining dynamic
rea Kiesel, Wilfried Kunde, Albrecht Sebald, Armin Stock, and programming with generalization in classifier systems. In
Christian Stöcker. The work was supported by the German P. L. Lanzi, W. Stolzmann, & S. W. Wilson (Eds.),
Research Council (DFG) under grant DFG HO1301/4-. Advances in learning classifier systems: Third Interna-
tional Workshop, IWLCS 2000 (pp. 52–69). Berlin:
Springer.
References Greenwald, A. (1970). Sensory feedback mechanisms in per-
formance control: With special reference to the ideo-
Adams, C., & Dickinson, A. (1981). Instrumental responding motor mechanism. Psychological Review, 77, 73–99.
following reinforcer devaluation. Quarterly Journal of Harle, E. (1861). Der Apparat des Willens. Zeitschrift für Phi-
Experimental Psychology B, 33, 109–121. losophie und philosophische Kritik, 38, 50–73.
Butz, M. V. (2002). Anticipatory learning classifier systems. Hoffmann, J. (1993). Vorhersage und Erkenntnis: Die Funktion
Boston, MA: Kluwer Academic. von Antizipationen in der menschlichen Verhaltenssteue-
Butz, M. V., Goldberg, D. E., & Stolzmann, W. (2000). Intro- rung und Wahrnehmung [Anticipation and cognition: The
ducing a genetic generalization pressure to the anticipa- function of anticipations in human behavioral control and
tory classifier system: Part 2—Performance analysis. In perception]. Goettingen, Germany: Hogrefe.
D. Whitely, D. E. Goldberg, E. Cantu-Paz, L. Spector, I. Hoffmann, J., Sebald, A., & Stöcker, C. (2001). Irrelevant
Parmee, & H.-G. Beyer (Eds.), Proceedings of the response effects improve serial learning in serial reaction
Genetic and Evolutionary Computation Conference time tasks. Journal of Experimental Psychology: Learn-
(GECCO-2000) (pp. 42–49). San Francisco: Morgan ing, Memory, and Cognition, 27, 470–482.
Kaufmann. Holland, J. H. (1976). Adaptation. In R. Rosen & F. Snell
Colwill, R. M., & Rescorla, R. A. (1985). Postconditioning (Eds.), Progress in theoretical biology (Vol. 4, pp. 263–
devaluation of a reinforcer affects instrumental learning. 293). New York: Academic Press.
Journal of Experimental Psychology: Animal Behavior Holland, J. H. (1990). Concerning the emergence of tag-medi-
Processes, 11(1), 120–132. ated lookahead in classifier systems. In S. Forrest (Ed.),
Colwill, R. M., & Rescorla, R. A. (1990). Evidence for the Emergent computation. Proceedings of the Ninth Annual
hierarchical structure of instrumental learning. Animal International Conference of the Center for Nonlinear
Learning & Behavior, 18(1), 71–82. Studies on Self-organizing, Collective, and Cooperative
Phenomena in Natural and Artificial Computing Net- Rescorla, R. A. (1995). Full preservation of a response-out-
works. A special issue of Physica D (Vol. 42, pp. 188– come association through training with a second outcome.
201). New York: North-Holland. Quarterly Journal of Experimental Psychology B, 48,
Hommel, B. (1996). The cognitive representation of action: 252–261.
Automatic integration of perceived action effects. Psycho- Riolo, R. L. (1991). Lookahead planning and latent learning in
logical Research, 59, 176–186. a classifier system. In J.-A. Meyer & S.W. Wilson (Eds.),
Hommel, B. (1998). Perceiving one’s own action—and what it From animals to animats: Proceedings of the First Inter-
leads to. In J. S. Jordan (Ed.), Systems theory and a priori national Conference on Simulation of Adaptive Behavior
aspects of perception (pp. 143–179). Amsterdam: North (pp. 316–326). Cambridge, MA: MIT Press.
Holland. Roitblat, H. L. (1994). Mechanism and process in animal
James, W. (1981). The principles of psychology (Vol. 2). Cam- behavior: Models of animals, animals as models. In D.
bridge, MA: Harvard University Press. Original work Cliff, P. Husbands, J.-A. Meyer, & S. W. Wilson (Eds.),
published 1890. From animals to animats 3: Proceedings of the Third
Kaelbling, L. P. (1993). Learning in embedded systems. Cam- International Conference on Simulation of Adaptive
bridge, MA: MIT Press. Behavior (pp. 12–21). Cambridge, MA: MIT Press.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Rein- Rosen, R. (1985). Anticipatory systems. Oxford: Pergamon
forcement learning: A survey. Journal of Artificial Intelli- Press.
gence Research, 4, 237–258. Rosen, R. (1991). Life itself. New York: Columbia University
Kunde, W. (2001). Response-effect compatibility in manual Press.
choice reaction tasks. Journal of Experimental Psychol- Schubotz, R. I., & Cramon, D. Y. von (2001). Functional organ-
ogy: Human Perception and Performance, 27, 387–394. ization of the lateral premotor cortex. fMRI reveals differ-
Kuvayev, L., & Sutton, R. S. (1996). Model-based reinforce- ent regions activated by anticipation of object properties,
ment learning with an approximate, learned model. In location and speed. Cognitive Brain Research, 11, 97–112.
Proceedings of the Ninth Yale Workshop on Adaptive and Stickgold, R. (1998). Sleep: Off-line memory reprocessing.
Learning Systems (pp. 101–105). New Haven, CT: Yale Trends in Cognitive Sciences, 2(12), 484–492.
University Press. Stock, A., & Hoffmann, J. (2002). Intentional fixation of
Lanzi, P. L., Stolzmann, W., & Wilson, S. W. (Eds.). (2000). behavioral learning or how R-E learning blocks S-R learn-
Learning classifier systems: From foundations to applica- ing. European Journal of Cognitive Psychology, 14(1),
tions. Berlin: Springer. 127–153.
Lotze, H. (1852). Medizinische Psychologie oder Physiologie Stolzmann, W. (1997). Antizipative Classifier Systems [Antici-
der Seele. Leipzig: Weidmann’sche Buchhandlung. patory classifier systems]. Aachen, Germany: Shaker.
Moore, A. W., & Atkeson, C. (1993). Prioritizes sweeping: Stolzmann, W. (2000). An introduction to anticipatory classi-
Reinforcement learning with less data and less real time. fier systems. In P. L. Lanzi, W. Stolzmann, & S. W. Wil-
Machine Learning, 13, 103–130. son (Eds.), Learning classifier systems: From foundations
Münsterberg, H. (1889). Beiträge zur Experimentalpsycholo- to applications (pp. 175–194). Berlin: Springer.
gie, Vol. 1. Greiburg: J.C.B. Mohr. Stolzmann, W., Butz, M. V., Hoffmann, J., & Goldberg, D. E.
Pashler, H., Johnston, J. C., & Ruthruff, E. (2001). Attention and (2000). First cognitive capabilities in the anticipatory clas-
performance. Annual Review of Psychology, 52, 629–651. sifier system. In J.-A. Meyer, A. Berthoz, D. Floreano, H.
Pearce, J. M. (1997). Animal learning and cognition (2nd ed.). Roitblat, & S. W. Wilson (Eds.), From animals to animats
Hove: Psychology Press. 6: Proceedings of the Sixth International Conference on
Prinz, W. (1990). A common coding approach to perception Simulation of Adaptive Behavior (pp. 287–296). Cam-
and action. In O. Neumann & W. Prinz (Eds.), Relation- bridge, MA: MIT Press.
ships between perception and action (pp. 167–201). Ber- Sutton, R., Precup, D., & Singh, S. (1999). Between MDPs and
lin: Springer. semi-MDPs: A framework for temporal abstraction in rein-
Prinz, W. (1997). Perception and action planning. European forcement learning. Artificial Intelligence, 112, 181–211.
Journal of Cognitive Psychology, 9, 129–154. Sutton, R. S. (1991a). Dyna, an integrated architecture for
Rescorla, R. A. (1990). Evidence for an association between the learning, planning, and reacting. In Working Notes of the
discriminative stimulus and the response-outcome associa- 1991 AAAI Spring Symposium on Integrated Intelligent
tion in instrumental learning. Journal of Experimental Psy- Architectures (pp. 151–155). Palo Alto, CA: Stanford Uni-
chology: Adaptive Behavior Processes, 16(4), 326–334. versity.
Rescorla, R. A. (1991). Associative relations in instrumental Sutton, R. S. (1991b). Reinforcement learning architectures for
learning: The eighteenth Bartlett memorial lecture. Quar- animats. In J.-A. Meyer & S. W. Wilson (Eds.), From ani-
terly Journal of Experimental Psychology B, 43, 1–23. mals to animats: Proceedings of the First International
Conference on Simulation of Adaptive Behavior (pp. 288– Wilson, S. W. (1995). Classifier fitness based on accuracy. Evo-
296). Cambridge, MA: MIT Press. lutionary Computation, 3(2), 149-175.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: Wilson, S. W. (1998). Generalization in the XCS classifier sys-
An introduction. Cambridge, MA: MIT Press. tem. In J. R. Koza, W. Banzhaf, K. Chellapilla, K. Deb, M.
Tani, J. (1996). Model-based learning for mobile robot naviga- Dorigo, D. B. Fogel, M. H. Grazon, D. E. Goldberg, H.
tion from the dynamical systems perspective. IEEE Trans- Iba, & R. L. Riolo (Eds.), Genetic programming 1998:
actions System, Man and Cybernetics B [Special Issue on Proceedings of the Third Annual Conference (pp. 665–
Learning Autonomous Systems], 26(3), 421–436. 674). San Francisco: Morgan Kaufmann.
Thislethwaite, D. (1951). A critical review of latent learning and Witkowski, C. M. (1997). Schemes for learning and behaviour:
related experiments. Psychological Bulletin, 48(2), 97–129. A new expectancy model. Doctoral dissertation, Depart-
Thorndike, E. L. (1911). Animal intelligence: Experimental ment of Computer Science, Queen Mary Westfield Col-
studies. New York: Macmillan. lege, University of London.
Tolman, E. C. (1932). Purposive behavior in animals and men. Witkowski, C. M. (2000). The role of behavioral extinction in
New York: Appleton. animat action selection. In J.-A. Meyer, A. Berthoz, D.
Tolman, E. C. (1949). There is more than one kind of learning. Floreano, H. Roitblat, & S. W. Wilson (Eds.), From ani-
Psychological Review, 5b, 144–155. mals to animats 6: Proceedings of the Sixth International
Tolman, E. C., & Honzik, C. (1930). Introduction and removal Conference on Simulation of Adaptive Behavior (pp. 177–
of reward, and maze performance in rats. University of 186). Cambridge, MA: MIT Press.
California, Publications in Psychology, 4, 257–275. Ziessler, M. (1998). Response-effect learning as a major com-
Watkins, C. J. C. H. (1989). Learning from delayed rewards. ponent of implicit serial learning. Journal of Experimental
Doctoral dissertation, King’s College, Cambridge, UK. Psychology: Learning, Memory, and Cognition, 24, 962–
Whitehead, S. D., & Ballard, D. H. (1991). Learning to per- 978.
ceive and act. Machine Learning, 7(1), 45–83. Ziessler, M., & Nattkemper, D. (2001). Learning of event
Wilson, S. W. (1991). The animat path to AI. In J.-A. Meyer & S. sequences is based on response-effect learning: Further
W. Wilson (Eds.), From animals to animats: Proceedings of evidence from serial reaction task. Journal of Experimen-
the First International Conference on Simulation of Adap- tal Psychology: Learning, Memory, and Cognition, 27,
tive Behavior (pp. 15–21). Cambridge, MA: MIT Press. 595–613.
Martin V. Butz Marin Butz is a Ph.D. student in computer science at the University of Illi-
nois at Urbana-Champaign. He received his diploma in computer science from the Uni-
versity of Würzburg in 2001. Butz is working at the Illinois Genetic Algorithms Laboratory
(IlliGAL) as well as at the department of cognitive psychology at the University of Würz-
burg. His major research interest lies in the study of anticipatory learning and anticipatory
behavior. Moreover, he is working on the relation of these mechanisms to general learn-
ing theories in machine learning as well as to cognitive mechanisms yielding competent
adaptive behavior.