0% found this document useful (0 votes)

61 views

Learning and Decision Making in Monkeys During A Rock-Paper-Scissors Game

The document discusses a study on learning and decision making in monkeys playing a rock-paper-scissors game against a computer. It provides background on game theory and models of reinforcement learning and belief learning. The study found that the monkeys' choices were influenced by both their own reward history and the computer's previous choices, suggesting their decision making incorporated aspects of both reinforcement learning and belief learning.

Uploaded by

Saravanan Balakrishnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

Learning and Decision Making in Monkeys During A Rock-Paper-Scissors Game

Uploaded by

Saravanan Balakrishnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Cognitive Brain Research 25 (2005) 416 – 430

www.elsevier.com/locate/cogbrainres

Research Report

Learning and decision making in monkeys during a

rock–paper–scissors game
Daeyeol Lee*, Benjamin P. McGreevy, Dominic J. Barraclough
Department of Brain and Cognitive Sciences, Center for Visual Science, University of Rochester, Rochester, NY 14627, USA

Accepted 12 July 2005

Available online 10 August 2005

Abstract

Game theory provides a solution to the problem of finding a set of optimal decision-making strategies in a group. However, people seldom
play such optimal strategies and adjust their strategies based on their experience. Accordingly, many theories postulate a set of variables
related to the probabilities of choosing various strategies and describe how such variables are dynamically updated. In reinforcement
learning, these value functions are updated based on the outcome of the player’s choice, whereas belief learning allows the value functions of
all available choices to be updated according to the choices of other players. We investigated the nature of learning process in monkeys
playing a competitive game with ternary choices, using a rock – paper – scissors game. During the baseline condition in which the computer
selected its targets randomly, each animal displayed biases towards some targets. When the computer exploited the pattern of animal’s choice
sequence but not its reward history, the animal’s choice was still systematically biased by the previous choice of the computer. This bias was
reduced when the computer exploited both the choice and reward histories of the animal. Compared to simple models of reinforcement
learning or belief learning, these adaptive processes were better described by a model that incorporated the features of both models. These
results suggest that stochastic decision-making strategies in primates during social interactions might be adjusted according to both actual and
hypothetical payoffs.
D 2005 Elsevier B.V. All rights reserved.

Theme: Neural basis of behavior

Topic: Cognition

Keywords: Game theory; Mixed strategy; Motivation; Prefrontal cortex; Reward; Zero-sum game

1. Introduction Furthermore, this adaptive process might be tuned for a

given species during evolution so that the complexity of
Many disciplines of science, such as economics, psychol- learning rules matches its environment. For example, for
ogy, and neuroscience, seek quantitative models to describe solitary carnivorous animals, a relatively simple learning rule
the process of decision making. In most cases, it is assumed based on random sampling and comparison of different
that the desirability of alternative actions is evaluated in outcomes might be sufficient to optimize its hunting strategy.
terms of a common currency, making it possible for a For animals living in social groups, however, learning rules
decision maker to select an action with the optimal outcome. for decision making are likely to take a more complex form.
However, often neglected is the fact that this process needs to This is because one’s action can modify the decision-making
be empirically adjusted for individual animals, especially strategies of other individuals and thereby influence the
when they face a complex and dynamic environment. outcome of future interactions with them.
The analysis of decision making in a social group is the
* Corresponding author. Fax: +1 585 271 3043. topic of game theory [40]. A game is defined by a set of
E-mail address: [email protected] (D. Lee). choices or strategies available to each player, and a payoff
URL: http://www.bcs.rochester.edu/~dlee/ (D. Lee). matrix that specifies the outcome (utility) to each player
0926-6410/$ - see front matter D 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.cogbrainres.2005.07.003
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 417

according to the choices of all players. A solution of a game Typically, models of adaptive decision making postulate a
refers to a set of strategies that would be selected by set of variables, one for each action, that are related to the
‘‘rational’’ players each trying to maximize his or her utility. probabilities of choosing different actions. These variables
Accordingly, it was an important discovery when Nash have been referred to as value functions [39], propensities
proved that any N-player game includes at least one such [12], or attractions [6], and they are updated iteratively
solution. This is known as Nash equilibrium and defined as through the experience of the player [5,13,15,29]. In
a set of strategies from which no players can increase their reinforcement learning models, value functions are updated
payoffs by changing their strategies individually [25]. strictly based on the outcome of a player’s choice [12,39]. For
Unfortunately, this important concept has theoretical and example, in matching pennies, if a player selects the head,
practical limitations. First, a game can have multiple Nash only the value function for the head is updated according to
equilibria, and it is difficult to determine which equilibrium the outcome of his or her choice. In belief learning models, on
should be preferred. Second, a large number of empirical the other hand, it is assumed that players choose their actions
studies have demonstrated that people deviate, often based on their beliefs as to how other players would behave
systematically, from such equilibrium. These limitations [27]. At one extreme, this could be entirely based on the most
led to the proposals that learning might play an important recent choices of other players, which is referred to as
role in optimizing decision-making strategies. In fact, many Cournot dynamics [8]. In other words, decision makers may
studies have shown that various learning models describe choose an option which is the best response to the most recent
the observed pattern of decision making better than the choices of other players they are interacting with. The other
equilibrium predictions [2,3,6,7,12,13,15,23,24,30,31]. In extreme is fictitious play, where the probability for a given
the present study, we have analyzed the choice behavior of choice of another player can be estimated based on its
monkeys during a simple zero-sum game with ternary empirical frequency from the entire history that can be
choices, known as rock –paper –scissors. This was moti- observed [27]. In weighted fictitious play, this approach was
vated by two considerations. First, rigorous comparative modified to give more weights to recent choices by other
studies of choice behavior in non-human primates can players [5,7]. Once the beliefs about the choices of other
potentially provide important insights into the evolutionary players are formed, they can be used to generate the expected
origins of human decision-making process. Second, such payoffs for different choices of a given player. These
primate models of decision making would also provide expected payoffs can then be converted to the probability
important opportunities to understand the neural mecha- of choosing an action, and therefore play the role analogous
nisms of human decision making. For example, classical to that of value functions in reinforcement learning [6,13]. It
game theory and other standard economic models have should be noted that the expected payoffs in a belief learning
always postulated certain variables, such as utility, that model are updated not only for a particular action chosen by a
cannot be measured directly, making it difficult to test such given player, but for all actions according to the hypothetical
theories rigorously. Recent advances in neuroscience, payoffs that the player would have received by choosing each
especially an emerging field of neuroeconomics, might action, given the choices of other players in previous trials.
make it possible to obtain precise measures of quantities that For example, if a player selects the head and wins in a
have been hitherto merely theoretical [16,41]. matching pennies game, the value function for the head might
In our previous studies [1,20], we have examined the increase and that for the tail might decrease. However, for
choices monkeys made during a binary zero-sum game, games with binary choices, such as matching pennies,
known as matching pennies. By training monkeys to play reinforcement learning and belief learning models make
such a competitive game against a computer opponent, we similar predictions and therefore are difficult to distinguish. If
showed that the animal’s behavior can be modified by the the player’s choice depends on the difference between the
strategies of its opponent. The Nash equilibrium in value functions of two choices, these two models would
matching pennies requires a player to make two choices become equivalent, since any increase in the value function
randomly with equal probabilities. Compared to a baseline for one choice would be equivalent to the decrease in the
condition, in which the animal was rewarded randomly, the value function for the other choice by the same amount. These
choice of the animal became more random when the two models make distinct predictions, however, when the
computer started exploiting statistical biases displayed by number of alternative choices is increased from two [24].
the animal in its choices. However, such biases did not Thus, in order to understand the nature of learning in decision
disappear completely even when the computer analyzed the making, we examined in the present study the choice
animal’s choice as well as its reward history. These biases behavior of monkeys during a rock – paper – scissors game.
were consistent with the predictions of a reinforcement The results showed that reinforcement learning models
learning model, suggesting that the animals approximated performed better than belief learning models. However, a
the equilibrium strategy through experience. Due to the hybrid model that incorporated the features of both models
simplicity of the task used in our previous study, however, provided an even better fit to the data. In addition, in one
it was not possible to distinguish among alternative learning animal, analysis of conditional probabilities revealed some
models. features of belief learning model. These results suggest that a
418 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430

learning process of monkeys in decision making might not be 0.494, Y = 62.9 cd/m2) presented at the center of the
fully accounted for by simple reinforcement learning models. computer screen (Fig. 1). After a 0.5 s fore-period, three
identical green disks (radius = 0.6-; CIE x = 0.286, y =
0.606, Y = 43.2 cd/m2) were presented on the circumference
2. Methods of an imaginary circle (radius = 5-). The animal maintained
its fixation on the central square during the following 0.5 s
2.1. Animal preparation and apparatus delay period. At the end of this delay period, the central
square was extinguished, and the animal was required to
Two male rhesus monkeys (Macaca mulatta, body produce a saccadic eye movement towards one of the targets
weight = 7 – 12 kg) were used in this study. The animal within 1 s and maintain its fixation for a 0.5 s hold period.
was seated in a primate chair and faced a computer monitor At the end of the hold period, a yellow ring was displayed
located approximately 57 cm from their eyes. All visual for 100 ms around the target that was selected by the
stimuli were presented on the computer monitor. The computer. Simultaneously, a red ring (radius = 1.0-; CIE x =
animal’s eye position was sampled at 250 Hz with a high- 0.632, y = 0.341, Y = 17.6 cd/m2) was also displayed around
speed video-based eye tracker (ET49, Thomas Recording, the target that would beat the computer’s choice.
Germany). All the procedures used in the present study were
approved by the University of Rochester Committee on 2.3. Algorithms of computer opponent
Animal Research, and conformed to the principles outlined
in the Guide for the Care and Use of Laboratory Animals As in our previous study on a matching-pennies game
(NIH publications No 80-23, revised 1996). [1], each animal was tested with 3 different algorithms with
increasing levels of sophistication.
2.2. Behavioral task In algorithm 0, the computer selected three targets
randomly with equal probabilities (i.e., p = 1/3). In a
The animals performed an oculomotor version of a rock – rock – paper – scissors game, this mixed strategy corresponds
paper –scissors game, similar to the matching pennies game to the Nash equilibrium. Against the computer opponent
used in our previous studies [1,20], except that the present with this strategy, any strategy adopted by the animal would
task included 3 different choices (Fig. 1). Three different produce the same expected payoff.
visual targets were arbitrarily designated as rock, paper, and In algorithm 1, the computer stored the entire sequence of
scissors, respectively. At the beginning of each trial, the choices made by the animal in a given session. In each trial,
computer opponent selected its target according to one of the the computer then used this information to calculate the
algorithms described below, and the outcome of the animal’s conditional probabilities that the animal would choose each
choice was classified as loss, tie, and win, according to the target given the animal’s choices in the preceding N trials
following rule: rock beats scissors, scissors beat paper, and (N = 0 to 4). A null hypothesis that this probability is 1/3 was
paper beats rock. At the end of each completed trial, the tested for each of these conditional probabilities (binomial
animal was rewarded with one or two drops (a drop = test, p < 0.05). If none of these hypotheses was rejected, it
approximately 0.23 ml) of juice for tie and win, respectively. was assumed that the animal had selected all three targets
No reward was given for the trial with a loss. with equal probabilities independently from its previous
At the beginning of each trial, the animal was required to choices, and the computer selected its target randomly as in
fixate a yellow square (0.9- 0.9-; CIE x = 0.432, y = algorithm 0. If one or more hypotheses were rejected, the

Fig. 1. Spatio-temporal sequence of a free-choice task used in a rock – paper – scissors game.
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 419

computer selected its target based on a particular order of event has a probability p i , the entropy H is defined by the
conditional probabilities that includes the maximum proba- following.
bility which was significantly different from 1/3. Denoting
this set of conditional probabilities for rock, paper, scissors, X
k
H¼ pi log2 pi ðbitsÞ:
as p, q, and 1 ( p + q), the computer selected each of these i¼1
three targets with the probabilities of 1 ( p + q), p, and q.
For example, if the animal exclusively selects rock (i.e., p = When the entropy was calculated based only on the
1), this would lead the computer opponent to choose paper animal’s choice sequence in 3 successive trials, there were a
with certainty. In algorithm 1, therefore, the animal was total of 27 possible outcomes (k = 33 = 27), and the
required to select the three targets randomly with equal maximum entropy was 4.755 bits. Entropy was also
probabilities and independently from its previous choices, in calculated based on the animal’s choice sequence in 3
order to maximize its total reward. successive trials and the choice of the computer’s opponent
In algorithm 2, the computer used the entire choice and in the first two of these 3 trials (k = 35 = 243). The maximum
reward history of the animal in a given session to predict the entropy in this case was 7.925 bits. When the entropy is
animal’s choice in the next trial. To this end, a series of estimated using the probabilities estimated from a finite
conditional probabilities that the animal would choose each sample, the estimate for the entropy is biased [22]. To correct
target, given the animal’s choices in the preceding N trials for this bias, the entropy was estimated by the following.
(N = 1 to 4) along with their payoffs, were calculated. As in X
k
k1
algorithm 1, each of these conditional probabilities was H¼ p̂p i log2 p̂p i þ ðbitsÞ;
tested against the null hypothesis that the corresponding i¼1
1:3863 N
conditional probability is 1/3. If none of these hypotheses where p i denotes the maximum likelihood estimate for p i ,
was rejected, then the computer selected each target and N the number of samples.
randomly with the probability of 1/3. Otherwise, the Mutual information was calculated between the animal’s
computer biased its target selection according to the same choice in 2 successive trials (input) and the animal’s choice
rule used in algorithm 1. In algorithm 2, therefore, the in the next trial (output), and between the choice sequence
animal was required to select its targets not only with equal of both players in 2 successive trials and the animal’s choice
probabilities and independently from its previous choices, in the next trial. This was estimated as the following to
but also independently from the combination of its previous correct for the bias due to a finite sample [22].
choices and their outcomes.
X
r X
j
pp̂ ij ðr 1Þðc 1Þ
2.4. Data analysis I¼ p̂p ij log2 ðbitsÞ;
i c
p̂p i p̂p j 1:3863 N

2.4.1. Analysis of choice probability and serial dependence where p i is the probability of the i-th outcome in the input
Probability that the animal would choose one of the event (r = 32 = 9, or 34 = 81), p j is the probability of the j-th
targets according to a given strategy (e.g., choose rock) was outcome in the output event (c = 3), and p ij is the joint
estimated for successive blocks of 100 or 2000 trials in each probability for the i-th input event and j-th output event.
algorithm. The statistical significance for rejecting the null
hypothesis that each of these probabilities was equal to a 2.5. Learning models
particular value was evaluated using a binomial test.
Whether the difference in a pair of such probabilities was In order to determine whether and how an animal’s
statistically significant was determined with a Z test [36]. choice is influenced by the cumulative effects of its previous
The tendency for such probabilities to increase or decrease choices and their outcomes, a set of learning models were fit
throughout the course of a particular algorithm was tested to the data. A common feature in all of these models is that a
with a regression model with the block number and the variable, referred to as the value function, is associated with
estimated probability as the independent and dependent each choice. How value functions for different choices are
variables, respectively. The statistical significance of a adjusted after each trial varies across different models. For
regression coefficient was determined with a t test. example, in reinforcement learning, value functions are
adjusted strictly according to the outcome of the animal’s
2.4.2. Entropy and mutual information choice. In contrast, in belief learning models, value
The degree of randomness in the animal’s choice functions are adjusted strictly according to the choices of
sequence was quantified with entropy and mutual informa- other players (computer opponent in this case), regardless of
tion. Both of these measures were evaluated using the the choice of the animal. These two different types of
choice sequence of the two players in 3 successive trials, models can be considered as two special cases in a spectrum
since this made it possible to obtain relatively reliable [6]. Therefore, one can also consider a model in which value
estimates by limiting the number of possible outcomes. functions are adjusted according to choices of all players.
Specifically, if there are k possible outcomes and its i-th This is referred to as a general learning model. The
420 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430

parameters of all models were estimated according to the in a tie, and D t (x) = D LW for the target that could have
maximum likelihood procedure [4] using a function resulted in a win. In this general learning model, however,
minimization algorithm in Matlab (Mathworks Inc. MA). the changes applied to the value functions after a loss trial
can differ from those applied following a tie or win trial.
2.5.1. Reinforcement learning model Therefore, the changes applied to the value functions after a
In all of the models examined in the present study, the tie trial are denoted as D TL, D TT, and D TW, and they were
value function at trial t for a given target x (x = R, P, or S, estimated separately. The corresponding parameters for a
for rock, paper, scissors, respectively), V t (x), was updated win trial are denoted by D WL, D WT, and D WW. As in the
after each trial according to the following: belief learning, the constant offset can be subtracted from
the value functions of all targets simultaneously without
Vt þ 1 ð xÞ ¼ aVt ð xÞ þ Dt ð xÞ; affecting the probability of choice. Therefore, D LL, D TL, and
D WL were set to 0, and the remaining parameters were
where a is a decay rate, and D t (x) reflects a change in the
estimated.
value function for target x. In the reinforcement learning
model, D t (x) = D L if the animal selects the target x and loses
2.5.4. Model selection
(i.e., no reward), D t (x) = D T if the animal selects the target x
In general, the performance of a model, as evaluated by
and ties with the computer (i.e., small reward), and D t (x) =
the measures based on the sum of squared errors, improves
D W if the animal selects the target x and wins (i.e., large
with an increasing number of free parameters used to
reward). D t (x) is set to 0, if the animal does not select the
estimate the model. Therefore, in order to compare the
target x. The probability that the animal would select a given
performance of multiple models, it is necessary to correct
target is then determined according to the softmax trans-
for the improvement in the model fit expected from the
formation. In other words,
difference in the number of free parameters. Two different
methods, both based on the log-likelihood, were utilized in
exp Vt ð xÞ the present study. First, the Akaike’s information criterion
pt ð x Þ ¼ P :
exp Vt ðuÞ (AIC), was computed by the following,
u a f R;P;S g
AIC ¼ 2 log L þ 2k;
2.5.2. Belief learning model where k is the number of free parameters used in a given
This model is similar to the reinforcement learning model, model [4]. Second, Bayesian information criterion (BIC)
except that the value functions were updated entirely was obtained according to the following,
according to the choice of the computer opponent. Therefore,
BIC ¼ 2 log L þ k log N ;
unlike the reinforcement learning model describe above,
D t (x) = D L for the target that would have been beaten by the where N denotes the number of data points. For a relatively
computer’s choice, D t (x) = D T for the target that would have large number of data points (N > 7.4), BIC penalizes
resulted in a tie, and D t (x) = D W for the target that would complex models more than AIC [17].
have beaten the computer’s choice. It should be noted that
these adjustments are applied to all targets regardless of the
animal’s choice. Since value functions were converted to the 3. Results
probability of choosing different targets via softmax trans-
formation, adding a constant offset to the value functions of 3.1. Database
all choices does not alter the resulting set of probabilities of
choosing different targets. Therefore, D L was set to 0, and the A total of 5765, 82,479, and 81,627 choices of two
model was fit to the data by choosing the remaining 3 monkeys were obtained for algorithms 0, 1, and 2,
parameters (D T, D W, and a) according to the maximum respectively. The number of days and that of trials in which
likelihood procedure. each animal was tested for different algorithms are shown in
Table 1.
2.5.3. General learning model
Both the reinforcement learning and belief learning 3.2. Choice and reward probability
models described above can be generalized by allowing
the changes in the value functions to be determined by a Each animal was tested with algorithm 0 for 2 days, and
combination of the animal’s choice and that of the computer both animals selected rock in less than 8% of the trials and
opponent. For example, if the animal loses in a given trial, therefore displayed substantial deviations from the Nash
the value functions for all 3 targets might be adjusted equilibrium (Fig. 2). In addition, the probability that the
simultaneously as in the belief learning model. In other animal would choose rock significantly decreased during
words, D t (x) = D LL for the target chosen by the animal in a algorithm 0, and a regression analysis showed that this trend
loss trial, D t (x) = D LT for the target that could have resulted was significant in both animals (t test, p < 105). Whereas
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 421

Table 1 Table 2
Number of days and trials tested in each animal and algorithm Probabilities of choosing rock, paper, and scissors
Algorithm Animal Days Trials Trials/day T SD Algorithm Animal p(rock) p(paper) p(scissors)
0 E 2 3011 1505 T 371 0 E 0.0784 0.5271 0.3946
F 2 2754 1377 T 42 F 0.0476 0.2389 0.7135
1 E 20 42,598 2130 T 616 1 E 0.2820 0.3737 0.3443
F 31 39,881 1287 T 336 F 0.2448 0.3674 0.3878
2 E 19 41,591 2189 T 579 2 E 0.2717 0.3838 0.3445
F 19 40,036 2107 T 326 F 0.2522 0.3790 0.3688

monkey E selected paper most frequently (52.7%), monkey algorithm 1 (Fig. 2). Even after removing the first 10,000
F selected scissors most frequently (71.3%; Table 2). These trials, the percentage of choosing rock was 29.1% and
results are not surprising, since during algorithm 0, the 25.4% for the remaining trials in algorithm 1 for monkeys E
animal would receive on average one drop of juice, and F, respectively, and both of these were significantly
regardless of its decision-making strategy. Indeed, each lower than 1/3 (binomial test, p < 1060). Accordingly,
animal received approximately one drop of juice on average although the overall average number of rewards during
when tested with algorithm 0 (Table 3; Fig. 3). algorithm 1 was larger than 0.95 for both animals, the
Following the introduction of algorithm 1, the probability percentage of loss trials was significantly higher than 1/3 in
that the animal would choose rock increased, although this both animals (36.3% and 35.6%; p < 1010; Table 3, Fig. 4).
change was somewhat more delayed in monkey E. The Overall, the introduction of algorithm 2 produced only
percentage of choosing rock in successive blocks of 100 relatively small changes in the probability of choosing
trials remained below 5% for the first 1500 trials in monkey different targets (Table 2; Fig. 2), the average amount of
E, whereas this was the case only for the first 400 trials in reward earned by the animal (Table 3; Fig. 3), or the
monkey F (Fig. 2, left panels). Accordingly, there was a probability of trials with different outcomes (e.g., win or
larger decrease in the average amount of reward received by loss; Table 3, Fig. 4).
monkey E during the corresponding period (Fig. 3, left
panels). In both animals, the average reward increased 3.3. Serial dependence and randomness in choice sequence
gradually and reached a level relatively close to that
expected for optimal performance by the end of the first The null hypothesis that the choices in two successive
day of algorithm 1 (3787 and 1507 trials for monkeys E and trials were made independently was rejected for all animals
F, respectively). However, the probability of choosing rock and algorithms by analyzing the 3 3 contingency table
remained slightly lower than the probability of choosing (v 2 test; v 2 > 140, p < 1016, in all cases). To examine
paper or that of choosing scissors throughout the duration of specifically how the successive choices deviated from the

Fig. 2. The frequency of choosing rock (green dots), and the frequency of choosing rock or paper (blue dots), in a blocks of 100 (left) or 2000 (right) trials.
Gray background indicates the results from the trials in which the computer opponent selected its targets according to algorithm 1.
422 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430

Table 3 the animal makes its choice independently from the

Probabilities of loss, tie, and win, and average number of rewards computer’s previous choice, the probability of CBR,
Algorithm Animal p(loss) p(tie) p(win) Mean reward CSBR, and CWR would be all equal to 1/3. This null
0 E 0.3212 0.3447 0.3341 1.0130 hypothesis was rejected for all animals and algorithms ( p <
F 0.3232 0.3362 0.3406 1.0174 0.005), with the only exception being the probability of
1 E 0.3627 0.3147 0.3226 0.9600
CWR in monkey F during algorithm 2 (Table 4). In all
F 0.3558 0.3197 0.3246 0.9688
2 E 0.3695 0.3163 0.3143 0.9448 cases, the probability of CBR was significantly higher than
F 0.3520 0.3223 0.3257 0.9736 1/3 (binomial test, p < 1010). During algorithm 1,
interestingly, the probability of CBR increased gradually
in monkey E, but decreased in monkey F (Fig. 6). A
predictions of the independence assumption, conditional regression analysis showed that both of these trends were
probabilities were estimated for choosing each possible statistically significant (t test, p < 0.001).
combination of two successive choices (e.g., paper followed The Nash equilibrium for a rock – paper – scissors game
by scissors). For algorithm 0, the most salient deviation was requires that the animal selects each target with the
observed for the probability of choosing rock (Fig. 5). probability of 1/3 and that successive choices be made
Although this probability was relatively low in both independently from the previous choices of the animal and
animals, the probability of choosing rock was substantially its opponent. To determine how closely the animal’s actual
higher after a trial with the same choice (42.37% and choice patterns approached this optimal performance,
25.95% for monkey E and F, respectively), and this was entropy and mutual information were calculated based on
significantly higher than the probability of choosing rock the sequence of choices in 3 successive trials in blocks of
after paper or scissors ( p < 1010). For algorithms 1 and 2, 2000 trials (Fig. 7). This analysis was performed only for
the deviations from independence were smaller than in algorithms 1 and 2, since algorithm 0 was tested for a
algorithm 0 (Fig. 5). relatively small number of trials and produced a substantial
In order to determine how the animal’s choice in a given deviation in the animal’s choice behavior from the
trial was influenced by the choice and its outcome in the equilibrium prediction. The entropy for 3 successive choices
previous trial, choices were classified according their of the animal was relatively close to the maximum possible
relationship with the computer’s choice in the previous value (Fig. 7). The value of entropy during the first block of
trial. For example, the animal’s choice that would defeat 2000 in algorithm 1 was relatively low in both animals,
the computer’s choice in the previous trial is referred to as indicating that the animal’s choice behavior stabilized
the Cournot best response (CBR). Similarly, a Cournot during this period. These values were, therefore, removed
second best response (CSBR) refers to the choice that in the calculation of mean entropy. The mean entropy for
would tie with the previous choice of the computer. Finally, algorithm 1 was 4.675 and 4.639 bits for monkeys E and F,
a Cournot worst response (CWR) refers to the choice that respectively. The corresponding values for algorithm 2 were
would be defeated by the computer’s previous choice. If 4.650 and 4.664. Therefore, the difference between the

Fig. 3. The average reward received by the animal. Same format as in Fig. 2.
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 423

Fig. 4. The frequency of win (green dots) and the frequency of win or tie (blue dots). Same format as in Fig. 2.

mean entropy and the theoretical maximum was less than decreased significantly during algorithm 1 in monkey F ( p <
0.12 in all cases. The difference in the mean entropy 0.001), even after the first block of 2000 was removed.
between algorithms 1 and 2 was not significant in either Entropy and mutual information were also calculated
animal. The mutual information between the two previous between the choices of the animal and the computer
choices and the current choice was also quite small (Fig. 7), opponent during the 2 successive trials and the animal’s
and the average mutual information remained below 0.025 choice in the next trial. Compared to the entropy computed
bits in all animals and algorithms when the first block of without taking into account the computer’s choice, the mean
2000 trials in algorithm 1 was excluded. The average mutual entropy based on the choice patterns of both players
information in algorithm 2 (0.021 and 0.005 for monkeys E displayed somewhat more substantial deviations from the
and F) was lower than that in algorithm 1 (0.025 and 0.009), maximum value. For example, the mean entropies for
but this difference was significant only in monkey F (t test, algorithm 1 were 7.318 and 7.656 bits for monkeys E and
p < 0.01). A regression analysis showed that, in most cases, F, respectively, whereas the corresponding values for
the values of entropy or mutual information did not show algorithm 2 were 7.694 and 7.730. The difference in the
any significant increase or decrease during the duration of a average entropy between the algorithms 1 and 2 was
given algorithm, except that the value of mutual information significant in monkey E ( p < 1011), but not in monkey F

Fig. 5. Conditional probabilities of selecting rock (white), paper (gray), or scissors (black), after selecting rock, paper, or scissors in the previous trial (abscissa).
424 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430

Table 4 all animals and algorithms. This is consistent with at least

Probability of Cournot worst response (CWR), second best response two different types of learning processes. In reinforcement
(CSBR), and best response (CBR)
learning, for example, the probability of selecting a par-
Algorithm Animal p(CWR) p(CSBR) p(CBR)
ticular target is adjusted according to the utility of reward
0 E 0.2508 0.2658 0.4834 received from the same target in a given trial. This theory
F 0.3062 0.2877 0.4061
predicts that the probability of CBR would be relatively
1 E 0.2528 0.2141 0.5331
F 0.3154 0.2992 0.3854 large following a win trial. Following a loss trial, this theory
2 E 0.3170 0.3232 0.3597 also predicts that the probability of CWR would be
F 0.3319 0.3078 0.3603 relatively low and that the probability of CBR and CSBR
would increase similarly. Another possibility is that the
probability of selecting a particular target is adjusted for all
( p = 0.525). The value of mutual information between the targets simultaneously according to the opponent’s choice.
choices of both players during the previous 2 trials and the The prediction of this belief learning theory for a choice
animal’s choice in the current trial was significantly higher after a win trial is the same as that of reinforcement learning,
in algorithm 1 than in algorithm 2, for both animals (t test, since both theories predict that the probability of CBR
p < 0.01). The average mutual information in algorithm 1 would increase. Unlike the reinforcement learning theory,
was 0.282 and 0.068 bits for monkeys E and F, and the the belief learning theory also predicts that following a loss
corresponding values were 0.053 and 0.033 bits in trial, the probability of CBR would be higher than that of
algorithm 2. Interestingly, during algorithm 1, both entropy CSBR.
and mutual information changed systematically, but the To determine which of these learning theories better
direction of change was opposite in two animals. In monkey describes the pattern of the animal’s choice behavior, the
E, entropy gradually decreased. Although this trend was not probability of CBR, CSBR, or CWR was calculated
statistically significant, a regression analysis showed that separately according to the outcome of the previous trial
mutual information increased significantly during the same (Fig. 8). For each triplet of frequencies for CBR, CSBR, and
period ( p < 0.05), suggesting that the choice behavior of CWR following a particular outcome in the previous trial
this animal became more dependent on the outcome of (i.e., loss, tie, or win), a v 2 goodness-of-fit test was applied.
previous trials. In contrast, entropy increased and mutual According to this test, the null hypothesis that animals
information decreased significantly during algorithm 1 in selected targets according to these three different strategies
monkey F ( p < 0.001). with equal probabilities was rejected, regardless of the
previous outcome, for all animals and all algorithms ( p <
3.4. Comparison of learning models 0.005). Nevertheless, there were some interesting differ-
ences between the two animals. In algorithm 0, the
As described above, the probability of CBR was higher probability of CBR was relatively high following a win in
than the probability of CSBR or that of CWR consistently in both animals. The probability of CBR following a loss or tie

Fig. 6. The frequency of making the Cournot best response (CBR; green dots), and the frequency of making the Cournot best response or second best response
(CSBR; blue dots). Same format as in Fig. 2.
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 425

Fig. 7. Top: Entropy of the animal’s choices in 3 successive trials (left), and entropy of the animal’s choices in 3 successive trials combined with the computer’s
choices in the first two of such trials (right). Bottom: Mutual information between the animal’s choices in 2 successive trials and its choice in the next trial (left),
and mutual information between the choices of the animal and its opponent in 2 successive trials and the animal’s choice in the next trial (left). Gray
background corresponds to algorithm 1.

trial in monkey E was similar to the probability of CBSR or learning. In both animals, the probability of CWR was lower
that of CWR. In contrast, monkey F displayed a stronger than the probabilities of the two remaining strategies after a
tendency to select the same target regardless of the outcome loss trial, and this is consistent with either of the learning
in the previous trial, as reflected by a high probability of algorithms. Interestingly, following a tie trial, the probability
CWR and that of CSBR following a loss trial and a tie trial, of CSBR was lower than the probability of CBR and that of
respectively (Fig. 8). During algorithm 1, monkey E CWR in both animals. Following the introduction of
displayed a relatively high probability of CBR, regardless algorithm 2, the probabilities of CBR, CSBR, and CWR
of the outcome in the previous trial. The probability of CBR became more similar for all outcomes in both animals. This
following a win, tie, and loss trial was 0.479, 0.458, and is not surprising, since a frequent adoption of such system-
0.668, respectively. This suggests that monkey E might have atic strategies could be exploited by the computer opponent
updated its strategy according to the rules of belief learning. in algorithm 2.
In contrast, monkey F displayed a relatively high probability To examine quantitatively how the animal’s choice in a
of CBR only after a win trial, indicating that it might have given trial was influenced by the previous choices of the
performed the task according to the rules of reinforcement animal and the computer, 3 different learning models were
426 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430

Fig. 8. Conditional probabilities of Cournot worst (white), second best (gray), and best (black) responses, computed separately for different outcomes of a
preceding trial (abscissa).

fit to the data as described in the Methods. For algorithm 1, additional free parameter, as evaluated by AIC and BIC
parameters of reinforcement learning indicated that there (Table 9).
was a relatively large increase in the value function for the Both learning models described above place certain
target chosen by the animal in a win trial (D W), relative to restrictions on how the value functions are updated after
the changes in the value functions for targets selected in a tie each trial. Reinforcement learning model, for example,
or loss trials (Table 5). For algorithm 1, decay rate (a) was updates the value function only for the target selected by the
relatively small in both animals, indicating that the out- animal in a given trial, whereas belief learning model
comes of the trials immediately prior to the current trial updates the value functions for all targets but these changes
exerted relatively large influences. In contrast, changes in are independent of the animal’s choice. A more general
the value functions became smaller in algorithm 2, and the approach would be to allow the changes in the value
decay rate increased. These results indicate that relatively function for each target to vary according to the animal’s
large influences of the outcomes in the most recent trials choice in the previous trial as well as its outcome. This
found in algorithm 1 were reduced in algorithm 2. This is model provided a significantly better fit to the data in all
not surprising, since during algorithm 2, the computer animals and algorithms, as indicated by the log-likelihood
utilized more information to exploit the biases in the (Table 8) as well as AIC and BIC (Table 9). In addition, a
animal’s choice patterns. The results from the belief learning close examination of the model parameters for this general
model (Table 6) were similar to those of the reinforcement learning model reveals some features that are consistent
learning model. The value function for the target that would with simpler learning models described above (Table 7).
have resulted in a win trial increased more than those for the For example, during algorithm 1 in monkey E, all the
other targets. As in the reinforcement learning model, decay parameters associated with win targets (i.e., D LW, D TW, and
rates were also larger for algorithm 2 than in algorithm 1. D WW) were more positive than those associated with tie
However, the log-likelihood for reinforcement learning targets. This implies that the value functions for a win target
model was substantially larger than that for belief learning increased regardless of the animal’s choice, and therefore is
model (Table 8), indicating that reinforcement learning consistent with the assumptions of belief learning model.
described the animal’s choice better. The reinforcement However, the predictions of the belief learning model were
learning model provided a better fit to the data even after not generally borne out in other cases. For example, the
correcting for the improvement expected from the use of an signs for the changes in the value functions associated with

Table 5 Table 6
Parameters of reinforcement learning model Parameters of belief learning model
Algorithm Animal DL DT DW a Algorithm Animal DT DW a
1 E 0.4764 0.7343 1.3652 0.2805 1 E 0.1553 0.7449 0.2025
F 0.0427 0.1722 0.5190 0.6148 F 0.0883 0.1726 0.5142
2 E 0.0202 0.0531 0.1356 0.9706 2 E 0.0365 0.1226 0.7112
F 0.0009 0.0150 0.0174 0.9975 F 0.0874 0.0600 0.7105
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 427

Table 7 Table 9
Parameters of general learning model Changes in the Akaike’s information criteria (AIC) and Bayesian
Algorithm 1 2 information criterion (BIC), relative to the values obtained for the
equilibrium model
Animal E F E F
Algorithm 1 2
D LT 0.2066 0.0541 0.0019 0.0095
D LW 0.7013 0.0240 0.0258 0.0141 Animal E F E F
D TT 0.6266 0.1907 0.0321 0.0159 AIC/BIC
D TW 0.2332 0.055 6 0.0290 0.0048 Equilibrium (0) 93,597 87,628 91,385 87,968
D WT 0.0522 0.1194 0.0086 0.0029
D WW 1.3430 0.4470 0.1451 0.0282 AIC
a 0.2400 0.6423 0.9708 0.9933 Bernoulli (2) 567.7 1502.0 820.7 1244.2
Cournot (2) 7298.5 493.2 131.8 166.1
Reinforcement (4) 8937.6 1629.2 3360.3 1743.9
Belief (3) 7530.7 679.0 226.7 301.7
tie targets (D LT, D TT, and D WT) were not consistent. Even General (7) 9802.2 1687.5 3436.4 1852.6
monkey E, for which there was some evidence for belief
learning, displayed a positive value for the tie target after a BIC
loss trial, but negative values in other cases (Table 7). In Bernoulli (2) 550.4 1484.8 803.4 1227.0
addition, the change for the win target was consistently Cournot (2) 7281.1 476.0 114.6 148.9
Reinforcement (4) 8903.0 1594.8 3325.7 1709.5
positive only after a win trial, and was negative in most Belief (3) 7504.7 653.2 200.8 275.9
cases. These results suggest that the most consistent bias in General (7) 9741.6 1627.4 3375.9 1792.4
the animal’s choice behavior was the tendency to select the
same target again following a win trial.
To determine whether the above learning models account smaller improvement compared to other learning models
for the data better than simpler models that incorporate only (Table 9).
the constant biases displayed by each animal, such as
unequal probabilities to choose different targets or the
conditional probabilities (i.e., probability of CBR, CSBR, or 4. Discussion
CWR), the log-likelihood was computed for two such
models. The first model is referred to as the Bernoulli 4.1. Models of learning in competitive games
model, since each choice is treated in this model as an
independent Bernoulli trial with a constant probability for Decision making in a social group is characterized by the
each choice. The second model introduces constant prob- fact that, in order to make optimal choices, players must
abilities for CBR, CSBR, and CWR, and therefore referred take into consideration the predicted behavior of other
to as the Cournot model. The Bernoulli model has no decision makers in the group. However, this process may
memory, since the animal’s choice in a given trial is not not be explicit. For example, in reinforcement learning,
affected by any other events in the past, whereas the value functions, hence the probability of making various
memory in the Cournot model is restricted to the last trial. choices, are adjusted only by the outcome of a particular
The values of log-likelihood for these simple models were choice. Therefore, for this type of learning, the player only
substantially lower than any of the learning models (Table needs to know the outcome of its own choices, but not the
8), indicating that the animal’s choice was influenced by the choices of other players. Nevertheless, if a given game is
cumulative effects of previous trials integrated over multiple played repeatedly, value functions are ultimately influenced
trials as assumed in the above learning models. Similarly, by the choices of other players that affect the outcome of
the values of AIC and BIC for these simple models showed one’s choice. In belief learning, the choices of other players
can influence one’s choice behavior more directly, since
value functions for all choices and therefore the probabilities
for choosing them can be simultaneously adjusted after each
Table 8
choice. In the present study, the choice behavior of monkeys
Improvement in the log-likelihoods of different learning models, relative to
the prediction of Nash equilibrium playing a competitive game with three alternative choices
Algorithm 1 2
was examined in order to gain insights into the nature of
learning during decision making in non-human primates.
Animal E F E F
As in our previous studies [1,20], we first examined the
Equilibrium (0) 46,799 43,815 45,692 43,984 choice behavior of each animal in a non-interactive and
Bernoulli (2) 285.8 753.0 412.3 624.1
Cournot (2) 3651.2 248.6 67.9 85.1
hence non-competitive situation where the computer’s
Reinforcement (4) 4472.8 818.6 1684.1 875.9 choice was random and independent of animal’s choice.
Belief (3) 3768.3 342.5 116.4 153.9 The computer selected each target with the probability of 1/
General (7) 4908.1 850.8 1725.2 933.3 3, which corresponds to the Nash equilibrium of the rock –
The number of free parameters in each model is shown in parentheses. paper –scissors game. Against this static strategy of the
428 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430

computer, the animal’s average payoff is fixed regardless of We also explored the possibility that the animal’s choice
its decision-making strategy. This is a property of any Nash might be influenced by the cumulative effects of multiple
equilibrium in a zero-sum game. Therefore, it is not trials in the past, rather than only by the animal’s choice and
surprising that under this condition, each animal displayed its outcome in the previous trial. The results showed that
an idiosyncratic pattern substantially deviating from the regardless of the type of learning rules examined, the
Nash equilibrium. Both animals displayed a bias against the models based on the temporal integration of value functions
target located directly above the fixation target, which was accounted for the biases in the animal’s choice behavior
designated as Rock. Nevertheless, the animal’s average better than the simple Cournot dynamics. In addition,
payoff was optimal for this game. When the computer began consistent with the analyses of conditional probabilities,
exploiting the statistical biases displayed by the animal in its the results also showed that the reinforcement learning
choice sequences (algorithm 1), the probabilities for model provided a substantially better description for all
choosing different targets became much more similar, algorithms and animals than the belief learning model.
although there was still significant bias against one of the However, in all cases, the best fit was provided by a model
targets. Interestingly, the choice behavior of the two animals that allows the value functions for different targets to be
diverged during the period of algorithm 1. One of the adjusted according to the previous choice of the animal and
animals (monkey E) gradually increased its tendency to its outcome. The essential feature of reinforcement learning
select the target that would beat the computer’s choice in the is that, in a given trial, only the value function for the action
previous trial, whereas this tendency decreased in monkey selected by the player is adjusted. In contrast, belief learning
F. These changes were probably driven by factors intrinsic allows the value functions for every possible action to be
to the animals, since there were no visible changes in the adjusted after each choice, according to the hypothetical
average payoff during this period. The strategy to choose the payoffs determined by the choices of other players. The
best response to the choices of other players in the previous general model we examined, therefore, incorporated the
trial is referred to as the Cournot best response (CBR) [6– features of both reinforcement learning and belief learning
8,29]. Similarly, the remaining two choices other than the models, similar to the experience-weighted attraction
CBR are referred to as the Cournot second best response (EWA) model proposed by Camerer and Ho [6]. In the
(CSBR) and the Cournot worst response (CWR). During EWA model, the changes in the value functions or
algorithm 1, the average probability of CSBR for monkey E attractions are determined by the monetary payoffs available
was 0.53 (Table 4), and the maximum value for a block of to a given player. In the present study, this constraint was
2000 trials was 0.61 (Fig. 6). The probability of CBR was removed, and we allowed the parameters of our general
lower in monkey F, and did not exceed 0.44. learning model to vary more freely, since we did not have
The choice behaviors of both animals displayed some any reason to believe that the changes in the value functions
features consistent with the predictions of reinforcement are proportional to the amount of juice given to the animal.
learning, especially during algorithm 1. For example, both For example, following a tie trial, the probability that the
animals were more likely to choose the same target again if animal would choose the same target was reduced in
they won in the preceding trial. A relatively low probability algorithm 1, suggesting that the value function was reduced
for choosing the same target after a loss trial is also for a tie target. Consistent with this result, in our general
consistent with reinforcement learning. Interestingly, the learning model, the parameter for the tie target after a tie
probability of choosing the same target as in the preceding trial was negative for algorithm 1.
trial was reduced after a tie trial. Within the framework of The results from the present study showed that the choice
reinforcement learning, this implies that the value function behavior of monkeys during a competitive game can be
for a given target is reduced after a tie, suggesting that there described better by a reinforcement learning model than by
was a negative reward prediction error. In addition, one of a belief learning model. Similar to the findings in the present
the animals (monkey E) displayed some features associated study, human players often display systematic deviations
with belief learning. For example, reinforcement learning from the predictions of Nash equilibrium even during
predicts that following a loss trial, the probability for the relatively simple games. Results from previous studies
CBR and CSBR would increase similarly. Although the suggest that human players might rely more on reinforce-
behavior of monkey F was more or less consistent with this ment learning, rather than believe learning, during a variety
pattern, monkey E was substantially more likely to select the of constant-sum games [12,24], although in some cases,
target that corresponds to the CBR in the next trial, as reinforcement learning and belief learning models per-
predicted by belief learning models. Since only two formed more similarly [13]. As described above, these
monkeys were tested in the present study, it is difficult to two different types of learning models share some common
determine how often monkeys would adjust its decision- features, such as the use of intermediate variables that are
making strategies according to the rules of belief learning, related to the actual probabilities of choices. The finding
or whether such tendency is a stable personality trait of a that a more general learning model provides a better
given animal that might be preserved across different types description of the choice behavior of monkeys in the
of games. This remains to be investigated in future studies. present study is consistent with the EWA model of Camerer
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 429

and Ho [6]. These results suggest that models incorporating that a relatively small reward might produce a negative
features of both learning models might better account for the reward prediction error even when it is quite close to the
animal’s behavior. Thus, the choice behavior of both average value. The results from the present study also
humans and monkeys during a competitive game may not suggest that following a particular action, value functions
conform to strict assumptions of reinforcement learning or for multiple actions may be adjusted simultaneously, as
belief learning model. Whether more complex learning suggested in belief learning or a more general learning
models are required to describe choice behaviors of humans model. This might be achieved by at least two different
and other animals needs to be investigated further [5]. mechanisms. One possibility is that changes in the value
functions for multiple actions might be reflected in the
4.2. Implications for the neural correlates of learning heterogeneity of dopamine neurons signaling reward pre-
during decision making diction errors. Alternatively, a particular pattern in the
activity of dopamine neurons might be interpreted differ-
Reinforcement learning algorithms seek an optimal se- ently by different neurons in the striatum or the cortex to
quence of actions in a dynamic environment. In this frame- update the value functions of multiple choices. Further
work, value functions are adjusted according to the reward neurophysiological studies are required to distinguish
prediction error [39] or actual payoff [12], and a given action between these alternative scenarios.
is selected according to a probability that is monotonically Another type of signals that play a central role in
related to the value function for the same action, for example, reinforcement learning is value function. A value function
through the softmax transformation. Although belief learning is an estimate for the temporally discounted sum of all
models are based on a different set of assumptions as to how future rewards resulting from a course of actions selected
the value functions are adjusted, they share a common feature by the animal’s current decision-making strategy, and
with reinforcement learning models in that the probability of therefore differs from the expected reward resulting
choosing a particular action is based on a set of hypothetical immediately from a given action [39]. Nevertheless, they
values, such as expected payoffs [7,13,24] or attractions [6], are closely related, since for a trial with a single action,
that are adjusted according to the choices of other players. they are equivalent. Therefore, a number of brain areas in
Therefore, in both reinforcement and belief learning models, which neurons often modulate their activity according to
signals related to the actual or hypothetical outcome must be expected reward might be involved in computing value
generated after each choice, and they must be temporally functions and/or using such signals to select the optimal
integrated to compute value functions or attractions that are sequence of actions. These areas include the prefrontal
directly related to the choice probabilities. Therefore, cortex [1,11,21,28,35], the posterior parietal cortex [10,26,
reinforcement learning models and other learning theories 38], and the basal ganglia [9,19]. However, the neural
of economic decision making provide a useful framework in correlates of value functions and how these signals might be
which to investigate the underlying neural processes of used for the purpose of selecting an optimal sequence of
decision making [32]. actions are still largely unknown. Nevertheless, the results
Indeed, transient neural signals related to the outcome of from the present study suggest that value functions for
a behavioral choice as well as signals related to the amount unselected actions might be updated according to hypo-
of expected reward have been found in various regions of thetical payoffs expected from observing the behaviors of
the primate brain. Transient activity of dopamine neurons in other players in a social group. Primate models of
the ventral midbrain can signal reward prediction errors competitive interactions as utilized in the present study
[33]. In addition, neurons in the medial frontal cortex, such might, therefore, provide a useful tool to investigate these
as the supplementary eye field [34,37] or the anterior issues.
cingulate cortex [18], provide information about the out-
come of a behavioral choice. Nevertheless, the function of
the dopamine neurons and other neurons carrying transient Acknowledgments
signals related to the choice outcome is not yet fully
understood. For example, it has been recently demonstrated We thank Lindsay Carr and Ted Twietmeyer for their
that dopamine neurons transmit information about the technical assistance and John Swan-Stone for computer
uncertainty of upcoming reward in addition to the error in programming. This study was supported by the National
predicting upcoming reward, raising the possibility that Institute of Health.
uncertainty-related activity of dopamine neurons might
facilitate learning in an unpredictable environment [14].
The results from the present study provide more specific References
predictions for the transient signals used to update the
[1] D.J. Barraclough, M.L. Conroy, D. Lee, Prefrontal cortex and decision
animal’s decision-making strategy during a competitive making in a mixed-strategy game, Nat. Neurosci. 7 (2004) 404 – 410.
game. In our study, the probability of selecting the same [2] K. Binmore, J. Swierzbinski, C. Proulx, Does minimax work? An
target was reduced after a tie trial in algorithm 1, suggesting experimental study, Econ. J. 111 (2001) 445 – 464.
430 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430

[3] D.V. Budescu, A. Rapoport, Subjective randomization in one- and [21] M.I. Leon, M.N. Shadlen, Effect of expected reward magnitude on the
two-person games, J. Behav. Decis. Mak. 7 (1994) 261 – 278. response of neurons in the dorsolateral prefrontal cortex of the
[4] K.P. Burnham, D.R. Anderson, Model selection and multimodel macaque, Neuron 24 (1999) 415 – 425.
inference, A Practical Information—Theoretic Approach, Second edR, [22] G.A. Miller, Note on the bias of information estimates, in: H. Quastler
Springer-Verlag, New York, 2002. (Ed.), Information Theory in Psychology, Free Press, Glencoe, 1955,
[5] C.F. Camerer, Behavioral Game Theory: Experiments in Strategic pp. 95 – 100.
Interaction, Princeton Univ. Press, Princeton, 2003. [23] D. Mookherjee, B. Sopher, Learning behavior in an experimental
[6] C.F. Camerer, T.-H. Ho, Experience-weighted attraction learning in matching pennies game, Games Econ. Behav. 7 (1994) 62 – 91.
normal form games, Econometrica 67 (1999) 827 – 874. [24] D. Mookherjee, B. Sopher, Learning and decision costs in exper-
[7] Y.-W. Cheung, D. Friedman, Individual learning in normal form games: imental constant sum games, Games Econ. Behav. 19 (1997) 97 – 132.
some laboratory results, Games Econ. Behav. 19 (1997) 46 – 76. [25] J.F. Nash, Equilibrium points in n-person games, Proc. Natl. Acad.
[8] A. Cournot, Recherches sur les principes mathematiques de la theorie Sci. 36 (1950) 48 – 49.
des richesses, 1938, in: N. Bacon (Ed.), Researches into the [26] M.L. Platt, P.W. Glimcher, Neural correlates of decision variables in
Mathematical Principles of the Theory of Wealth, English edition, parietal cortex, Nature 400 (1999) 233 – 238.
Macmillan, New York, 1897. [27] J. Robinson, An iterative method of solving a game, Ann. Math. 54
[9] H.C. Cromwell, W. Schultz, Effects of expectations for different (1951) 296 – 301.
reward magnitudes on neuronal activity in primate striatum, [28] M.R. Roesch, C.R. Olson, Impact of expected reward on neuronal
J. Neurophysiol. 89 (2003) 2823 – 2838. activity in prefrontal cortex, frontal and supplementary eye fields and
[10] M.C. Dorris, P.W. Glimcher, Activity in posterior parietal cortex is premotor cortex, J. Neurophysiol. 90 (2003) 1766 – 1789.
correlated with the relative subjective desirability of action, Neuron 44 [29] T.C. Salmon, An evaluation of econometric models of adaptive
(2004) 365 – 378. learning, Econometrica 69 (2001) 1597 – 1628.
[11] R. Elliott, J.L. Newman, O.A. Longe, J.F.W. Deakin, Differential [30] R. Sarin, F. Vahid, Predicting how people play games: a simple
response patterns in the striatum and orbitofrontal cortex to financial dynamic model of choice, Games Econ. Behav. 34 (2001) 104 – 122.
reward in humans: a parametric functional magnetic resonance [31] Y. Sato, E. Akiyama, J.D. Farmer, Chaos in learning a simple two-
imaging study, J. Neurosci. 23 (2003) 303 – 307. person game, Proc. Natl. Acad. Sci. 99 (2002) 4748 – 4751.
[12] I. Erev, A.E. Roth, Predicting how people play games: reinforcement [32] W. Schultz, Neural coding of basic reward terms of animal learning
learning in experimental games with unique, mixed strategy equilibria, theory, game theory, and microeconomics and behavioral ecology,
Am. Econ. Rev. 88 (1998) 848 – 881. Curr. Opin. Neurobiol. 14 (2004) 139 – 147.
[13] N. Feltovich, Reinforcement-based vs. belief-based learning models in [33] W. Schultz, A. Dickinson, Neuronal coding of prediction errors, Annu.
experimental asymmetric-information games, Econometrica 68 (2000) Rev. Neurosci. 23 (2000) 473 – 500.
605 – 641. [34] H. Seo, D.J. Barraclough, B.P. McGreevy, D. Lee, Role of
[14] C.D. Fiorillo, P.N. Tobler, W. Schultz, Discrete coding of reward supplementary eye field in decision making during a competitive
probability and uncertainty by dopamine neurons, Science 299 (2003) game, Program No. 87.3. 2004 Abstract Viewer/Itinery Planner,
1898 – 1902. Society for Neuroscience, Washington, DC, 2004, (Online).
[15] D. Fudenberg, D.K. Levine, The Theory of Learning in Games, MIT [35] M. Shidara, B.J. Richmond, Anterior cingulate: single neuronal
Press, Cambridge, 1998. signals related to degree of reward expectancy, Science 296 (2002)
[16] P.W. Glimcher, Decisions, Uncertainty, and the Brain, MIT Press, 1709 – 1711.
Cambridge, 2003. [36] G.W. Snedecor, W.G. Cochran, Statistical Methods, Eighth edR, Iowa
[17] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical State Univ. Press, Ames, 1989.
Learning: Data Mining, Inference, and Prediction, Springer, New [37] V. Stuphorn, T.L. Taylor, J.D. Schall, Performance monitoring by the
York, 2001. supplementary eye field, Nature 408 (2000) 857 – 860.
[18] S. Ito, V. Stuphorn, J.W. Brown, J.D. Schall, Performance monitoring [38] L.P. Sugrue, G.S. Corrado, W.T. Newsome, Matching behavior and the
by the anterior cingulate cortex during saccade countermanding, representation of value in the parietal cortex, Science 304 (2004)
Science 302 (2003) 120 – 122. 1782 – 1787.
[19] R. Kawagoe, Y. Takikawa, O. Hikosaka, Expectation of reward [39] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction,
modulates cognitive signals in the basal ganglia, Nat. Neurosci. 1 MIT Press, Cambridge, 1998.
(1998) 411 – 416. [40] J. von Neumann, O. Morgenstern, The Theory of Games and
[20] D. Lee, M.L. Conroy, B.P. McGreevy, D.J. Barraclough, Reinforce- Economic Behavior, Princeton Univ. Press, Princeton, 1944.
ment learning and decision making in monkeys during a competitive [41] P. Zak, Neuroeconomics, Philos. Trans. R. Soc. London, B 359 (2004)
game, Cognit. Brain Res. 22 (2004) 45 – 58. 1737 – 1748.

Teachers Guide For General Mathematics
84% (116)
Teachers Guide For General Mathematics
343 pages
The Oxford Handbook of Music Education
No ratings yet
The Oxford Handbook of Music Education
18 pages
27th NAOP Conference Programme Schedule - Page 50 PDF
No ratings yet
27th NAOP Conference Programme Schedule - Page 50 PDF
92 pages
Urban Sociolinguistics
100% (1)
Urban Sociolinguistics
261 pages
Game Theory in Neuroscience
No ratings yet
Game Theory in Neuroscience
22 pages
C B C N P D M Decision Theory, Reinforcement Learning, and The Brain
No ratings yet
C B C N P D M Decision Theory, Reinforcement Learning, and The Brain
25 pages
Unlearnable Games and "Satisficing" Decisions: A Simple Model For A Complex World
No ratings yet
Unlearnable Games and "Satisficing" Decisions: A Simple Model For A Complex World
34 pages
DW 01
No ratings yet
DW 01
14 pages
Stochastic Control Paper
No ratings yet
Stochastic Control Paper
33 pages
Niv - Reinforcement learning in the brain
No ratings yet
Niv - Reinforcement learning in the brain
38 pages
PIIS0960982219305469
No ratings yet
PIIS0960982219305469
15 pages
Prefrontal executive function and adaptive behavio
No ratings yet
Prefrontal executive function and adaptive behavio
7 pages
Game Theory Behavioral Finance
No ratings yet
Game Theory Behavioral Finance
5 pages
do13
No ratings yet
do13
18 pages
Staddon, J.E.R. (2016) - Adaptative Behavior and Learning. Cambridge University Press.
No ratings yet
Staddon, J.E.R. (2016) - Adaptative Behavior and Learning. Cambridge University Press.
616 pages
Neuroscience Research and Textbook: 2
From Everand
Neuroscience Research and Textbook: 2
Aliasghar Tabatabaei Mohammadi
No ratings yet
Ok 2 Learning and Conditioning
No ratings yet
Ok 2 Learning and Conditioning
34 pages
ParkThesisBeamer
No ratings yet
ParkThesisBeamer
54 pages
Can We Build Behavioral Game Theory?
No ratings yet
Can We Build Behavioral Game Theory?
13 pages
Lee - 2006 - Neural Basis of Quasi-Rational Decision Making
No ratings yet
Lee - 2006 - Neural Basis of Quasi-Rational Decision Making
8 pages
Ch. 8 Outline
No ratings yet
Ch. 8 Outline
4 pages
Cognitive Hierarchy Theory
No ratings yet
Cognitive Hierarchy Theory
18 pages
The Symbolic Foundations of Conditioned Behavior Distinguished Lecture Series 1st Edition Charles R. Gallistel download
100% (3)
The Symbolic Foundations of Conditioned Behavior Distinguished Lecture Series 1st Edition Charles R. Gallistel download
48 pages
How Instructed Knowledge Modulates The Neural Systems of Reward Learning
No ratings yet
How Instructed Knowledge Modulates The Neural Systems of Reward Learning
6 pages
paper
No ratings yet
paper
35 pages
Learning and Memory - (Chapter 4 - Behavioral Learning)
No ratings yet
Learning and Memory - (Chapter 4 - Behavioral Learning)
32 pages
tmp6850 TMP
No ratings yet
tmp6850 TMP
15 pages
9 Planning, Memory, and Decision Making
No ratings yet
9 Planning, Memory, and Decision Making
27 pages
21ai020 & Reinforcement Learning UNIT 1-LM:1
No ratings yet
21ai020 & Reinforcement Learning UNIT 1-LM:1
8 pages
Get The Symbolic Foundations of Conditioned Behavior Distinguished Lecture Series 1st Edition Charles R. Gallistel PDF ebook with Full Chapters Now
100% (2)
Get The Symbolic Foundations of Conditioned Behavior Distinguished Lecture Series 1st Edition Charles R. Gallistel PDF ebook with Full Chapters Now
59 pages
Evolution NN Games Intelligence
No ratings yet
Evolution NN Games Intelligence
26 pages
Ereira
No ratings yet
Ereira
14 pages
Cohen CABN 2005
No ratings yet
Cohen CABN 2005
10 pages
PhysRevX.14.021039
No ratings yet
PhysRevX.14.021039
38 pages
The Impact of Risk Preferences and Learning Dynamics On Strategic Decision
No ratings yet
The Impact of Risk Preferences and Learning Dynamics On Strategic Decision
14 pages
Camerer Articulo Neuroeconomics
No ratings yet
Camerer Articulo Neuroeconomics
4 pages
Model-Based Influences On Humans' Choices and Striatal Prediction Errors
No ratings yet
Model-Based Influences On Humans' Choices and Striatal Prediction Errors
12 pages
ReportSNN
No ratings yet
ReportSNN
25 pages
Classical & Operant ConditioningTheory
No ratings yet
Classical & Operant ConditioningTheory
8 pages
CHicken or Checkin'?
No ratings yet
CHicken or Checkin'?
29 pages
Most Important Paper PDF
No ratings yet
Most Important Paper PDF
22 pages
2014 Ma Jazayeri
No ratings yet
2014 Ma Jazayeri
19 pages
Fleming 2020NC
No ratings yet
Fleming 2020NC
11 pages
Psychology: (8th Edition) David Myers
No ratings yet
Psychology: (8th Edition) David Myers
63 pages
Soltani & Izquierdo (2019) Adaptative learning under expected and unexpected uncertainty
No ratings yet
Soltani & Izquierdo (2019) Adaptative learning under expected and unexpected uncertainty
10 pages
Adaptive Behavior and Learning: Internet Edition
No ratings yet
Adaptive Behavior and Learning: Internet Edition
410 pages
Learning & Organisational Reward System: Compiled by A Srinivasa Rao
No ratings yet
Learning & Organisational Reward System: Compiled by A Srinivasa Rao
24 pages
What Has He Learned?
No ratings yet
What Has He Learned?
9 pages
Traps
No ratings yet
Traps
11 pages
Artigo 2 Fantino
No ratings yet
Artigo 2 Fantino
10 pages
A Game Theoretic Framework For Incentive-Based Models of Intrinsic Motivation in Artificial Systems
No ratings yet
A Game Theoretic Framework For Incentive-Based Models of Intrinsic Motivation in Artificial Systems
17 pages
Learning Chapter 7 Psy (2)
No ratings yet
Learning Chapter 7 Psy (2)
84 pages
Integrating Cortico-Limbic-Basal Ganglia Architectures For Learning Model-Based and Model-Free Navigation Strategies
No ratings yet
Integrating Cortico-Limbic-Basal Ganglia Architectures For Learning Model-Based and Model-Free Navigation Strategies
19 pages
Level-0 Meta-Models For Predicting Human Behavior in Games
No ratings yet
Level-0 Meta-Models For Predicting Human Behavior in Games
17 pages
What Has He Learned?
No ratings yet
What Has He Learned?
9 pages
What Can You Teach A Dog, A Cat, and Rat?
No ratings yet
What Can You Teach A Dog, A Cat, and Rat?
38 pages
Brain Plasticity Through the Life Span Learning to Learn and Action Video Games
No ratings yet
Brain Plasticity Through the Life Span Learning to Learn and Action Video Games
26 pages
Functional Brain Networks For Learning Predictive Statistics
No ratings yet
Functional Brain Networks For Learning Predictive Statistics
16 pages
Game Theory - An Overview - ScienceDirect Topics
No ratings yet
Game Theory - An Overview - ScienceDirect Topics
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
2 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Learning in Extensive-Form Games I. Self-Confirming Equilibria
No ratings yet
Learning in Extensive-Form Games I. Self-Confirming Equilibria
36 pages
Action Election: Fundamentals and Applications
From Everand
Action Election: Fundamentals and Applications
Fouad Sabry
No ratings yet
Learning with AI : Intelligent Optimisation
From Everand
Learning with AI : Intelligent Optimisation
Coleman Colman
No ratings yet
27th Naop Annual Conference Rgistration Form
No ratings yet
27th Naop Annual Conference Rgistration Form
94 pages
Itc LTD Itc
No ratings yet
Itc LTD Itc
4 pages
Singer India Limited: Summary of Rated Instruments Instrument Rated Amount (Rs. Crore) Rating Action
No ratings yet
Singer India Limited: Summary of Rated Instruments Instrument Rated Amount (Rs. Crore) Rating Action
6 pages
International Journal of Psychophysiology: Wataru Sato, Tomomi Fujimura, Naoto Suzuki
No ratings yet
International Journal of Psychophysiology: Wataru Sato, Tomomi Fujimura, Naoto Suzuki
5 pages
Mehergarh Neolithic PDF
No ratings yet
Mehergarh Neolithic PDF
20 pages
Gowlett Out in The Cold-N2001
No ratings yet
Gowlett Out in The Cold-N2001
2 pages
Porters Five Forces Worksheet NEW
No ratings yet
Porters Five Forces Worksheet NEW
1 page
EV and VNB Methodology
No ratings yet
EV and VNB Methodology
2 pages
Can A Virtual Supermarket Bring Realism Into The Lab? Comparing Shopping Behavior Using Virtual and Pictorial Store..
No ratings yet
Can A Virtual Supermarket Bring Realism Into The Lab? Comparing Shopping Behavior Using Virtual and Pictorial Store..
43 pages
Aries Agro
No ratings yet
Aries Agro
257 pages
Useful Japanese Phrases - JapaneseUp
No ratings yet
Useful Japanese Phrases - JapaneseUp
39 pages
Unpacking The Standards For Understanding: SUBJECT: English Content Standard Performance Standard Competencies
No ratings yet
Unpacking The Standards For Understanding: SUBJECT: English Content Standard Performance Standard Competencies
4 pages
GST 101 Lecture Notes-Clrs-1
No ratings yet
GST 101 Lecture Notes-Clrs-1
10 pages
SPIR SSPA Foundations of Statistical Analysis 2022
No ratings yet
SPIR SSPA Foundations of Statistical Analysis 2022
2 pages
BOOMERANGEMPLOY
No ratings yet
BOOMERANGEMPLOY
11 pages
EDT - Math Lesson Plan (Time)
No ratings yet
EDT - Math Lesson Plan (Time)
3 pages
Complete Engagement Marks - Defining Success Worksheet 18
No ratings yet
Complete Engagement Marks - Defining Success Worksheet 18
2 pages
Orca Share Media1579314420476
No ratings yet
Orca Share Media1579314420476
2 pages
Special Schools
No ratings yet
Special Schools
10 pages
Project Cost Contingency
No ratings yet
Project Cost Contingency
9 pages
MS02-S02-showing The Way To Aminities
No ratings yet
MS02-S02-showing The Way To Aminities
6 pages
Competitor-Intelligence - Dec07 PDF
No ratings yet
Competitor-Intelligence - Dec07 PDF
3 pages
Bachelor of Secondary Education Major in MAPEH Curr
No ratings yet
Bachelor of Secondary Education Major in MAPEH Curr
3 pages
Comprehensive Exam in TNC
No ratings yet
Comprehensive Exam in TNC
1 page
Dealing With High Resistance Clients
No ratings yet
Dealing With High Resistance Clients
252 pages
Matching Heading Boredom
No ratings yet
Matching Heading Boredom
2 pages
ST Letter F Lesson
No ratings yet
ST Letter F Lesson
2 pages
Module 3B Lesson 1 Learning Resource Maps For Distance Learning
No ratings yet
Module 3B Lesson 1 Learning Resource Maps For Distance Learning
2 pages
Interactive Teaching in Classroom
50% (2)
Interactive Teaching in Classroom
19 pages
CEL 2106 - Worksheet - Week 13 and 14
0% (1)
CEL 2106 - Worksheet - Week 13 and 14
2 pages
Lesson Plan 13 - Being Yourself
No ratings yet
Lesson Plan 13 - Being Yourself
4 pages
Cgma Competency Framework 2019 Edition Leadership Skills
No ratings yet
Cgma Competency Framework 2019 Edition Leadership Skills
7 pages
Inter Chino
No ratings yet
Inter Chino
13 pages
Econ 119 Syllabus 2021 Fall
No ratings yet
Econ 119 Syllabus 2021 Fall
8 pages
Thesis/ Project Writing: Presentation by Anshu Singh
No ratings yet
Thesis/ Project Writing: Presentation by Anshu Singh
32 pages
Commonknowledge Commonknowledge: Dyslexia: A Case Study Dyslexia: A Case Study
No ratings yet
Commonknowledge Commonknowledge: Dyslexia: A Case Study Dyslexia: A Case Study
28 pages
Lev Vygotsky Theory
No ratings yet
Lev Vygotsky Theory
7 pages
Business Process and Systems Interactions
No ratings yet
Business Process and Systems Interactions
19 pages

Learning and Decision Making in Monkeys During A Rock-Paper-Scissors Game

Uploaded by

Learning and Decision Making in Monkeys During A Rock-Paper-Scissors Game

Uploaded by

Cognitive Brain Research 25 (2005) 416 – 430

Learning and decision making in monkeys during a

Accepted 12 July 2005

Theme: Neural basis of behavior

1. Introduction Furthermore, this adaptive process might be tuned for a

Table 3 the animal makes its choice independently from the

Table 4 all animals and algorithms. This is consistent with at least

You might also like