Learning and Decision Making in Monkeys During A Rock-Paper-Scissors Game
Learning and Decision Making in Monkeys During A Rock-Paper-Scissors Game
www.elsevier.com/locate/cogbrainres
Research Report
Abstract
Game theory provides a solution to the problem of finding a set of optimal decision-making strategies in a group. However, people seldom
play such optimal strategies and adjust their strategies based on their experience. Accordingly, many theories postulate a set of variables
related to the probabilities of choosing various strategies and describe how such variables are dynamically updated. In reinforcement
learning, these value functions are updated based on the outcome of the player’s choice, whereas belief learning allows the value functions of
all available choices to be updated according to the choices of other players. We investigated the nature of learning process in monkeys
playing a competitive game with ternary choices, using a rock – paper – scissors game. During the baseline condition in which the computer
selected its targets randomly, each animal displayed biases towards some targets. When the computer exploited the pattern of animal’s choice
sequence but not its reward history, the animal’s choice was still systematically biased by the previous choice of the computer. This bias was
reduced when the computer exploited both the choice and reward histories of the animal. Compared to simple models of reinforcement
learning or belief learning, these adaptive processes were better described by a model that incorporated the features of both models. These
results suggest that stochastic decision-making strategies in primates during social interactions might be adjusted according to both actual and
hypothetical payoffs.
D 2005 Elsevier B.V. All rights reserved.
Keywords: Game theory; Mixed strategy; Motivation; Prefrontal cortex; Reward; Zero-sum game
according to the choices of all players. A solution of a game Typically, models of adaptive decision making postulate a
refers to a set of strategies that would be selected by set of variables, one for each action, that are related to the
‘‘rational’’ players each trying to maximize his or her utility. probabilities of choosing different actions. These variables
Accordingly, it was an important discovery when Nash have been referred to as value functions [39], propensities
proved that any N-player game includes at least one such [12], or attractions [6], and they are updated iteratively
solution. This is known as Nash equilibrium and defined as through the experience of the player [5,13,15,29]. In
a set of strategies from which no players can increase their reinforcement learning models, value functions are updated
payoffs by changing their strategies individually [25]. strictly based on the outcome of a player’s choice [12,39]. For
Unfortunately, this important concept has theoretical and example, in matching pennies, if a player selects the head,
practical limitations. First, a game can have multiple Nash only the value function for the head is updated according to
equilibria, and it is difficult to determine which equilibrium the outcome of his or her choice. In belief learning models, on
should be preferred. Second, a large number of empirical the other hand, it is assumed that players choose their actions
studies have demonstrated that people deviate, often based on their beliefs as to how other players would behave
systematically, from such equilibrium. These limitations [27]. At one extreme, this could be entirely based on the most
led to the proposals that learning might play an important recent choices of other players, which is referred to as
role in optimizing decision-making strategies. In fact, many Cournot dynamics [8]. In other words, decision makers may
studies have shown that various learning models describe choose an option which is the best response to the most recent
the observed pattern of decision making better than the choices of other players they are interacting with. The other
equilibrium predictions [2,3,6,7,12,13,15,23,24,30,31]. In extreme is fictitious play, where the probability for a given
the present study, we have analyzed the choice behavior of choice of another player can be estimated based on its
monkeys during a simple zero-sum game with ternary empirical frequency from the entire history that can be
choices, known as rock –paper –scissors. This was moti- observed [27]. In weighted fictitious play, this approach was
vated by two considerations. First, rigorous comparative modified to give more weights to recent choices by other
studies of choice behavior in non-human primates can players [5,7]. Once the beliefs about the choices of other
potentially provide important insights into the evolutionary players are formed, they can be used to generate the expected
origins of human decision-making process. Second, such payoffs for different choices of a given player. These
primate models of decision making would also provide expected payoffs can then be converted to the probability
important opportunities to understand the neural mecha- of choosing an action, and therefore play the role analogous
nisms of human decision making. For example, classical to that of value functions in reinforcement learning [6,13]. It
game theory and other standard economic models have should be noted that the expected payoffs in a belief learning
always postulated certain variables, such as utility, that model are updated not only for a particular action chosen by a
cannot be measured directly, making it difficult to test such given player, but for all actions according to the hypothetical
theories rigorously. Recent advances in neuroscience, payoffs that the player would have received by choosing each
especially an emerging field of neuroeconomics, might action, given the choices of other players in previous trials.
make it possible to obtain precise measures of quantities that For example, if a player selects the head and wins in a
have been hitherto merely theoretical [16,41]. matching pennies game, the value function for the head might
In our previous studies [1,20], we have examined the increase and that for the tail might decrease. However, for
choices monkeys made during a binary zero-sum game, games with binary choices, such as matching pennies,
known as matching pennies. By training monkeys to play reinforcement learning and belief learning models make
such a competitive game against a computer opponent, we similar predictions and therefore are difficult to distinguish. If
showed that the animal’s behavior can be modified by the the player’s choice depends on the difference between the
strategies of its opponent. The Nash equilibrium in value functions of two choices, these two models would
matching pennies requires a player to make two choices become equivalent, since any increase in the value function
randomly with equal probabilities. Compared to a baseline for one choice would be equivalent to the decrease in the
condition, in which the animal was rewarded randomly, the value function for the other choice by the same amount. These
choice of the animal became more random when the two models make distinct predictions, however, when the
computer started exploiting statistical biases displayed by number of alternative choices is increased from two [24].
the animal in its choices. However, such biases did not Thus, in order to understand the nature of learning in decision
disappear completely even when the computer analyzed the making, we examined in the present study the choice
animal’s choice as well as its reward history. These biases behavior of monkeys during a rock – paper – scissors game.
were consistent with the predictions of a reinforcement The results showed that reinforcement learning models
learning model, suggesting that the animals approximated performed better than belief learning models. However, a
the equilibrium strategy through experience. Due to the hybrid model that incorporated the features of both models
simplicity of the task used in our previous study, however, provided an even better fit to the data. In addition, in one
it was not possible to distinguish among alternative learning animal, analysis of conditional probabilities revealed some
models. features of belief learning model. These results suggest that a
418 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430
learning process of monkeys in decision making might not be 0.494, Y = 62.9 cd/m2) presented at the center of the
fully accounted for by simple reinforcement learning models. computer screen (Fig. 1). After a 0.5 s fore-period, three
identical green disks (radius = 0.6-; CIE x = 0.286, y =
0.606, Y = 43.2 cd/m2) were presented on the circumference
2. Methods of an imaginary circle (radius = 5-). The animal maintained
its fixation on the central square during the following 0.5 s
2.1. Animal preparation and apparatus delay period. At the end of this delay period, the central
square was extinguished, and the animal was required to
Two male rhesus monkeys (Macaca mulatta, body produce a saccadic eye movement towards one of the targets
weight = 7 – 12 kg) were used in this study. The animal within 1 s and maintain its fixation for a 0.5 s hold period.
was seated in a primate chair and faced a computer monitor At the end of the hold period, a yellow ring was displayed
located approximately 57 cm from their eyes. All visual for 100 ms around the target that was selected by the
stimuli were presented on the computer monitor. The computer. Simultaneously, a red ring (radius = 1.0-; CIE x =
animal’s eye position was sampled at 250 Hz with a high- 0.632, y = 0.341, Y = 17.6 cd/m2) was also displayed around
speed video-based eye tracker (ET49, Thomas Recording, the target that would beat the computer’s choice.
Germany). All the procedures used in the present study were
approved by the University of Rochester Committee on 2.3. Algorithms of computer opponent
Animal Research, and conformed to the principles outlined
in the Guide for the Care and Use of Laboratory Animals As in our previous study on a matching-pennies game
(NIH publications No 80-23, revised 1996). [1], each animal was tested with 3 different algorithms with
increasing levels of sophistication.
2.2. Behavioral task In algorithm 0, the computer selected three targets
randomly with equal probabilities (i.e., p = 1/3). In a
The animals performed an oculomotor version of a rock – rock – paper – scissors game, this mixed strategy corresponds
paper –scissors game, similar to the matching pennies game to the Nash equilibrium. Against the computer opponent
used in our previous studies [1,20], except that the present with this strategy, any strategy adopted by the animal would
task included 3 different choices (Fig. 1). Three different produce the same expected payoff.
visual targets were arbitrarily designated as rock, paper, and In algorithm 1, the computer stored the entire sequence of
scissors, respectively. At the beginning of each trial, the choices made by the animal in a given session. In each trial,
computer opponent selected its target according to one of the the computer then used this information to calculate the
algorithms described below, and the outcome of the animal’s conditional probabilities that the animal would choose each
choice was classified as loss, tie, and win, according to the target given the animal’s choices in the preceding N trials
following rule: rock beats scissors, scissors beat paper, and (N = 0 to 4). A null hypothesis that this probability is 1/3 was
paper beats rock. At the end of each completed trial, the tested for each of these conditional probabilities (binomial
animal was rewarded with one or two drops (a drop = test, p < 0.05). If none of these hypotheses was rejected, it
approximately 0.23 ml) of juice for tie and win, respectively. was assumed that the animal had selected all three targets
No reward was given for the trial with a loss. with equal probabilities independently from its previous
At the beginning of each trial, the animal was required to choices, and the computer selected its target randomly as in
fixate a yellow square (0.9- 0.9-; CIE x = 0.432, y = algorithm 0. If one or more hypotheses were rejected, the
Fig. 1. Spatio-temporal sequence of a free-choice task used in a rock – paper – scissors game.
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 419
computer selected its target based on a particular order of event has a probability p i , the entropy H is defined by the
conditional probabilities that includes the maximum proba- following.
bility which was significantly different from 1/3. Denoting
this set of conditional probabilities for rock, paper, scissors, X
k
H¼ pi log2 pi ðbitsÞ:
as p, q, and 1 ( p + q), the computer selected each of these i¼1
three targets with the probabilities of 1 ( p + q), p, and q.
For example, if the animal exclusively selects rock (i.e., p = When the entropy was calculated based only on the
1), this would lead the computer opponent to choose paper animal’s choice sequence in 3 successive trials, there were a
with certainty. In algorithm 1, therefore, the animal was total of 27 possible outcomes (k = 33 = 27), and the
required to select the three targets randomly with equal maximum entropy was 4.755 bits. Entropy was also
probabilities and independently from its previous choices, in calculated based on the animal’s choice sequence in 3
order to maximize its total reward. successive trials and the choice of the computer’s opponent
In algorithm 2, the computer used the entire choice and in the first two of these 3 trials (k = 35 = 243). The maximum
reward history of the animal in a given session to predict the entropy in this case was 7.925 bits. When the entropy is
animal’s choice in the next trial. To this end, a series of estimated using the probabilities estimated from a finite
conditional probabilities that the animal would choose each sample, the estimate for the entropy is biased [22]. To correct
target, given the animal’s choices in the preceding N trials for this bias, the entropy was estimated by the following.
(N = 1 to 4) along with their payoffs, were calculated. As in X
k
k1
algorithm 1, each of these conditional probabilities was H¼ p̂p i log2 p̂p i þ ðbitsÞ;
tested against the null hypothesis that the corresponding i¼1
1:3863 N
conditional probability is 1/3. If none of these hypotheses where p i denotes the maximum likelihood estimate for p i ,
was rejected, then the computer selected each target and N the number of samples.
randomly with the probability of 1/3. Otherwise, the Mutual information was calculated between the animal’s
computer biased its target selection according to the same choice in 2 successive trials (input) and the animal’s choice
rule used in algorithm 1. In algorithm 2, therefore, the in the next trial (output), and between the choice sequence
animal was required to select its targets not only with equal of both players in 2 successive trials and the animal’s choice
probabilities and independently from its previous choices, in the next trial. This was estimated as the following to
but also independently from the combination of its previous correct for the bias due to a finite sample [22].
choices and their outcomes.
X
r X
j
pp̂ ij ðr 1Þðc 1Þ
2.4. Data analysis I¼ p̂p ij log2 ðbitsÞ;
i c
p̂p i p̂p j 1:3863 N
2.4.1. Analysis of choice probability and serial dependence where p i is the probability of the i-th outcome in the input
Probability that the animal would choose one of the event (r = 32 = 9, or 34 = 81), p j is the probability of the j-th
targets according to a given strategy (e.g., choose rock) was outcome in the output event (c = 3), and p ij is the joint
estimated for successive blocks of 100 or 2000 trials in each probability for the i-th input event and j-th output event.
algorithm. The statistical significance for rejecting the null
hypothesis that each of these probabilities was equal to a 2.5. Learning models
particular value was evaluated using a binomial test.
Whether the difference in a pair of such probabilities was In order to determine whether and how an animal’s
statistically significant was determined with a Z test [36]. choice is influenced by the cumulative effects of its previous
The tendency for such probabilities to increase or decrease choices and their outcomes, a set of learning models were fit
throughout the course of a particular algorithm was tested to the data. A common feature in all of these models is that a
with a regression model with the block number and the variable, referred to as the value function, is associated with
estimated probability as the independent and dependent each choice. How value functions for different choices are
variables, respectively. The statistical significance of a adjusted after each trial varies across different models. For
regression coefficient was determined with a t test. example, in reinforcement learning, value functions are
adjusted strictly according to the outcome of the animal’s
2.4.2. Entropy and mutual information choice. In contrast, in belief learning models, value
The degree of randomness in the animal’s choice functions are adjusted strictly according to the choices of
sequence was quantified with entropy and mutual informa- other players (computer opponent in this case), regardless of
tion. Both of these measures were evaluated using the the choice of the animal. These two different types of
choice sequence of the two players in 3 successive trials, models can be considered as two special cases in a spectrum
since this made it possible to obtain relatively reliable [6]. Therefore, one can also consider a model in which value
estimates by limiting the number of possible outcomes. functions are adjusted according to choices of all players.
Specifically, if there are k possible outcomes and its i-th This is referred to as a general learning model. The
420 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430
parameters of all models were estimated according to the in a tie, and D t (x) = D LW for the target that could have
maximum likelihood procedure [4] using a function resulted in a win. In this general learning model, however,
minimization algorithm in Matlab (Mathworks Inc. MA). the changes applied to the value functions after a loss trial
can differ from those applied following a tie or win trial.
2.5.1. Reinforcement learning model Therefore, the changes applied to the value functions after a
In all of the models examined in the present study, the tie trial are denoted as D TL, D TT, and D TW, and they were
value function at trial t for a given target x (x = R, P, or S, estimated separately. The corresponding parameters for a
for rock, paper, scissors, respectively), V t (x), was updated win trial are denoted by D WL, D WT, and D WW. As in the
after each trial according to the following: belief learning, the constant offset can be subtracted from
the value functions of all targets simultaneously without
Vt þ 1 ð xÞ ¼ aVt ð xÞ þ Dt ð xÞ; affecting the probability of choice. Therefore, D LL, D TL, and
D WL were set to 0, and the remaining parameters were
where a is a decay rate, and D t (x) reflects a change in the
estimated.
value function for target x. In the reinforcement learning
model, D t (x) = D L if the animal selects the target x and loses
2.5.4. Model selection
(i.e., no reward), D t (x) = D T if the animal selects the target x
In general, the performance of a model, as evaluated by
and ties with the computer (i.e., small reward), and D t (x) =
the measures based on the sum of squared errors, improves
D W if the animal selects the target x and wins (i.e., large
with an increasing number of free parameters used to
reward). D t (x) is set to 0, if the animal does not select the
estimate the model. Therefore, in order to compare the
target x. The probability that the animal would select a given
performance of multiple models, it is necessary to correct
target is then determined according to the softmax trans-
for the improvement in the model fit expected from the
formation. In other words,
difference in the number of free parameters. Two different
methods, both based on the log-likelihood, were utilized in
exp Vt ð xÞ the present study. First, the Akaike’s information criterion
pt ð x Þ ¼ P :
exp Vt ðuÞ (AIC), was computed by the following,
u a f R;P;S g
AIC ¼ 2 log L þ 2k;
2.5.2. Belief learning model where k is the number of free parameters used in a given
This model is similar to the reinforcement learning model, model [4]. Second, Bayesian information criterion (BIC)
except that the value functions were updated entirely was obtained according to the following,
according to the choice of the computer opponent. Therefore,
BIC ¼ 2 log L þ k log N ;
unlike the reinforcement learning model describe above,
D t (x) = D L for the target that would have been beaten by the where N denotes the number of data points. For a relatively
computer’s choice, D t (x) = D T for the target that would have large number of data points (N > 7.4), BIC penalizes
resulted in a tie, and D t (x) = D W for the target that would complex models more than AIC [17].
have beaten the computer’s choice. It should be noted that
these adjustments are applied to all targets regardless of the
animal’s choice. Since value functions were converted to the 3. Results
probability of choosing different targets via softmax trans-
formation, adding a constant offset to the value functions of 3.1. Database
all choices does not alter the resulting set of probabilities of
choosing different targets. Therefore, D L was set to 0, and the A total of 5765, 82,479, and 81,627 choices of two
model was fit to the data by choosing the remaining 3 monkeys were obtained for algorithms 0, 1, and 2,
parameters (D T, D W, and a) according to the maximum respectively. The number of days and that of trials in which
likelihood procedure. each animal was tested for different algorithms are shown in
Table 1.
2.5.3. General learning model
Both the reinforcement learning and belief learning 3.2. Choice and reward probability
models described above can be generalized by allowing
the changes in the value functions to be determined by a Each animal was tested with algorithm 0 for 2 days, and
combination of the animal’s choice and that of the computer both animals selected rock in less than 8% of the trials and
opponent. For example, if the animal loses in a given trial, therefore displayed substantial deviations from the Nash
the value functions for all 3 targets might be adjusted equilibrium (Fig. 2). In addition, the probability that the
simultaneously as in the belief learning model. In other animal would choose rock significantly decreased during
words, D t (x) = D LL for the target chosen by the animal in a algorithm 0, and a regression analysis showed that this trend
loss trial, D t (x) = D LT for the target that could have resulted was significant in both animals (t test, p < 105). Whereas
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 421
Table 1 Table 2
Number of days and trials tested in each animal and algorithm Probabilities of choosing rock, paper, and scissors
Algorithm Animal Days Trials Trials/day T SD Algorithm Animal p(rock) p(paper) p(scissors)
0 E 2 3011 1505 T 371 0 E 0.0784 0.5271 0.3946
F 2 2754 1377 T 42 F 0.0476 0.2389 0.7135
1 E 20 42,598 2130 T 616 1 E 0.2820 0.3737 0.3443
F 31 39,881 1287 T 336 F 0.2448 0.3674 0.3878
2 E 19 41,591 2189 T 579 2 E 0.2717 0.3838 0.3445
F 19 40,036 2107 T 326 F 0.2522 0.3790 0.3688
monkey E selected paper most frequently (52.7%), monkey algorithm 1 (Fig. 2). Even after removing the first 10,000
F selected scissors most frequently (71.3%; Table 2). These trials, the percentage of choosing rock was 29.1% and
results are not surprising, since during algorithm 0, the 25.4% for the remaining trials in algorithm 1 for monkeys E
animal would receive on average one drop of juice, and F, respectively, and both of these were significantly
regardless of its decision-making strategy. Indeed, each lower than 1/3 (binomial test, p < 1060). Accordingly,
animal received approximately one drop of juice on average although the overall average number of rewards during
when tested with algorithm 0 (Table 3; Fig. 3). algorithm 1 was larger than 0.95 for both animals, the
Following the introduction of algorithm 1, the probability percentage of loss trials was significantly higher than 1/3 in
that the animal would choose rock increased, although this both animals (36.3% and 35.6%; p < 1010; Table 3, Fig. 4).
change was somewhat more delayed in monkey E. The Overall, the introduction of algorithm 2 produced only
percentage of choosing rock in successive blocks of 100 relatively small changes in the probability of choosing
trials remained below 5% for the first 1500 trials in monkey different targets (Table 2; Fig. 2), the average amount of
E, whereas this was the case only for the first 400 trials in reward earned by the animal (Table 3; Fig. 3), or the
monkey F (Fig. 2, left panels). Accordingly, there was a probability of trials with different outcomes (e.g., win or
larger decrease in the average amount of reward received by loss; Table 3, Fig. 4).
monkey E during the corresponding period (Fig. 3, left
panels). In both animals, the average reward increased 3.3. Serial dependence and randomness in choice sequence
gradually and reached a level relatively close to that
expected for optimal performance by the end of the first The null hypothesis that the choices in two successive
day of algorithm 1 (3787 and 1507 trials for monkeys E and trials were made independently was rejected for all animals
F, respectively). However, the probability of choosing rock and algorithms by analyzing the 3 3 contingency table
remained slightly lower than the probability of choosing (v 2 test; v 2 > 140, p < 1016, in all cases). To examine
paper or that of choosing scissors throughout the duration of specifically how the successive choices deviated from the
Fig. 2. The frequency of choosing rock (green dots), and the frequency of choosing rock or paper (blue dots), in a blocks of 100 (left) or 2000 (right) trials.
Gray background indicates the results from the trials in which the computer opponent selected its targets according to algorithm 1.
422 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430
Fig. 3. The average reward received by the animal. Same format as in Fig. 2.
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 423
Fig. 4. The frequency of win (green dots) and the frequency of win or tie (blue dots). Same format as in Fig. 2.
mean entropy and the theoretical maximum was less than decreased significantly during algorithm 1 in monkey F ( p <
0.12 in all cases. The difference in the mean entropy 0.001), even after the first block of 2000 was removed.
between algorithms 1 and 2 was not significant in either Entropy and mutual information were also calculated
animal. The mutual information between the two previous between the choices of the animal and the computer
choices and the current choice was also quite small (Fig. 7), opponent during the 2 successive trials and the animal’s
and the average mutual information remained below 0.025 choice in the next trial. Compared to the entropy computed
bits in all animals and algorithms when the first block of without taking into account the computer’s choice, the mean
2000 trials in algorithm 1 was excluded. The average mutual entropy based on the choice patterns of both players
information in algorithm 2 (0.021 and 0.005 for monkeys E displayed somewhat more substantial deviations from the
and F) was lower than that in algorithm 1 (0.025 and 0.009), maximum value. For example, the mean entropies for
but this difference was significant only in monkey F (t test, algorithm 1 were 7.318 and 7.656 bits for monkeys E and
p < 0.01). A regression analysis showed that, in most cases, F, respectively, whereas the corresponding values for
the values of entropy or mutual information did not show algorithm 2 were 7.694 and 7.730. The difference in the
any significant increase or decrease during the duration of a average entropy between the algorithms 1 and 2 was
given algorithm, except that the value of mutual information significant in monkey E ( p < 1011), but not in monkey F
Fig. 5. Conditional probabilities of selecting rock (white), paper (gray), or scissors (black), after selecting rock, paper, or scissors in the previous trial (abscissa).
424 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430
Fig. 6. The frequency of making the Cournot best response (CBR; green dots), and the frequency of making the Cournot best response or second best response
(CSBR; blue dots). Same format as in Fig. 2.
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 425
Fig. 7. Top: Entropy of the animal’s choices in 3 successive trials (left), and entropy of the animal’s choices in 3 successive trials combined with the computer’s
choices in the first two of such trials (right). Bottom: Mutual information between the animal’s choices in 2 successive trials and its choice in the next trial (left),
and mutual information between the choices of the animal and its opponent in 2 successive trials and the animal’s choice in the next trial (left). Gray
background corresponds to algorithm 1.
trial in monkey E was similar to the probability of CBSR or learning. In both animals, the probability of CWR was lower
that of CWR. In contrast, monkey F displayed a stronger than the probabilities of the two remaining strategies after a
tendency to select the same target regardless of the outcome loss trial, and this is consistent with either of the learning
in the previous trial, as reflected by a high probability of algorithms. Interestingly, following a tie trial, the probability
CWR and that of CSBR following a loss trial and a tie trial, of CSBR was lower than the probability of CBR and that of
respectively (Fig. 8). During algorithm 1, monkey E CWR in both animals. Following the introduction of
displayed a relatively high probability of CBR, regardless algorithm 2, the probabilities of CBR, CSBR, and CWR
of the outcome in the previous trial. The probability of CBR became more similar for all outcomes in both animals. This
following a win, tie, and loss trial was 0.479, 0.458, and is not surprising, since a frequent adoption of such system-
0.668, respectively. This suggests that monkey E might have atic strategies could be exploited by the computer opponent
updated its strategy according to the rules of belief learning. in algorithm 2.
In contrast, monkey F displayed a relatively high probability To examine quantitatively how the animal’s choice in a
of CBR only after a win trial, indicating that it might have given trial was influenced by the previous choices of the
performed the task according to the rules of reinforcement animal and the computer, 3 different learning models were
426 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430
Fig. 8. Conditional probabilities of Cournot worst (white), second best (gray), and best (black) responses, computed separately for different outcomes of a
preceding trial (abscissa).
fit to the data as described in the Methods. For algorithm 1, additional free parameter, as evaluated by AIC and BIC
parameters of reinforcement learning indicated that there (Table 9).
was a relatively large increase in the value function for the Both learning models described above place certain
target chosen by the animal in a win trial (D W), relative to restrictions on how the value functions are updated after
the changes in the value functions for targets selected in a tie each trial. Reinforcement learning model, for example,
or loss trials (Table 5). For algorithm 1, decay rate (a) was updates the value function only for the target selected by the
relatively small in both animals, indicating that the out- animal in a given trial, whereas belief learning model
comes of the trials immediately prior to the current trial updates the value functions for all targets but these changes
exerted relatively large influences. In contrast, changes in are independent of the animal’s choice. A more general
the value functions became smaller in algorithm 2, and the approach would be to allow the changes in the value
decay rate increased. These results indicate that relatively function for each target to vary according to the animal’s
large influences of the outcomes in the most recent trials choice in the previous trial as well as its outcome. This
found in algorithm 1 were reduced in algorithm 2. This is model provided a significantly better fit to the data in all
not surprising, since during algorithm 2, the computer animals and algorithms, as indicated by the log-likelihood
utilized more information to exploit the biases in the (Table 8) as well as AIC and BIC (Table 9). In addition, a
animal’s choice patterns. The results from the belief learning close examination of the model parameters for this general
model (Table 6) were similar to those of the reinforcement learning model reveals some features that are consistent
learning model. The value function for the target that would with simpler learning models described above (Table 7).
have resulted in a win trial increased more than those for the For example, during algorithm 1 in monkey E, all the
other targets. As in the reinforcement learning model, decay parameters associated with win targets (i.e., D LW, D TW, and
rates were also larger for algorithm 2 than in algorithm 1. D WW) were more positive than those associated with tie
However, the log-likelihood for reinforcement learning targets. This implies that the value functions for a win target
model was substantially larger than that for belief learning increased regardless of the animal’s choice, and therefore is
model (Table 8), indicating that reinforcement learning consistent with the assumptions of belief learning model.
described the animal’s choice better. The reinforcement However, the predictions of the belief learning model were
learning model provided a better fit to the data even after not generally borne out in other cases. For example, the
correcting for the improvement expected from the use of an signs for the changes in the value functions associated with
Table 5 Table 6
Parameters of reinforcement learning model Parameters of belief learning model
Algorithm Animal DL DT DW a Algorithm Animal DT DW a
1 E 0.4764 0.7343 1.3652 0.2805 1 E 0.1553 0.7449 0.2025
F 0.0427 0.1722 0.5190 0.6148 F 0.0883 0.1726 0.5142
2 E 0.0202 0.0531 0.1356 0.9706 2 E 0.0365 0.1226 0.7112
F 0.0009 0.0150 0.0174 0.9975 F 0.0874 0.0600 0.7105
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 427
Table 7 Table 9
Parameters of general learning model Changes in the Akaike’s information criteria (AIC) and Bayesian
Algorithm 1 2 information criterion (BIC), relative to the values obtained for the
equilibrium model
Animal E F E F
Algorithm 1 2
D LT 0.2066 0.0541 0.0019 0.0095
D LW 0.7013 0.0240 0.0258 0.0141 Animal E F E F
D TT 0.6266 0.1907 0.0321 0.0159 AIC/BIC
D TW 0.2332 0.055 6 0.0290 0.0048 Equilibrium (0) 93,597 87,628 91,385 87,968
D WT 0.0522 0.1194 0.0086 0.0029
D WW 1.3430 0.4470 0.1451 0.0282 AIC
a 0.2400 0.6423 0.9708 0.9933 Bernoulli (2) 567.7 1502.0 820.7 1244.2
Cournot (2) 7298.5 493.2 131.8 166.1
Reinforcement (4) 8937.6 1629.2 3360.3 1743.9
Belief (3) 7530.7 679.0 226.7 301.7
tie targets (D LT, D TT, and D WT) were not consistent. Even General (7) 9802.2 1687.5 3436.4 1852.6
monkey E, for which there was some evidence for belief
learning, displayed a positive value for the tie target after a BIC
loss trial, but negative values in other cases (Table 7). In Bernoulli (2) 550.4 1484.8 803.4 1227.0
addition, the change for the win target was consistently Cournot (2) 7281.1 476.0 114.6 148.9
Reinforcement (4) 8903.0 1594.8 3325.7 1709.5
positive only after a win trial, and was negative in most Belief (3) 7504.7 653.2 200.8 275.9
cases. These results suggest that the most consistent bias in General (7) 9741.6 1627.4 3375.9 1792.4
the animal’s choice behavior was the tendency to select the
same target again following a win trial.
To determine whether the above learning models account smaller improvement compared to other learning models
for the data better than simpler models that incorporate only (Table 9).
the constant biases displayed by each animal, such as
unequal probabilities to choose different targets or the
conditional probabilities (i.e., probability of CBR, CSBR, or 4. Discussion
CWR), the log-likelihood was computed for two such
models. The first model is referred to as the Bernoulli 4.1. Models of learning in competitive games
model, since each choice is treated in this model as an
independent Bernoulli trial with a constant probability for Decision making in a social group is characterized by the
each choice. The second model introduces constant prob- fact that, in order to make optimal choices, players must
abilities for CBR, CSBR, and CWR, and therefore referred take into consideration the predicted behavior of other
to as the Cournot model. The Bernoulli model has no decision makers in the group. However, this process may
memory, since the animal’s choice in a given trial is not not be explicit. For example, in reinforcement learning,
affected by any other events in the past, whereas the value functions, hence the probability of making various
memory in the Cournot model is restricted to the last trial. choices, are adjusted only by the outcome of a particular
The values of log-likelihood for these simple models were choice. Therefore, for this type of learning, the player only
substantially lower than any of the learning models (Table needs to know the outcome of its own choices, but not the
8), indicating that the animal’s choice was influenced by the choices of other players. Nevertheless, if a given game is
cumulative effects of previous trials integrated over multiple played repeatedly, value functions are ultimately influenced
trials as assumed in the above learning models. Similarly, by the choices of other players that affect the outcome of
the values of AIC and BIC for these simple models showed one’s choice. In belief learning, the choices of other players
can influence one’s choice behavior more directly, since
value functions for all choices and therefore the probabilities
for choosing them can be simultaneously adjusted after each
Table 8
choice. In the present study, the choice behavior of monkeys
Improvement in the log-likelihoods of different learning models, relative to
the prediction of Nash equilibrium playing a competitive game with three alternative choices
Algorithm 1 2
was examined in order to gain insights into the nature of
learning during decision making in non-human primates.
Animal E F E F
As in our previous studies [1,20], we first examined the
Equilibrium (0) 46,799 43,815 45,692 43,984 choice behavior of each animal in a non-interactive and
Bernoulli (2) 285.8 753.0 412.3 624.1
Cournot (2) 3651.2 248.6 67.9 85.1
hence non-competitive situation where the computer’s
Reinforcement (4) 4472.8 818.6 1684.1 875.9 choice was random and independent of animal’s choice.
Belief (3) 3768.3 342.5 116.4 153.9 The computer selected each target with the probability of 1/
General (7) 4908.1 850.8 1725.2 933.3 3, which corresponds to the Nash equilibrium of the rock –
The number of free parameters in each model is shown in parentheses. paper –scissors game. Against this static strategy of the
428 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430
computer, the animal’s average payoff is fixed regardless of We also explored the possibility that the animal’s choice
its decision-making strategy. This is a property of any Nash might be influenced by the cumulative effects of multiple
equilibrium in a zero-sum game. Therefore, it is not trials in the past, rather than only by the animal’s choice and
surprising that under this condition, each animal displayed its outcome in the previous trial. The results showed that
an idiosyncratic pattern substantially deviating from the regardless of the type of learning rules examined, the
Nash equilibrium. Both animals displayed a bias against the models based on the temporal integration of value functions
target located directly above the fixation target, which was accounted for the biases in the animal’s choice behavior
designated as Rock. Nevertheless, the animal’s average better than the simple Cournot dynamics. In addition,
payoff was optimal for this game. When the computer began consistent with the analyses of conditional probabilities,
exploiting the statistical biases displayed by the animal in its the results also showed that the reinforcement learning
choice sequences (algorithm 1), the probabilities for model provided a substantially better description for all
choosing different targets became much more similar, algorithms and animals than the belief learning model.
although there was still significant bias against one of the However, in all cases, the best fit was provided by a model
targets. Interestingly, the choice behavior of the two animals that allows the value functions for different targets to be
diverged during the period of algorithm 1. One of the adjusted according to the previous choice of the animal and
animals (monkey E) gradually increased its tendency to its outcome. The essential feature of reinforcement learning
select the target that would beat the computer’s choice in the is that, in a given trial, only the value function for the action
previous trial, whereas this tendency decreased in monkey selected by the player is adjusted. In contrast, belief learning
F. These changes were probably driven by factors intrinsic allows the value functions for every possible action to be
to the animals, since there were no visible changes in the adjusted after each choice, according to the hypothetical
average payoff during this period. The strategy to choose the payoffs determined by the choices of other players. The
best response to the choices of other players in the previous general model we examined, therefore, incorporated the
trial is referred to as the Cournot best response (CBR) [6– features of both reinforcement learning and belief learning
8,29]. Similarly, the remaining two choices other than the models, similar to the experience-weighted attraction
CBR are referred to as the Cournot second best response (EWA) model proposed by Camerer and Ho [6]. In the
(CSBR) and the Cournot worst response (CWR). During EWA model, the changes in the value functions or
algorithm 1, the average probability of CSBR for monkey E attractions are determined by the monetary payoffs available
was 0.53 (Table 4), and the maximum value for a block of to a given player. In the present study, this constraint was
2000 trials was 0.61 (Fig. 6). The probability of CBR was removed, and we allowed the parameters of our general
lower in monkey F, and did not exceed 0.44. learning model to vary more freely, since we did not have
The choice behaviors of both animals displayed some any reason to believe that the changes in the value functions
features consistent with the predictions of reinforcement are proportional to the amount of juice given to the animal.
learning, especially during algorithm 1. For example, both For example, following a tie trial, the probability that the
animals were more likely to choose the same target again if animal would choose the same target was reduced in
they won in the preceding trial. A relatively low probability algorithm 1, suggesting that the value function was reduced
for choosing the same target after a loss trial is also for a tie target. Consistent with this result, in our general
consistent with reinforcement learning. Interestingly, the learning model, the parameter for the tie target after a tie
probability of choosing the same target as in the preceding trial was negative for algorithm 1.
trial was reduced after a tie trial. Within the framework of The results from the present study showed that the choice
reinforcement learning, this implies that the value function behavior of monkeys during a competitive game can be
for a given target is reduced after a tie, suggesting that there described better by a reinforcement learning model than by
was a negative reward prediction error. In addition, one of a belief learning model. Similar to the findings in the present
the animals (monkey E) displayed some features associated study, human players often display systematic deviations
with belief learning. For example, reinforcement learning from the predictions of Nash equilibrium even during
predicts that following a loss trial, the probability for the relatively simple games. Results from previous studies
CBR and CSBR would increase similarly. Although the suggest that human players might rely more on reinforce-
behavior of monkey F was more or less consistent with this ment learning, rather than believe learning, during a variety
pattern, monkey E was substantially more likely to select the of constant-sum games [12,24], although in some cases,
target that corresponds to the CBR in the next trial, as reinforcement learning and belief learning models per-
predicted by belief learning models. Since only two formed more similarly [13]. As described above, these
monkeys were tested in the present study, it is difficult to two different types of learning models share some common
determine how often monkeys would adjust its decision- features, such as the use of intermediate variables that are
making strategies according to the rules of belief learning, related to the actual probabilities of choices. The finding
or whether such tendency is a stable personality trait of a that a more general learning model provides a better
given animal that might be preserved across different types description of the choice behavior of monkeys in the
of games. This remains to be investigated in future studies. present study is consistent with the EWA model of Camerer
D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430 429
and Ho [6]. These results suggest that models incorporating that a relatively small reward might produce a negative
features of both learning models might better account for the reward prediction error even when it is quite close to the
animal’s behavior. Thus, the choice behavior of both average value. The results from the present study also
humans and monkeys during a competitive game may not suggest that following a particular action, value functions
conform to strict assumptions of reinforcement learning or for multiple actions may be adjusted simultaneously, as
belief learning model. Whether more complex learning suggested in belief learning or a more general learning
models are required to describe choice behaviors of humans model. This might be achieved by at least two different
and other animals needs to be investigated further [5]. mechanisms. One possibility is that changes in the value
functions for multiple actions might be reflected in the
4.2. Implications for the neural correlates of learning heterogeneity of dopamine neurons signaling reward pre-
during decision making diction errors. Alternatively, a particular pattern in the
activity of dopamine neurons might be interpreted differ-
Reinforcement learning algorithms seek an optimal se- ently by different neurons in the striatum or the cortex to
quence of actions in a dynamic environment. In this frame- update the value functions of multiple choices. Further
work, value functions are adjusted according to the reward neurophysiological studies are required to distinguish
prediction error [39] or actual payoff [12], and a given action between these alternative scenarios.
is selected according to a probability that is monotonically Another type of signals that play a central role in
related to the value function for the same action, for example, reinforcement learning is value function. A value function
through the softmax transformation. Although belief learning is an estimate for the temporally discounted sum of all
models are based on a different set of assumptions as to how future rewards resulting from a course of actions selected
the value functions are adjusted, they share a common feature by the animal’s current decision-making strategy, and
with reinforcement learning models in that the probability of therefore differs from the expected reward resulting
choosing a particular action is based on a set of hypothetical immediately from a given action [39]. Nevertheless, they
values, such as expected payoffs [7,13,24] or attractions [6], are closely related, since for a trial with a single action,
that are adjusted according to the choices of other players. they are equivalent. Therefore, a number of brain areas in
Therefore, in both reinforcement and belief learning models, which neurons often modulate their activity according to
signals related to the actual or hypothetical outcome must be expected reward might be involved in computing value
generated after each choice, and they must be temporally functions and/or using such signals to select the optimal
integrated to compute value functions or attractions that are sequence of actions. These areas include the prefrontal
directly related to the choice probabilities. Therefore, cortex [1,11,21,28,35], the posterior parietal cortex [10,26,
reinforcement learning models and other learning theories 38], and the basal ganglia [9,19]. However, the neural
of economic decision making provide a useful framework in correlates of value functions and how these signals might be
which to investigate the underlying neural processes of used for the purpose of selecting an optimal sequence of
decision making [32]. actions are still largely unknown. Nevertheless, the results
Indeed, transient neural signals related to the outcome of from the present study suggest that value functions for
a behavioral choice as well as signals related to the amount unselected actions might be updated according to hypo-
of expected reward have been found in various regions of thetical payoffs expected from observing the behaviors of
the primate brain. Transient activity of dopamine neurons in other players in a social group. Primate models of
the ventral midbrain can signal reward prediction errors competitive interactions as utilized in the present study
[33]. In addition, neurons in the medial frontal cortex, such might, therefore, provide a useful tool to investigate these
as the supplementary eye field [34,37] or the anterior issues.
cingulate cortex [18], provide information about the out-
come of a behavioral choice. Nevertheless, the function of
the dopamine neurons and other neurons carrying transient Acknowledgments
signals related to the choice outcome is not yet fully
understood. For example, it has been recently demonstrated We thank Lindsay Carr and Ted Twietmeyer for their
that dopamine neurons transmit information about the technical assistance and John Swan-Stone for computer
uncertainty of upcoming reward in addition to the error in programming. This study was supported by the National
predicting upcoming reward, raising the possibility that Institute of Health.
uncertainty-related activity of dopamine neurons might
facilitate learning in an unpredictable environment [14].
The results from the present study provide more specific References
predictions for the transient signals used to update the
[1] D.J. Barraclough, M.L. Conroy, D. Lee, Prefrontal cortex and decision
animal’s decision-making strategy during a competitive making in a mixed-strategy game, Nat. Neurosci. 7 (2004) 404 – 410.
game. In our study, the probability of selecting the same [2] K. Binmore, J. Swierzbinski, C. Proulx, Does minimax work? An
target was reduced after a tie trial in algorithm 1, suggesting experimental study, Econ. J. 111 (2001) 445 – 464.
430 D. Lee et al. / Cognitive Brain Research 25 (2005) 416 – 430
[3] D.V. Budescu, A. Rapoport, Subjective randomization in one- and [21] M.I. Leon, M.N. Shadlen, Effect of expected reward magnitude on the
two-person games, J. Behav. Decis. Mak. 7 (1994) 261 – 278. response of neurons in the dorsolateral prefrontal cortex of the
[4] K.P. Burnham, D.R. Anderson, Model selection and multimodel macaque, Neuron 24 (1999) 415 – 425.
inference, A Practical Information—Theoretic Approach, Second edR, [22] G.A. Miller, Note on the bias of information estimates, in: H. Quastler
Springer-Verlag, New York, 2002. (Ed.), Information Theory in Psychology, Free Press, Glencoe, 1955,
[5] C.F. Camerer, Behavioral Game Theory: Experiments in Strategic pp. 95 – 100.
Interaction, Princeton Univ. Press, Princeton, 2003. [23] D. Mookherjee, B. Sopher, Learning behavior in an experimental
[6] C.F. Camerer, T.-H. Ho, Experience-weighted attraction learning in matching pennies game, Games Econ. Behav. 7 (1994) 62 – 91.
normal form games, Econometrica 67 (1999) 827 – 874. [24] D. Mookherjee, B. Sopher, Learning and decision costs in exper-
[7] Y.-W. Cheung, D. Friedman, Individual learning in normal form games: imental constant sum games, Games Econ. Behav. 19 (1997) 97 – 132.
some laboratory results, Games Econ. Behav. 19 (1997) 46 – 76. [25] J.F. Nash, Equilibrium points in n-person games, Proc. Natl. Acad.
[8] A. Cournot, Recherches sur les principes mathematiques de la theorie Sci. 36 (1950) 48 – 49.
des richesses, 1938, in: N. Bacon (Ed.), Researches into the [26] M.L. Platt, P.W. Glimcher, Neural correlates of decision variables in
Mathematical Principles of the Theory of Wealth, English edition, parietal cortex, Nature 400 (1999) 233 – 238.
Macmillan, New York, 1897. [27] J. Robinson, An iterative method of solving a game, Ann. Math. 54
[9] H.C. Cromwell, W. Schultz, Effects of expectations for different (1951) 296 – 301.
reward magnitudes on neuronal activity in primate striatum, [28] M.R. Roesch, C.R. Olson, Impact of expected reward on neuronal
J. Neurophysiol. 89 (2003) 2823 – 2838. activity in prefrontal cortex, frontal and supplementary eye fields and
[10] M.C. Dorris, P.W. Glimcher, Activity in posterior parietal cortex is premotor cortex, J. Neurophysiol. 90 (2003) 1766 – 1789.
correlated with the relative subjective desirability of action, Neuron 44 [29] T.C. Salmon, An evaluation of econometric models of adaptive
(2004) 365 – 378. learning, Econometrica 69 (2001) 1597 – 1628.
[11] R. Elliott, J.L. Newman, O.A. Longe, J.F.W. Deakin, Differential [30] R. Sarin, F. Vahid, Predicting how people play games: a simple
response patterns in the striatum and orbitofrontal cortex to financial dynamic model of choice, Games Econ. Behav. 34 (2001) 104 – 122.
reward in humans: a parametric functional magnetic resonance [31] Y. Sato, E. Akiyama, J.D. Farmer, Chaos in learning a simple two-
imaging study, J. Neurosci. 23 (2003) 303 – 307. person game, Proc. Natl. Acad. Sci. 99 (2002) 4748 – 4751.
[12] I. Erev, A.E. Roth, Predicting how people play games: reinforcement [32] W. Schultz, Neural coding of basic reward terms of animal learning
learning in experimental games with unique, mixed strategy equilibria, theory, game theory, and microeconomics and behavioral ecology,
Am. Econ. Rev. 88 (1998) 848 – 881. Curr. Opin. Neurobiol. 14 (2004) 139 – 147.
[13] N. Feltovich, Reinforcement-based vs. belief-based learning models in [33] W. Schultz, A. Dickinson, Neuronal coding of prediction errors, Annu.
experimental asymmetric-information games, Econometrica 68 (2000) Rev. Neurosci. 23 (2000) 473 – 500.
605 – 641. [34] H. Seo, D.J. Barraclough, B.P. McGreevy, D. Lee, Role of
[14] C.D. Fiorillo, P.N. Tobler, W. Schultz, Discrete coding of reward supplementary eye field in decision making during a competitive
probability and uncertainty by dopamine neurons, Science 299 (2003) game, Program No. 87.3. 2004 Abstract Viewer/Itinery Planner,
1898 – 1902. Society for Neuroscience, Washington, DC, 2004, (Online).
[15] D. Fudenberg, D.K. Levine, The Theory of Learning in Games, MIT [35] M. Shidara, B.J. Richmond, Anterior cingulate: single neuronal
Press, Cambridge, 1998. signals related to degree of reward expectancy, Science 296 (2002)
[16] P.W. Glimcher, Decisions, Uncertainty, and the Brain, MIT Press, 1709 – 1711.
Cambridge, 2003. [36] G.W. Snedecor, W.G. Cochran, Statistical Methods, Eighth edR, Iowa
[17] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical State Univ. Press, Ames, 1989.
Learning: Data Mining, Inference, and Prediction, Springer, New [37] V. Stuphorn, T.L. Taylor, J.D. Schall, Performance monitoring by the
York, 2001. supplementary eye field, Nature 408 (2000) 857 – 860.
[18] S. Ito, V. Stuphorn, J.W. Brown, J.D. Schall, Performance monitoring [38] L.P. Sugrue, G.S. Corrado, W.T. Newsome, Matching behavior and the
by the anterior cingulate cortex during saccade countermanding, representation of value in the parietal cortex, Science 304 (2004)
Science 302 (2003) 120 – 122. 1782 – 1787.
[19] R. Kawagoe, Y. Takikawa, O. Hikosaka, Expectation of reward [39] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction,
modulates cognitive signals in the basal ganglia, Nat. Neurosci. 1 MIT Press, Cambridge, 1998.
(1998) 411 – 416. [40] J. von Neumann, O. Morgenstern, The Theory of Games and
[20] D. Lee, M.L. Conroy, B.P. McGreevy, D.J. Barraclough, Reinforce- Economic Behavior, Princeton Univ. Press, Princeton, 1944.
ment learning and decision making in monkeys during a competitive [41] P. Zak, Neuroeconomics, Philos. Trans. R. Soc. London, B 359 (2004)
game, Cognit. Brain Res. 22 (2004) 45 – 58. 1737 – 1748.