Learning in Extensive-Form Games I. Self-Confirming Equilibria
Learning in Extensive-Form Games I. Self-Confirming Equilibria
AND
DAVID M. KREPS
1. INTRODUCTION
20
0899-8256/95 $6.00
Copyright 1995 by Academic Press, Inc.
All rights of reproduction in any form reserved,
LEARNING IN EXTENSIVE-FORM GAMES 21
played in the past. It is easy to prove that if players observe the strategies
chosen by their opponents and their beliefs come to resemble the empirical
distribution, then if behavior converges to a steady state (at least, in pure
strategies), the steady state will be a Nash equilibrium. Thus the focus of
interest in the literature related to fictitious play has been on questions
of whether behavior will converge and, to a lesser extent, on the prospects
(and modes) of convergence to mixed strategy equilibria.l
This paper studies learning processes in the general style of fictitious
play under the assumption that players observe only the actions that are
actually played in a given extensive-form game, and not the actions that
their opponents would have chosen at information sets that were not
reached in the course of play. 2 Thus repeated observations of opponents'
play need not lead to correct beliefs about their full strategies, which
prescribe actions at all information sets. All that can be expected if play
converges is that players come to have correct beliefs about behavior at
information sets that lie along the path of play. Two players might persis-
tently maintain different beliefs about how a third player would respond
to a deviation from the path of play, and one player might persist in
correlated beliefs concerning the actions of other players whose informa-
tion sets lie off the path of play.
Since both of these phenomena can support non-Nash outcomes, learn-
ing processes need nbt lead to Nash equilibrium absent some reason (such
as experimentation with off-path actions or restrictions on the prior beliefs)
for players to have correct beliefs about off-path play. Rather, the set of
possible stable points is the set of self-confirming equilibria. 3
Throughout we work in the style of the literature on bounded rationality.
That is, we exogenously specify behavior rules for the players, rather
i There are many papers in this literature; see Fudenberg and Kreps (1993) for a partial
bibliography.
2 In this paper we restrict attention to extensive-form stage games and the problems raised
by off-the-path information sets, but similar issues arise whenever players observe something
less than the full (pure) strategies their opponents have chosen. For example, players might
observe only their own actions and payoffs.
3 The basic idea of a self-confirming equilibrium--that players need have correct beliefs
only about those elements of play that they observe--appears in the literature as early as
Hahn's (1977) notion of conjectural equilibrium. Recent formalizations and analyses in a
game-theoretic context include Battigalli (1987), Battigalli and Guaitoli (1988), Rubinstein
and Wolinsky (1990), Fudenberg and Levine (1993a,b), and Kalai and Lehrer (1993a,b,c).
Fudenberg and Levine (1993b) and Kalai and Lehrer (1993a) concern explicit learning models;
their results present an interesting contrast with the results given here.
Our specific definition, and the name we use, is taken from Fudenberg and Levine (1993a),
with one simplification: They study a learning model with a large number of players I,
players 2, etc., which leads them to a definition of self-confirming equilibrium that allows
the (off-the-path) beliefs of different players 1 to differ. The definition we use corresponds
to what they call unitary beliefs.
22 FUDENBERG AND KREPS
2. PRELIMINARIES
4 As long as ~b is known to players, putting all of nature's moves at the start of the tree
is without loss o f generality. If we had players learning nature's probabilities, complications
arise; see footnote 6 following.
5 We reserve E i for the space of mixed strategies for player i, i.e., E i = A(S/).
6 This is why the placement of nature's moves matters when players are learning ~b. Placing
nature's moves at the start implies that players will see all of nature's moves in each round;
if some o f nature's moves are placed in an unreached portion o f the tree, they will not be
observed.
24 FUDENBERG AND KREPS
The behavior of each player at any date t will depend on the history of
play up to that date and, more particularly, on what each player believes
to be the joint strategies being chosen by his rivals. While a very general
formulation would specify each player's probability assessment over strat-
egy selection rules of his opponents for each and every future date, we
will make do keeping track of each player's beliefs about the joint strategies
of his rivals for the current round of play, as a function of past play.
That is, the assessment/~i is the marginal distribution over terminal nodes
induced by player i's beliefs and player i's intentions.
Suppose that i has a single rival, j, who must choose between two pure
strategies, and suppose that i assesses that it is equally likely that j will
choose either pure strategy. Having a formalism for i's beliefs allows us
to distinguish between the case where i believes with certainty that j is
7 In general, subscripts will denote time and superscripts will denote players. The excep-
tional case of Z to the power t - 1 is indicated by (Z) t-j.
LEARNING IN EXTENSIVE-FORM GAMES 25
playing the corresponding mixed strategy and the case where i believes
that there is probability 1/2 t h a t j will play one or the other pure strategy.
These two situations are equivalent in a static setting, but may have very
different implications about what i will learn from observingj. 8
f Ae(zls', i(~,)(d~--i)
5' i+ ~(g,, z)(A) = , (3.2)
fn-' e( zls', rr-') 5' i(g,)(drr-i)
8 Compare with the analysis in Fudenberg and Kreps (1993), where we formalized(only)
i's joint probabilityassessment over the pure strategy profile of his rivals, a concept closest
to assessments as defined here.
9 That is, player i assesses the sequence of selections by his rivals as exchangeable.
26 F U D E N B E R G A N D KREPS
times a is played in the t - I plays of the game recorded by ~t. Note that
~aEA(h) K(o; ~t) = K(h; Ct).
(2) For all ~t and h E H such that K(h; ~t) > 0, define a probability
distribution #(h; ~,) on A(h) by
K(a; ~t)
fr(h; ~,)(a) - K(h; ~t) for all a EA(h).
(3) For all ~ E ~, let Hp.r.(~) be those information sets that are reached
a strictly positive fraction of the time along the history ~, using a limit
infimum test; i.e., h E Hp.f.(~) if lim inf,__,= K(h; ~t)/t > O.
what j has been doing than on observations in the far distant past. So,
for this case, neither part of asymptotic empiricism is valid.
A behavior rule for player i specifies how 'i will act at each
date for each history. Formally, this is given by a sequence of functions
r i = (r'i, ~'~ . . . . ), where 7r
t^i has domain ~t and range 1-Ii.
Given beliefs y ~, we denote player i's expected current payoffto strategy
~r i by ui(Tr i, 7 i), which is
We also write ui(~-i, 7r-~) for i's expected payoff if he plays 7r; and his
rivals play according to 7r-~.
In our earlier work on learning in strategic-form games, we assumed
that behavior rules were asymptotically myopic with respect to the player's
beliefs in the following sense: There exists a sequence of nonnegative
numbers {e,} such that lim,_~ e, = 0 and, for each t and ~t,
~0 Ellison (1993) provides conditions under which a patient rational player can improve
on myopic behavior for a fixed population size.
LEARNING IN EXTENSIVE-FORM GAMES 29
FIG. 1. A n e x t e n s i v e - f o r m g a m e .
4.1. E x a n t e or ex p o s t E x p e c t a t i o n s ?
The first complication concerns the stage game depicted in Fig. 1. Imag-
ine player 2 entertains beliefs that player 1 will play Right with probability
p close to one. Then the strategy Up is not at all costly to player 2 ex
ante: Choosing Up is suboptimal by the ex a n t e expected amount 1 - p.
Thus if player 1 plays Right increasingly often, player 2 (with asymptoti-
cally empirical beliefs) sees the choice of Up as vanishingly suboptimal.
And if player 2 persists with Up, then player I will (probably) be more
and more inclined to choose Right.
But if player 2 is ever called upon to move, it is clear ex p o s t that the
choice of Up will cost her one unit of payoff. We are inclined to say that
"suboptimality cost calculations" should be formulated in terms of ex
p o s t expected payoffs, so that player 2 may not persist with Up whenever
given the chance, even if player 1 chooses Right with a frequency that
approaches one.
Notwithstanding this inclination, in this paper we formulate asymptotic
myopia using ex ante expected payoff calculations. By so doing we are
using a weaker form of asymptotic myopia, which permits more behavior
rules to qualify. After seeing the consequences of this weak assumption on
behavior rules, we might wish later to explore what happens if asymptotic
myopia is formulated on the basis of ex p o s t expected payoffs. But that
must await another paper (by ourselves or others).
4.2. Experimentation
The second complication can also be posed in terms of the stage game
in Fig. I, with the emphasis now on the behavior of player 1. Imagine
that player 1 believes that player 2 will choose Up in each round indepen-
dently with some probability p and player l's initial beliefs are that p is
uniformly distributed over [0, 1]. Then in the first round player 1's marginal
assessment is that player 2 will choose Up with probability 1/2, and player
l's immediate expectations favor the choice of Right. If player 1 chooses
Right, he does not receive any information about the value of p, because
player 2 is not given the opportunity to move. So in the second round,
l's beliefs remain his prior, and again short-run considerations lead to a
choice of Right. If player 1 acts myopically in the sense of maximizing
30 FUDENBERG AND KREPS
his expected payoff in each round, given his beliefs, he will choose Right
in each round. But if player I does not discount the future very heavily,
he may choose to play Left for some period of time to learn the value of
p; if the data lead to the conclusion that p < 1/3 (which has prior probability
1/3 according to player 1), Left becomes short-run optimal.
In the context of learning to play a strategic-form game, in which each
player learns his rivals' pure strategies after each round of play,a player
who believes that his own actions do not affect the subsequent choices
of his rivals will wish to play whatever strategy maximizes his immediate
expected payoff. If a player believes that his own actions will asymptoti-
cally have no impact on the actions of his rivals, then asymptotic myopia
is mandated. ~ But in the current context of learning to play an extensive
game, this argument fails because the player's immediate actions can
affect what he learns about his rivals' behavior.
ii We are not being precise about what is meant by "asymptotically has no impact," so
this is somewhat loose.
tz Note, however, that players need not take every action infinitely often, not even at
information sets that are reached infinitely often. To take a simple example, in the game in
Fig. 1, imagine that player ! chooses Left in round t with probability l/t, and player 2
chooses Up in round t with probability 1/t, independently of what player 1 has done. Then
player 2 is almost surely given infinitely many chances to act, but the combination Left-Up
is (jointly) chosen only finitely often.
LEARNING IN EXTENSIVE-FORM GAMES 31
for some sequence 8~---~O,fi(s i) _fi(~i) _< 8,(sil' for all s i and ~i, (4.2)
where K(s ~) is the number of times s; has been attempted Let ~t be any
nondecreasing sequence of positive integers with ~t ~ ~ and 'ot/t ~ O,
and let et = e; + 8nt. Then i (choosing as we have imagined) will satisfy
asymptotic myopm for "0, and et: If s ~has been tried "0t times or more by
time t, then it can be better in terms of future value than any other strategy
by at most 8nt. Thus if it is worse than some other strategy in terms of
current value by more than e,, then it must be worse than this other
strategy in terms of current plus future value by at least e;, and s i will
not (therefore) be chosen.
Can we justify the uniform bound in (4.2)? Suppose player i believes
at the outset that his rivals are playing according to some fixed strategy
profile. If i discounts his payoffs using some discount rate less than 1,
f~(s i) is the future value function of a problem very much like the classic
multi-armed bandit problem, where each pure strategy s E S; that i might
choose corresponds to one arm of the bandit The problem differs from
the standard bandit model in that the returns to the various arms may be
correlated, but the solution to this "extensive-form bandit" has many of
the features of the solution to the case of independent arms. ~3 In this
setting, (4.2) has appeal along the following intuitive lines: The more
" a r m " s" is tried, the more is learned about the consequences of trying
this strategy, and the less there is to learn
This intuition (and the uniform bound in (42)) holds for standard multi-
armed bandit problems, as long as prior beliefs are non-doctrinaire. But
it fails in general for extensive-form bandit problems, as the following
example indicates Consider the game depicted in Fig. 2. (Only player l's
payoffs matter, so only they are given.) Imagine that player 1 believes at
the outset that players 2, 3, and 4 will repeatedly play mixed strategies,
with p the probability with which player 2 chooses Left, q the probability
that player 3 chooses left, and r the probability that player 4 chooses
gauche. Player 1 initially believes that (p, q, r) has uniform distribution
on the unit cube, which makes Out the short-run optimal strategy. But if
pqr is low enough, In would be better for player I, and so with small
Player I Player2
", 0 ,.
0 Out In ~ L e f t "~
"T 'Y x
-100 gauche / droit " 1 1"'1 1.2 1.3
Player4
FIG. 2. A troublesome example. Only player l's payoffs are given.
enough discount rate, player I would optimally choose In, to learn about
the values of p, q, and r. Now imagine that whenever player 1 chooses
In, either Left-right or Right-left is observed, each with limiting frequency
I/2. Assuming that player 1's beliefs are strongly asymptotically empirical,
player 1 comes to conclude that p = q -- 1/2. By asymptotic independence,
player 1 believes that there is 1/4 chance that, if he chooses In, he will
(finally) learn something about the value of r. Until something is learned
about r, In remains short-run suboptimal by an amount bounded away
from zero. But as long as l's discount rate is very small, the expected
value of information obtained from In more than makes up for this short-
run suboptimality. Along the path where 2 and 3 alternate between
Right-left and Left-right, player 1 never abandons In, despite the fact
that this strategy remains distinctly suboptimal. N.B., the strategy em-
ployed by player 1, which is an optimal strategy according to dynamic
programming given l's initial beliefs, will fail to meet our definition of
asymtotic myopia; hence the definition unduly limits the amount of experi-
mentation that player 1 may undertake.
Comparing this example with the result we claim for standard, indepen-
dent-arms bandit problem, it is clear where the difficulty arises, viz., from
the players' doctrinaire belief that their opponents' play is uncorrelated
(despite their non-doctrinaire beliefs over the strategy of each individual
opponent), which they maintain no matter how strongly the data suggest
otherwise. This suggests that abandonment of asymptotic independence
will solve this problem. Alternatively, we can argue that if players 2 and
3 are using a fixed stragegy, then the sort of correlated history that under-
lies this example is unlikely to occur. Either of these can provide a basis for
the bound (4.2) and thus justify calendar-time bounds on experimentation
along the lines sketched above; see the concluding remarks. But, taking
note of this example, we are forced to conclude that calendar-time limita-
tions on experimentation can be sharper than we would like, at least in
some (exceptional) circumstances.
LEARNING IN EXTENSIVE-FORM GAMES 35
4.6. A C o m m e n t on A s y m p t o t i c Empiricism
As a final comment, we return to the definition of asymptotic empiricism
and, in particular, to the reason why (3.3) is required only for information
sets that are reached a nonvanishing fraction of the time. The question
is, What credence do players give to evidence generated at information
sets visited infinitely often but a vanishing fraction of time? If a player
believes that his rivals are playing the same strategy profile repeatedly,
he ought to put a lot of credence in this evidence. But our formulation of
asymptotic myopia suggests two reasons that such evidence might be
considered to be of lesser quality than data generated at an information
set visited a nonvanishing fraction of the time.
First, we assume players are asymptotically myopic using ex ante evalu-
ation of expected payoffs. Insofar as players assess vanishingly small
probability of reaching an information set that has been visited a vanishing
fraction of the time,t5 their behavior at those information sets is relatively
unconstrained by asymptotic myopia. Thus a player may believe that the
actions of his rivals at information sets visited a vanishing frequency of
time could be capricious and hence are too irregular to be predicted by
the empirical frequencies of previous actions. 16
t4 In bandit problems, any strategy that picks the short-run optimal action a fraction of
the time that approaches one, while picking each action infinitely often, will be average-
payoff optimal almost surely. Of course, maximizing average payoffs is a notoriously weak
criterion, admitting many optimal strategies.
t5 This insofar has a purpose; this is not an implication of asymptotic empiricism. As the
example in the previous subsection shows, asymptotic independence may cause a player
to assess nonvanishing probability for reaching an information set that is never reached in
the course of play.
~6 Having introduced the notion that behavior might be capricious or (more to the point)
irregular when it does not have much effect on expected payoffs, we should note that this
36 FUDENBERG AND KREPS
poses problems as well for actions taken where players are close to indifferent, e.g., in
situations where they are meant to be randomizing. Noisy payoffs, in the sense of Harsanyi's
(1973) work on purification, can be a device for avoiding this sort of problem; see, for
example, Section 7 of Fudenberg and Kreps (1993).
LEARNING IN EXTENSIVE-FORM GAMES 37
Note that in both definitions, the "target" profile ~-, is compared with
the nonexperimental parts of each player's behavior rules and not with the
behavior rules themselves. For these definitions to have some empirical
content, and in particular for the definition of local stability to have con-
tent, we will want to show that the strategies actually played (given by
the behavior rules) resemble to some extent the target strategy.
Compared to the corresponding definitions from Fudenberg and Kreps
(1993), three things are noteworthy.
(1) In the definition of unstable profiles given here, e must work
uniformly for all conforming models. In Fudenberg and Kreps (1993), e
is permitted to vary with the model of behavior and beliefs. But (as in
fact noted in Fudenberg and Kreps (1993)) all the results in the earlier
paper go through for the stronger definition here.~7
(2) On the other hand, here we require only that the probability of
staying in the e-neighborhood of 7r, have prior probability zero; previously
we required that this be true conditional on any partial history of previous
play. But it is easy to see, given the uniformity of e over all conforming
models, that this seemingly weaker requirement is equivalent: The dynam-
ics beginning at any partial history of play in a conforming model are
precisely the sahae as the dynamics beginning at date 1 in a different
conforming model.
(3) In the definition of local stability given here, there must be positive
probability of the nonexperimental part of behavior converging to the
target profile ex ante, in some conforming model. In Fudenberg and Kreps
(1993), we required that for a fixed conforming model, for every e > 0
we could find a partial history such that convergence to the target strategy
profile had conditional probability at least 1 - e, conditional on the partial
history. These are in fact equivalent; cf. Lemma A.1 of Fudenberg and
Kreps (1993).
6. S E L F - C O N F I R M I N G EQUILIBRIA
t7 In fact, the proofs given in the earlier paper are entirely adequate as given.
38 FUDENBERG AND KREPS
1 A 2 a
0 ,,, = (I,1,1)
to . . . . . . . . . . . . . .
1o
(0,3,3) (3,0,0) (0,3,0) (3,0,2)
c. . . . . -=, co,e,81
7. BASIC RESULTS
~ 1
lOOqJdq = I00/101
that player 3 will choose L given the opportunity, and that player 2 will
choose a with probability I00/I01. So myopic optimization leads 1 to
choose A. Given his initial beliefs, player 2 assesses probability 100/101
that 1 will choose A and 100/101 that 3 will choose R, so player 2 chooses
a. The initial outcome is (A, a).
When players update their beliefs given this initial outcome, player 1
increases the mass on strategies in which 2 is likely to pick a, and player
2 increases the mass on strategies in which 1 is likely to pick A. The exact
calculations are both easy and unimportant. The important point is that,
L E A R N I N G IN E X T E N S I V E - F O R M GAMES 41
in the second round of play, neither 1 nor 2 changes beliefs about 3, hence
neither changes her assessment of what 3 would do if given the chance
to move. Because no evidence was produced about the play of player 3,
and Is and 2s believe that the strategies used by their rivals are drawn
independently, there is no information in 2's play of a, for example, about
what 3 might do.
Hence in round 2, 1 chooses A and 2 chooses a. And so on forever.
The outcome in each round is (A, a). Supposing that 3's behavior is fixed,
behavior profiles have converged (trivially) to a non-Nash self-confirming
equilibrium profile. The point is very simple. Players 1 and 2 begin with
disparate beliefs on what strategy 3 is likely to use. This leads them to
behavior that keeps 3 from moving. And if 3 never moves, then 1 and 2
have no opportunity to learn what 3 would in fact do, so that their disparate
beliefs can persist.
When confronted with this example, colleagues often have asked the
following question. Suppose that player 2 knows player 1's payoff function
and knows that player 1 knows his own payoffs. Then when player 2 sees
player 1 play A, she can infer that player 1 expects player 3 to play L
with substantial probability. Should this not lead player 2 to revise her
beliefs about player 3 in the direction of increasing the probability that
player 3 plays L? In the spirit of the literature on the impossibility of
players "agreeing to disagree," should players 1 and 2 not end up with
the same beliefs about player 3? While we do not preclude this sort of
indirect learning in our model, it need not take place. First, the indirect
learning supposes that players know (or have strong beliefs about) one
another's payoffs, which is consistent with our model but is not necessary
for it. If player 2 is unaware of player l's payoffs (and vice versa), then
2 would not find it particularly surprising that 1 chooses A. Second, even
if player 2 knows player l's payoffs (and knows that player I knows them),
and hence is able to infer that player 1 believes player 3 is likely to play
L, it is not clear that this will lead player 2 to revise her own beliefs. It
is true that player 2 will revise her beliefs if she views the discrepancy
between her own beliefs and player l's as due to information that player
1 has received but player 2 has not. But player 2 might also believe that
1 has no objective reason for her beliefs and has simply made a mistake.
The "agreeing to disagree" literature ensures that all differences in beliefs
are attributable to differences in information by supposing that the players'
beliefs are consistent with Bayesian updating from a common prior distri-
bution. But assuming a common prior assumes away the key question of
learning outside of equilibrium. Indeed, the question of whether learning
leads to Nash equilibrium would seem to be a special case of the question
of whether (and when) learning leads to common posterior beliefs starting
from arbitrary priors. To emphasize this point, recall that assuming players
have a common prior distribution over one another's strategies is equiva-
42 FUDENBERG AND KREPS
is Where might the initial correlated beliefs of player 1 come from? Suppose that before
the first play of the three-player game described above, players 2 and 3 have repeatedly
played a 2 x 2, two-player coordination game whose payoffs are exactly as in Fig. 4. That
is, initially players 2 and 3 play a game without a player 1, and then later on player I is
added. Suppose further that player 1 does not observe play in the initial two-player game.
It seems natural to suppose that players 2 and 3 will view their part of the game in Fig. 4
as the same as the two-player game that preceeded it, and hence they will use their previous
experience to guide their play in the current game. Player I (and we) might assess high
probability that play in the initial two-player game has converged to one of the pure-strategy
equilibrium, without being able to predict which of those two equilibria has emerged.
L E A R N I N G IN E X T E N S I V E - F O R M GAMES 43
~, i'( { # -i' . max II#i(h ') - ~ * ( h ' ) l l < ~'}) > 1 - s',
i~i' ,hEHin-H(rr,)
tation will have no impact on the lim inf and lim sup of the empirical
frequencies.
To show that for information sets h E H(Tr,) the frequency of visits to
h has strictly positive lim inf (almost surely on A) involves an induction
on the length of the shortest path of positive probability (under 7r,) from
the initial node (or, an initial node) to h. The initial information set is
certainly reached with nonvanishing frequency. Take any one-action path
of positive probability, starting at the initial node. (Let a be this action,
and let h be the information set reached.) The lim inf of the occurrence
of a in the nonexperimental portion of the behavior rule is at least
~r,(a) - e which is strictly positive, and since the initial information set
is reached a nonvanishing frequency of the time, the lim inf of the occur-
rence of a in the actual strategy (with experiments) is the same as the
lim inf of the occurrence of a in the nonexperimental portions. Thus the
lim inf frequency with which h is reached is at least ~r,(a) - e, which is
strictly positive. The induction step should now be apparent.
8. UNSTABLE OUTCOMES
where Z(x) is the set of all terminal successors of x and Z(x, a) is the set
of all terminal successors of (x, a). The following are easily established.
(1) For a general outcome O, there may exist nodes x and x' from
the same information set h and a E A(h) such that qJ(p)(x, a) #
q,(p) (x', a).
(2) If p = p(Tr) for some legitimate strategy It, then qJ(O)(x, a) = 7r(a)
for all x E X(p) and a E A(h(x)). Thus, for a given outcome P, if there is
a strategy zr with p(r) = p, then for all nodes x, x' E X(O) such that x
and x' come from the same information set h, and for all actions a available
at that information set, qJ(p)(x, a) = qJ(O)(x', a).
(3) Conversely, suppose that qJ(p)(x, a) = qJ(O)(x', a) for all nodes x,
x' E X(p) such that x and x' are in the same information set and for all
a E A(h(x)). Then any strategy 7r such that 7r(a) = d/(p)(x , a) for
x E X(p) and a E A(h) satisfies p = p(~-).
(4) (p) is continuous in p (on its domain of definition).
L E A R N I N G IN E X T E N S I V E - F O R M GAMES 47
Note that II60) is the empty set for a given p when the antecedent of (3) is
violated; when the antecedent of (3) is satisfied, then (3) gives an alternate
characterization of 11(0). It is clear from this alternate characterization
that II60) is closed. Moreover, if or, ~r' E I-I60), then zr is identical to ~r'
for all h E H60). Finally, I-I60) has a product structure: If or, r' 1-I60)
and we construct a strategy which composes 7ri for some players and r'J
for the rest, this third strategy will also lie in II60).
Fix an outcome p , , whose stability (more precisely, whose unstability)
is to be investigated. If 1-I60,) is empty, then there exists some e > 0 such
that IIp( ,) - 041 > ~ for every strategy or, and thus p , must be unstable
by definition. ~9Thus we can assume w.l.o.g, that, for the given p , , 1-160,)
is nonempty. Let zr, be any (arbitrarily selected) member of 1160,). Note
that ~-, is completely determined by p , at information sets from H60,).
We are done if we show that P, is unstable under the assumption
that 1-160,) contains no self-confirming equilibrium strategy profile. First,
L e m m a 7.1 is extonded:
LEMMA 8.1. IfF160,) contains no self-confirming equilibrium profiles,
there exists an e' > 0 and a player i such that for all beliefs y i such that
there exists an s i such that ui(s i, yi) >_ ui(,iri ,)/i) + ~' for all 71"i such that
In other words, this says that if player i believes that others are likely
to play in a manner that would give the outcome p , , then i will prefer
some strategy that causes the outcome to differ from P , .
P r o o f o f L e m m a 8.1. Suppose to the contrary that for each integer n,
for each player i there exist beliefs 7 / a n d a strategy 7r," such that (8.1)
and (8.2) hold for e' = I/n and such that ui(s i, 7 / ) < u(rr.,
i i y .i) + l/n for
all s i. Let 7r. be the profile where each player plays 7r/. Since the probabil-
19 Suppose there exists ~r, such that [Ip(~r,) - P,l] -< l/n for each n. Take a subsequence
along which r, converges to, say, 7r,, and use the continuity of p(.) and ~ to derive a
contradiction.
48 FUDENBERG AND KREPS
9. CONCLUDING REMARKS
20 Moreover, Nash equilibria of the constrained game are exactly the e-constrained equilib-
ria that Selten uses to define perfection. A trembling-hand perfect equilibrium is the limit
point of e-constrained equilibria as e converges to zero. Thus for small e, Nash profiles that
are not approximately perfect would also be unstable.
50 FUDENBERG AND KREPS
2~ For the criterion of unstability, play lies within a small neighborhood of such behavior.
LEARNING IN EXTENSIVE-FORM GAMES 51
where in the expression on the left-hand side, the second argument ~r-;
is shorthand for beliefs that put a unit mass on - i using the strategy # - i ,
and (2) # agrees with 7r. at all information sets h E H0r.).
To find "~', we use K u h n ' s theorem (Kuhn, 1953), which establishes a
correspondence between behaviorally mixed strategies and mixed strat-
egies.
Specifically, recalling that I-Ii is the set of behavior strategies of player
i and letting A(S i) be the space of mixed strategies for player i, define
y i : FI i ~ A(S i) by
For every 7r; E 11i, Yi(Tr i) is one among many mixed strategies equivalent
to ~r; in the sense that, whatever - i does, the distribution over endpoints
if i uses Yi(Tri) is identical with the distribution if i uses zr i. (This specific
choice of Yi(Tr;) corresponds to independent randomizations by a player
at each of his information sets.)
We also define q~i. A(S i) ~ Hi such that for every o-i E A(Si), ~i(cri)
is equivalent to cr i. This takes a bit more work.
For each information set h ~ H i, let H a ( h ) = {h' E Hi: h' < h} and let
H~ (h) = {h' ~ H i" h' 7~ h, h' h}. (Because the game has perfect recall,
the notion of precedence among information sets of a single player is well
defined.) Let Si(h) be all strategies by i that do not preclude h. That is,
s i E Si(h) if, for every h' E H&(h), si(h ') specifies the single action in
A(h'), denoted by a(h', h), that allows play to continue to h. Otherwise,
s i is unrestricted. That is, if we define S i ( h ) = 1-Ih,EW~h)A(h'), then there
is a obvious one-to-one correspondence between Si(h) and A(h) x -Si(h).
For a E A(h) for h E H i, define
. . . . i i
q.ti(o.i)(a) = {s' : s'~S'U,),s'(h)=a} O" (S )
~'{si : siESi(h)} o'i(s i)
(A.2)
52 FUDENBERG AND KREPS
for i = 1, 2. That is, we construct player i's strategy out of - i ' s beliefs;
for each ~'i in the support of ~/-~, we pass to the corresponding mixed
strategy, average over y - i , and then reconvert to a behaviorally mixed
strategy.
We claim that # ; agrees with zr. at information sets h E H(zr.). To this
end, fix some information set h E H(~'.) and assume that player i moves
at h.
Let ~i(h)= l-lh,eH~(h) A(A(h')); that is, ~ i specifies behavior by i
at information sets m Hi(h). Since h ~ H(~-,), so is h' for every
h' E Hi(h). Since beliefs y~, are not disconfirmed (see part (b) of the
definition of a self-confirming equilibrium), every 7r i in the support of
,)/~i agrees with ~'~, on h and on h' E H~>(h). We can therefore think of
y', as the product of a probability distribution ~ -" on H;(h) and a degenerate
measure (at rr,) on the other components of a full behavior strategy. With
this definition, for any s i E S i we can write
h'El~(h)
which, letting K be the constant 1-Ih,eH~th)7r,(a(h
i !
, h)), is
h'EH~(h)
K I(a','g')~.Ah)(I)x'Si (''
~
"/r~(at) l-I
3"~'(h) h'elt~(h)
~i(-gi(h'))7-i[dNi]]
=K
a'EA(h)
7r'*(a'){~ie?~(h S~;(h) I-I
h' EH~ (h)
~i(-si(h'))~-i[d~i]}
or
Dividing the numerator by the denominator cancels the K . K' terms, and
we are left with #i(a) = rri,(a).
The rest is easy. Because rr, is a self-confirming equilibrium relative
to the y',,
Thus by (A.1),
Since ~-~, is identical to #i at all information sets that are hit with positive
probability (under rr,, hence under #), we know that
ACKNOWLEDGMENTS
We are grateful to Robert Anderson, Robert Aumann, Ehud Kalai, David Levine, two
referees, and the associate editor for helpful comments. We thank IDEI, Toulouse, and the
Institute for Advanced Studies, Tel-Aviv University, for their hospitality while this research
was being conducted. The financial assistance of the National Science Foundation (Grants
SES 88-08204, SES 90-08770, SES 89-08402, and SES 92-08954) and the John Simon Guggen-
heim Foundation is gratefully acknowledged.
REFERENCES