lecture14
lecture14
Outline:
• Multiagent learning
• Regret matching
• Fictitious play
Single Agent Learning
• Setup:
– Two players: Player 1 vs. Nature
– Actions set: A1 and AN
– Payoffs: U : A1 × AN → R
Nature
Rain No Rain
Umbrella 1 0
P1
No umbrella 0 1
Player 1’s Payoff
1
Single agent learning (cont)
Nature
Rain No Rain Thunder
Umbrella 1 0 0
P1 No umbrella 0 1 0
Jacket 0.1 0.1 0.1
Player 1’s Payoff
– What is security strategy?
– What is security level?
– How would answers change if there was never any thunder?
• Fact: Security strategies and values can be highly influenced by “rare” actions
• Are there “online” policies that can provide potentially better performance guarantees?
2
What about regret?
• Definition: Player’s perceived average payoff at day t if committed to fixed action and
nature was unchanged
t
a1 1X
v̄ (t) = U (a1 , aN (τ ))
t τ =1
• Example:
Day 1 2 3 4 5 6 ...
Player’s Decision NU U NU U NU NU ...
Nature’s Decision R NR R R NR R ...
Payoff 0 0 0 1 1 0 ...
– Ū (6)?
– v̄ U (6)?
– v̄ N U (6)?
– R̄U (6)?
– R̄N U (6)?
3
Regret Matching
• Example revisited:
Day 1 2 3 4 5 6 ...
Player’s Decision NU U NU U NU NU ...
Nature’s Decision R NR R R NR R ...
Payoff 0 0 0 1 1 0 ...
– Regret matching strategy day 2?
– Regret matching strategy day 3?
– Regret matching strategy day 4?
– Regret matching strategy day 5?
– Regret matching strategy day 6?
4
Learning in games
5
Regret matching
Pt
v̄iai (t) = 1
t τ =1 Ui (ai , a−i (t)) = Ui (ai , z−i (t))
R̄iai (t) = v̄iai (t) − Ūi (t) = Ui (ai , z−i (t)) − Ui (z(t))
• Characteristic of no-regret point
R̄iai (t) ≤ 0 ⇔ Ui (ai , z−i (t)) ≤ Ui (z(t))
• No-regret point restated: For any player i and action ai
Ui (ai , z−i (t)) ≤ ui (z(t))
• No-regret point = Coarse correlated equilibrium (slightly weaker notion than correlated
equilibrium)
• Slightly modified (and more complex) version of regret matching ensures convergence to
correlated equilibrium.
• Theorem: If all players follow the regret matching strategy then the empirical frequency
converges to the set of coarse correlated equilibrium.
6
Convergence to NE?
• Recap: If all players follow the regret matching strategy then the empirical frequency
converges to the set of coarse correlated equilibria.
• This result holds irrespective of the underlying game!
• Problems:
– Predictability: Behavior will not necessarily settle down, i.e., only guarantees that
empirical frequency of play will be in the set of CCE
– Efficiency: Set of CCE much larger than the set of NE. Are CCE worse than NE in
terms of efficiency?
• Revised goal: Are there learning rules that converge to NE (as opposed to CCE) for any
game?
• Answer: No
• Theorem: There are no “natural” dynamics that lead to NE in any game (Hart, 2009).
– Natural = adaptive, simple, efficient (e.g., regret matching, cournot, ...)
– Not natural = exhaustive search, mediator, ...
• Question: Are there natural dynamics that converge to NE for special game structures?
(e.g., zero-sum games?)
7
Fictitious Play
• Fictitious play: A learning rule where the strategy pi (t + 1) is a best response to the
scenario where all players j 6= i are selecting their action independently according to the
empirical frequency of their past decisions.
• Define empirical frequencies qi (t) as follows:
t
1X
qiai (t) = I{ai (τ ) = ai }
t τ =1
where
a
X Y
ui (pi , q−i (t)) = ui (a1 , a2 , ..., an )pai i qj j (t)
a∈A j6=i
8
Fictitious play example
L C R
T −1 0 1
M 1 −1 0
B 0 1 −1
• What is a(2)?
• What is a(3)?