0% found this document useful (0 votes)
17 views

lecture14

Uploaded by

andybao291
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

lecture14

Uploaded by

andybao291
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Game Theory Lecture #14

Outline:

• Multiagent learning
• Regret matching
• Fictitious play
Single Agent Learning

• Setup:
– Two players: Player 1 vs. Nature
– Actions set: A1 and AN
– Payoffs: U : A1 × AN → R

Nature
Rain No Rain
Umbrella 1 0
P1
No umbrella 0 1
Player 1’s Payoff

• Player repeatedly interacts with nature


– Player’s action day t: a1 (t)
– Nature’s action day t: aN (t)
– Payoff day t: U (a1 (t), aN (t))
• Goal: Implement strategy that provides desirable guarantees with regard to average
performance
• Case 1: Stationary environment

– Nature’s choice according to non-adaptive (fixed) prob distribution pN ∈ ∆(An )


– Theory available to optimize average performance, e.g., reinforcement learning

• Case 2: Non-stationary environment

– Nature’s choice according to adaptive prob distribution, i.e., pN (t) 6= pN (t − 1)


– In general, pN (t) = f (a1 (0), ..., a1 (t − 1), aN (0), ..., aN (t − 1))
– One choice: aN (t) = βN (a1 (t − 1)) (assume zero sum game)

• Question: Is a player’s environment stationary or non-stationary in a game?

1
Single agent learning (cont)

• Challenge: Hard to predict what nature is going to do


• Previous direction: Optimize worst-case payoffs (e.g., security strategies)
• Problem: Derived strategies might be highly inefficient given behavior of nature
• Example:

Nature
Rain No Rain Thunder
Umbrella 1 0 0
P1 No umbrella 0 1 0
Jacket 0.1 0.1 0.1
Player 1’s Payoff
– What is security strategy?
– What is security level?
– How would answers change if there was never any thunder?

• Fact: Security strategies and values can be highly influenced by “rare” actions
• Are there “online” policies that can provide potentially better performance guarantees?

2
What about regret?

• New direction: Can a player optimize “what if” scenarios?


• Definition: Player’s average payoff at day t
t
1X
Ū (t) = U (a1 (τ ), aN (τ ))
t τ =1

• Definition: Player’s perceived average payoff at day t if committed to fixed action and
nature was unchanged
t
a1 1X
v̄ (t) = U (a1 , aN (τ ))
t τ =1

• Definition: Player’s regret at day t for not having used action a1

R̄a1 (t) = v̄ a1 (t) − Ū (t)

• Example:

Day 1 2 3 4 5 6 ...
Player’s Decision NU U NU U NU NU ...
Nature’s Decision R NR R R NR R ...
Payoff 0 0 0 1 1 0 ...

– Ū (6)?
– v̄ U (6)?
– v̄ N U (6)?
– R̄U (6)?
– R̄N U (6)?

3
Regret Matching

• Positive regret = Player could have done something better in hindsight


• Q: Is it possible to make positive regret vanish asymptotically “irrespective” of nature?
• Consider the strategy Regret Matching : At day t play strategy p(t) ∈ ∆(A1 )
 U 
R̄ (t) +
pU (t + 1) =  U   
R̄ (t) + + R̄N U (t) +
 NU 
R̄ (t) +
pN U (t + 1) =  U   
R̄ (t) + + R̄N U (t) +

• Notation: [·]+ is projection to positive orthant, i.e., [x]+ = max{x, 0}


• Strategy generalizes to more than two actions
• Fact: Positive regret asymptotically vanishes irrespective of nature
 U 
R̄ (t) →0
 N U +
R̄ (t) + → 0

• Example revisited:

Day 1 2 3 4 5 6 ...
Player’s Decision NU U NU U NU NU ...
Nature’s Decision R NR R R NR R ...
Payoff 0 0 0 1 1 0 ...
– Regret matching strategy day 2?
– Regret matching strategy day 3?
– Regret matching strategy day 4?
– Regret matching strategy day 5?
– Regret matching strategy day 6?

4
Learning in games

• Consider the following one-shot game


– Players N
– Actions Ai
– Utility functions Ui : A → R
• Consider a repeated version of the above one-shot game where at each time t ∈ {1, 2, ...},
each player i ∈ N simultaneously
– Selects a strategy pi (t) ∈ ∆(Ai )
– Selects an action ai (t) randomly according to strategy pi (t)
– Receives utility Ui (ai (t), a−i (t))
– Each player updates strategy using available information

pi (t + 1) = f (a(0), a(1), ..., a(t); Ui )

• The strategy update function f (·) is referred to as the learning rule

– Ex: Cournot adjustment process

• Concern: How much information do players have access to?

– Structural form of utility function, i.e., Ui (·)?


– Action of other players, i.e., a−i (t)?
– Perceived reward for alternative actions, i.e., Ui (ai , a−i (t)) for any ai
– Utility received, Ui (a(t))

• Informational restrictions place restriction on class of admissible learning rules


• Goal: Provide asymptotic guarantees if all players follow a specific f (·)

5
Regret matching

• Consider the learning rule f (·) where


 ai 
R̄i (t) +
pai i (t + 1) = P 
ãi (t)

ãi ∈Ai R̄ +

– pai i (t + 1) = Probability player i plays action ai at time t + 1


– R̄iai (t) = Regret of player i for action ai at time t
• Fact: Max regret of all players goes to 0 (think of other players as “nature”)
 ai 
R̄i (t) + → 0
• Result restated: The behavior converges to a “no-regret” point
• Question: Where are we? Is this a NE?
• Rewrite regret in terms of empirical frequency z(t) ∈ ∆(A)
Ūi (t) = 1t tτ =1 Ui (a(τ )) = Ui (z(t))
P

Pt
v̄iai (t) = 1
t τ =1 Ui (ai , a−i (t)) = Ui (ai , z−i (t))

R̄iai (t) = v̄iai (t) − Ūi (t) = Ui (ai , z−i (t)) − Ui (z(t))
• Characteristic of no-regret point
R̄iai (t) ≤ 0 ⇔ Ui (ai , z−i (t)) ≤ Ui (z(t))
• No-regret point restated: For any player i and action ai
Ui (ai , z−i (t)) ≤ ui (z(t))
• No-regret point = Coarse correlated equilibrium (slightly weaker notion than correlated
equilibrium)
• Slightly modified (and more complex) version of regret matching ensures convergence to
correlated equilibrium.
• Theorem: If all players follow the regret matching strategy then the empirical frequency
converges to the set of coarse correlated equilibrium.

6
Convergence to NE?

• Recap: If all players follow the regret matching strategy then the empirical frequency
converges to the set of coarse correlated equilibria.
• This result holds irrespective of the underlying game!
• Problems:
– Predictability: Behavior will not necessarily settle down, i.e., only guarantees that
empirical frequency of play will be in the set of CCE
– Efficiency: Set of CCE much larger than the set of NE. Are CCE worse than NE in
terms of efficiency?
• Revised goal: Are there learning rules that converge to NE (as opposed to CCE) for any
game?
• Answer: No
• Theorem: There are no “natural” dynamics that lead to NE in any game (Hart, 2009).
– Natural = adaptive, simple, efficient (e.g., regret matching, cournot, ...)
– Not natural = exhaustive search, mediator, ...
• Question: Are there natural dynamics that converge to NE for special game structures?
(e.g., zero-sum games?)

7
Fictitious Play

• Recall: A learning rule is of the form

pi (t + 1) = f (a(1), a(2), ..., a(t); Ui )

• Fictitious play: A learning rule where the strategy pi (t + 1) is a best response to the
scenario where all players j 6= i are selecting their action independently according to the
empirical frequency of their past decisions.
• Define empirical frequencies qi (t) as follows:
t
1X
qiai (t) = I{ai (τ ) = ai }
t τ =1

• Fictitious play: Each player best responds to empirical frequencies

pi (t + 1) ∈ arg max ui (pi , q−i (t))


pi ∈∆(Ai )

where
a
X Y
ui (pi , q−i (t)) = ui (a1 , a2 , ..., an )pai i qj j (t)
a∈A j6=i

• FP facts: Beliefs (i.e., empirical frequencies) converge to NE for


– For 2-player games with 2 moves per player
– Zero sum games with arbitrary moves per player
– Other game structures as well (more to come on this)

8
Fictitious play example

• Consider the following two-player zero-sum games

L C R
T −1 0 1
M 1 −1 0
B 0 1 −1

• Suppose a(1) = {T, L}

– What is qrow (1)?


– What is qcol (1)?

• What is a(2)?

– What is qrow (2)?


– What is qcol (2)?

• What is a(3)?

You might also like