0% found this document useful (0 votes)
10 views

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

James Tan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

James Tan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Lecture 10:

Reinforcement
Learning and Links
with Games
Reminders
● Feedback on your project.

2
References for this lecture:

1. Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive


environments." Advances in neural information processing systems 30 (2017).
2. Slides on RL: https://dpmd.ai/DeepMindxUCL21
3. Notes on the repeated prisoner’s Dilemma: https://sites.math.northwestern.edu/~clark/364/
handouts/repeated.pdf

3
Today
1. RL setting

2. The policy gradient method:


1. Policy gradient 101
2. A game perspective on Actor-critic!

3. Q-learning
1. Tabular Q-learning 101
2. A variational inequality perspective on Q learning
Disclaimer
- Not a full course on RL!!!!
- More like a crash course.
- Check the course linked in these slides.
The RL setting
Agent and Environment
At each step t the agent:
• Receives observation ot and reward rt
• Executes action at

The environment:
• Receives action at
• Emits observation ot+1 and reward rt+1

Usually: observation is the state st of the environment.


In the following: ot = st (we move from state to state)
Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
The Goal in RL
A reward rt is a scalar feedback signal
- Indicates how well agent is doing at step t
- The agent’s job is to maximize its future cumulative reward

Rt := rt + γrt+1 + γ 2rt+2 + . . .

Rt is called the return and γ is called the discount factor.


Reinforcement learning is based on the reward hypothesis:

Our goal can be formalized as the outcome of maximizing a cumulative reward


Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
The Discounted reward
Why do we consider infinite horizon with a discount
factor:

Rt := rt+1 + γrt+2 + γ 2rt+3 + . . . = γ krk+t

k=0
Why not a finite horizon T ?
We will show that if T is a random variable, then the finite formulation actually recovers the formulation with an infinite horizon

T−1


R := r1 + . . . + RT = rt
k=0
The Discounted reward
Why not a finite horizon T ?
T


R := r1 + . . . + RT = rt+1
k=0
Answer: A fixed finite horizon is not realistic (you do
not know when you will die).
If you do your behaviour will drastically change (e.g.,
people who get diagnosed cancer) .
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
The Discounted reward
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
Why geometric:
1.Simple
2.Memory less ℙ(T = t | T ≥ t0) = ℙ(T = t − t0)
• Conditioned on the fact that you are still “alive’’, the
end of the episode is not more likely as time passes.
3.It recovers the infinite time horizon:
∞ T not infinity ∞ ∞
γ trt+1
T[R] = T[ ∑ ∑ ∑
rt+1] = rt+1ℙ(T ≥ t) =
t=0 t=0 t=0
proportional instead of equal here
𝔼
𝔼
Connection with games
• Similar Issues in Game theory:
• Not the same equilibria if the horizon is known or
not. (See next week)
• When the horizon is known:
• you will defect at the last timestep (best strategy)
• Hence defect at the previous time step (and so on)
• When the horizon is not known:
• No backward induction!
• New perfect sublime equilibria (cf next lecture)
• You can cooperate!
Conclusion on Finite Vs. Infinite Horizon
- Finite known horizon can lead to pathological behaviour.

- Because they are rational both in RL and Games Agent can act
differently close to the end of the episode (if they know the horizon).

- These undesired behaviour (in the real life we usually do not know the
exact horizon) are avoided with infinite horizon and discounted
reward.

- Such infinite horizon and discounted reward is equivalent to the


cumulative reward with a random end of the episode.

Rational agent: agent that aims at maximizing its reward.


Back to RL: Maximizing value by taking actions

Goal: select actions to maximise cumulative reward.


• Actions may have long term consequences.
• Reward may be delayed.
• Better to sacrifice immediate reward to gain more long-term
reward.
• Examples:
• Refuelling a helicopter (might prevent a crash in several hours)
• Defensive moves in a game (may help chances of winning later)
• Learning a new skill (can be costly & time-consuming at first)

A mapping from states to actions is called a policy


Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Agent, Environment, and Policy
At each step t the agent:
• Receives state st and reward rt
• Executes action at ∼ πθ(a | st )

The environment:
• Receives action at
• Move to state st+1 and reward rt+1

Goal: learn a policy πθ that maximizes the cumulative reward.

Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Policy Gradient
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.


s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
t
θ ∈ arg min
argmax
t=0
Reward obtained for picking at in st.
d0 : distribution over initial state
st+1 sampled by the env., as a function of at ∼ πθ(a | st )
Usually one note


J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.


s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
Policy based reinforcement learning is an toptimization problem:
θ ∈ arg min
Find θ that maximisest=0 J(θ)
Reward obtained for picking at in st.
d0 : distribution
Focus on over initial state
stochastic gradient ascent:
- Ef cient. st+1 sampled by the env., as a function of at ∼ πθ(a | st )
- Easy to implement with deep nets.
Usually one note


J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
fi
Gradients on parameterized policies
Goal: Compute ∇θ J(θ) where

J(θ) := πθ[ ∑ γ trt+1]

𝔼
t=0
Problem: trajectories depend on θ! Thus, cannot switch expectation and differentiation

∫a
∇ rt+1∼πθ(a|st )[rt+1] = ∇θ rt+1(a, st)πθ(a | st)da

∫a
= rt+1(a, st) ∇θ πθ(a | st)da
The trick:
∇θ πθ(a | st)
∫a
= rt+1(a, st) π (a | st)da
πθ(a | st) θ

∫a
= rt+1(a, st) ∇θ log πθ(a | st) ⋅ πθ(a | st)da = rt+1∼πθ(a|st )[rt+1 ∇log πθ]
𝔼
𝔼
Policy Gradient Theorem
Theorem: under certain regularity assumptions on πθ (for instance has
a density for continuous action spaces), we have


πθ πθ
∇J(θ) = πθ[ ∑ Q (st, at) ∇log πθ(at | st)] = s ∼ p πθ[Q (s, a) ∇log πθ(a | s)]
t=1 a ∼ πθ


Where Q π(st, at ) := γ trt+1 | S0 = st, A0 = at] is called the Q-value.
π[ ∑
t=0
Note that: Q π(st, at ) := π[rt+1 + γQ π(st+1, at+1) | S0 = st, A0 = at]
Bellman’s fixed point equation

We start the trajectories at st by picking the action at.


𝔼
𝔼
𝔼
𝔼
Policy Gradient Theorem
Corollary: under certain regularity assumptions on πθ (for instance
has a density for continuous action spaces), we have


πθ
∇J(θ) = πθ[ ∑ (Q (st, at)−B(st)) ∇log πθ(at | st)]
t=1


Where Q π(st, at ) := t
π[ ∑ γ rt+1 | S0 = st, A0 = at] is called the Q-value.
t=0
And B(st ) is any function independent of the action at. This function is
called the baseline.
𝔼
𝔼
Actor Critic
Given a policy πθ

Problem: Not easy to estimate its Q-value Q πθ

Idea: Use another network (the critic) to do it!

Q πθ(st, at) ≈ rt + γVw(st+1)


V is the critic (parametrised by W)

rt : reward for picking at ∼ πθ(a | st ) Vw(st+1): value of the next state estimated by the critic.

Last idea: also use Vw as a baseline.


Actor-Critic Meta Algorithm
Two “players”:
• Policy πθ learnt by policy gradient on J(θ).
• Q-value of πθ estimated with the critic Vw.

Learning steps:
1.Take an action according to the policy at ∼ πθ(a | st).
2.Update the critic using rt+1 caused by at.
3.Estimate Q πθ(st, at) using the critic Vw and update πθ.

Interactions between the critic and the policy! GAME!


Connecting Generative Adversarial Networks and Actor-
Critic Methods
Paper by David Pfau, Oriol Vinyals (2017)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator


Standard Actor-Critic
Critic: Update parameters w of Vw by TD
Actor: Update θ by policy gradient.
Standard Actor Critic:
• Initialise s0, θ, w
• for t = 0,1,2,…do:
• Sample at ∼ πθ(a | st )
• Sample rt+1and st+1
• δt+1 := rt+1 + Vw(st+1) − Vw(st ) [one-step TD-error, or advantage]
should be gamma * V_w(s_t+1)

• w ← w + ηδt ∇w Vw(st ) [TD(0) critic update]


• θ ← θ + αδt ∇log πθ(at | st ) [Policy gradient step]
Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Another Actor Critic
Another idea to approximate Q πθ(st, at).

Directly use a neural network: Q πθ(st, at) = Qw(st, at)

Learn it with DQN:


2
ℒ(w) = s,a,r∼πθ[(Qw (s, a) − y(s′, r)) ]

where y(s′, r) = r − γ max Qw̄(s′, a′)


a′
And w̄ is a copy of w (no differentiation through it!!!)
Intuition: learn the target y


𝔼



Connecting Generative Adversarial Networks and Actor-
Critic Methods
Paper by David Pfau, Oriol Vinyals (2017)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator


Q-learning
Q-Learning: Tabular Setting
Goal: Learn the Q values of each state-action pairs.

Given a state s, and a policy π, we want to estimate the


cumulative reward for each actions

Q π(s, a) = [rt+1 + γrt+2 + … | st = s, at = a, π]

Fixed point equation:

Q π(s, a) = [rt+1 + γQ π(st+1, at+1) | st = s, at = a, at+1 ∼ π( ⋅ , st+1)]


𝔼
𝔼
Q-Learning: Tabular Setting
Thm: The best Q-value corresponds to the optimal
policy: π*(s) = arg max Q*(s, a) where
a

Q*(s, a) = [rt+1 + γ max Q*(st+1, a) | st = s, at = a]


a
𝔼
Q-Learning: Tabular Setting
Goal: Learn the optimal Q value of each state-action
pairs.
Given a state s, we want to estimate the cumulative
reward for each actions (assuming subsequent
optimal play)
Q*(s, a) = (s′,r)∼P(⋅|a,s)[r + γ max Q*(s′, a′)]
a′
Fixed point equation F(Q*) = 0:
Non-linear operation!

F(Q)(s, a) := Q(s, a) − (s′,r)∼P(⋅|a,s)[r + γ max Q(s′, a′)]


a′


Indexes of Q (tabular case)


Could be rewritten with a transition matrix
𝔼
𝔼




Standard Q-Learning algorithm
• Initialize Qo(s, a) , ∀a, s
• Initialize s0
• For t = 0,…, T :
• Sample at ∼ π(st ) (exploration + exploitation)
• Get rt+1, st+1 from the environment
• Compute δt := rt+1 + γ max Qt(st+1, a) − Qt(st, at )
a
• Update Q: Qt+1(st, at ) = Qt(st, at ) + αδt

Usually π(st ) = arg max Qt(st, a) for exploitation and


a
π(st) = U({a}) for exploration
Q-learning as a Variational inequality

Standard Q-learning:

Qt+1(st, at) = Qt(st, at) − α[Qt(st, at) − (rt + γ max Qt(st+1, a))]
a

Q-learning as a VIP:
Qt+1 = Qt − αF̃(Qt)

With a stochastic estimate of F(Q) .

{0
F(Q)(st, at) if s = st and a = at
F̃(Q)(s, a) =
otherwise
Questions
What do to with that? What is challenging in this VIP?
Qt+1 = Qt − αF̃(Qt)

- Analysis of Q-learning using the tools from the last 4


classes.
However:
- Stochastic VIP.
- Stochasticity depends on the policy π (that may
depends on Q)
- Is this VIP monotone?????
Conclusion
1. Discounted reward can be seen as a random finite
horizon.

2. Actor critic looks like a GAN for the RL framework


1. Actor <—> Generator
2. Critic <—> Discriminator

3. Tabular Q-learning can be seen as a stochastic


variational inequality!

4.We are not even multi-agent yet!!!!

You might also like