Lecture 10 - Overview of RL With A VIP Perspective
Lecture 10 - Overview of RL With A VIP Perspective
Reinforcement
Learning and Links
with Games
Reminders
● Feedback on your project.
2
References for this lecture:
3
Today
1. RL setting
3. Q-learning
1. Tabular Q-learning 101
2. A variational inequality perspective on Q learning
Disclaimer
- Not a full course on RL!!!!
- More like a crash course.
- Check the course linked in these slides.
The RL setting
Agent and Environment
At each step t the agent:
• Receives observation ot and reward rt
• Executes action at
The environment:
• Receives action at
• Emits observation ot+1 and reward rt+1
Rt := rt + γrt+1 + γ 2rt+2 + . . .
T−1
∑
R := r1 + . . . + RT = rt
k=0
The Discounted reward
Why not a finite horizon T ?
T
∑
R := r1 + . . . + RT = rt+1
k=0
Answer: A fixed finite horizon is not realistic (you do
not know when you will die).
If you do your behaviour will drastically change (e.g.,
people who get diagnosed cancer) .
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
The Discounted reward
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
Why geometric:
1.Simple
2.Memory less ℙ(T = t | T ≥ t0) = ℙ(T = t − t0)
• Conditioned on the fact that you are still “alive’’, the
end of the episode is not more likely as time passes.
3.It recovers the infinite time horizon:
∞ T not infinity ∞ ∞
γ trt+1
T[R] = T[ ∑ ∑ ∑
rt+1] = rt+1ℙ(T ≥ t) =
t=0 t=0 t=0
proportional instead of equal here
𝔼
𝔼
Connection with games
• Similar Issues in Game theory:
• Not the same equilibria if the horizon is known or
not. (See next week)
• When the horizon is known:
• you will defect at the last timestep (best strategy)
• Hence defect at the previous time step (and so on)
• When the horizon is not known:
• No backward induction!
• New perfect sublime equilibria (cf next lecture)
• You can cooperate!
Conclusion on Finite Vs. Infinite Horizon
- Finite known horizon can lead to pathological behaviour.
- Because they are rational both in RL and Games Agent can act
differently close to the end of the episode (if they know the horizon).
- These undesired behaviour (in the real life we usually do not know the
exact horizon) are avoided with infinite horizon and discounted
reward.
The environment:
• Receives action at
• Move to state st+1 and reward rt+1
Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Policy Gradient
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.
∞
s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
t
θ ∈ arg min
argmax
t=0
Reward obtained for picking at in st.
d0 : distribution over initial state
st+1 sampled by the env., as a function of at ∼ πθ(a | st )
Usually one note
∞
J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.
∞
s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
Policy based reinforcement learning is an toptimization problem:
θ ∈ arg min
Find θ that maximisest=0 J(θ)
Reward obtained for picking at in st.
d0 : distribution
Focus on over initial state
stochastic gradient ascent:
- Ef cient. st+1 sampled by the env., as a function of at ∼ πθ(a | st )
- Easy to implement with deep nets.
Usually one note
∞
J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
fi
Gradients on parameterized policies
Goal: Compute ∇θ J(θ) where
∞
J(θ) := πθ[ ∑ γ trt+1]
𝔼
t=0
Problem: trajectories depend on θ! Thus, cannot switch expectation and differentiation
∫a
∇ rt+1∼πθ(a|st )[rt+1] = ∇θ rt+1(a, st)πθ(a | st)da
∫a
= rt+1(a, st) ∇θ πθ(a | st)da
The trick:
∇θ πθ(a | st)
∫a
= rt+1(a, st) π (a | st)da
πθ(a | st) θ
∫a
= rt+1(a, st) ∇θ log πθ(a | st) ⋅ πθ(a | st)da = rt+1∼πθ(a|st )[rt+1 ∇log πθ]
𝔼
𝔼
Policy Gradient Theorem
Theorem: under certain regularity assumptions on πθ (for instance has
a density for continuous action spaces), we have
∞
πθ πθ
∇J(θ) = πθ[ ∑ Q (st, at) ∇log πθ(at | st)] = s ∼ p πθ[Q (s, a) ∇log πθ(a | s)]
t=1 a ∼ πθ
∞
Where Q π(st, at ) := γ trt+1 | S0 = st, A0 = at] is called the Q-value.
π[ ∑
t=0
Note that: Q π(st, at ) := π[rt+1 + γQ π(st+1, at+1) | S0 = st, A0 = at]
Bellman’s fixed point equation
∞
πθ
∇J(θ) = πθ[ ∑ (Q (st, at)−B(st)) ∇log πθ(at | st)]
t=1
∞
Where Q π(st, at ) := t
π[ ∑ γ rt+1 | S0 = st, A0 = at] is called the Q-value.
t=0
And B(st ) is any function independent of the action at. This function is
called the baseline.
𝔼
𝔼
Actor Critic
Given a policy πθ
rt : reward for picking at ∼ πθ(a | st ) Vw(st+1): value of the next state estimated by the critic.
Learning steps:
1.Take an action according to the policy at ∼ πθ(a | st).
2.Update the critic using rt+1 caused by at.
3.Estimate Q πθ(st, at) using the critic Vw and update πθ.
Standard Q-learning:
Qt+1(st, at) = Qt(st, at) − α[Qt(st, at) − (rt + γ max Qt(st+1, a))]
a
Q-learning as a VIP:
Qt+1 = Qt − αF̃(Qt)
{0
F(Q)(st, at) if s = st and a = at
F̃(Q)(s, a) =
otherwise
Questions
What do to with that? What is challenging in this VIP?
Qt+1 = Qt − αF̃(Qt)