0% found this document useful (0 votes)

10 views

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

James Tan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

James Tan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Lecture 10:

Reinforcement
Learning and Links
with Games
Reminders
● Feedback on your project.

2
References for this lecture:

1. Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive

environments." Advances in neural information processing systems 30 (2017).
2. Slides on RL: https://dpmd.ai/DeepMindxUCL21
3. Notes on the repeated prisoner’s Dilemma: https://sites.math.northwestern.edu/~clark/364/
handouts/repeated.pdf

3
Today
1. RL setting

2. The policy gradient method:

1. Policy gradient 101
2. A game perspective on Actor-critic!

3. Q-learning
1. Tabular Q-learning 101
2. A variational inequality perspective on Q learning
Disclaimer
- Not a full course on RL!!!!
- More like a crash course.
- Check the course linked in these slides.
The RL setting
Agent and Environment
At each step t the agent:
• Receives observation ot and reward rt
• Executes action at

The environment:
• Receives action at
• Emits observation ot+1 and reward rt+1

Usually: observation is the state st of the environment.

In the following: ot = st (we move from state to state)
Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
The Goal in RL
A reward rt is a scalar feedback signal
- Indicates how well agent is doing at step t
- The agent’s job is to maximize its future cumulative reward

Rt := rt + γrt+1 + γ 2rt+2 + . . .

Rt is called the return and γ is called the discount factor.

Reinforcement learning is based on the reward hypothesis:

Our goal can be formalized as the outcome of maximizing a cumulative reward

Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
The Discounted reward
Why do we consider infinite horizon with a discount
factor:
∞
Rt := rt+1 + γrt+2 + γ 2rt+3 + . . . = γ krk+t
∑
k=0
Why not a finite horizon T ?
We will show that if T is a random variable, then the finite formulation actually recovers the formulation with an infinite horizon

T−1

∑
R := r1 + . . . + RT = rt
k=0
The Discounted reward
Why not a finite horizon T ?
T

∑
R := r1 + . . . + RT = rt+1
k=0
Answer: A fixed finite horizon is not realistic (you do
not know when you will die).
If you do your behaviour will drastically change (e.g.,
people who get diagnosed cancer) .
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
The Discounted reward
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
Why geometric:
1.Simple
2.Memory less ℙ(T = t | T ≥ t0) = ℙ(T = t − t0)
• Conditioned on the fact that you are still “alive’’, the
end of the episode is not more likely as time passes.
3.It recovers the infinite time horizon:
∞ T not infinity ∞ ∞
γ trt+1
T[R] = T[ ∑ ∑ ∑
rt+1] = rt+1ℙ(T ≥ t) =
t=0 t=0 t=0
proportional instead of equal here
𝔼
𝔼
Connection with games
• Similar Issues in Game theory:
• Not the same equilibria if the horizon is known or
not. (See next week)
• When the horizon is known:
• you will defect at the last timestep (best strategy)
• Hence defect at the previous time step (and so on)
• When the horizon is not known:
• No backward induction!
• New perfect sublime equilibria (cf next lecture)
• You can cooperate!
Conclusion on Finite Vs. Infinite Horizon
- Finite known horizon can lead to pathological behaviour.

- Because they are rational both in RL and Games Agent can act
differently close to the end of the episode (if they know the horizon).

- These undesired behaviour (in the real life we usually do not know the
exact horizon) are avoided with infinite horizon and discounted
reward.

- Such infinite horizon and discounted reward is equivalent to the

cumulative reward with a random end of the episode.

Rational agent: agent that aims at maximizing its reward.

Back to RL: Maximizing value by taking actions

Goal: select actions to maximise cumulative reward.

• Actions may have long term consequences.
• Reward may be delayed.
• Better to sacrifice immediate reward to gain more long-term
reward.
• Examples:
• Refuelling a helicopter (might prevent a crash in several hours)
• Defensive moves in a game (may help chances of winning later)
• Learning a new skill (can be costly & time-consuming at first)

A mapping from states to actions is called a policy

Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Agent, Environment, and Policy
At each step t the agent:
• Receives state st and reward rt
• Executes action at ∼ πθ(a | st )

The environment:
• Receives action at
• Move to state st+1 and reward rt+1

Goal: learn a policy πθ that maximizes the cumulative reward.

Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Policy Gradient
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.

∞
s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
t
θ ∈ arg min
argmax
t=0
Reward obtained for picking at in st.
d0 : distribution over initial state
st+1 sampled by the env., as a function of at ∼ πθ(a | st )
Usually one note

∞
J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.

∞
s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
Policy based reinforcement learning is an toptimization problem:
θ ∈ arg min
Find θ that maximisest=0 J(θ)
Reward obtained for picking at in st.
d0 : distribution
Focus on over initial state
stochastic gradient ascent:
- Ef cient. st+1 sampled by the env., as a function of at ∼ πθ(a | st )
- Easy to implement with deep nets.
Usually one note

∞
J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
fi
Gradients on parameterized policies
Goal: Compute ∇θ J(θ) where
∞
J(θ) := πθ[ ∑ γ trt+1]

𝔼
t=0
Problem: trajectories depend on θ! Thus, cannot switch expectation and differentiation

∫a
∇ rt+1∼πθ(a|st )[rt+1] = ∇θ rt+1(a, st)πθ(a | st)da

∫a
= rt+1(a, st) ∇θ πθ(a | st)da
The trick:
∇θ πθ(a | st)
∫a
= rt+1(a, st) π (a | st)da
πθ(a | st) θ

∫a
= rt+1(a, st) ∇θ log πθ(a | st) ⋅ πθ(a | st)da = rt+1∼πθ(a|st )[rt+1 ∇log πθ]
𝔼
𝔼
Policy Gradient Theorem
Theorem: under certain regularity assumptions on πθ (for instance has
a density for continuous action spaces), we have

∞
πθ πθ
∇J(θ) = πθ[ ∑ Q (st, at) ∇log πθ(at | st)] = s ∼ p πθ[Q (s, a) ∇log πθ(a | s)]
t=1 a ∼ πθ

∞
Where Q π(st, at ) := γ trt+1 | S0 = st, A0 = at] is called the Q-value.
π[ ∑
t=0
Note that: Q π(st, at ) := π[rt+1 + γQ π(st+1, at+1) | S0 = st, A0 = at]
Bellman’s fixed point equation

We start the trajectories at st by picking the action at.

𝔼
𝔼
𝔼
𝔼
Policy Gradient Theorem
Corollary: under certain regularity assumptions on πθ (for instance
has a density for continuous action spaces), we have

∞
πθ
∇J(θ) = πθ[ ∑ (Q (st, at)−B(st)) ∇log πθ(at | st)]
t=1

∞
Where Q π(st, at ) := t
π[ ∑ γ rt+1 | S0 = st, A0 = at] is called the Q-value.
t=0
And B(st ) is any function independent of the action at. This function is
called the baseline.
𝔼
𝔼
Actor Critic
Given a policy πθ

Problem: Not easy to estimate its Q-value Q πθ

Idea: Use another network (the critic) to do it!

Q πθ(st, at) ≈ rt + γVw(st+1)

V is the critic (parametrised by W)

rt : reward for picking at ∼ πθ(a | st ) Vw(st+1): value of the next state estimated by the critic.

Last idea: also use Vw as a baseline.

Actor-Critic Meta Algorithm
Two “players”:
• Policy πθ learnt by policy gradient on J(θ).
• Q-value of πθ estimated with the critic Vw.

Learning steps:
1.Take an action according to the policy at ∼ πθ(a | st).
2.Update the critic using rt+1 caused by at.
3.Estimate Q πθ(st, at) using the critic Vw and update πθ.

Interactions between the critic and the policy! GAME!

Connecting Generative Adversarial Networks and Actor-
Critic Methods
Paper by David Pfau, Oriol Vinyals (2017)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator

Standard Actor-Critic
Critic: Update parameters w of Vw by TD
Actor: Update θ by policy gradient.
Standard Actor Critic:
• Initialise s0, θ, w
• for t = 0,1,2,…do:
• Sample at ∼ πθ(a | st )
• Sample rt+1and st+1
• δt+1 := rt+1 + Vw(st+1) − Vw(st ) [one-step TD-error, or advantage]
should be gamma * V_w(s_t+1)

• w ← w + ηδt ∇w Vw(st ) [TD(0) critic update]

• θ ← θ + αδt ∇log πθ(at | st ) [Policy gradient step]
Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Another Actor Critic
Another idea to approximate Q πθ(st, at).

Directly use a neural network: Q πθ(st, at) = Qw(st, at)

Learn it with DQN:

2
ℒ(w) = s,a,r∼πθ[(Qw (s, a) − y(s′, r)) ]

where y(s′, r) = r − γ max Qw̄(s′, a′)

a′
And w̄ is a copy of w (no differentiation through it!!!)
Intuition: learn the target y

𝔼

Connecting Generative Adversarial Networks and Actor-
Critic Methods
Paper by David Pfau, Oriol Vinyals (2017)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator

Q-learning
Q-Learning: Tabular Setting
Goal: Learn the Q values of each state-action pairs.

Given a state s, and a policy π, we want to estimate the

cumulative reward for each actions

Q π(s, a) = [rt+1 + γrt+2 + … | st = s, at = a, π]

Fixed point equation:

Q π(s, a) = [rt+1 + γQ π(st+1, at+1) | st = s, at = a, at+1 ∼ π( ⋅ , st+1)]

𝔼
𝔼
Q-Learning: Tabular Setting
Thm: The best Q-value corresponds to the optimal
policy: π*(s) = arg max Q*(s, a) where
a

Q(s, a) = [rt+1 + γ max Q(st+1, a) | st = s, at = a]

a
𝔼
Q-Learning: Tabular Setting
Goal: Learn the optimal Q value of each state-action
pairs.
Given a state s, we want to estimate the cumulative
reward for each actions (assuming subsequent
optimal play)
Q*(s, a) = (s′,r)∼P(⋅|a,s)[r + γ max Q*(s′, a′)]
a′
Fixed point equation F(Q*) = 0:
Non-linear operation!

F(Q)(s, a) := Q(s, a) − (s′,r)∼P(⋅|a,s)[r + γ max Q(s′, a′)]

a′

Indexes of Q (tabular case)

Could be rewritten with a transition matrix
𝔼
𝔼

Standard Q-Learning algorithm
• Initialize Qo(s, a) , ∀a, s
• Initialize s0
• For t = 0,…, T :
• Sample at ∼ π(st ) (exploration + exploitation)
• Get rt+1, st+1 from the environment
• Compute δt := rt+1 + γ max Qt(st+1, a) − Qt(st, at )
a
• Update Q: Qt+1(st, at ) = Qt(st, at ) + αδt

Usually π(st ) = arg max Qt(st, a) for exploitation and

a
π(st) = U({a}) for exploration
Q-learning as a Variational inequality

Standard Q-learning:

Qt+1(st, at) = Qt(st, at) − α[Qt(st, at) − (rt + γ max Qt(st+1, a))]
a

Q-learning as a VIP:
Qt+1 = Qt − αF̃(Qt)

With a stochastic estimate of F(Q) .

{0
F(Q)(st, at) if s = st and a = at
F̃(Q)(s, a) =
otherwise
Questions
What do to with that? What is challenging in this VIP?
Qt+1 = Qt − αF̃(Qt)

- Analysis of Q-learning using the tools from the last 4

classes.
However:
- Stochastic VIP.
- Stochasticity depends on the policy π (that may
depends on Q)
- Is this VIP monotone?????
Conclusion
1. Discounted reward can be seen as a random finite
horizon.

2. Actor critic looks like a GAN for the RL framework

1. Actor <—> Generator
2. Critic <—> Discriminator

3. Tabular Q-learning can be seen as a stochastic

variational inequality!

4.We are not even multi-agent yet!!!!

CB Insights Generative AI Bible
100% (6)
CB Insights Generative AI Bible
122 pages
BTP Presentation On Text To Image Synthesis
100% (1)
BTP Presentation On Text To Image Synthesis
38 pages
Lecture 11 - Multi Agent RL
No ratings yet
Lecture 11 - Multi Agent RL
36 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Sections
No ratings yet
Sections
76 pages
DRL
No ratings yet
DRL
9 pages
lecture-06
No ratings yet
lecture-06
98 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Reinforcement Learning.pptx
No ratings yet
Reinforcement Learning.pptx
59 pages
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
No ratings yet
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
28 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Maai 6
No ratings yet
Maai 6
143 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
drl_v5
No ratings yet
drl_v5
64 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
2024 MTH058 Lecture05 ReinforcementLearning
No ratings yet
2024 MTH058 Lecture05 ReinforcementLearning
59 pages
RL Lecturer (1)
No ratings yet
RL Lecturer (1)
38 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
RL Intro-2
No ratings yet
RL Intro-2
24 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
kguh
No ratings yet
kguh
38 pages
cs224r_L04_Actor_Critic
No ratings yet
cs224r_L04_Actor_Critic
89 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Nips00 Bs
No ratings yet
Nips00 Bs
7 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
lecture11
No ratings yet
lecture11
51 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
RL
No ratings yet
RL
62 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
No ratings yet
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
524 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Lesson 5 AI
No ratings yet
Lesson 5 AI
38 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Lecture 1: Introduction To Reinforcement Learning: David Silver
No ratings yet
Lecture 1: Introduction To Reinforcement Learning: David Silver
46 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Reinforcement Learning and Robotics
No ratings yet
Reinforcement Learning and Robotics
35 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Question bank
No ratings yet
Question bank
4 pages
Cyber Security Implications of Deepfakes
No ratings yet
Cyber Security Implications of Deepfakes
27 pages
2111.11133v11
No ratings yet
2111.11133v11
18 pages
Pranshi Singla IX C AI Activity 1
No ratings yet
Pranshi Singla IX C AI Activity 1
24 pages
Preprint-V1 Covered
No ratings yet
Preprint-V1 Covered
18 pages
phase1 report_removed
No ratings yet
phase1 report_removed
36 pages
From Words To Pictures Artificial Intelligence Based Art Generator
No ratings yet
From Words To Pictures Artificial Intelligence Based Art Generator
9 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
12 pages
Applied Neural Networks with TensorFlow 2 API Oriented Deep Learning with Python 1st Edition Orhan Gazi Yalcın Yalçın Orhan - Download the ebook now to never miss important content
100% (2)
Applied Neural Networks with TensorFlow 2 API Oriented Deep Learning with Python 1st Edition Orhan Gazi Yalcın Yalçın Orhan - Download the ebook now to never miss important content
69 pages
Role of Generative AI in Industry 50 A Transformative Force
No ratings yet
Role of Generative AI in Industry 50 A Transformative Force
12 pages
Computer Vision Challenges Trends and Opportunities 1st Edition Md Atiqur Rahman Ahad download pdf
100% (1)
Computer Vision Challenges Trends and Opportunities 1st Edition Md Atiqur Rahman Ahad download pdf
67 pages
Unit 1
No ratings yet
Unit 1
18 pages
Unit 4
No ratings yet
Unit 4
12 pages
RF Fingerprinting Unmanned Aerial Vehicles With Non-Standard Transmitter Waveforms
No ratings yet
RF Fingerprinting Unmanned Aerial Vehicles With Non-Standard Transmitter Waveforms
14 pages
remotesensing-17-00935
No ratings yet
remotesensing-17-00935
22 pages
U-Net and Its Variants for Medical Image Segmentat
No ratings yet
U-Net and Its Variants for Medical Image Segmentat
43 pages
Important Questions Unit 2
No ratings yet
Important Questions Unit 2
8 pages
Channel Augmented Joint Learning for Visible-Infrared Recognition
No ratings yet
Channel Augmented Joint Learning for Visible-Infrared Recognition
10 pages
SD To HD Video Conversion Tool Using Srgan
No ratings yet
SD To HD Video Conversion Tool Using Srgan
10 pages
Ai2121 First
No ratings yet
Ai2121 First
10 pages
3.popular Machine Learning Algorithm
No ratings yet
3.popular Machine Learning Algorithm
11 pages
STDIX-Unit4-IntroductiontoGenerativeAIExercise(2024-25)
No ratings yet
STDIX-Unit4-IntroductiontoGenerativeAIExercise(2024-25)
6 pages
Portfolio
No ratings yet
Portfolio
35 pages
A Gentle Introduction To Generative Adversarial Networks (GANs)
No ratings yet
A Gentle Introduction To Generative Adversarial Networks (GANs)
15 pages
Building Face Ageing Model Using Face Synthesis
No ratings yet
Building Face Ageing Model Using Face Synthesis
7 pages
AntiFraud(GraphQL,DeepLearningMastery)
No ratings yet
AntiFraud(GraphQL,DeepLearningMastery)
7 pages
Deep Learning PPT Full Notes
No ratings yet
Deep Learning PPT Full Notes
105 pages
Atharva Deshpande CV
No ratings yet
Atharva Deshpande CV
1 page

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

Lecture 10:

1. Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive

2. The policy gradient method:

Usually: observation is the state st of the environment.

Rt is called the return and γ is called the discount factor.

Our goal can be formalized as the outcome of maximizing a cumulative reward

- Such infinite horizon and discounted reward is equivalent to the

Rational agent: agent that aims at maximizing its reward.

Goal: select actions to maximise cumulative reward.

A mapping from states to actions is called a policy

Goal: learn a policy πθ that maximizes the cumulative reward.

We start the trajectories at st by picking the action at.

Problem: Not easy to estimate its Q-value Q πθ

Idea: Use another network (the critic) to do it!

Q πθ(st, at) ≈ rt + γVw(st+1)

Last idea: also use Vw as a baseline.

Interactions between the critic and the policy! GAME!

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator

• w ← w + ηδt ∇w Vw(st ) [TD(0) critic update]

Directly use a neural network: Q πθ(st, at) = Qw(st, at)

Learn it with DQN:

where y(s′, r) = r − γ max Qw̄(s′, a′)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator

Given a state s, and a policy π, we want to estimate the

Q π(s, a) = [rt+1 + γrt+2 + … | st = s, at = a, π]

Fixed point equation:

Q π(s, a) = [rt+1 + γQ π(st+1, at+1) | st = s, at = a, at+1 ∼ π( ⋅ , st+1)]

Q*(s, a) = [rt+1 + γ max Q*(st+1, a) | st = s, at = a]

F(Q)(s, a) := Q(s, a) − (s′,r)∼P(⋅|a,s)[r + γ max Q(s′, a′)]

Usually π(st ) = arg max Qt(st, a) for exploitation and

With a stochastic estimate of F(Q) .

- Analysis of Q-learning using the tools from the last 4

2. Actor critic looks like a GAN for the RL framework

3. Tabular Q-learning can be seen as a stochastic

4.We are not even multi-agent yet!!!!

You might also like

Q(s, a) = [rt+1 + γ max Q(st+1, a) | st = s, at = a]