drl_v5
drl_v5
Human label
? “3-3”?
Policy Gradient
Actor-Critic
Reward Shaping
Environment
Example: Playing Video Game
Termination: all the aliens are killed,
• Space invader
or your spaceship is destroyed.
Score
(reward)
Kill the
aliens
shield
fire
Example: Playing Video Game
Actor
Observation Action
“right”
Reward
reward = 0
Environment
Example: Playing Video Game
Find an actor maximizing expected reward.
Actor
Observation Action
“fire”
Reward
reward = 5
if killing an alien.
Environment
Example: Learning to play Go
Observation Action
Reward
Next Move
Environment
Example: Learning to play Go
Find an actor maximizing expected reward.
Observation Action
Reward
reward = 0 in most cases
If win, reward = 1
If loss, reward = -1
Environment
Machine Learning is so simple ……
…
actions
fire 0.1
pixels
Classification Task!!!
This is an episode.
After many turns Game Over Total reward
𝑇
(spaceship destroyed) (return):
𝑅 = 𝑟𝑡
𝑡=1
Obtain reward 𝑟𝑇
What we want
Action 𝑎 𝑇 to maximize
Trajectory
Step 3: Optimization 𝜏 = 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , ⋯
Network Network
𝑠1 𝑎1 𝑠2 𝑎2
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
𝑇
How to do the optimization here is
𝑅 𝜏 = 𝑟𝑡
the main challenge in RL.
c.f. GAN 𝑡=1
To learn more about policy gradient:
Outline https://youtu.be/W8XF3ME8G2I
Policy Gradient
Actor-Critic
Reward Shaping
𝐿 = 𝑒1 − 𝑒2 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿
𝜃
How to control your actor
Training Data
……
Training Data
……
……
……
Reward Reward ……
𝑟1 𝑟2 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝑟𝑁
many episodes
Short-sighted Version!
𝑠1 𝑎1 𝑠2
Version 0 Env Actor Env Actor
𝑠1 𝑎1 𝑠2 𝑎2
“right” “fire”
Reward Reward ……
?
𝑟1 0 𝑟2 +5
𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠1 , 𝑎1 A1 = 𝐺1
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠2 , 𝑎2 A2 = 𝐺2
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠3 , 𝑎3 A3 = 𝐺3
……
……
𝐺1 = 𝑟1 + 𝑟2 + 𝑟3 + …… + 𝑟𝑁 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁
𝐺2 = 𝑟2 + 𝑟3 + …… + 𝑟𝑁
𝑁
𝐺3 = 𝑟3 + …… + 𝑟𝑁 𝐺𝑡 = 𝑟𝑛
cumulated reward 𝑛=𝑡
Version 2
Training Data
𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠3 , 𝑎3 A3 = 𝐺3′
……
……
Also the credit of 𝑎1 ?
𝐺1 = 𝑟1 + 𝑟2 + 𝑟3 + …… + 𝑟𝑁 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′
𝐺1′ = 𝑟1 + 𝛾𝑟2 + 𝛾 2 𝑟3 + …… 𝑁
𝐺𝑡′ = 𝛾 𝑛−𝑡 𝑟𝑛
Discount factor 𝛾 < 1
𝑛=𝑡
Version 3
Training Data
𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′ −𝑏
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′ −𝑏
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠3 , 𝑎3 A3 = 𝐺3′ −𝑏
……
……
Good or bad reward is “relative”
If all the 𝑟𝑛 ≥ 10 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′ −𝑏
𝑟𝑛 = 10 is negative … 𝑁
Minus by a baseline 𝑏 ??? 𝐺𝑡′ = 𝛾 𝑛−𝑡 𝑟𝑛
Make 𝐺𝑡′ have positive and negative values 𝑛=𝑡
Policy Gradient
• Initialize actor network parameters 𝜃 0
• For training iteration 𝑖 = 1 to 𝑇
• Using actor 𝜃 𝑖−1 to interact
• Obtain data 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , … , 𝑠𝑁 , 𝑎𝑁
• Compute 𝐴1 , 𝐴2 , … , 𝐴𝑁
• Compute loss 𝐿 Data collection is in the “for
𝑖 𝑖−1
•𝜃 ←𝜃 − 𝜂∇𝐿 loop” of training iterations.
Policy Gradient
Training Data
𝑠1 , 𝑎1
𝑠 Actor 𝑎
A1 𝜃
𝑠2 , 𝑎2 A2
𝑠3 , 𝑎3 A3 𝐿 = A𝑛 𝑒𝑛
……
……
𝜃 𝑖 ← 𝜃 𝑖−1 − 𝜂∇𝐿
𝑠𝑁 , 𝑎𝑁 A𝑁 only update once
𝑠1 𝑠2 𝑠3 𝑠𝑁
Trajectory of 𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 To train 𝜃 𝑖
𝜃 𝑖−1
𝑟1 𝑟2 𝑟3 𝑟𝑁
Not apply to
everyone
Policy Gradient
Actor-Critic
Reward Shaping
𝑉𝜃 𝑠
s 𝑉𝜃
scalar
𝑉 𝜃 𝑠 is large 𝑉 𝜃 𝑠 is smaller
After seeing 𝑠𝑎 ,
Until the end of the episode, 𝑠𝑎 𝑉𝜃 𝑉 𝜃 𝑠𝑎 𝐺𝑎′
the cumulated reward is 𝐺𝑎′
After seeing 𝑠𝑏 ,
Until the end of the episode,
the cumulated reward is 𝐺𝑏′
𝑠𝑏 𝑉𝜃 𝑉 𝜃 𝑠𝑏 𝐺𝑏′
How to estimate 𝑉 𝜋 𝑠
• Temporal-difference (TD) approach
⋯ 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ⋯ (ignore the expectation here)
𝑉 𝜃 𝑠𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 …
𝑉 𝜃 𝑠𝑡+1 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + ⋯
𝜃 𝑉 𝜃 𝑠𝑡 = 𝛾𝑉 𝜃 𝑠𝑡+1 + 𝑟𝑡
𝑠𝑡 𝑉 𝑉𝜃 𝑠𝑡
- 𝑉 𝜃 𝑠𝑡 − 𝛾𝑉 𝜃 𝑠𝑡+1 𝑟𝑡
×𝛾
𝑠𝑡+1 𝑉𝜃 𝑉 𝜃 𝑠𝑡+1
[Sutton, v2,
MC v.s. TD Example 6.4]
𝑠1 𝑠2 𝑠3 𝑠𝑁 Training Data
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′ −𝑏
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′ −𝑏
𝑠3 , 𝑎3 A3 = 𝐺3′ −𝑏
……
……
𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′ −𝑏
s 𝑉𝜃 𝑉 𝑠 𝜃
Version 3.5
𝑠1 𝑠2 𝑠3 𝑠𝑁 Training Data
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′ −𝑉 𝜃 𝑠1
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′ −𝑉 𝜃 𝑠2
𝑠3 , 𝑎3 A3 = 𝐺3′ −𝑉 𝜃 𝑠3
……
……
𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′ −𝑉 𝜃 𝑠𝑁
s 𝑉𝜃 𝑉 𝑠 𝜃
Version 3.5 𝑠𝑡 , 𝑎𝑡 A𝑡 = 𝐺𝑡′ −𝑉 𝜃 𝑠𝑡
𝐺 = 100
𝐺=3
𝐺=1 𝑉 𝜃 𝑠𝑡
𝑠𝑡 𝐺=2
𝐺 = −10
(not necessary take 𝑎𝑡 )
(You sample the actions based on
a distribution) A𝑡 > 0
𝑎𝑡 is better than average.
𝑎𝑡
𝐺𝑡′ A𝑡 < 0
𝑠𝑡
Just a sample 𝑎𝑡 is worse than average.
𝑟𝑡 + 𝑉 𝜃 𝑠𝑡+1 − 𝑉 𝜃 𝑠𝑡
Version 4 𝑠𝑡 , 𝑎𝑡 A𝑡 = 𝐺𝑡′ −𝑉 𝜃 𝑠𝑡
Advantage Actor-Critic
𝐺 = 100
𝐺=3
𝐺=1 𝑉 𝜃 𝑠𝑡
𝑠𝑡 𝐺=2
𝐺 = −10
(not necessary take 𝑎𝑡 )
𝐺 = 101
Obtain 𝑟𝑡 𝐺=4
𝑎𝑡 𝑟𝑡
𝐺=3
+𝑉 𝜃 𝑠𝑡+1
𝑠𝑡 𝑠𝑡+1 𝐺=1
𝐺 = −5
Tip of Actor-Critic
• The parameters of actor and critic can be
shared.
left
Video:
https://youtu.be/o_g9JUMw1Oc
https://youtu.be/2-zGCx4iv_k
https://arxiv.org/abs/1710.02298
Outline
Policy Gradient
Actor-Critic
Reward Shaping
…
…
𝑟1 𝑟2 𝑟3 𝑟𝑁
sN , a N A𝑁
https://bair.berkeley.edu/blog/2017/12/20/reverse-curriculum/
Reward Shaping - Curiosity https://arxiv.org/abs/1705.05363
Policy Gradient
Actor-Critic
Reward Shaping
?????
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
Actor can interact with the environment, but reward
function is not available
forward ?
Expert
𝑠𝑖 𝑎𝑖 𝑎ො𝑖
https://www.youtube.com/watch?v=j2FSB3bseek
Inverse Reinforcement Learning
demonstration
of the expert
Environment
𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾
𝑅 𝜏Ƹ 𝑛 > 𝑅 𝜏
𝑛=1 𝑛=1
Reward
𝜏1 , 𝜏2 , ⋯ , 𝜏𝐾 Function R
Actor
= Generator
Find an actor based
Reward function Actor 𝜋 on reward function R
= Discriminator
By Reinforcement learning
GAN High score for real,
low score for generated
𝜏1 , 𝜏2 , ⋯ , 𝜏𝐾 Reward
Function
Policy Gradient
Actor-Critic
Sparse Reward