CS230: Lecture 9 Deep Reinforcement Learning: Kian Katanforoosh Menti Code: 80 24 08
CS230: Lecture 9 Deep Reinforcement Learning: Kian Katanforoosh Menti Code: 80 24 08
I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics
AlphaGo
Human Level Control through
Deep Reinforcement Learning
[Silver et al. (2017): Mastering the game of Go without human knowledge]
[Mnih et al. (2015): Human Level Control through Deep Reinforcement Learning] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
I. Motivation
Why RL?
• Delayed labels
• Making sequences of decisions
What is RL?
• Automatically learn to make good sequences of decision
Source: https://deepmind.com/blog/alphago-
zero-learning-scratch/
Examples of RL applications
I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics
Problem statement
State 1 State 2 (initial) State 3 State 4 State 5
Goal: maximize the return (rewards)
Types of states:
Define reward “r” in every state
Agent’s Possible actions:
+2 0 0 +1 +10 Additional rule: garbage
collector coming in 3min, it takes
1min to move between states
Best strategy to follow if γ = 1
How to define the long-term return?
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2
t=0
How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
S1 S2 S3 S4 S5 +0
+0 S3
Assuming γ = 0.9 S3
S4
+1
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL
How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
S1 S2 S3 S4 S5 +0
+0 S3
Assuming γ = 0.9 S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL
How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
S1 S2 S3 S4 S5 +0
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL
How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
(= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL
How? S1
+2 0 0 +1 +10 +2
+ 8.1 S2 S2
(= 0 + 0.9 x 9) +0 (= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL
How? S1
+2 0 0 +1 +10 +2 (= 0 + 0.9 x 9)
+ 8.1
+ 8.1 S2 S2
(= 0 + 0.9 x 9) (= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL
How? S1
+2 0 0 +1 +10 +2 (= 0 + 0.9 x 9)
+ 8.1
+ 8.1 S2 S2
(= 0 + 0.9 x 9) (= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL
Bellman equation
+2 0 0 +1 +10 (optimality equation)
a
Function telling us our best strategy
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
What we’ve learned so far:
I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics
START
[1]
⎛ 0⎞ a 1
⎜1⎟ a [2]
Loss function
⎜ ⎟ a [1]
2
1
a1[ 3] Q(s,←)
s = ⎜ 0⎟ a2[2] L = ( y − Q(s,←)) 2
⎜ ⎟ a [1]
a1[ 3] Q(s,→)
⎜ ⎟0 3
⎜⎝ 0⎟⎠ a3[2]
a4[1]
Target value Case: Q(s,←) > Q(s,→) Case: Q(s,←) < Q(s,→)
⎜1⎟ a [2]
Loss function (regression)
⎜ ⎟ a [1]
2
1
a1[ 3] Q(s,←)
s = ⎜ 0⎟ a2[2] L = ( y − Q(s,→)) 2
⎜ ⎟ a [1]
a1[ 3] Q(s,→)
⎜ ⎟0 3
⎜⎝ 0⎟⎠ a3[2]
a4[1]
Target value Case: Q(s,←) > Q(s,→) Case: Q(s,←) < Q(s,→)
∂L
Backpropagation Compute
∂W
and update W using stochastic gradient descent
- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.
I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics
Demo
input of Q-network Output of Q-network
Q-values
⎛ Q(s,←)⎞
s= ⎜ Q(s,→)⎟
⎜ ⎟
⎜⎝ Q(s,−) ⎟⎠
Demo
input of Q-network Output of Q-network
Q-values
s= ⎛ Q(s,←)⎞
⎜ Q(s,→)⎟
⎜ ⎟
⎜⎝ Q(s,−) ⎟⎠
input of Q-network
φ (s) =
⎜⎝ Q(s,−) ⎟⎠
- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.
- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.
φ (s')
Use s’ to create
φ (s')
-
- Check if s’ is a terminal state. Compute targets y by forward propagating state s’ in the Q-network, then
compute loss.
Can be used with mini batch gradient descent Advantages of experience replay?
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Recap’ (+ experience replay)
DQN Implementation:
Some training challenges:
- Initialize your Q-network parameters - Keep track of terminal step
- Initialize replay memory D
- Experience replay
- Loop over episodes: - Epsilon greedy action choice
- Start from initial state φ (s) (Exploration / Exploitation tradeoff)
- Create a boolean to detect terminal states: terminal = False
Terminal state
S4 R = +1000
Terminal state
S4 R = +1000
Terminal state
S4 R = +1000
Terminal state
- Check if s’ is a terminal state. Compute targets y by forward propagating state φ (s') in the Q-network, then compute loss.
- Check if s’ is a terminal state. Compute targets y by forward propagating state φ (s') in the Q-network, then compute loss.
[https://www.youtube.com/watch?v=NirMkC5uvWU]
[https://www.youtube.com/watch?v=p88R2_3yWPA]
[https://www.youtube.com/watch?v=W2CAghUiofY&t=2s] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Today’s outline
I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics
Alpha Go
[DeepMind Blog]
[Silver et al. (2017): Mastering the game of Go without human knowledge]
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics Competitive self-play
Meta learning
[Finn et al. (2017): Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics Imitation learning
[Ho et al. (2016): Generative Adversarial Imitation Learning] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics Auxiliary task
This Friday:
• TA Sections:
• How to have a great final project write-up.
• Advices on: How to write a great report.
• Advices on: How to build a super poster.
• Advices on: Final project grading criteria.
• Going through examples of great projects and why they were great.
• Small competitive quiz in section.