0% found this document useful (0 votes)
64 views

CS230: Lecture 9 Deep Reinforcement Learning: Kian Katanforoosh Menti Code: 80 24 08

The document summarizes a lecture on deep reinforcement learning. It covers: I. The motivation for using deep reinforcement learning, including problems with delayed labels and sequential decision making. II. An introduction to reinforcement learning using an example of optimizing recycling. The goal is to learn the best strategy to follow through a sequence of states to maximize rewards. III. It will cover deep Q-networks, using them to solve Atari games, and tips for training deep Q-networks. It will also discuss advanced topics.

Uploaded by

Spott Tify
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

CS230: Lecture 9 Deep Reinforcement Learning: Kian Katanforoosh Menti Code: 80 24 08

The document summarizes a lecture on deep reinforcement learning. It covers: I. The motivation for using deep reinforcement learning, including problems with delayed labels and sequential decision making. II. An introduction to reinforcement learning using an example of optimizing recycling. The goal is to learn the best strategy to follow through a sequence of states to maximize rewards. III. It will cover deep Q-networks, using them to solve Atari games, and tips for training deep Q-networks. It will also discuss advanced topics.

Uploaded by

Spott Tify
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

CS230: Lecture 9

Deep Reinforcement Learning


Kian Katanforoosh
Menti code: 80 24 08

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


I. Motivation

AlphaGo
Human Level Control through
Deep Reinforcement Learning
[Silver et al. (2017): Mastering the game of Go without human knowledge]
[Mnih et al. (2015): Human Level Control through Deep Reinforcement Learning] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
I. Motivation
Why RL?
• Delayed labels
• Making sequences of decisions

What is RL?
• Automatically learn to make good sequences of decision
Source: https://deepmind.com/blog/alphago-
zero-learning-scratch/

Examples of RL applications

Robotics Games Advertisement

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


II. Recycling is good: an introduction to RL

Problem statement
State 1 State 2 (initial) State 3 State 4 State 5
Goal: maximize the return (rewards)

START Number of states: 5


initial normal terminal

Types of states:
Define reward “r” in every state
Agent’s Possible actions:
+2 0 0 +1 +10 Additional rule: garbage
collector coming in 3min, it takes
1min to move between states
Best strategy to follow if γ = 1
How to define the long-term return?
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2

t=0

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?


State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ Q11 Q12 ⎞
⎜ ⎟
START ⎜ Q21 Q22 ⎟
Q-table Q = ⎜⎜ Q31 Q32 ⎟⎟ #states
⎜ Q41 Q42 ⎟
⎜ ⎟
Define reward “r” in every state ⎜⎝ Q51 Q52 ⎟⎠

How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
S1 S2 S3 S4 S5 +0
+0 S3
Assuming γ = 0.9 S3
S4
+1
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?


State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ Q11 Q12 ⎞
⎜ ⎟
START ⎜ Q21 Q22 ⎟
Q-table Q = ⎜⎜ Q31 Q32 ⎟⎟ #states
⎜ Q41 Q42 ⎟
⎜ ⎟
Define reward “r” in every state ⎜⎝ Q51 Q52 ⎟⎠

How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
S1 S2 S3 S4 S5 +0
+0 S3
Assuming γ = 0.9 S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?


State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ Q11 Q12 ⎞
⎜ ⎟
START ⎜ Q21 Q22 ⎟
Q-table Q = ⎜⎜ Q31 Q32 ⎟⎟ #states
⎜ Q41 Q42 ⎟
⎜ ⎟
Define reward “r” in every state ⎜⎝ Q51 Q52 ⎟⎠

How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
S1 S2 S3 S4 S5 +0
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?


State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ Q11 Q12 ⎞
⎜ ⎟
START ⎜ Q21 Q22 ⎟
Q-table Q = ⎜⎜ Q31 Q32 ⎟⎟ #states
⎜ Q41 Q42 ⎟
⎜ ⎟
Define reward “r” in every state ⎜⎝ Q51 Q52 ⎟⎠

How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
(= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?


State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ Q11 Q12 ⎞
⎜ ⎟
START ⎜ Q21 Q22 ⎟
Q-table Q = ⎜⎜ Q31 Q32 ⎟⎟ #states
⎜ Q41 Q42 ⎟
⎜ ⎟
Define reward “r” in every state ⎜⎝ Q51 Q52 ⎟⎠

How? S1
+2 0 0 +1 +10 +2
+ 8.1 S2 S2
(= 0 + 0.9 x 9) +0 (= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?


State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ Q11 Q12 ⎞
⎜ ⎟
START ⎜ Q21 Q22 ⎟
Q-table Q = ⎜⎜ Q31 Q32 ⎟⎟ #states
⎜ Q41 Q42 ⎟
⎜ ⎟
Define reward “r” in every state ⎜⎝ Q51 Q52 ⎟⎠

How? S1
+2 0 0 +1 +10 +2 (= 0 + 0.9 x 9)

+ 8.1
+ 8.1 S2 S2
(= 0 + 0.9 x 9) (= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?


State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ 0 0 ⎞
⎜ ⎟
START ⎜ 2 9 ⎟
Q=⎜ 8.1 10 ⎟
Q-table ⎜ ⎟ #states
⎜ 9 10 ⎟
⎜ 0 0 ⎟⎠
Define reward “r” in every state ⎝

How? S1
+2 0 0 +1 +10 +2 (= 0 + 0.9 x 9)

+ 8.1
+ 8.1 S2 S2
(= 0 + 0.9 x 9) (= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?


State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ 0 0 ⎞
⎜ ⎟
START ⎜ 2 9 ⎟
Q=⎜ 8.1 10 ⎟
Q-table ⎜ ⎟ #states
⎜ 9 10 ⎟
⎜ 0 0 ⎟⎠
Define reward “r” in every state ⎝

Bellman equation
+2 0 0 +1 +10 (optimality equation)

Best strategy to follow if γ = 0.9 Q (s,a) = r + γ max (Q (s',a'))


* *
a'

When state and actions space are too


big, this method has huge memory cost Policy π (s) = arg max (Q (s,a)) *

a
Function telling us our best strategy
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
What we’ve learned so far:

- Vocabulary: environment, agent, state, action, reward, total return,


discount factor.

- Q-table: matrix of entries representing “how good is it to take action a


in state s”


- Policy: function telling us what’s the best strategy to adopt

- Bellman equation satisfied by the optimal Q-table

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


III. Deep Q-Networks

Main idea: find a Q-function to replace the Q-table


Problem statement Neural Network
State 1 State 2 (initial) State 3 State 4 State 5

START

Then compute loss, backpropagate.


Q-table
⎛ 0⎞ a1[1]
⎜1⎟
#actions
a1[2]
⎛ 0 0 ⎞ ⎜ ⎟ a [1]
2
a[ 3]
1
Q(s,←)
⎜ ⎟
⎜ 2 9 ⎟ s = ⎜ 0⎟ a2[2]
Q=⎜

8.1 10 ⎟
⎟ #states ⎜ ⎟
0 a3[1] a1[ 3] Q(s,→)
⎜ 9 10 ⎟ ⎜ ⎟

⎝ 0 0 ⎟

⎜⎝ 0⎟⎠ a3[2]
a4[1]

How to compute the loss?


Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
III. Deep Q-Networks Q (s,a) = r + γ max (Q (s',a'))
* *
a'

[1]
⎛ 0⎞ a 1

⎜1⎟ a [2]
Loss function
⎜ ⎟ a [1]
2
1
a1[ 3] Q(s,←)
s = ⎜ 0⎟ a2[2] L = ( y − Q(s,←)) 2

⎜ ⎟ a [1]
a1[ 3] Q(s,→)
⎜ ⎟0 3

⎜⎝ 0⎟⎠ a3[2]
a4[1]

Target value Case: Q(s,←) > Q(s,→) Case: Q(s,←) < Q(s,→)

y = r← + γ max (Q(s next



,a')) y = r→ + γ max (Q(s next

,a'))
a' a'
Hold fixed for backprop
Hold fixed for backprop
Immediate reward for
taking action in
state s Immediate Reward for Discounted maximum
taking action in future reward when
next
Discounted maximum future reward state s you are in state s→
next
when you are in state s←
[Francisco S. Melo: Convergence of Q-learning: a simple proof] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
III. Deep Q-Networks
[1]
⎛ 0⎞ a1

⎜1⎟ a [2]
Loss function (regression)
⎜ ⎟ a [1]
2
1
a1[ 3] Q(s,←)
s = ⎜ 0⎟ a2[2] L = ( y − Q(s,→)) 2

⎜ ⎟ a [1]
a1[ 3] Q(s,→)
⎜ ⎟0 3

⎜⎝ 0⎟⎠ a3[2]
a4[1]

Target value Case: Q(s,←) > Q(s,→) Case: Q(s,←) < Q(s,→)

y = r← + γ max (Q(s next



,a')) y = r→ + γ max (Q(s next

,a'))
a' a'

∂L
Backpropagation Compute
∂W
and update W using stochastic gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Recap’

DQN Implementation: State 1 State 2 (initial) State 3 State 4 State 5

- Initialize your Q-network parameters


START
- Loop over episodes:

- Start from initial state s

- Loop over time-steps:

- Forward propagate s in the Q-network


- Execute action a (that has the maximum Q(s,a) output of Q-network)

- Observe rewards r and next state s’


- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.

- Update parameters with gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


IV. Deep Q-Networks application: Breakout (Atari)

Goal: play breakout, i.e. destroy all the bricks.

Demo
input of Q-network Output of Q-network
Q-values
⎛ Q(s,←)⎞
s= ⎜ Q(s,→)⎟
⎜ ⎟
⎜⎝ Q(s,−) ⎟⎠

Would that work?

https://www.youtube.com/watch?v=V1eYniJ0Rnk Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


IV. Deep Q-Networks application: Breakout (Atari)

Goal: play breakout, i.e. destroy all the bricks.

Demo
input of Q-network Output of Q-network
Q-values

s= ⎛ Q(s,←)⎞
⎜ Q(s,→)⎟
⎜ ⎟
⎜⎝ Q(s,−) ⎟⎠

Preprocessing - Convert to grayscale


What is done in
preprocessing?
- Reduce dimensions (h,w)
- History (4 frames)
φ (s)
https://www.youtube.com/watch?v=V1eYniJ0Rnk Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
IV. Deep Q-Networks application: Breakout (Atari)

input of Q-network

φ (s) =

Deep Q-network architecture?


⎛ Q(s,←)⎞
φ (s) CONV ReLU CONV ReLU CONV ReLU
⎜ Q(s,→)⎟
⎜ ⎟
FC (RELU) FC (LINEAR)

⎜⎝ Q(s,−) ⎟⎠

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Recap’ (+ preprocessing + terminal state)

Some training challenges:


DQN Implementation: - Keep track of terminal step
- Experience replay
- Initialize your Q-network parameters - Epsilon greedy action choice

- Loop over episodes: (Exploration / Exploitation tradeoff)


φ (s)
- Start from initial state s

- Loop over time-steps: φ (s)


φ (s)
- Forward propagate s in the Q-network


- Execute action a (that has the maximum Q(s,a) output of Q-network)

- Observe rewards r and next state s’


 Use s’ to create φ (s')


φ (s')
-

- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.

- Update parameters with gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Recap’ (+ preprocessing + terminal state)

Some training challenges:


DQN Implementation: - Keep track of terminal step
- Experience replay
- Initialize your Q-network parameters - Epsilon greedy action choice

- Loop over episodes: (Exploration / Exploitation tradeoff)


φ (s)
- Start from initial state s

- Loop over time-steps: φ (s)


φ (s)
- Forward propagate s in the Q-network


- Execute action a (that has the maximum Q(s,a) output of Q-network)

- Observe rewards r and next state s’


 Use s’ to create φ (s')


φ (s')
-

- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.

- Update parameters with gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Recap’ (+ preprocessing + terminal state)

DQN Implementation: Some training challenges:


- Keep track of terminal step
- Initialize your Q-network parameters
- Experience replay
- Loop over episodes: - Epsilon greedy action choice
φ (s) (Exploration / Exploitation tradeoff)
- Start from initial state s

- Create a boolean to detect terminal states: terminal = False



⎧if terminal = False : y = r + γ max (Q(s',a'))

- Loop over time-steps:
φ (s) ⎨
a'

⎪⎩if terminal = True : y = r (break)


Forward propagate s in the Q-network

φ (s)
-

- Execute action a (that has the maximum Q(s,a) output of Q-network)

- Observe rewards r and next state s’

φ (s')
Use s’ to create 

φ (s')
-

- Check if s’ is a terminal state. Compute targets y by forward propagating state s’ in the Q-network, then
compute loss.

- Update parameters with gradient descent


Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
IV - DQN training challenges
1 experience (leads to one iteration of gradient descent)
Experience replay
Current method is to start from Experience Replay
initial state s and follow:
E1
E1 φ (s) → a → r → φ (s') E2 E2 E1

E2 φ (s') → a' → r ' → φ (s'') E3 E3

E3 φ (s'') → a'' → r '' → φ (s''') …

... Replay memory (D)

Training: E1 E2 E3 Training: E1 sample(E1, E2) sample(E1, E2, E3)


sample(E1, E2, E3, E4) …

Can be used with mini batch gradient descent Advantages of experience replay?
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Recap’ (+ experience replay)
DQN Implementation:
Some training challenges:
- Initialize your Q-network parameters - Keep track of terminal step
- Initialize replay memory D
- Experience replay
- Loop over episodes: - Epsilon greedy action choice
- Start from initial state φ (s) (Exploration / Exploitation tradeoff)
- Create a boolean to detect terminal states: terminal = False


- Loop over time-steps:


The transition
- Forward propagate φ (s) in the Q-network

resulting from this
- Execute action a (that has the maximum Q(φ (s),a) output of Q-network) is added to D, and
will not always be
- Observe rewards r and next state s’ used in this
- Use s’ to create
 φ (s') iteration’s update!

- Add experience (φ (s),a,r,φ (s')) to replay memory (D)

- Sample random mini-batch of transitions from D


Update
using - Check if s’ is a terminal state. Compute targets y by forward propagating state φ (s') in the Q-network, then compute loss.

sampled
- Update parameters with gradient descent
transitions
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Exploration vs. Exploitation

Terminal state

S2 R = +0 Just after initializing the


a1 Q-network, we get:
Q(S1,a1 ) = 0.5
Initial state Terminal state
a2
S1 S3 R = +1 Q(S1,a2 ) = 0.4
a3 Q(S1,a3 ) = 0.3
Terminal state

S4 R = +1000

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Exploration vs. Exploitation

Terminal state

S2 R = +0 Just after initializing the


a1 Q-network, we get:
Q(S1,a1 ) = 0.5 0
Initial state Terminal state
a2
S1 S3 R = +1 Q(S1,a2 ) = 0.4
a3 Q(S1,a3 ) = 0.3
Terminal state

S4 R = +1000

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Exploration vs. Exploitation

Terminal state

S2 R = +0 Just after initializing the


a1 Q-network, we get:
Q(S1,a1 ) = 0.5 0
Initial state Terminal state
a2
S1 S3 R = +1 Q(S1,a2 ) = 0.4 1
a3 Q(S1,a3 ) = 0.3
Terminal state

S4 R = +1000

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Exploration vs. Exploitation

Terminal state

S2 R = +0 Just after initializing the


a1 Q-network, we get:
Q(S1,a1 ) = 0.5 0
Initial state Terminal state
a2
S1 S3 R = +1 Q(S1,a2 ) = 0.4 1
a3 Q(S1,a3 ) = 0.3
Terminal state

S4 R = +1000 Will never be visited, because


Q(S1,a3) < Q(S1,a2)

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Recap’ (+ epsilon greedy action)
DQN Implementation:

- Initialize your Q-network parameters


- Initialize replay memory D

- Loop over episodes:

- Start from initial state φ (s)


- Create a boolean to detect terminal states: terminal = False


- Loop over time-steps:

- With probability epsilon, take random action a.


- Otherwise:
- Forward propagate φ (s) in the Q-network
- Execute action a (that has the maximum Q(φ (s),a) output of Q-network).


- Observe rewards r and next state s’

- Use s’ to create
 φ (s')


- Add experience (φ (s),a,r,φ (s')) to replay memory (D)

- Sample random mini-batch of transitions from D

- Check if s’ is a terminal state. Compute targets y by forward propagating state φ (s') in the Q-network, then compute loss.


- Update parameters with gradient descent


Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Overall recap’
DQN Implementation:

- Initialize your Q-network parameters - Preprocessing


- Initialize replay memory D
- Detect terminal state
- Loop over episodes:
- Experience replay
- Start from initial state φ (s) - Epsilon greedy action
- Create a boolean to detect terminal states: terminal = False


- Loop over time-steps:

- With probability epsilon, take random action a.


- Otherwise:
- Forward propagate φ (s) in the Q-network
- Execute action a (that has the maximum Q(φ (s),a) output of Q-network).


- Observe rewards r and next state s’

- Use s’ to create
 φ (s')


- Add experience (φ (s),a,r,φ (s')) to replay memory (D)

- Sample random mini-batch of transitions from D

- Check if s’ is a terminal state. Compute targets y by forward propagating state φ (s') in the Q-network, then compute loss.


- Update parameters with gradient descent


Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Results

[https://www.youtube.com/watch?v=TmPfTpjtdgg] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Other Atari games

Pong SeaQuest Space Invaders

[https://www.youtube.com/watch?v=NirMkC5uvWU]
[https://www.youtube.com/watch?v=p88R2_3yWPA]
[https://www.youtube.com/watch?v=W2CAghUiofY&t=2s] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


VI - Advanced topics

Alpha Go

[DeepMind Blog]
[Silver et al. (2017): Mastering the game of Go without human knowledge]
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics Competitive self-play

[Bansal et al. (2017): Emergent Complexity via multi-agent competition]


[OpenAI Blog: Competitive self-play] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics

Meta learning

[Finn et al. (2017): Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics Imitation learning

[Source: Bellemare et al. (2016): Unifying Count-Based


Exploration and Intrinsic Motivation]

[Ho et al. (2016): Generative Adversarial Imitation Learning] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics Auxiliary task

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri


Announcements

For Tuesday 06/05, 9am:


This Friday:
• TA Sections:
• How to have a great final project write-up.
• Advices on: How to write a great report.
• Advices on: How to build a super poster.
• Advices on: Final project grading criteria.
• Going through examples of great projects and why they were great.
• Small competitive quiz in section.

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

You might also like