0% found this document useful (0 votes)

64 views

CS230: Lecture 9 Deep Reinforcement Learning: Kian Katanforoosh Menti Code: 80 24 08

The document summarizes a lecture on deep reinforcement learning. It covers: I. The motivation for using deep reinforcement learning, including problems with delayed labels and sequential decision making. II. An introduction to reinforcement learning using an example of optimizing recycling. The goal is to learn the best strategy to follow through a sequence of states to maximize rewards. III. It will cover deep Q-networks, using them to solve Atari games, and tips for training deep Q-networks. It will also discuss advanced topics.

Uploaded by

Spott Tify

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

CS230: Lecture 9 Deep Reinforcement Learning: Kian Katanforoosh Menti Code: 80 24 08

Uploaded by

Spott Tify

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

CS230: Lecture 9

Deep Reinforcement Learning

Kian Katanforoosh
Menti code: 80 24 08

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

I. Motivation

AlphaGo
Human Level Control through
Deep Reinforcement Learning
[Silver et al. (2017): Mastering the game of Go without human knowledge]
[Mnih et al. (2015): Human Level Control through Deep Reinforcement Learning] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
I. Motivation
Why RL?
• Delayed labels
• Making sequences of decisions

What is RL?
• Automatically learn to make good sequences of decision
Source: https://deepmind.com/blog/alphago-
zero-learning-scratch/

Examples of RL applications

Robotics Games Advertisement

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

II. Recycling is good: an introduction to RL

Problem statement
State 1 State 2 (initial) State 3 State 4 State 5
Goal: maximize the return (rewards)

START Number of states: 5

initial normal terminal

Types of states:
Define reward “r” in every state
Agent’s Possible actions:
+2 0 0 +1 +10 Additional rule: garbage
collector coming in 3min, it takes
1min to move between states
Best strategy to follow if γ = 1
How to define the long-term return?
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2

t=0

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?

State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ Q11 Q12 ⎞
⎜ ⎟
START ⎜ Q21 Q22 ⎟
Q-table Q = ⎜⎜ Q31 Q32 ⎟⎟ #states
⎜ Q41 Q42 ⎟
⎜ ⎟
Define reward “r” in every state ⎜⎝ Q51 Q52 ⎟⎠

How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
S1 S2 S3 S4 S5 +0
+0 S3
Assuming γ = 0.9 S3
S4
+1
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?

How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
S1 S2 S3 S4 S5 +0
+0 S3
Assuming γ = 0.9 S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?

How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
S1 S2 S3 S4 S5 +0
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?

How? S1
+2 0 0 +1 +10 +2
S2 +0 S2
(= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?

How? S1
+2 0 0 +1 +10 +2
+ 8.1 S2 S2
(= 0 + 0.9 x 9) +0 (= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?

How? S1
+2 0 0 +1 +10 +2 (= 0 + 0.9 x 9)

+ 8.1
+ 8.1 S2 S2
(= 0 + 0.9 x 9) (= 0 + 0.9 x 10)
S1 S2 S3 S4 S5 +9
+9 S3
Assuming γ = 0.9 (= 0 + 0.9 x 10)
S3
S4
+ 10
S5
Discounted return R = ∑ γ rt = r0 + γ r1 + γ r2 + ...
t 2 (= 1 + 10 x 0.9)
+10
t=0
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
II. Recycling is good: an introduction to RL

Problem statement What do we want to learn?

State 1 State 2 (initial) State 3 State 4 State 5 #actions
how good is it to take
action 1 in state 2 ⎛ 0 0 ⎞
⎜ ⎟
START ⎜ 2 9 ⎟
Q=⎜ 8.1 10 ⎟
Q-table ⎜ ⎟ #states
⎜ 9 10 ⎟
⎜ 0 0 ⎟⎠
Define reward “r” in every state ⎝

How? S1
+2 0 0 +1 +10 +2 (= 0 + 0.9 x 9)

Problem statement What do we want to learn?

Bellman equation
+2 0 0 +1 +10 (optimality equation)

Best strategy to follow if γ = 0.9 Q (s,a) = r + γ max (Q (s',a'))

* *
a'

When state and actions space are too

big, this method has huge memory cost Policy π (s) = arg max (Q (s,a)) *

a
Function telling us our best strategy
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
What we’ve learned so far:

- Vocabulary: environment, agent, state, action, reward, total return,

discount factor.

- Q-table: matrix of entries representing “how good is it to take action a

in state s” 

- Policy: function telling us what’s the best strategy to adopt

- Bellman equation satisfied by the optimal Q-table

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

III. Deep Q-Networks

Main idea: find a Q-function to replace the Q-table

Problem statement Neural Network
State 1 State 2 (initial) State 3 State 4 State 5

START

Then compute loss, backpropagate.

Q-table
⎛ 0⎞ a1[1]
⎜1⎟
#actions
a1[2]
⎛ 0 0 ⎞ ⎜ ⎟ a [1]
2
a[ 3]
1
Q(s,←)
⎜ ⎟
⎜ 2 9 ⎟ s = ⎜ 0⎟ a2[2]
Q=⎜
⎜
8.1 10 ⎟
⎟ #states ⎜ ⎟
0 a3[1] a1[ 3] Q(s,→)
⎜ 9 10 ⎟ ⎜ ⎟
⎜
⎝ 0 0 ⎟
⎠
⎜⎝ 0⎟⎠ a3[2]
a4[1]

How to compute the loss?

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
III. Deep Q-Networks Q (s,a) = r + γ max (Q (s',a'))
* *
a'

[1]
⎛ 0⎞ a 1

⎜1⎟ a [2]
Loss function
⎜ ⎟ a [1]
2
1
a1[ 3] Q(s,←)
s = ⎜ 0⎟ a2[2] L = ( y − Q(s,←)) 2

⎜ ⎟ a [1]
a1[ 3] Q(s,→)
⎜ ⎟0 3

⎜⎝ 0⎟⎠ a3[2]
a4[1]

Target value Case: Q(s,←) > Q(s,→) Case: Q(s,←) < Q(s,→)

y = r← + γ max (Q(s next

←
,a')) y = r→ + γ max (Q(s next
→
,a'))
a' a'
Hold fixed for backprop
Hold fixed for backprop
Immediate reward for
taking action in
state s Immediate Reward for Discounted maximum
taking action in future reward when
next
Discounted maximum future reward state s you are in state s→
next
when you are in state s←
[Francisco S. Melo: Convergence of Q-learning: a simple proof] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
III. Deep Q-Networks
[1]
⎛ 0⎞ a1

⎜1⎟ a [2]
Loss function (regression)
⎜ ⎟ a [1]
2
1
a1[ 3] Q(s,←)
s = ⎜ 0⎟ a2[2] L = ( y − Q(s,→)) 2

⎜ ⎟ a [1]
a1[ 3] Q(s,→)
⎜ ⎟0 3

⎜⎝ 0⎟⎠ a3[2]
a4[1]

Target value Case: Q(s,←) > Q(s,→) Case: Q(s,←) < Q(s,→)

y = r← + γ max (Q(s next

←
,a')) y = r→ + γ max (Q(s next
→
,a'))
a' a'

∂L
Backpropagation Compute
∂W
and update W using stochastic gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Recap’

DQN Implementation: State 1 State 2 (initial) State 3 State 4 State 5

- Initialize your Q-network parameters

START
- Loop over episodes:

- Start from initial state s

- Loop over time-steps:

- Forward propagate s in the Q-network 

- Execute action a (that has the maximum Q(s,a) output of Q-network)

- Observe rewards r and next state s’ 

- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.

- Update parameters with gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

IV. Deep Q-Networks application: Breakout (Atari)

Goal: play breakout, i.e. destroy all the bricks.

Demo
input of Q-network Output of Q-network
Q-values
⎛ Q(s,←)⎞
s= ⎜ Q(s,→)⎟
⎜ ⎟
⎜⎝ Q(s,−) ⎟⎠

Would that work?

https://www.youtube.com/watch?v=V1eYniJ0Rnk Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

IV. Deep Q-Networks application: Breakout (Atari)

Goal: play breakout, i.e. destroy all the bricks.

Demo
input of Q-network Output of Q-network
Q-values

s= ⎛ Q(s,←)⎞
⎜ Q(s,→)⎟
⎜ ⎟
⎜⎝ Q(s,−) ⎟⎠

Preprocessing - Convert to grayscale

What is done in
preprocessing?
- Reduce dimensions (h,w)
- History (4 frames)
φ (s)
https://www.youtube.com/watch?v=V1eYniJ0Rnk Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
IV. Deep Q-Networks application: Breakout (Atari)

input of Q-network

φ (s) =

Deep Q-network architecture?

⎛ Q(s,←)⎞
φ (s) CONV ReLU CONV ReLU CONV ReLU
⎜ Q(s,→)⎟
⎜ ⎟
FC (RELU) FC (LINEAR)

⎜⎝ Q(s,−) ⎟⎠

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Recap’ (+ preprocessing + terminal state)

Some training challenges:

DQN Implementation: - Keep track of terminal step
- Experience replay
- Initialize your Q-network parameters - Epsilon greedy action choice

- Loop over episodes: (Exploration / Exploitation tradeoff)

φ (s)
- Start from initial state s

- Loop over time-steps: φ (s)

φ (s)
- Forward propagate s in the Q-network 

- Execute action a (that has the maximum Q(s,a) output of Q-network)

- Observe rewards r and next state s’

  Use s’ to create φ (s')

φ (s')
-

- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.

- Update parameters with gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Recap’ (+ preprocessing + terminal state)

Some training challenges:

DQN Implementation: - Keep track of terminal step
- Experience replay
- Initialize your Q-network parameters - Epsilon greedy action choice

- Loop over episodes: (Exploration / Exploitation tradeoff)

φ (s)
- Start from initial state s

- Loop over time-steps: φ (s)

φ (s)
- Forward propagate s in the Q-network 

- Execute action a (that has the maximum Q(s,a) output of Q-network)

- Observe rewards r and next state s’

  Use s’ to create φ (s')

φ (s')
-

- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.

- Update parameters with gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Recap’ (+ preprocessing + terminal state)

DQN Implementation: Some training challenges:

- Keep track of terminal step
- Initialize your Q-network parameters
- Experience replay
- Loop over episodes: - Epsilon greedy action choice
φ (s) (Exploration / Exploitation tradeoff)
- Start from initial state s

- Create a boolean to detect terminal states: terminal = False 

⎧if terminal = False : y = r + γ max (Q(s',a'))
⎪
- Loop over time-steps:
φ (s) ⎨
a'

⎪⎩if terminal = True : y = r (break)

Forward propagate s in the Q-network 
φ (s)
-

- Execute action a (that has the maximum Q(s,a) output of Q-network)

- Observe rewards r and next state s’

φ (s')
Use s’ to create  
φ (s')
-

- Check if s’ is a terminal state. Compute targets y by forward propagating state s’ in the Q-network, then
compute loss.

- Update parameters with gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
IV - DQN training challenges
1 experience (leads to one iteration of gradient descent)
Experience replay
Current method is to start from Experience Replay
initial state s and follow:
E1
E1 φ (s) → a → r → φ (s') E2 E2 E1

E2 φ (s') → a' → r ' → φ (s'') E3 E3

E3 φ (s'') → a'' → r '' → φ (s''') …

... Replay memory (D)

Training: E1 E2 E3 Training: E1 sample(E1, E2) sample(E1, E2, E3)

sample(E1, E2, E3, E4) …

Can be used with mini batch gradient descent Advantages of experience replay?
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Recap’ (+ experience replay)
DQN Implementation:
Some training challenges:
- Initialize your Q-network parameters - Keep track of terminal step
- Initialize replay memory D
- Experience replay
- Loop over episodes: - Epsilon greedy action choice
- Start from initial state φ (s) (Exploration / Exploitation tradeoff)
- Create a boolean to detect terminal states: terminal = False 

- Loop over time-steps:

The transition
- Forward propagate φ (s) in the Q-network 
resulting from this
- Execute action a (that has the maximum Q(φ (s),a) output of Q-network) is added to D, and
will not always be
- Observe rewards r and next state s’ used in this
- Use s’ to create  φ (s') iteration’s update!

- Add experience (φ (s),a,r,φ (s')) to replay memory (D)

- Sample random mini-batch of transitions from D

Update
using - Check if s’ is a terminal state. Compute targets y by forward propagating state φ (s') in the Q-network, then compute loss. 
sampled
- Update parameters with gradient descent
transitions
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Exploration vs. Exploitation

Terminal state

S2 R = +0 Just after initializing the

a1 Q-network, we get:
Q(S1,a1 ) = 0.5
Initial state Terminal state
a2
S1 S3 R = +1 Q(S1,a2 ) = 0.4
a3 Q(S1,a3 ) = 0.3
Terminal state

S4 R = +1000

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Exploration vs. Exploitation

Terminal state

S2 R = +0 Just after initializing the

a1 Q-network, we get:
Q(S1,a1 ) = 0.5 0
Initial state Terminal state
a2
S1 S3 R = +1 Q(S1,a2 ) = 0.4
a3 Q(S1,a3 ) = 0.3
Terminal state

S4 R = +1000

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Exploration vs. Exploitation

Terminal state

S2 R = +0 Just after initializing the

a1 Q-network, we get:
Q(S1,a1 ) = 0.5 0
Initial state Terminal state
a2
S1 S3 R = +1 Q(S1,a2 ) = 0.4 1
a3 Q(S1,a3 ) = 0.3
Terminal state

S4 R = +1000

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Exploration vs. Exploitation

Terminal state

S2 R = +0 Just after initializing the

a1 Q-network, we get:
Q(S1,a1 ) = 0.5 0
Initial state Terminal state
a2
S1 S3 R = +1 Q(S1,a2 ) = 0.4 1
a3 Q(S1,a3 ) = 0.3
Terminal state

S4 R = +1000 Will never be visited, because

Q(S1,a3) < Q(S1,a2)

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Recap’ (+ epsilon greedy action)
DQN Implementation:

- Initialize your Q-network parameters

- Initialize replay memory D

- Loop over episodes:

- Start from initial state φ (s)

- Create a boolean to detect terminal states: terminal = False 

- Loop over time-steps:

- With probability epsilon, take random action a.

- Otherwise:
- Forward propagate φ (s) in the Q-network
- Execute action a (that has the maximum Q(φ (s),a) output of Q-network). 

- Observe rewards r and next state s’

- Use s’ to create  φ (s')

- Add experience (φ (s),a,r,φ (s')) to replay memory (D)

- Sample random mini-batch of transitions from D

- Check if s’ is a terminal state. Compute targets y by forward propagating state φ (s') in the Q-network, then compute loss. 

- Update parameters with gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Overall recap’
DQN Implementation:

- Initialize your Q-network parameters - Preprocessing

- Initialize replay memory D
- Detect terminal state
- Loop over episodes:
- Experience replay
- Start from initial state φ (s) - Epsilon greedy action
- Create a boolean to detect terminal states: terminal = False 

- Loop over time-steps:

- With probability epsilon, take random action a.

- Otherwise:
- Forward propagate φ (s) in the Q-network
- Execute action a (that has the maximum Q(φ (s),a) output of Q-network). 

- Observe rewards r and next state s’

- Use s’ to create  φ (s')

- Add experience (φ (s),a,r,φ (s')) to replay memory (D)

- Sample random mini-batch of transitions from D

- Check if s’ is a terminal state. Compute targets y by forward propagating state φ (s') in the Q-network, then compute loss. 

- Update parameters with gradient descent

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Results

[https://www.youtube.com/watch?v=TmPfTpjtdgg] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Other Atari games

Pong SeaQuest Space Invaders

[https://www.youtube.com/watch?v=NirMkC5uvWU]
[https://www.youtube.com/watch?v=p88R2_3yWPA]
[https://www.youtube.com/watch?v=W2CAghUiofY&t=2s] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
Today’s outline

I. Motivation
II. Recycling is good: an introduction to RL
III. Deep Q-Networks
IV. Application of Deep Q-Network: Breakout (Atari)
V. Tips to train Deep Q-Network
VI. Advanced topics

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

VI - Advanced topics

Alpha Go

[DeepMind Blog]
[Silver et al. (2017): Mastering the game of Go without human knowledge]
Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics Competitive self-play

[Bansal et al. (2017): Emergent Complexity via multi-agent competition]

[OpenAI Blog: Competitive self-play] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics

Meta learning

[Finn et al. (2017): Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics Imitation learning

[Source: Bellemare et al. (2016): Unifying Count-Based

Exploration and Intrinsic Motivation]

[Ho et al. (2016): Generative Adversarial Imitation Learning] Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri
VI - Advanced topics Auxiliary task

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Announcements

For Tuesday 06/05, 9am: 

This Friday:
• TA Sections:
• How to have a great final project write-up.
• Advices on: How to write a great report.
• Advices on: How to build a super poster.
• Advices on: Final project grading criteria.
• Going through examples of great projects and why they were great.
• Small competitive quiz in section.

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

Integrated Marketing Communication Budweiser
No ratings yet
Integrated Marketing Communication Budweiser
8 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
lecture-06
No ratings yet
lecture-06
98 pages
Q Learning SARSA Deep Q Learning
No ratings yet
Q Learning SARSA Deep Q Learning
4 pages
Lecture 10 - Overview of RL With A VIP Perspective
No ratings yet
Lecture 10 - Overview of RL With A VIP Perspective
35 pages
Sections
No ratings yet
Sections
76 pages
UNIT- 5
No ratings yet
UNIT- 5
43 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Chapter_1_Introduction_RL_Report_Kiran
No ratings yet
Chapter_1_Introduction_RL_Report_Kiran
2 pages
Q Learning
No ratings yet
Q Learning
38 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Intro to Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro to Reinforcement Learning - DQ Q AC A3C
36 pages
Hota-ML-ReinforcementLearning
No ratings yet
Hota-ML-ReinforcementLearning
12 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
RL Intro-2
No ratings yet
RL Intro-2
24 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Adobe Scan Nov 18, 2024
No ratings yet
Adobe Scan Nov 18, 2024
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
RL Class Mtech
No ratings yet
RL Class Mtech
67 pages
unit-5
No ratings yet
unit-5
65 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Lecture2 Introduction Part2
No ratings yet
Lecture2 Introduction Part2
13 pages
lec22
No ratings yet
lec22
22 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
12 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Q learning
No ratings yet
Q learning
187 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
39-Q Learning Numerical
No ratings yet
39-Q Learning Numerical
13 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Unit 1
No ratings yet
Unit 1
18 pages
DRL
No ratings yet
DRL
9 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
UNIT-5
No ratings yet
UNIT-5
54 pages
RL Lecturer (1)
No ratings yet
RL Lecturer (1)
38 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
Unit-5 Mlt
No ratings yet
Unit-5 Mlt
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
02 Bellman Equations and Optimality - Complete Guide
No ratings yet
02 Bellman Equations and Optimality - Complete Guide
6 pages
Bellman Equation and RL Notes
No ratings yet
Bellman Equation and RL Notes
6 pages
Q Learning
No ratings yet
Q Learning
12 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
RADL LACuong
No ratings yet
RADL LACuong
81 pages
37 RL
No ratings yet
37 RL
18 pages
Unit 4
No ratings yet
Unit 4
12 pages
Lecture Notes on Reinforcement Learning Basics
No ratings yet
Lecture Notes on Reinforcement Learning Basics
6 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
lecture doubts
No ratings yet
lecture doubts
2 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Introduction To LR
No ratings yet
Introduction To LR
1 page
ARTS (1st Week) - Mira
No ratings yet
ARTS (1st Week) - Mira
20 pages
Hunter P A1 Peter Senge - Systems Thinking
No ratings yet
Hunter P A1 Peter Senge - Systems Thinking
1 page
Grade 9 Module
100% (1)
Grade 9 Module
14 pages
Bi Cultural in Bingliual
No ratings yet
Bi Cultural in Bingliual
16 pages
Gardner's Theory of Multiple Intelligences Part 2: Bodily-Kinesthetic Intelligence
No ratings yet
Gardner's Theory of Multiple Intelligences Part 2: Bodily-Kinesthetic Intelligence
5 pages
Mid Term Report ANNEXURE I
No ratings yet
Mid Term Report ANNEXURE I
3 pages
Ise 102 - Course Outline
No ratings yet
Ise 102 - Course Outline
9 pages
MG309 Final Assignment Complete
No ratings yet
MG309 Final Assignment Complete
16 pages
What Is TPACK Framework
No ratings yet
What Is TPACK Framework
17 pages
Reflective Case Study On The Teaching Unit of "Let's Go Shopping"
No ratings yet
Reflective Case Study On The Teaching Unit of "Let's Go Shopping"
9 pages
DarlingHammond RestructuringSchoolsStudent 1995
No ratings yet
DarlingHammond RestructuringSchoolsStudent 1995
11 pages
Reflection - P
No ratings yet
Reflection - P
2 pages
The Perspective of Grade 12 Accountancy Business and Management Students of San Jose City National Highschool Senior Highschool On What To Pursue After Senior High 1
No ratings yet
The Perspective of Grade 12 Accountancy Business and Management Students of San Jose City National Highschool Senior Highschool On What To Pursue After Senior High 1
7 pages
Biography English
No ratings yet
Biography English
1 page
TCP (Checklist)
No ratings yet
TCP (Checklist)
2 pages
College of Education and Liberal Arts: Identify The Variants of Marxism
No ratings yet
College of Education and Liberal Arts: Identify The Variants of Marxism
4 pages
Tls 416 Interview With Ell
No ratings yet
Tls 416 Interview With Ell
3 pages
Immediate download Coherent School Leadership Forging Clarity from Complexity 1st Edition Michael Fullan ebooks 2024
100% (5)
Immediate download Coherent School Leadership Forging Clarity from Complexity 1st Edition Michael Fullan ebooks 2024
50 pages
Lesson Plan For ARTS: Learning Objectives
No ratings yet
Lesson Plan For ARTS: Learning Objectives
3 pages
Holistic Leadership
No ratings yet
Holistic Leadership
2 pages
Lab Report Format
No ratings yet
Lab Report Format
1 page
INSET 2021 - Challenges and Best Practices (Quarter 2)
No ratings yet
INSET 2021 - Challenges and Best Practices (Quarter 2)
17 pages
FAI Unit I
No ratings yet
FAI Unit I
35 pages
UTS The Physical Self
No ratings yet
UTS The Physical Self
22 pages
HRM Case Analysis
No ratings yet
HRM Case Analysis
12 pages
The Effects of Social Game
No ratings yet
The Effects of Social Game
12 pages
Mayahi MODULE 2 ANSWERS - ACHIEVING YOUR GOALS
No ratings yet
Mayahi MODULE 2 ANSWERS - ACHIEVING YOUR GOALS
5 pages
Lesson Plan For IPs
No ratings yet
Lesson Plan For IPs
5 pages