0% found this document useful (0 votes)

4 views

drl_v5

Uploaded by

416072459

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

drl_v5

Uploaded by

416072459

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Introduction of Deep

Reinforcement Learning (RL)

Hung-yi Lee
Supervised Learning → RL
Human label
? Cat

Human label
? “3-3”?

It is challenging to label data in some tasks.

…… machine can know the results are good or not.
Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration

Machine Learning
≈ Looking for a Function
Actor
Observation Action
Action =
Function Function
f( Observation ) output
input

Find a policy maximizing

total reward
Reward

Environment
Example: Playing Video Game
Termination: all the aliens are killed,
• Space invader
or your spaceship is destroyed.
Score
(reward)

Kill the
aliens

shield
fire
Example: Playing Video Game

Actor
Observation Action

“right”

Reward
reward = 0

Environment
Example: Playing Video Game
Find an actor maximizing expected reward.
Actor
Observation Action

“fire”

Reward
reward = 5
if killing an alien.

Environment
Example: Learning to play Go

Observation Action

Reward

Next Move

Environment
Example: Learning to play Go
Find an actor maximizing expected reward.
Observation Action

Reward
reward = 0 in most cases
If win, reward = 1
If loss, reward = -1

Environment
Machine Learning is so simple ……

Step 1: Step 2: define

Step 3:
function with loss from
optimization
unknown training data
Step 1: Function with Unknown
Policy Network
(Actor) Sample based
left 0.7 on scores

… right 0.2 Scores of

…
actions
fire 0.1
pixels
Classification Task!!!

• Input of neural network: the observation of machine

represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
Step 2: Define “Loss”
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Obtain reward Obtain reward

𝑟1 = 0 𝑟2 = 5

Action 𝑎1 : “right” Action 𝑎2 : “fire”

(kill an alien)
Step 2: Define “Loss”
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3

This is an episode.
After many turns Game Over Total reward
𝑇
(spaceship destroyed) (return):
𝑅 = ෍ 𝑟𝑡
𝑡=1
Obtain reward 𝑟𝑇
What we want
Action 𝑎 𝑇 to maximize
Trajectory
Step 3: Optimization 𝜏 = 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , ⋯
Network Network
𝑠1 𝑎1 𝑠2 𝑎2

Env Actor Env Actor Env ……

𝑠1 𝑎1 𝑠2 𝑎2 𝑠3

Reward sample Reward They are

black box …
𝑟1 𝑟2 … with randomness

𝑇
How to do the optimization here is
𝑅 𝜏 = ෍ 𝑟𝑡
the main challenge in RL.
c.f. GAN 𝑡=1
To learn more about policy gradient:
Outline https://youtu.be/W8XF3ME8G2I

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration

How to control your actor
• Make it take (or don’t take) a specific action 𝑎ො given
specific observation 𝑠.
𝑎 𝑎ො
left 1
Actor right 0
𝜃
s fire 0
𝑒
Take action 𝑎ො Cross-entropy
𝐿=𝑒
Don’t take action 𝑎ො 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿
𝜃
𝐿 = −𝑒
How to control your actor
Take action 𝑎ො given s 𝑎 𝑒1 𝑎ො
left 1
Actor right 0
𝜃
s fire 0
Don’t take action 𝑎′
ො given 𝑠′ 𝑎 𝑒2 𝑎′
ො
left 0
Actor right 1
𝜃
𝑠′ fire 0

𝐿 = 𝑒1 − 𝑒2 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿
𝜃
How to control your actor

Training Data

𝑠1 , 𝑎ො1 +1 Yes 𝑠 Actor 𝑎

𝑠2 , 𝑎ො2 -1 No 𝜃
𝑠3 , 𝑎ො3 +1 Yes
𝐿 = + 𝑒1 − 𝑒2 + 𝑒3 ⋯ − 𝑒N
……

……

𝑠𝑁 , 𝑎ො𝑁 -1 No 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿

𝜃
How to control your actor

Training Data

𝑠1 , 𝑎ො1 A1 +1.5 𝑠 Actor 𝑎

𝑠2 , 𝑎ො2 A2 - 0.5 𝜃
𝑠3 , 𝑎ො3 A3 +0.5
𝐿 = ෍ A𝑛 𝑒𝑛
……

……

𝑠𝑁 , 𝑎ො𝑁 A𝑁 -10 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿

𝜃
? ?
Version 0
𝑠1 𝑎1 𝑠2 Training Data

Env Actor Env Actor 𝑠1 , 𝑎1 A1 = 𝑟1

𝑠2 , 𝑎2 A2 = 𝑟2
𝑠1 𝑎1 𝑠2 𝑎2
𝑠3 , 𝑎3 A3 = 𝑟3

……
……
Reward Reward ……
𝑟1 𝑟2 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝑟𝑁
many episodes

Short-sighted Version!
𝑠1 𝑎1 𝑠2
Version 0 Env Actor Env Actor

𝑠1 𝑎1 𝑠2 𝑎2
“right” “fire”
Reward Reward ……
?
𝑟1 0 𝑟2 +5

• An action affects the subsequent observations and thus

subsequent rewards.
• Reward delay: Actor has to sacrifice immediate reward to
gain more long-term reward.
• In space invader, only “fire” yields positive reward, so vision
0 will learn an actor that always “fire”.
Version 1
Training Data

𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠1 , 𝑎1 A1 = 𝐺1
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠2 , 𝑎2 A2 = 𝐺2
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠3 , 𝑎3 A3 = 𝐺3

……
……
𝐺1 = 𝑟1 + 𝑟2 + 𝑟3 + …… + 𝑟𝑁 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁
𝐺2 = 𝑟2 + 𝑟3 + …… + 𝑟𝑁
𝑁
𝐺3 = 𝑟3 + …… + 𝑟𝑁 𝐺𝑡 = ෍ 𝑟𝑛
cumulated reward 𝑛=𝑡
Version 2
Training Data

𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠3 , 𝑎3 A3 = 𝐺3′

……
……
Also the credit of 𝑎1 ?
𝐺1 = 𝑟1 + 𝑟2 + 𝑟3 + …… + 𝑟𝑁 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′

𝐺1′ = 𝑟1 + 𝛾𝑟2 + 𝛾 2 𝑟3 + …… 𝑁

𝐺𝑡′ = ෍ 𝛾 𝑛−𝑡 𝑟𝑛
Discount factor 𝛾 < 1
𝑛=𝑡
Version 3
Training Data

𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′ −𝑏
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′ −𝑏
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠3 , 𝑎3 A3 = 𝐺3′ −𝑏

……
……
Good or bad reward is “relative”
If all the 𝑟𝑛 ≥ 10 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′ −𝑏
𝑟𝑛 = 10 is negative … 𝑁
Minus by a baseline 𝑏 ??? 𝐺𝑡′ = ෍ 𝛾 𝑛−𝑡 𝑟𝑛
Make 𝐺𝑡′ have positive and negative values 𝑛=𝑡
Policy Gradient
• Initialize actor network parameters 𝜃 0
• For training iteration 𝑖 = 1 to 𝑇
• Using actor 𝜃 𝑖−1 to interact
• Obtain data 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , … , 𝑠𝑁 , 𝑎𝑁
• Compute 𝐴1 , 𝐴2 , … , 𝐴𝑁
• Compute loss 𝐿 Data collection is in the “for
𝑖 𝑖−1
•𝜃 ←𝜃 − 𝜂∇𝐿 loop” of training iterations.
Policy Gradient
Training Data

𝑠1 , 𝑎1
𝑠 Actor 𝑎
A1 𝜃
𝑠2 , 𝑎2 A2
𝑠3 , 𝑎3 A3 𝐿 = ෍ A𝑛 𝑒𝑛
……
……

𝜃 𝑖 ← 𝜃 𝑖−1 − 𝜂∇𝐿
𝑠𝑁 , 𝑎𝑁 A𝑁 only update once

Each time you update the model parameters, you need to

collect the whole training set again.
Policy Gradient
• Initialize actor network parameters 𝜃 0
• For training iteration 𝑖 = 1 to 𝑇
Experience of 𝜃 𝑖−1
𝑖−1
• Using actor 𝜃 to interact
• Obtain data 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , … , 𝑠𝑁 , 𝑎𝑁
• Compute 𝐴1 , 𝐴2 , … , 𝐴𝑁
• Compute loss 𝐿
• 𝜃 𝑖 ← 𝜃 𝑖−1 − 𝜂∇𝐿

May not be good for 𝜃 𝑖

棋魂第八集
棋魂第八集
Policy Gradient
• Initialize actor network parameters 𝜃 0
• For training iteration 𝑖 = 1 to 𝑇
• Using actor 𝜃 𝑖−1 to interact
• Obtain data 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , … , 𝑠𝑁 , 𝑎𝑁
• Compute 𝐴1 , 𝐴2 , … , 𝐴𝑁
May not observe by 𝜃 𝑖
• Compute loss 𝐿
• 𝜃 𝑖 ← 𝜃 𝑖−1 − 𝜂∇𝐿 𝑠 𝑠 𝑠 𝑠
1 2 3 𝑁
Trajectory of 𝑎1 𝑎2 𝑎3 …… 𝑎𝑁
𝜃 𝑖−1
𝑟1 𝑟2 𝑟3 𝑟𝑁
On-policy v.s. Off-policy
• The actor to train and the actor for interacting is
the same. → On-policy
• Can the actor to train and the actor for interacting
be different? → Off-policy

𝑠1 𝑠2 𝑠3 𝑠𝑁
Trajectory of 𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 To train 𝜃 𝑖
𝜃 𝑖−1
𝑟1 𝑟2 𝑟3 𝑟𝑁

In this way, we do not have to collection data after each update.

Off-policy → Proximal Policy
Optimization (PPO)
• The actor to train has to
know its difference from
the actor to interact.
video:
https://youtu.be/OAKAZhFmYoI

Not apply to
everyone

the actor to train

https://disp.cc/b/115-bLHe the actor to interact
Collection Training Data:
Exploration
𝑠1 𝑎1 𝑠2
Enlarge output
Env Actor Env Actor entropy
𝑠1 𝑎1 𝑠2 𝑎2 Add noises onto
parameters
Reward Reward ……
Suppose your actor
𝑟1 𝑟2 always takes “left”.
The actor needs to have randomness during We never know
data collection. what would happen
A major reason why we sample actions. ☺ if taking “fire”.
DeepMind - PPO https://youtu.be/gn4nRCC9TwQ
OpenAI - PPO https://blog.openai.com/o
penai-baselines-ppo/
Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration

Critic 𝐺1′ = 𝑟1 + 𝛾𝑟2 + 𝛾 2 𝑟3 + ……

• Critic: Given actor 𝜃, how good it is when observing 𝑠 (and

taking action 𝑎)
• Value function 𝑉 𝜃 𝑠 : When using actor 𝜃, the discounted
cumulated reward expects to be obtained after seeing s

𝑉𝜃 𝑠
s 𝑉𝜃
scalar

𝑉 𝜃 𝑠 is large 𝑉 𝜃 𝑠 is smaller

The output values of a critic depend on the actor evaluated.

How to estimate 𝑉 𝜃 𝑠
• Monte-Carlo (MC) based approach
The critic watches actor 𝜃 to interact with the environment.

After seeing 𝑠𝑎 ,
Until the end of the episode, 𝑠𝑎 𝑉𝜃 𝑉 𝜃 𝑠𝑎 𝐺𝑎′
the cumulated reward is 𝐺𝑎′

After seeing 𝑠𝑏 ,
Until the end of the episode,
the cumulated reward is 𝐺𝑏′
𝑠𝑏 𝑉𝜃 𝑉 𝜃 𝑠𝑏 𝐺𝑏′
How to estimate 𝑉 𝜋 𝑠
• Temporal-difference (TD) approach
⋯ 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ⋯ (ignore the expectation here)
𝑉 𝜃 𝑠𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 …
𝑉 𝜃 𝑠𝑡+1 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + ⋯

𝜃 𝑉 𝜃 𝑠𝑡 = 𝛾𝑉 𝜃 𝑠𝑡+1 + 𝑟𝑡
𝑠𝑡 𝑉 𝑉𝜃 𝑠𝑡

- 𝑉 𝜃 𝑠𝑡 − 𝛾𝑉 𝜃 𝑠𝑡+1 𝑟𝑡
×𝛾
𝑠𝑡+1 𝑉𝜃 𝑉 𝜃 𝑠𝑡+1
[Sutton, v2,
MC v.s. TD Example 6.4]

• The critic has observed the following 8 episodes

• 𝑠𝑎 , 𝑟 = 0, 𝑠𝑏 , 𝑟 = 0, END
• 𝑠𝑏 , 𝑟 = 1, END
𝑉 𝜃 𝑠𝑏 = 3/4
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END 𝑉 𝜃 𝑠𝑎 =? 0? 3/4?
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END Monte-Carlo: 𝑉 𝜃 𝑠𝑎 = 0
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 0, END Temporal-difference:

(Assume 𝛾 = 1, and the 𝑉 𝜃 𝑠𝑎 = 𝑉 𝜃 𝑠𝑏 + 𝑟

actions are ignored here.) 3/4 3/4 0
Version 3.5

𝑠1 𝑠2 𝑠3 𝑠𝑁 Training Data
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′ −𝑏
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′ −𝑏
𝑠3 , 𝑎3 A3 = 𝐺3′ −𝑏

……
……
𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′ −𝑏
s 𝑉𝜃 𝑉 𝑠 𝜃
Version 3.5

𝑠1 𝑠2 𝑠3 𝑠𝑁 Training Data
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′ −𝑉 𝜃 𝑠1
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′ −𝑉 𝜃 𝑠2
𝑠3 , 𝑎3 A3 = 𝐺3′ −𝑉 𝜃 𝑠3

……
……
𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′ −𝑉 𝜃 𝑠𝑁
s 𝑉𝜃 𝑉 𝑠 𝜃
Version 3.5 𝑠𝑡 , 𝑎𝑡 A𝑡 = 𝐺𝑡′ −𝑉 𝜃 𝑠𝑡

𝐺 = 100
𝐺=3
𝐺=1 𝑉 𝜃 𝑠𝑡
𝑠𝑡 𝐺=2
𝐺 = −10
(not necessary take 𝑎𝑡 )
(You sample the actions based on
a distribution) A𝑡 > 0
𝑎𝑡 is better than average.
𝑎𝑡
𝐺𝑡′ A𝑡 < 0
𝑠𝑡
Just a sample 𝑎𝑡 is worse than average.
𝑟𝑡 + 𝑉 𝜃 𝑠𝑡+1 − 𝑉 𝜃 𝑠𝑡
Version 4 𝑠𝑡 , 𝑎𝑡 A𝑡 = 𝐺𝑡′ −𝑉 𝜃 𝑠𝑡
Advantage Actor-Critic
𝐺 = 100
𝐺=3
𝐺=1 𝑉 𝜃 𝑠𝑡
𝑠𝑡 𝐺=2
𝐺 = −10
(not necessary take 𝑎𝑡 )

𝐺 = 101
Obtain 𝑟𝑡 𝐺=4
𝑎𝑡 𝑟𝑡
𝐺=3
+𝑉 𝜃 𝑠𝑡+1
𝑠𝑡 𝑠𝑡+1 𝐺=1
𝐺 = −5
Tip of Actor-Critic
• The parameters of actor and critic can be
shared.

left

Network right Actor

𝑠 Network fire

Network scalar Critic

Outlook: Deep Q Network (DQN)

Video:
https://youtu.be/o_g9JUMw1Oc
https://youtu.be/2-zGCx4iv_k

https://arxiv.org/abs/1710.02298
Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration

Sparse Reward A𝑡 = 𝑟𝑡 + 𝑉 𝜃 𝑠𝑡+1 − 𝑉 𝜃 𝑠𝑡
Training Data
𝑠1 , 𝑎1 A1
𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠2 , 𝑎2 A2
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠3 , 𝑎3 A3

…
…
𝑟1 𝑟2 𝑟3 𝑟𝑁
sN , a N A𝑁

We don’t know actions

If 𝑟𝑡 = 0 in most cases
are good or bad.
e.g., robot arm to bolt on the screws
The developers define extra
reward shaping
rewards to guide agents.
Reward Shaping
VizDoom https://openreview.net/forum?id=Hk3mPK5gg&noteId=Hk3mPK5gg

Visual Doom AI Competition @ CIG 2016

https://www.youtube.com/watch?v=94EPSjQH38Y
Reward Shaping
VizDoom https://openreview.net/forum?id=Hk3mPK5gg&noteId=Hk3mPK5gg

https://bair.berkeley.edu/blog/2017/12/20/reverse-curriculum/
Reward Shaping - Curiosity https://arxiv.org/abs/1705.05363

Obtaining extra reward when the agent sees something

new (but meaningful).

Source of video: https://pathak22.github.io/noreward-rl/

Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration

Motivation
• Even define reward can be challenging in some tasks.
• Hand-crafted rewards can lead to uncontrolled behavior.
Three Laws of Robotics:
1. A robot may not injure a human being or, through
inaction, allow a human being to come to harm.
2. A robot must obey the orders given it by human beings
except where such orders would conflict with the First Law.
3. A robot must protect its own existence as long as such
protection does not conflict with the First or Second Laws.

?????

restraining individual human behavior and sacrificing

some humans will ensure humanity's survival
Imitation Learning
𝑠1 𝑎1 𝑠2 𝑎2

Env Actor Env Actor Env ……

𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
Actor can interact with the environment, but reward
function is not available

We have demonstration of the expert. Self driving: record

human drivers
Each 𝜏Ƹ is a trajectory
𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾 Robot: grab the
of the export.
arm of robot
Isn’t it Supervised Learning?
Yes, also known as
• Self-driving cars as example Behavior Cloning
𝜏Ƹ = 𝑠1 , 𝑎ො1 , 𝑠2 , 𝑎ො2 , ⋯

forward ?
Expert

𝑠𝑖 𝑎𝑖 𝑎ො𝑖

Problem: The experts only

sample limited observation.
The agent will copy
More problem …… every behavior, even
irrelevant actions.

https://www.youtube.com/watch?v=j2FSB3bseek
Inverse Reinforcement Learning
demonstration
of the expert
Environment
𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾

Reward Inverse Reinforcement

Reinforcement Optimal
Function
Expert
Actor
Learning

Using the reward function to find the

optimal actor.
Inverse Reinforcement Learning
• Principle: The teacher is always the best.
• Basic idea:
• Initialize an actor
• In each iteration
• The actor interacts with the environments to obtain
some trajectories.
• Define a reward function, which makes the
trajectories of the teacher better than the actor.
• The actor learns to maximize the reward based on
the new reward function.
• Output the reward function and the actor learned from
the reward function
Framework of IRL 𝐾 𝐾

෍ 𝑅 𝜏Ƹ 𝑛 > ෍ 𝑅 𝜏
𝑛=1 𝑛=1

Expert 𝜋ො 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾 Obtain

Reward Function R

Reward
𝜏1 , 𝜏2 , ⋯ , 𝜏𝐾 Function R
Actor
= Generator
Find an actor based
Reward function Actor 𝜋 on reward function R
= Discriminator
By Reinforcement learning
GAN High score for real,
low score for generated

Find a G whose output

G
obtains large score from D
IRL
Larger reward for 𝜏,Ƹ
Expert 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾 Lower reward for 𝜏

𝜏1 , 𝜏2 , ⋯ , 𝜏𝐾 Reward
Function

Find a Actor obtains

Actor large reward
Robot
• How to teach robots? https://www.youtube.com/watch?v=DEGbtjTOIB0
Chelsea Finn, Sergey Levine, Pieter Abbeel,
Guided Cost Learning: Deep Inverse Optimal
Robot Control via Policy Optimization, ICML, 2016
http://rll.berkeley.edu/gcl/
To Learn More …
Visual Reinforcement Learning with Imagined Goals, NIPS 2018
https://arxiv.org/abs/1807.04742

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning,

ICML 2020 https://arxiv.org/abs/1903.03698

Reinforcement learning with Imagined Goals (RIG)

Concluding Remarks

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Sparse Reward

No Reward: Learning from Demonstration

Lecture2 Drl A
No ratings yet
Lecture2 Drl A
39 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Lecture 10 - Overview of RL With A VIP Perspective
No ratings yet
Lecture 10 - Overview of RL With A VIP Perspective
35 pages
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
No ratings yet
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
7 pages
What is TD Learning
No ratings yet
What is TD Learning
15 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
DRL
No ratings yet
DRL
9 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Maai 6
No ratings yet
Maai 6
143 pages
SocrAI Day 4
No ratings yet
SocrAI Day 4
38 pages
Sections
No ratings yet
Sections
76 pages
Q_Networks[1]-31-50
No ratings yet
Q_Networks[1]-31-50
20 pages
NeurIPS 2021 Decision Transformer Reinforcement Learning via Sequence Modeling Paper
No ratings yet
NeurIPS 2021 Decision Transformer Reinforcement Learning via Sequence Modeling Paper
14 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
RADL LACuong
No ratings yet
RADL LACuong
81 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
37 RL
No ratings yet
37 RL
18 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
01 Module 2 Neural Network Based Reinforcement Learning
No ratings yet
01 Module 2 Neural Network Based Reinforcement Learning
133 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
100% (9)
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
65 pages
Assignment3 Yash Patel
No ratings yet
Assignment3 Yash Patel
10 pages
Unit 5
No ratings yet
Unit 5
10 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
Continuous Time 2
No ratings yet
Continuous Time 2
91 pages
lecture-06
No ratings yet
lecture-06
98 pages
Decision Transformer: Reinforcement Learning Via Sequence Modeling
No ratings yet
Decision Transformer: Reinforcement Learning Via Sequence Modeling
21 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
RL UNIT V QA (1)
No ratings yet
RL UNIT V QA (1)
13 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
lecture21
No ratings yet
lecture21
29 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
trajectory-transformer
No ratings yet
trajectory-transformer
15 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
D2 Gaming System - Version 3 (D2v3)
From Everand
D2 Gaming System - Version 3 (D2v3)
Phillip Rhoades
No ratings yet
Pulp! Fantasy: Estia the Isle of Giants
From Everand
Pulp! Fantasy: Estia the Isle of Giants
Brian Liberge
No ratings yet
2016 Precon Final For Web
No ratings yet
2016 Precon Final For Web
33 pages
Pittsfield Public Schools Master Plan
No ratings yet
Pittsfield Public Schools Master Plan
25 pages
RL Quiz 22-3-24
No ratings yet
RL Quiz 22-3-24
8 pages
Ej 1390653
No ratings yet
Ej 1390653
16 pages
Item Analysis Tlenew
No ratings yet
Item Analysis Tlenew
26 pages
Students Life. Collocations
No ratings yet
Students Life. Collocations
3 pages
Mardy DLP
No ratings yet
Mardy DLP
4 pages
Chapter 1
67% (3)
Chapter 1
5 pages
Speech and Debate Syllabus
No ratings yet
Speech and Debate Syllabus
3 pages
Dominican Education System Op-Molo Learners' Module Holy Rosary Academy, Inc. Molo, Iloilo
No ratings yet
Dominican Education System Op-Molo Learners' Module Holy Rosary Academy, Inc. Molo, Iloilo
11 pages
Qualities of An Evaluation Tool
No ratings yet
Qualities of An Evaluation Tool
42 pages
Card Template
No ratings yet
Card Template
2 pages
Self-Regulation and Self-Efficacy For The Improvement of Listening Proficiency Outside The Classroom
No ratings yet
Self-Regulation and Self-Efficacy For The Improvement of Listening Proficiency Outside The Classroom
15 pages
Quality Systems and Value Management BRIEF Assignment 3
No ratings yet
Quality Systems and Value Management BRIEF Assignment 3
7 pages
2022 2023 SPG Accomplishment Report
No ratings yet
2022 2023 SPG Accomplishment Report
11 pages
Haryana Nishtha Course Secondary 2.0 - Batch 2 2021
No ratings yet
Haryana Nishtha Course Secondary 2.0 - Batch 2 2021
1 page
The Teacher-Student Communication Pattern: A Need To Follow?
No ratings yet
The Teacher-Student Communication Pattern: A Need To Follow?
7 pages
Classroom Improvement Plan 5 E. AGUINALDO
94% (36)
Classroom Improvement Plan 5 E. AGUINALDO
3 pages
Physics Curriculum Corrected WEB 25 Jan 2012
No ratings yet
Physics Curriculum Corrected WEB 25 Jan 2012
316 pages
SFC Assignment MS March 2018 - Final
No ratings yet
SFC Assignment MS March 2018 - Final
4 pages
S2 167 (2831-2851)
No ratings yet
S2 167 (2831-2851)
21 pages
A Survey Report On AB-English Program Depreciate by Majority of The Students in Capitol University
No ratings yet
A Survey Report On AB-English Program Depreciate by Majority of The Students in Capitol University
10 pages
The Cambridge Aice Program at Palmetto Ridge High School
No ratings yet
The Cambridge Aice Program at Palmetto Ridge High School
12 pages
Montessori Schools
No ratings yet
Montessori Schools
28 pages
Educ 212 Midterm
No ratings yet
Educ 212 Midterm
30 pages
Georgina Smith - Resume 1
No ratings yet
Georgina Smith - Resume 1
3 pages
AUSPELD SLD Flowchart
No ratings yet
AUSPELD SLD Flowchart
1 page
Teacher 1
No ratings yet
Teacher 1
1 page
Creating A Business (Principles, Tools, and Techniques)
100% (1)
Creating A Business (Principles, Tools, and Techniques)
4 pages
Elaine Chin
No ratings yet
Elaine Chin
31 pages

drl_v5

Uploaded by

drl_v5

Uploaded by

Introduction of Deep

Reinforcement Learning (RL)

It is challenging to label data in some tasks.

What is RL? (Three steps in ML)

No Reward: Learning from Demonstration

Find a policy maximizing

Step 1: Step 2: define

… right 0.2 Scores of

• Input of neural network: the observation of machine

Obtain reward Obtain reward

Action 𝑎1 : “right” Action 𝑎2 : “fire”

Env Actor Env Actor Env ……

Reward sample Reward They are

What is RL? (Three steps in ML)

No Reward: Learning from Demonstration

𝑠1 , 𝑎ො1 +1 Yes 𝑠 Actor 𝑎

𝑠𝑁 , 𝑎ො𝑁 -1 No 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿

𝑠1 , 𝑎ො1 A1 +1.5 𝑠 Actor 𝑎

𝑠𝑁 , 𝑎ො𝑁 A𝑁 -10 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿

Env Actor Env Actor 𝑠1 , 𝑎1 A1 = 𝑟1

• An action affects the subsequent observations and thus

Each time you update the model parameters, you need to

May not be good for 𝜃 𝑖

In this way, we do not have to collection data after each update.

the actor to train

What is RL? (Three steps in ML)

No Reward: Learning from Demonstration

• Critic: Given actor 𝜃, how good it is when observing 𝑠 (and

The output values of a critic depend on the actor evaluated.

• The critic has observed the following 8 episodes

(Assume 𝛾 = 1, and the 𝑉 𝜃 𝑠𝑎 = 𝑉 𝜃 𝑠𝑏 + 𝑟

Network right Actor

Network scalar Critic

What is RL? (Three steps in ML)

No Reward: Learning from Demonstration

We don’t know actions

Visual Doom AI Competition @ CIG 2016

Obtaining extra reward when the agent sees something

Source of video: https://pathak22.github.io/noreward-rl/

What is RL? (Three steps in ML)

No Reward: Learning from Demonstration

restraining individual human behavior and sacrificing

Env Actor Env Actor Env ……

We have demonstration of the expert. Self driving: record

Problem: The experts only

Reward Inverse Reinforcement

Using the reward function to find the

Expert 𝜋ො 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾 Obtain

Find a G whose output

Find a Actor obtains

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning,

Reinforcement learning with Imagined Goals (RIG)

What is RL? (Three steps in ML)

No Reward: Learning from Demonstration

You might also like