0% found this document useful (0 votes)
4 views

drl_v5

Uploaded by

416072459
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

drl_v5

Uploaded by

416072459
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Introduction of Deep

Reinforcement Learning (RL)


Hung-yi Lee
Supervised Learning → RL
Human label
? Cat

Human label
? “3-3”?

It is challenging to label data in some tasks.


…… machine can know the results are good or not.
Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration


Machine Learning
≈ Looking for a Function
Actor
Observation Action
Action =
Function Function
f( Observation ) output
input

Find a policy maximizing


total reward
Reward

Environment
Example: Playing Video Game
Termination: all the aliens are killed,
• Space invader
or your spaceship is destroyed.
Score
(reward)

Kill the
aliens

shield
fire
Example: Playing Video Game

Actor
Observation Action

“right”

Reward
reward = 0

Environment
Example: Playing Video Game
Find an actor maximizing expected reward.
Actor
Observation Action

“fire”

Reward
reward = 5
if killing an alien.

Environment
Example: Learning to play Go

Observation Action

Reward

Next Move

Environment
Example: Learning to play Go
Find an actor maximizing expected reward.
Observation Action

Reward
reward = 0 in most cases
If win, reward = 1
If loss, reward = -1

Environment
Machine Learning is so simple ……

Step 1: Step 2: define


Step 3:
function with loss from
optimization
unknown training data
Step 1: Function with Unknown
Policy Network
(Actor) Sample based
left 0.7 on scores

… right 0.2 Scores of


actions
fire 0.1
pixels
Classification Task!!!

• Input of neural network: the observation of machine


represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
Step 2: Define “Loss”
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Obtain reward Obtain reward


𝑟1 = 0 𝑟2 = 5

Action 𝑎1 : “right” Action 𝑎2 : “fire”


(kill an alien)
Step 2: Define “Loss”
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3

This is an episode.
After many turns Game Over Total reward
𝑇
(spaceship destroyed) (return):
𝑅 = ෍ 𝑟𝑡
𝑡=1
Obtain reward 𝑟𝑇
What we want
Action 𝑎 𝑇 to maximize
Trajectory
Step 3: Optimization 𝜏 = 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , ⋯
Network Network
𝑠1 𝑎1 𝑠2 𝑎2

Env Actor Env Actor Env ……

𝑠1 𝑎1 𝑠2 𝑎2 𝑠3

Reward sample Reward They are


black box …
𝑟1 𝑟2 … with randomness

𝑇
How to do the optimization here is
𝑅 𝜏 = ෍ 𝑟𝑡
the main challenge in RL.
c.f. GAN 𝑡=1
To learn more about policy gradient:
Outline https://youtu.be/W8XF3ME8G2I

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration


How to control your actor
• Make it take (or don’t take) a specific action 𝑎ො given
specific observation 𝑠.
𝑎 𝑎ො
left 1
Actor right 0
𝜃
s fire 0
𝑒
Take action 𝑎ො Cross-entropy
𝐿=𝑒
Don’t take action 𝑎ො 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿
𝜃
𝐿 = −𝑒
How to control your actor
Take action 𝑎ො given s 𝑎 𝑒1 𝑎ො
left 1
Actor right 0
𝜃
s fire 0
Don’t take action 𝑎′
ො given 𝑠′ 𝑎 𝑒2 𝑎′

left 0
Actor right 1
𝜃
𝑠′ fire 0

𝐿 = 𝑒1 − 𝑒2 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿
𝜃
How to control your actor

Training Data

𝑠1 , 𝑎ො1 +1 Yes 𝑠 Actor 𝑎


𝑠2 , 𝑎ො2 -1 No 𝜃
𝑠3 , 𝑎ො3 +1 Yes
𝐿 = + 𝑒1 − 𝑒2 + 𝑒3 ⋯ − 𝑒N
……

……

𝑠𝑁 , 𝑎ො𝑁 -1 No 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿


𝜃
How to control your actor

Training Data

𝑠1 , 𝑎ො1 A1 +1.5 𝑠 Actor 𝑎


𝑠2 , 𝑎ො2 A2 - 0.5 𝜃
𝑠3 , 𝑎ො3 A3 +0.5
𝐿 = ෍ A𝑛 𝑒𝑛
……

……

𝑠𝑁 , 𝑎ො𝑁 A𝑁 -10 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿


𝜃
? ?
Version 0
𝑠1 𝑎1 𝑠2 Training Data

Env Actor Env Actor 𝑠1 , 𝑎1 A1 = 𝑟1


𝑠2 , 𝑎2 A2 = 𝑟2
𝑠1 𝑎1 𝑠2 𝑎2
𝑠3 , 𝑎3 A3 = 𝑟3

……
……
Reward Reward ……
𝑟1 𝑟2 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝑟𝑁
many episodes

Short-sighted Version!
𝑠1 𝑎1 𝑠2
Version 0 Env Actor Env Actor

𝑠1 𝑎1 𝑠2 𝑎2
“right” “fire”
Reward Reward ……
?
𝑟1 0 𝑟2 +5

• An action affects the subsequent observations and thus


subsequent rewards.
• Reward delay: Actor has to sacrifice immediate reward to
gain more long-term reward.
• In space invader, only “fire” yields positive reward, so vision
0 will learn an actor that always “fire”.
Version 1
Training Data

𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠1 , 𝑎1 A1 = 𝐺1
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠2 , 𝑎2 A2 = 𝐺2
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠3 , 𝑎3 A3 = 𝐺3

……
……
𝐺1 = 𝑟1 + 𝑟2 + 𝑟3 + …… + 𝑟𝑁 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁
𝐺2 = 𝑟2 + 𝑟3 + …… + 𝑟𝑁
𝑁
𝐺3 = 𝑟3 + …… + 𝑟𝑁 𝐺𝑡 = ෍ 𝑟𝑛
cumulated reward 𝑛=𝑡
Version 2
Training Data

𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠3 , 𝑎3 A3 = 𝐺3′

……
……
Also the credit of 𝑎1 ?
𝐺1 = 𝑟1 + 𝑟2 + 𝑟3 + …… + 𝑟𝑁 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′

𝐺1′ = 𝑟1 + 𝛾𝑟2 + 𝛾 2 𝑟3 + …… 𝑁

𝐺𝑡′ = ෍ 𝛾 𝑛−𝑡 𝑟𝑛
Discount factor 𝛾 < 1
𝑛=𝑡
Version 3
Training Data

𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′ −𝑏
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′ −𝑏
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠3 , 𝑎3 A3 = 𝐺3′ −𝑏

……
……
Good or bad reward is “relative”
If all the 𝑟𝑛 ≥ 10 𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′ −𝑏
𝑟𝑛 = 10 is negative … 𝑁
Minus by a baseline 𝑏 ??? 𝐺𝑡′ = ෍ 𝛾 𝑛−𝑡 𝑟𝑛
Make 𝐺𝑡′ have positive and negative values 𝑛=𝑡
Policy Gradient
• Initialize actor network parameters 𝜃 0
• For training iteration 𝑖 = 1 to 𝑇
• Using actor 𝜃 𝑖−1 to interact
• Obtain data 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , … , 𝑠𝑁 , 𝑎𝑁
• Compute 𝐴1 , 𝐴2 , … , 𝐴𝑁
• Compute loss 𝐿 Data collection is in the “for
𝑖 𝑖−1
•𝜃 ←𝜃 − 𝜂∇𝐿 loop” of training iterations.
Policy Gradient
Training Data

𝑠1 , 𝑎1
𝑠 Actor 𝑎
A1 𝜃
𝑠2 , 𝑎2 A2
𝑠3 , 𝑎3 A3 𝐿 = ෍ A𝑛 𝑒𝑛
……
……

𝜃 𝑖 ← 𝜃 𝑖−1 − 𝜂∇𝐿
𝑠𝑁 , 𝑎𝑁 A𝑁 only update once

Each time you update the model parameters, you need to


collect the whole training set again.
Policy Gradient
• Initialize actor network parameters 𝜃 0
• For training iteration 𝑖 = 1 to 𝑇
Experience of 𝜃 𝑖−1
𝑖−1
• Using actor 𝜃 to interact
• Obtain data 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , … , 𝑠𝑁 , 𝑎𝑁
• Compute 𝐴1 , 𝐴2 , … , 𝐴𝑁
• Compute loss 𝐿
• 𝜃 𝑖 ← 𝜃 𝑖−1 − 𝜂∇𝐿

May not be good for 𝜃 𝑖


棋魂第八集
棋魂第八集
Policy Gradient
• Initialize actor network parameters 𝜃 0
• For training iteration 𝑖 = 1 to 𝑇
• Using actor 𝜃 𝑖−1 to interact
• Obtain data 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , … , 𝑠𝑁 , 𝑎𝑁
• Compute 𝐴1 , 𝐴2 , … , 𝐴𝑁
May not observe by 𝜃 𝑖
• Compute loss 𝐿
• 𝜃 𝑖 ← 𝜃 𝑖−1 − 𝜂∇𝐿 𝑠 𝑠 𝑠 𝑠
1 2 3 𝑁
Trajectory of 𝑎1 𝑎2 𝑎3 …… 𝑎𝑁
𝜃 𝑖−1
𝑟1 𝑟2 𝑟3 𝑟𝑁
On-policy v.s. Off-policy
• The actor to train and the actor for interacting is
the same. → On-policy
• Can the actor to train and the actor for interacting
be different? → Off-policy

𝑠1 𝑠2 𝑠3 𝑠𝑁
Trajectory of 𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 To train 𝜃 𝑖
𝜃 𝑖−1
𝑟1 𝑟2 𝑟3 𝑟𝑁

In this way, we do not have to collection data after each update.


Off-policy → Proximal Policy
Optimization (PPO)
• The actor to train has to
know its difference from
the actor to interact.
video:
https://youtu.be/OAKAZhFmYoI

Not apply to
everyone

the actor to train


https://disp.cc/b/115-bLHe the actor to interact
Collection Training Data:
Exploration
𝑠1 𝑎1 𝑠2
Enlarge output
Env Actor Env Actor entropy
𝑠1 𝑎1 𝑠2 𝑎2 Add noises onto
parameters
Reward Reward ……
Suppose your actor
𝑟1 𝑟2 always takes “left”.
The actor needs to have randomness during We never know
data collection. what would happen
A major reason why we sample actions. ☺ if taking “fire”.
DeepMind - PPO https://youtu.be/gn4nRCC9TwQ
OpenAI - PPO https://blog.openai.com/o
penai-baselines-ppo/
Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration


Critic 𝐺1′ = 𝑟1 + 𝛾𝑟2 + 𝛾 2 𝑟3 + ……

• Critic: Given actor 𝜃, how good it is when observing 𝑠 (and


taking action 𝑎)
• Value function 𝑉 𝜃 𝑠 : When using actor 𝜃, the discounted
cumulated reward expects to be obtained after seeing s

𝑉𝜃 𝑠
s 𝑉𝜃
scalar

𝑉 𝜃 𝑠 is large 𝑉 𝜃 𝑠 is smaller

The output values of a critic depend on the actor evaluated.


How to estimate 𝑉 𝜃 𝑠
• Monte-Carlo (MC) based approach
The critic watches actor 𝜃 to interact with the environment.

After seeing 𝑠𝑎 ,
Until the end of the episode, 𝑠𝑎 𝑉𝜃 𝑉 𝜃 𝑠𝑎 𝐺𝑎′
the cumulated reward is 𝐺𝑎′

After seeing 𝑠𝑏 ,
Until the end of the episode,
the cumulated reward is 𝐺𝑏′
𝑠𝑏 𝑉𝜃 𝑉 𝜃 𝑠𝑏 𝐺𝑏′
How to estimate 𝑉 𝜋 𝑠
• Temporal-difference (TD) approach
⋯ 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ⋯ (ignore the expectation here)
𝑉 𝜃 𝑠𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 …
𝑉 𝜃 𝑠𝑡+1 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + ⋯

𝜃 𝑉 𝜃 𝑠𝑡 = 𝛾𝑉 𝜃 𝑠𝑡+1 + 𝑟𝑡
𝑠𝑡 𝑉 𝑉𝜃 𝑠𝑡

- 𝑉 𝜃 𝑠𝑡 − 𝛾𝑉 𝜃 𝑠𝑡+1 𝑟𝑡
×𝛾
𝑠𝑡+1 𝑉𝜃 𝑉 𝜃 𝑠𝑡+1
[Sutton, v2,
MC v.s. TD Example 6.4]

• The critic has observed the following 8 episodes


• 𝑠𝑎 , 𝑟 = 0, 𝑠𝑏 , 𝑟 = 0, END
• 𝑠𝑏 , 𝑟 = 1, END
𝑉 𝜃 𝑠𝑏 = 3/4
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END 𝑉 𝜃 𝑠𝑎 =? 0? 3/4?
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END Monte-Carlo: 𝑉 𝜃 𝑠𝑎 = 0
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 0, END Temporal-difference:

(Assume 𝛾 = 1, and the 𝑉 𝜃 𝑠𝑎 = 𝑉 𝜃 𝑠𝑏 + 𝑟


actions are ignored here.) 3/4 3/4 0
Version 3.5

𝑠1 𝑠2 𝑠3 𝑠𝑁 Training Data
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′ −𝑏
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′ −𝑏
𝑠3 , 𝑎3 A3 = 𝐺3′ −𝑏

……
……
𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′ −𝑏
s 𝑉𝜃 𝑉 𝑠 𝜃
Version 3.5

𝑠1 𝑠2 𝑠3 𝑠𝑁 Training Data
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠1 , 𝑎1 A1 = 𝐺1′ −𝑉 𝜃 𝑠1
𝑟1 𝑟2 𝑟3 𝑟𝑁 𝑠2 , 𝑎2 A2 = 𝐺2′ −𝑉 𝜃 𝑠2
𝑠3 , 𝑎3 A3 = 𝐺3′ −𝑉 𝜃 𝑠3

……
……
𝑠𝑁 , 𝑎𝑁 A𝑁 = 𝐺𝑁′ −𝑉 𝜃 𝑠𝑁
s 𝑉𝜃 𝑉 𝑠 𝜃
Version 3.5 𝑠𝑡 , 𝑎𝑡 A𝑡 = 𝐺𝑡′ −𝑉 𝜃 𝑠𝑡

𝐺 = 100
𝐺=3
𝐺=1 𝑉 𝜃 𝑠𝑡
𝑠𝑡 𝐺=2
𝐺 = −10
(not necessary take 𝑎𝑡 )
(You sample the actions based on
a distribution) A𝑡 > 0
𝑎𝑡 is better than average.
𝑎𝑡
𝐺𝑡′ A𝑡 < 0
𝑠𝑡
Just a sample 𝑎𝑡 is worse than average.
𝑟𝑡 + 𝑉 𝜃 𝑠𝑡+1 − 𝑉 𝜃 𝑠𝑡
Version 4 𝑠𝑡 , 𝑎𝑡 A𝑡 = 𝐺𝑡′ −𝑉 𝜃 𝑠𝑡
Advantage Actor-Critic
𝐺 = 100
𝐺=3
𝐺=1 𝑉 𝜃 𝑠𝑡
𝑠𝑡 𝐺=2
𝐺 = −10
(not necessary take 𝑎𝑡 )

𝐺 = 101
Obtain 𝑟𝑡 𝐺=4
𝑎𝑡 𝑟𝑡
𝐺=3
+𝑉 𝜃 𝑠𝑡+1
𝑠𝑡 𝑠𝑡+1 𝐺=1
𝐺 = −5
Tip of Actor-Critic
• The parameters of actor and critic can be
shared.

left

Network right Actor


𝑠 Network fire

Network scalar Critic


Outlook: Deep Q Network (DQN)

Video:
https://youtu.be/o_g9JUMw1Oc
https://youtu.be/2-zGCx4iv_k

https://arxiv.org/abs/1710.02298
Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration


Sparse Reward A𝑡 = 𝑟𝑡 + 𝑉 𝜃 𝑠𝑡+1 − 𝑉 𝜃 𝑠𝑡
Training Data
𝑠1 , 𝑎1 A1
𝑠1 𝑠2 𝑠3 𝑠𝑁 𝑠2 , 𝑎2 A2
𝑎1 𝑎2 𝑎3 …… 𝑎𝑁 𝑠3 , 𝑎3 A3



𝑟1 𝑟2 𝑟3 𝑟𝑁
sN , a N A𝑁

We don’t know actions


If 𝑟𝑡 = 0 in most cases
are good or bad.
e.g., robot arm to bolt on the screws
The developers define extra
reward shaping
rewards to guide agents.
Reward Shaping
VizDoom https://openreview.net/forum?id=Hk3mPK5gg&noteId=Hk3mPK5gg

Visual Doom AI Competition @ CIG 2016


https://www.youtube.com/watch?v=94EPSjQH38Y
Reward Shaping
VizDoom https://openreview.net/forum?id=Hk3mPK5gg&noteId=Hk3mPK5gg

https://bair.berkeley.edu/blog/2017/12/20/reverse-curriculum/
Reward Shaping - Curiosity https://arxiv.org/abs/1705.05363

Obtaining extra reward when the agent sees something


new (but meaningful).

Source of video: https://pathak22.github.io/noreward-rl/


Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Reward Shaping

No Reward: Learning from Demonstration


Motivation
• Even define reward can be challenging in some tasks.
• Hand-crafted rewards can lead to uncontrolled behavior.
Three Laws of Robotics:
1. A robot may not injure a human being or, through
inaction, allow a human being to come to harm.
2. A robot must obey the orders given it by human beings
except where such orders would conflict with the First Law.
3. A robot must protect its own existence as long as such
protection does not conflict with the First or Second Laws.

?????

restraining individual human behavior and sacrificing


some humans will ensure humanity's survival
Imitation Learning
𝑠1 𝑎1 𝑠2 𝑎2

Env Actor Env Actor Env ……

𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
Actor can interact with the environment, but reward
function is not available

We have demonstration of the expert. Self driving: record


human drivers
Each 𝜏Ƹ is a trajectory
𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾 Robot: grab the
of the export.
arm of robot
Isn’t it Supervised Learning?
Yes, also known as
• Self-driving cars as example Behavior Cloning
𝜏Ƹ = 𝑠1 , 𝑎ො1 , 𝑠2 , 𝑎ො2 , ⋯

forward ?
Expert

𝑠𝑖 𝑎𝑖 𝑎ො𝑖

Problem: The experts only


sample limited observation.
The agent will copy
More problem …… every behavior, even
irrelevant actions.

https://www.youtube.com/watch?v=j2FSB3bseek
Inverse Reinforcement Learning
demonstration
of the expert
Environment
𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾

Reward Inverse Reinforcement


Reinforcement Optimal
Function
Expert
Actor
Learning

Using the reward function to find the


optimal actor.
Inverse Reinforcement Learning
• Principle: The teacher is always the best.
• Basic idea:
• Initialize an actor
• In each iteration
• The actor interacts with the environments to obtain
some trajectories.
• Define a reward function, which makes the
trajectories of the teacher better than the actor.
• The actor learns to maximize the reward based on
the new reward function.
• Output the reward function and the actor learned from
the reward function
Framework of IRL 𝐾 𝐾

෍ 𝑅 𝜏Ƹ 𝑛 > ෍ 𝑅 𝜏
𝑛=1 𝑛=1

Expert 𝜋ො 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾 Obtain


Reward Function R

Reward
𝜏1 , 𝜏2 , ⋯ , 𝜏𝐾 Function R
Actor
= Generator
Find an actor based
Reward function Actor 𝜋 on reward function R
= Discriminator
By Reinforcement learning
GAN High score for real,
low score for generated

Find a G whose output


G
obtains large score from D
IRL
Larger reward for 𝜏,Ƹ
Expert 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝐾 Lower reward for 𝜏

𝜏1 , 𝜏2 , ⋯ , 𝜏𝐾 Reward
Function

Find a Actor obtains


Actor large reward
Robot
• How to teach robots? https://www.youtube.com/watch?v=DEGbtjTOIB0
Chelsea Finn, Sergey Levine, Pieter Abbeel,
Guided Cost Learning: Deep Inverse Optimal
Robot Control via Policy Optimization, ICML, 2016
http://rll.berkeley.edu/gcl/
To Learn More …
Visual Reinforcement Learning with Imagined Goals, NIPS 2018
https://arxiv.org/abs/1807.04742

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning,


ICML 2020 https://arxiv.org/abs/1903.03698

Reinforcement learning with Imagined Goals (RIG)


Concluding Remarks

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

Sparse Reward

No Reward: Learning from Demonstration

You might also like