2015.08.26.Lecture01Intro 2
2015.08.26.Lecture01Intro 2
Reinforcement Learning
John
Schulman
Goal
of
the
Course
■ Understand
how
deep
reinforcement
learning
can
be
applied
in
various
domains
■ Learn
about
three
classes
of
RL
algorithm
and
how
implement
with
neural
networks
■ policy
gradient
methods
■ approximate
dynamic
programming
■ search
+
supervised
learning
■ Understand
the
state
of
deep
RL
as
a
research
topic
2
Outline
of
Lecture
3
Sequential
Decision
Making
Agent
Action Observation,
Reward
Environment
with
respect
to
the
policy:
a
function
from
observation
history
to
next
action
4
Applications
■ Robotics:
■ Actions:
torque
at
joints
■ Observations:
sensor
readings
■ Rewards:
■ navigate
to
target
location
5
Applications
■ Robotics:
■ Actions:
torque
at
joints
■ Observations:
sensor
readings
■ Rewards:
■ navigate
to
target
location
■ complete
manipulation
task
6
Applications
■ Business
operations
■ Inventory
management:
how
much
to
purchase
of
inventory,
spare
parts
■ Resource
allocation:
e.g.
in
call
center,
who
to
service
first
■ Routing
problems:
e.g.
for
management
of
shipping
fleet,
which
trucks/truckers
to
assign
to
which
cargo
7
Applications
■ Finance
■ Investment
decisions
■ Portfolio
design
■ Option/asset
pricing
8
Applications
■ E-‐commerce
/
media
■ What
content
to
present
to
users
(using
click-‐through
/
visit
time
as
reward)
■ What
ads
to
present
to
users
(avoiding
ad
fatigue)
9
Applications
■ Medicine
■ What
tests
to
perform,
what
treatments
to
provide
10
Applications
■ Structured
prediction:
algorithm
has
to
make
a
sequence
of
predictions,
which
are
fed
back
into
predictor
■ in
NLP,
text
generation
&
translation,
parsing
[1,2]
■ multi-‐step
pipelines
in
vision
[3]
1.2. CATEGORIZATION OF LEARNING TASKS 9
12
RL
vs
Other
Learning
Problems
■ Contextual
Bandits
■ given
observation,
output
action,
receive
reward,
with
unknown
and
stochastic
dependence
on
action
and
observation
■ e.g.,
advertising
13
RL
vs
Other
Learning
Problems
■ Reinforcement
learning
■ given
observation,
output
action,
receive
reward,
with
unknown
and
stochastic
dependence
on
action
and
observation
■ AND
we
perform
a
sequence
of
actions,
and
states
depend
on
previous
actions
14
RL
vs.
Other
Learning
Problems
o o o o o o o o o
a a a a a a a a a
r r r r r r r
r r
Supervised
learning
⊂ Contextual
bandits ⊂ Reinforcement
learning
15
How
is
RL
different
from
Supervised
Learning,
In
Practice?
16
What
is
“Deep
RL”?
Agent
Action Observation,
Reward
Environment
17
What
is
“Deep
RL”?
fθ(history)
Action Observation,
Reward
Environment
18
Deep
RL:
Algorithm
Design
Criteria
■ Algorithm
learns
a
parameterized
function
fθ
■ Algorithm
does
not
depend
on
parameterization,
just
that
loss
is
differentiable
wrt
θ
■ Optimize
using
gradient-‐based
algorithms,
using
gradient
estimators
∇θLoss
■ actual
objective
E[total
reward]
is
an
expectation
over
random
variables
of
unknown
system
20
Deep
RL
Allows
Unified
Treatment
of
Problem
Classes
21
Deep
RL
Frontier
■ Opportunity
for
theoretical
/
conceptual
advances
■ How
to
explore
state
space
■ How
to
have
a
policy
that
involves
actions
with
different
timescales,
or
has
subgoals
(hierarchy)
■ How
to
combine
reinforcement
learning
with
unsupervised
learning
22
Deep
RL
Frontier
■ Opportunity
for
empirical/engineering
advances
■ Now
entire
computer
vision
field
uses
deep
neural
networks
for
feature
extraction,
and
moving
towards
end-‐to-‐end
optimization
of
entire
pipeline
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
[KSH2012]
Krizhevsky,
between Sutskever,
the two GPUs.&
HOne
inton.,
GPUImageNet
Classification
runs the layer-parts at the top ofwthe
ith
figure
Deep
Convolutional
while Neural
the other runs the Networks,
2012
layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
23
Where
is
RL
Deployed
Today
■ Operations
research
(see,
e.g.,
[1])
■ Inventory
/
storage
■ Power
grid:
when
to
buy
new
transformers.
Each
costs
$5M,
but
failure
leads
to
much
bigger
costs
■ How
much
of
items
to
purchase
and
keep
in
stock
■ Resource
allocation
■ Fleet
management:
assign
cargos
to
truck
drivers,
locomotives
to
trains
■ Queueing
problems:
which
customers
to
serve
first
in
call
center
24
RL
in
Robotics
■ Most
industrial
robotic
systems
perform
a
fixed
motion
repeatedly
with
simple
or
no
perception.
■ Iterative
Learning
Control
[1]
is
used
in
some
robotic
systems
—
using
model
of
dynamics,
correct
errors
in
trajectories.
But
these
systems
still
use
simple
or
no
perception
[1]
Bristow,
Douglas,
Marina
Tharayil,
and
Andrew
G.
Alleyne.
A
survey
of
iterative
learning
control
25
Classic
Paradigm
for
Vision-‐Based
Robotics
Motor
commands
26
Future
paradigm?
Motor
commands
Sensor
data
images
/
lidar Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
Deep
neural
net
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
neurons in a kernel map). The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥
256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth
convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256
kernels of size 3 ⇥ 3 ⇥ 192. The fully-connected layers have 4096 neurons each. 27
Frontiers
in
Robotic
Manipulation
28
Frontiers
in
Robotic
Locomotion
Mordatch,
Igor,
Kendall
Lowrey,
and
Emanuel
Todorov.
Ensemble-‐CIO:
Full-‐Body
Dynamic
Motion
Planning
that
Transfers
to
Physical
Humanoids.
30
Frontiers
in
Locomotion
Schulman, Levine, Moritz, Jordan, Abbeel (2015) Trust Region Policy Optimization
32
Where
Else
Could
Deep
RL
Be
Applied?
33
Outline
for
Next
Lectures
■ Mon
8/31:
MDPs
■ Weds
9/2:
neural
nets
and
backprop
■ Mon
9/9:
policy
gradients
34
Brushing
up
on
RL:
refs
■ MDP
review
■ Sutton
and
Barto,
ch
3
and
4
■ See
Andrew
Ng’s
thesis,
ch
1-‐2
for
a
nice
concise
review
of
MDPs
35
Reinforcement
Learning
Textbooks
■ Sutton
&
Barto,
Reinforcement
Learning:
An
Introduction
■ Vol 1. ch 6: survey of some of the most useful practical approaches for control, e.g. MPC, rollout algs
■ Vol
2
(Approximate
Dynamic
Programming,
3ed):
linear
and
otherwise
tractable
methods
for
solving
for
value
functions,
policy
iteration
algs
37