0% found this document useful (0 votes)
12 views

2015.08.26.Lecture01Intro 2

This document provides an introduction to a course on deep reinforcement learning. The goals of the course are to understand how deep RL can be applied in various domains and learn about three classes of RL algorithms that use neural networks. The lecture will define deep RL, discuss where RL is currently deployed, and outline three main classes of algorithms to be covered: policy gradients, approximate dynamic programming, and search with supervised learning. Examples of RL applications include robotics, business, finance, e-commerce, and medicine.

Uploaded by

hinsermu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

2015.08.26.Lecture01Intro 2

This document provides an introduction to a course on deep reinforcement learning. The goals of the course are to understand how deep RL can be applied in various domains and learn about three classes of RL algorithms that use neural networks. The lecture will define deep RL, discuss where RL is currently deployed, and outline three main classes of algorithms to be covered: policy gradients, approximate dynamic programming, and search with supervised learning. Examples of RL applications include robotics, business, finance, e-commerce, and medicine.

Uploaded by

hinsermu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Deep

 Reinforcement  Learning  

Lecture  1:  Introduction  

John  Schulman
Goal  of  the  Course

■ Understand  how  deep  reinforcement  learning  can  be  applied  in  various  domains  
■ Learn  about  three  classes  of  RL  algorithm  and  how  implement  with  neural  
networks  
■ policy  gradient  methods  
■ approximate  dynamic  programming  
■ search  +  supervised  learning  
■ Understand  the  state  of  deep  RL  as  a  research  topic

2
Outline  of  Lecture

■ What  is  “deep  reinforcement  learning”  


■ Where  is  reinforcement  learning  deployed?  
■ Where  is  reinforcement  learning  NOT  deployed?                                                  
(but  could  be…)

3
Sequential  Decision  Making

Agent
Action Observation,  
Reward

Environment

Goal:  maximize  expected  total  reward

with  respect  to  the  policy:  a  function  from  observation  history  to  
next  action

4
Applications
■ Robotics:  
■ Actions:  torque  at  joints  
■ Observations:  sensor  readings  
■ Rewards:    
■ navigate  to  target  location

5
Applications
■ Robotics:  
■ Actions:  torque  at  joints  
■ Observations:  sensor  readings  
■ Rewards:    
■ navigate  to  target  location  
■ complete  manipulation  
task

6
Applications
■ Business  operations  
■ Inventory  management:  how  much  to  purchase  of  
inventory,  spare  parts  
■ Resource  allocation:  e.g.  in  call  center,  who  to  service  first  
■ Routing  problems:  e.g.  for  management  of  shipping  fleet,  
which  trucks/truckers  to  assign  to  which  cargo

7
Applications
■ Finance  
■ Investment  decisions  
■ Portfolio  design  
■ Option/asset  pricing

8
Applications
■ E-­‐commerce  /  media  
■ What  content  to  present  to  users  (using  click-­‐through  /  visit  
time  as  reward)  
■ What  ads  to  present  to  users  (avoiding  ad  fatigue)

9
Applications
■ Medicine  
■ What  tests  to  perform,  what  treatments  to  provide

10
Applications
■ Structured  prediction:  algorithm  has  to  make  a  sequence  of  
predictions,  which  are  fed  back  into  predictor  
■ in  NLP,  text  generation  &  translation,  parsing  [1,2]  
■ multi-­‐step  pipelines  in  vision  [3]
1.2. CATEGORIZATION OF LEARNING TASKS 9

Iterate many *mes over graph:


[1]  Daumé,  Hal,  et  al..Search-­‐based  structured  
prediction  (2009)   Features
Predictor
Neighbors’
[2]  Shi,  T  et  al.,  Learning  Where  to  Sample  in   predic*ons

Structured  Prediction,  (2015)


[3]  S.  Ross,  Interactive  Learning  for  Sequential  
Figure 1.3: Depiction of the inference or decoding process of structured prediction meth-
ods in the context of image labeling. E↵ectively, a sequence of predictions are made at
Decisions  and  Predictions,  2013
each pixel/image segments over the image, using local image features, and previous
11
computations/predictions at nearby pixels/image segments. This is often iterated many
RL  vs  Other  Learning  Problems
■ Supervised  learning:  classification  /  regression  
■ given  observation,  predict  label,  maximize  reward  
function  R(observation,  label)

object  detection speech  recognition

12
RL  vs  Other  Learning  Problems
■ Contextual  Bandits  
■ given  observation,  output  action,  receive  reward,  with  
unknown  and  stochastic  dependence  on  action  and  
observation  
■ e.g.,  advertising

13
RL  vs  Other  Learning  Problems
■ Reinforcement  learning  
■ given  observation,  output  action,  receive  reward,  with  
unknown  and  stochastic  dependence  on  action  and  
observation  
■ AND  we  perform  a  sequence  of  actions,  and  states  
depend  on  previous  actions

14
RL  vs.  Other  Learning  Problems
o o o o o o o o o

a a a a a a a a a

r r r r r r r
r r

Supervised  learning
⊂ Contextual  bandits ⊂ Reinforcement  learning

deterministic   decision stochastic


node node node

15
How  is  RL  different  from  Supervised  Learning,  In  Practice?

■ State  distribution  is  affected  by  policy  


■ Need  for  exploration  
■ Leads  to  instability  in  many  algorithms  

■ Can’t  use  past  data  —  online  learning  is  not  straightforward

16
What  is  “Deep  RL”?

Agent
Action Observation,  
Reward

Environment

17
What  is  “Deep  RL”?

fθ(history)
Action Observation,  
Reward

Environment

18
Deep  RL:  Algorithm  Design  Criteria
■ Algorithm  learns  a  parameterized  function  fθ  
■ Algorithm  does  not  depend  on  parameterization,  just  that  loss  
is  differentiable  wrt  θ  
■ Optimize  using  gradient-­‐based  algorithms,  using  gradient  
estimators  ∇θLoss  

■ computational  complexity  is  linear  in  θ  


■ sample  complexity  is  (in  a  sense)  independent  of  θ
19
Nonlinear/Nonconvex  Learning
■ Supervised  learning:  just  an  unconstrained  minimization  of  differentiable  objective  

■ minimizeθ  Loss(Xtrain,  ytrain)  

■ easy  to  get  convergence  to  local  minimum  

■ Reinforcement  learning:  no  differentiable  objective  to  optimize!  

■ actual  objective  E[total  reward]  is  an  expectation  over  random  variables  of  unknown  
system  

■ Approximate  Dynamic  Programming  methods  e.g.  Q-­‐learning:  NOT  gradient  descent  


on  fixed  objective,  NOT  guaranteed  to  converge

20
Deep  RL  Allows  Unified  Treatment  of  Problem  Classes

■ No  difference  between  Markov  Decision  Process  (MDP)  and  Partially  


Observed  Markov  Decision  Process  (POMDP)  
■ Not  much  difference  between  discrete  and  continuous  state/action  setting  
■ No  difference  between  finite-­‐horizon,  infinite  horizon  discounted,  and  
average-­‐cost  setting  
■ we’re  always  just  ignoring  long-­‐term  dependencies

21
Deep  RL  Frontier
■ Opportunity  for  theoretical  /  conceptual  advances  
■ How  to  explore  state  space  
■ How  to  have  a  policy  that  involves  actions  with  different  
timescales,  or  has  subgoals  (hierarchy)  
■ How  to  combine  reinforcement  learning  with  unsupervised  
learning

22
Deep  RL  Frontier
■ Opportunity  for  empirical/engineering  advances  

■ Pre-­‐2012,  object  recognition  state-­‐of-­‐the-­‐art  used  hand-­‐engineered  features  +  learned  


linear  classifiers  +  hand-­‐engineered  grouping  mechanism  

■ Now  entire  computer  vision  field  uses  deep  neural  networks  for  feature  extraction,  and  
moving  towards  end-­‐to-­‐end  optimization  of  entire  pipeline

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
[KSH2012]  Krizhevsky,  
between Sutskever,  
the two GPUs.&  HOne
inton.,  
GPUImageNet   Classification  
runs the layer-parts at the top ofwthe
ith  figure
Deep   Convolutional  
while Neural  
the other runs the Networks,  2012
layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
23
Where  is  RL  Deployed  Today
■ Operations  research  (see,  e.g.,  [1])  
■ Inventory  /  storage  
■ Power  grid:  when  to  buy  new  transformers.  Each  costs  $5M,  but  failure  
leads  to  much  bigger  costs  
■ How  much  of  items  to  purchase  and  keep  in  stock  
■ Resource  allocation  
■ Fleet  management:  assign  cargos  to  truck  drivers,  locomotives  to  trains  
■ Queueing  problems:  which  customers  to  serve  first  in  call  center  

24
RL  in  Robotics
■ Most  industrial  robotic  systems  perform  a  fixed  motion  repeatedly  with  simple  or  no  
perception.  

■ Iterative  Learning  Control  [1]  is  used  in  some  robotic  systems  —  using  model  of  dynamics,  
correct  errors  in  trajectories.  But  these  systems  still  use  simple  or  no  perception

[1]  Bristow,  Douglas,  Marina  Tharayil,  and  Andrew  G.  Alleyne.  A  survey  of  iterative  learning  control
25
Classic  Paradigm  for  Vision-­‐Based  Robotics

Motor    
commands

state  estimation,     Motion  planning  


integration +  control
Sensor  data  
images  /  lidar
World  model

26
Future  paradigm?

Motor    
commands

Sensor  data  
images  /  lidar Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
Deep  neural  net
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.

neurons in a kernel map). The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥
256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth
convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256
kernels of size 3 ⇥ 3 ⇥ 192. The fully-connected layers have 4096 neurons each. 27
Frontiers  in  Robotic  Manipulation

28
Frontiers  in  Robotic  Locomotion

Mordatch  et  al.,  Interactive  Control  of  Diverse  Complex  Characters  


with  Neural  Networks,  Under  review  (2015)
29
Frontiers  in  Robotic  Locomotion

Mordatch,  Igor,  Kendall  Lowrey,  and  Emanuel  Todorov.  Ensemble-­‐CIO:  Full-­‐Body  Dynamic  Motion  
Planning  that  Transfers  to  Physical  Humanoids.
30
Frontiers  in  Locomotion

Schulman,  Moritz,  Levine,  Jordan,  Abbeel  (2015)    


High-­‐Dimensional  Continuous  Control  Using  Generalized  Advantage  Estimation  
31
Atari  Games

Schulman,  Levine,  Moritz,  Jordan,  Abbeel  (2015)  Trust  Region  Policy  Optimization

32
Where  Else  Could  Deep  RL  Be  Applied?

33
Outline  for  Next  Lectures
■ Mon  8/31:  MDPs  
■ Weds  9/2:  neural  nets  and  backprop  
■ Mon  9/9:  policy  gradients

34
Brushing  up  on  RL:  refs
■ MDP  review  
■ Sutton  and  Barto,  ch  3  and  4  
■ See  Andrew  Ng’s  thesis,  ch  1-­‐2  for  a  nice  concise  review  of  MDPs

35
Reinforcement  Learning  Textbooks
■ Sutton  &  Barto,  Reinforcement  Learning:  An  Introduction  

■ informal,  prefers  online  algorithms  

■ Bertsekas,  Dynamic  Programming  and  Optimal  Control  

■ Vol  1.  ch  6:  survey  of  some  of  the  most  useful  practical  approaches  for  control,  e.g.  MPC,  rollout  algs  

■ Vol  2  (Approximate  Dynamic  Programming,  3ed):  linear  and  otherwise  tractable  methods  for  solving  for  value  functions,  policy  
iteration  algs  

■ Puterman,  Markov  Decision  Processes:  Discrete  Stochastic  Dynamic  Programming  

■ Exact  methods  for  solving  MDPs,  including  modified  policy  iteration  

■ Czepesvari,  Algorithms  for  Reinforcement  Learning  

■ Theory  on  online  algorithms  

■ Powell,  Approximate  Dynamic  Programming  

great  on  OR  applications


36

Thanks
■ Next  class  is  Monday,  August  31st  
■ We’ll  cover  MDPs  
■ First  homework  will  be  released  
■ uses  python+numpy+ipython

37

You might also like