0% found this document useful (0 votes)

10 views

RL Intro-2

The document introduces reinforcement learning (RL). RL allows agents to learn policies through trial-and-error interactions with an environment to maximize rewards. Key concepts discussed include trajectories, returns, policies, and value functions. The goal of RL is to find the optimal policy that maximizes expected returns by solving the RL optimization problem using approaches like policy optimization or Q-learning.

Uploaded by

Mariana Meireles

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

RL Intro-2

Uploaded by

Mariana Meireles

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Introduction to Reinforcement Learning

Joshua Achiam

March 3, 2018

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 1 / 21

What is RL?

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 2 / 21

What can RL do?

RL can...
Play video games from raw pixels
Control robots in simulation and in the real world
Play Go and Dota 1v1 at superhuman levels

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 3 / 21

What is RL?

An agent interacts with an environment.

obs = env.reset()
done = False
while not(done):
act = agent.get_action(obs)
next_obs, reward, done, info = env.step(act)
obs = next_obs

The goal of the agent is to maximize cumulative reward (called return).

Reinforcement learning (RL) is a field of study for algorithms that do that.

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 4 / 21

Key Concepts in RL

Before we can talk about algorithms, we have to talk about:

Trajectories
Return
Policies
The RL optimization problem
Value and Action-Value Functions
Note: For this talk, we will talk about all of these things in the context of deep RL,
where we use neural networks to represent them.

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 5 / 21

Trajectories

A trajectory τ is a sequence of states and actions in an environment:

τ = (s0 , a0 , s1 , a1 , ...).

The initial state s0 is sampled from a start state distribution µ:

s0 ∼ µ(·).
State transitions depend only on the most recent state and action. They could be
deterministic:
st+1 = f (st , at ),
or stochastic:
st+1 ∼ P(·|st , at ).
A trajectory is sometimes also called an episode or rollout.

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 6 / 21

Reward and Return
The reward function of an environment measures how good state-action pairs are:

rt = R(st , at ).

Example: if you want a robot to run forwards but use minimal energy,
R(s, a) = v − αkak22 .

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 7 / 21

Reward and Return
The reward function of an environment measures how good state-action pairs are:

rt = R(st , at ).

Example: if you want a robot to run forwards but use minimal energy,
R(s, a) = v − αkak22 .
The return of a trajectory is a measure of cumulative reward along it. There are two
main ways to compute return:
Finite horizon undiscounted sum of rewards::
T
X
R(τ ) = rt
t=0

Infinite horizon discounted sum of rewards:

∞
X
R(τ ) = γ t rt
t=0

where γ ∈ (0, 1). This makes rewards less valuable if they are further in the future.
(Why would we ever want this? Think about cash: it’s valuable to have it sooner
rather than later!)
Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 7 / 21
Policies

A policy π is a rule for selecting actions. It can be either

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 8 / 21

Policies

A policy π is a rule for selecting actions. It can be either

stochastic, which means that it gives a probability distribution over actions, and
actions are selected randomly based on that distribution (at ∼ π(·|st )),
or deterministic, which means that π directly maps to an action (at = π(st )).
Examples of policies:
Stochastic policy over discrete actions:
obs = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
net = mlp(obs, hidden_dims=(64,64), activation=tf.tanh)
logits = tf.layers.dense(net, units=num_actions, activation=None)
actions = tf.squeeze(tf.multinomial(logits=logits,num_samples=1), axis=1)

Deterministic policy for a vector-valued continuous action:

obs = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
net = mlp(obs, hidden_dims=(64,64), activation=tf.tanh)
actions = tf.layers.dense(net, units=act_dim, activation=None)

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 8 / 21

The Reinforcement Learning Problem

The goal in RL is to learn a policy which maximizes expected return. The optimal policy
π ∗ is:
π ∗ = arg max E [R(τ )] ,
π τ ∼π

where by τ ∼ π, we mean

s0 ∼ µ(·), at ∼ π(·|st ), st+1 ∼ P(·|st , at ).

There are two main approaches for solving this problem:

policy optimization
and Q-learning.

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 9 / 21

Value Functions and Action-Value Functions

Value functions tell you the expected return after a state or state-action pair.

V π (s) = E [R(τ ) |s0 = s ] Start in s and then sample from π

τ ∼π
π
Q (s, a) = E [R(τ ) |s0 = s, a0 = a ] Start in s, take action a, then sample from π
τ ∼π
∗
V (s) = max E [R(τ ) |s0 = s ] Start in s and then act optimally
π τ ∼π
∗
Q (s, a) = max E [R(τ ) |s0 = s, a0 = a ] Start in s, take action a, then act optimally
π τ ∼π

The value functions satisfy recursive Bellman equations:

E r (s, a) + γV π (s 0 )
π 0 0
V π (s) = a∼π Q π (s, a) = 0E

r (s, a) + γ 0E Q (s , a )
s ∼P a ∼π
s 0 ∼P

V ∗ (s) = max 0E r (s, a) + γV π (s 0 ) Q ∗ (s, a) = 0E π 0 0

r (s, a) + γ max
0
Q (s , a )
a s ∼P s ∼P a

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 10 / 21

Q∗

The optimal Q function, Q ∗ , is especially important because it gives us a policy. In any

state s, the optimal action is

a∗ = arg max Q ∗ (s, a).

We can measure how good a Q ∗ -approximator, Qθ , is by measuring its mean-squared

Bellman error:
X 2
1 0 0
`(θ) = Qθ (s, a) − r + γ max Qθ (s , a ) .
|D| 0
a0
(s,a,s ,r )∈D

This (roughly) says how well it satisfies the Bellman equation

Q ∗ (s, a) = 0E r (s, a) + γ max0
Q π 0
(s , a 0
)
s ∼P a

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 11 / 21

Deep RL Algorithms

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 12 / 21

Deep RL Algorithms
There are many different kinds of RL algorithms! This is a non-exhaustive taxonomy
(with specific algorithms in blue):

We will talk about two of them: Policy Gradient and DQN.

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 13 / 21
A Few Notes

Using Model-Free RL Algorithms:

Algorithm a Discrete a Continuous
Policy optimization Yes Yes
DQN / C51 / QR-DQN Yes No
DDPG No Yes
Using Model-Based RL Algorithms:
Learning the model means learning to generate next state and/or reward:

ŝt+1 , rˆt = fˆφ (st , at )

Some algorithms may only work with an exact model of the environment
AlphaZero uses the rules of the game to build its search tree

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 14 / 21

Policy Gradients

An algorithm for training stochastic policies:

Run current policy in the environment to collect rollouts
Take stochastic gradient ascent on policy performance using the policy gradient:
" T #
X
g = ∇θ τ ∼π
E rt
θ
t=0
T T
" !#
X X
= τ ∼π
E ∇θ log πθ (at |st ) rt 0
θ
t=0 t 0 =t
T T
!
1 X X X
≈ ∇θ log πθ (at |st ) rt 0
|D| τ ∈D t=0
t 0 =t

Core idea: push up the probabilities of good actions and push down the probabilities
of bad actions
Definition: sum of rewards after time t is the reward-to-go at time t:
T
X
R̂t = rt 0
t 0 =t

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 15 / 21

Example Implementation

Make the model, loss function, and optimizer:

# make model
with tf.variable_scope(’model’):
obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
net = mlp(obs_ph, hidden_sizes=[hidden_dim]*n_layers)
logits = tf.layers.dense(net, units=n_acts, activation=None)
actions = tf.squeeze(tf.multinomial(logits=logits,num_samples=1), axis=1)

# make loss
adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
act_ph = tf.placeholder(shape=(None,), dtype=tf.int32)
action_one_hots = tf.one_hot(act_ph, n_acts)
log_probs = tf.reduce_sum(action_one_hots * tf.nn.log_softmax(logits), axis=1)
loss = -tf.reduce_mean(adv_ph * log_probs)

# make train op
train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 16 / 21

Example Implementation (Continued)

One iteration of training:

# train model for one iteration
batch_obs, batch_acts, batch_rtgs, batch_rets, batch_lens = [], [], [], [], []
obs, rew, done, ep_rews = env.reset(), 0, False, []
while True:
batch_obs.append(obs.copy())
act = sess.run(actions, {obs_ph: obs.reshape(1,-1)})[0]
obs, rew, done, _ = env.step(act)
batch_acts.append(act)
ep_rews.append(rew)
if done:
batch_rets.append(sum(ep_rews))
batch_lens.append(len(ep_rews))
batch_rtgs += list(discount_cumsum(ep_rews, gamma))
obs, rew, done, ep_rews = env.reset(), 0, False, []
if len(batch_obs) > batch_size:
break

# normalize advs trick:

batch_advs = np.array(batch_rtgs)
batch_advs = (batch_advs - np.mean(batch_advs))/(np.std(batch_advs) + 1e-8)
batch_loss, _ = sess.run([loss,train_op], feed_dict={obs_ph: np.array(batch_obs),
act_ph: np.array(batch_acts),
adv_ph: batch_advs})

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 17 / 21

Q-Learning

Core idea: learn Q ∗ and use it to get the optimal actions

Way to do it:
Collect experience in the environment using a policy which trades off between acting
randomly and acting according to current Qθ
Interleave data collection with updates to Qθ to minimize Bellman error:
X 2
min Qθ (s, a) − r + γ max Qθ (s 0 , a0 )
θ a0
(s,a,s 0 ,r )∈D

...sort of! This actually won’t work!

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 18 / 21

Getting Q-Learning to Work (DQN)

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 19 / 21

Getting Q-Learning to Work (DQN)

Experience replay:
Data distribution changes over time: as your Q function gets better and you exploit
this, you visit different (s, a, s 0 , r ) transitions than you did earlier
Stabilize learning by keeping old transitions in a replay buffer, and taking minibatch
gradient descent on mix of old and new transitions
Target networks:
Minimizing Bellman error directly is unstable!
It’s like regression, but it’s not:
X 2
min Qθ (s, a) − y (s 0 , r ) ,
θ
(s,a,s 0 ,r )∈D

where the target y (s 0 , r ) is

y (s 0 , r ) = r + γ max
0
Qθ (s 0 , a0 ).
a

Targets depend on parameters θ—so an update to Q changes the target!

Stabilize it by holding the target fixed for a while: keep a separate target network,
Qθtarg , and every k steps update θtarg ← θ

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 19 / 21

DQN Pseudocode

Algorithm 1 Deep Q-Learning

Randomly generate Q-function parameters θ
Set target Q-network parameters θtarg ← θ
Make empty replay buffer D
Receive observation s0 from environment
for t = 0, 1, 2, ... do
With probability , select random action at ; otherwise select at = arg maxa Qθ (st , a)
Step environment to get st+1 , rt and end-of-episode signal dt
Linearly decay until it reaches final value f
Store (st , at , rt , st+1 , dt ) → D
Sample mini-batch of transitions B = {(s, a, r , s 0 , d)i } from D
For each transition in B, compute

r transition is terminal (d = True)
y = r + γ max 0 Q (s 0 , a0 ) otherwise
a θtarg

Update Q by gradient descent on regression loss:

X
θ ← θ − α∇θ (Qθ (s, a) − y )2
(s,a,y )∈B

if t%tupdate = 0 then
Set θtarg ← θ
end if
end for
Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 20 / 21
Recommended Reading: Deep RL Algorithms

A2C / A3C: Mnih et al, 2016 (https://arxiv.org/abs/1602.01783)

PPO: Schulman et al, 2017 (https://arxiv.org/abs/1707.06347)
TRPO: Schulman et al, 2015 (https://arxiv.org/abs/1502.05477)
DQN: Mnih et al, 2013
(https://www.cs.toronto.edu/˜vmnih/docs/dqn.pdf)
C51: Bellemare et al, 2017 (https://arxiv.org/abs/1707.06887)
QR-DQN: Dabney et al, 2017 (https://arxiv.org/abs/1710.10044)
DDPG: Lillicrap et al, 2015 (https://arxiv.org/abs/1509.02971)
SVG: Heess et al, 2015 (https://arxiv.org/abs/1510.09142)
I2A: Weber et al, 2017 (https://arxiv.org/abs/1707.06203)
MBMF: Nagabandi et al, 2017 (https://sites.google.com/view/mbmf)
AlphaZero: Silver et al, 2017 (https://arxiv.org/abs/1712.01815)

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 21 / 21

Zoho Questions PDF
80% (10)
Zoho Questions PDF
4 pages
Nim Nam
No ratings yet
Nim Nam
20 pages
Lecture 1: Introduction To Reinforcement Learning: David Silver
No ratings yet
Lecture 1: Introduction To Reinforcement Learning: David Silver
46 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
03-04-lessonarticle
No ratings yet
03-04-lessonarticle
5 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Deep Reinforcement Learning
100% (4)
Deep Reinforcement Learning
48 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
Lecture 10 - Overview of RL With A VIP Perspective
No ratings yet
Lecture 10 - Overview of RL With A VIP Perspective
35 pages
Lec 21
No ratings yet
Lec 21
28 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Reinforcement Learning - Introduction
No ratings yet
Reinforcement Learning - Introduction
19 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
lecture-06
No ratings yet
lecture-06
98 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
2312.08365v2
No ratings yet
2312.08365v2
39 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
rl-3
No ratings yet
rl-3
31 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Unit 3
No ratings yet
Unit 3
12 pages
cs224r_L04_Actor_Critic
No ratings yet
cs224r_L04_Actor_Critic
89 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
No ratings yet
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
RL_Basics_1737166593
No ratings yet
RL_Basics_1737166593
30 pages
Unit 5
No ratings yet
Unit 5
10 pages
DRL
No ratings yet
DRL
9 pages
Lecture_2_Summary
No ratings yet
Lecture_2_Summary
1 page
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
lecture doubts
No ratings yet
lecture doubts
2 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
37 RL
No ratings yet
37 RL
18 pages
lec22
No ratings yet
lec22
22 pages
Reinforcement Learning - Basics
No ratings yet
Reinforcement Learning - Basics
7 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
10 1108 - Jabs 01 2021 0025
No ratings yet
10 1108 - Jabs 01 2021 0025
17 pages
Impact of Digital Marketing On Online Purchase Intention: Mediation Effect of Customer Relationship Management
No ratings yet
Impact of Digital Marketing On Online Purchase Intention: Mediation Effect of Customer Relationship Management
17 pages
Digital Initiatives in India Contributing To The Functioning of The Education System
No ratings yet
Digital Initiatives in India Contributing To The Functioning of The Education System
21 pages
YOLOv11 Intro-class
No ratings yet
YOLOv11 Intro-class
73 pages
Ecografo Mindray DC-N Ceries PDF
No ratings yet
Ecografo Mindray DC-N Ceries PDF
233 pages
PMI - PMP Questions For Exam Practice
No ratings yet
PMI - PMP Questions For Exam Practice
73 pages
5 Yellow (Diffused) 5mm
No ratings yet
5 Yellow (Diffused) 5mm
3 pages
Lab Example of Asymmetric Encryption
No ratings yet
Lab Example of Asymmetric Encryption
2 pages
Saraswat Bank Services
No ratings yet
Saraswat Bank Services
6 pages
Lecture - PEE Aspirant 2016
No ratings yet
Lecture - PEE Aspirant 2016
14 pages
Report of EIB Hons H 741-750 Practical Assinment 1
No ratings yet
Report of EIB Hons H 741-750 Practical Assinment 1
4 pages
Introduction To Pascal
No ratings yet
Introduction To Pascal
30 pages
ChatGPT Writing Prompts
86% (7)
ChatGPT Writing Prompts
12 pages
Annexure 3 Msme
No ratings yet
Annexure 3 Msme
1 page
Practical DAA 1
No ratings yet
Practical DAA 1
10 pages
S. No Qty Unit Price in USD Total Price in USD Item Description
No ratings yet
S. No Qty Unit Price in USD Total Price in USD Item Description
1 page
Fingerprint Attendance System Chapter One
No ratings yet
Fingerprint Attendance System Chapter One
7 pages
Webforge Handrail 2020 Web 3
No ratings yet
Webforge Handrail 2020 Web 3
17 pages
Deltabeam - Instalacion
No ratings yet
Deltabeam - Instalacion
10 pages
Pre Test Unit 1
No ratings yet
Pre Test Unit 1
3 pages
SSD-4000-28 Measure To Video Output Failure
No ratings yet
SSD-4000-28 Measure To Video Output Failure
5 pages
Dexhand A Space Qualified Multi Fingered
No ratings yet
Dexhand A Space Qualified Multi Fingered
7 pages
ICT Lecture-05 Ok-F
No ratings yet
ICT Lecture-05 Ok-F
26 pages
JBDL 3
No ratings yet
JBDL 3
15 pages
Quadratic Equation Ax bxc0 X A B B Ac: Mathematical Formulae
No ratings yet
Quadratic Equation Ax bxc0 X A B B Ac: Mathematical Formulae
15 pages
Advanced Concepts in SQL
No ratings yet
Advanced Concepts in SQL
5 pages
Table Creation Pract 1 PDF
No ratings yet
Table Creation Pract 1 PDF
3 pages
T T 29363 Buddy The Dogs Internet Safety Story Powerpoint - Ver - 5
No ratings yet
T T 29363 Buddy The Dogs Internet Safety Story Powerpoint - Ver - 5
18 pages

RL Intro-2

Uploaded by

RL Intro-2

Uploaded by

Introduction to Reinforcement Learning

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 1 / 21

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 2 / 21

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 3 / 21

An agent interacts with an environment.

The goal of the agent is to maximize cumulative reward (called return).

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 4 / 21

Before we can talk about algorithms, we have to talk about:

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 5 / 21

A trajectory τ is a sequence of states and actions in an environment:

The initial state s0 is sampled from a start state distribution µ:

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 6 / 21

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 7 / 21

Infinite horizon discounted sum of rewards:

A policy π is a rule for selecting actions. It can be either

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 8 / 21

A policy π is a rule for selecting actions. It can be either

Deterministic policy for a vector-valued continuous action:

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 8 / 21

s0 ∼ µ(·), at ∼ π(·|st ), st+1 ∼ P(·|st , at ).

There are two main approaches for solving this problem:

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 9 / 21

V π (s) = E [R(τ ) |s0 = s ] Start in s and then sample from π

The value functions satisfy recursive Bellman equations:

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 10 / 21

The optimal Q function, Q ∗ , is especially important because it gives us a policy. In any

a∗ = arg max Q ∗ (s, a).

We can measure how good a Q ∗ -approximator, Qθ , is by measuring its mean-squared

This (roughly) says how well it satisfies the Bellman equation

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 11 / 21

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 12 / 21

We will talk about two of them: Policy Gradient and DQN.

Using Model-Free RL Algorithms:

ŝt+1 , rˆt = fˆφ (st , at )

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 14 / 21

An algorithm for training stochastic policies:

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 15 / 21

Make the model, loss function, and optimizer:

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 16 / 21

One iteration of training:

# normalize advs trick:

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 17 / 21

Core idea: learn Q ∗ and use it to get the optimal actions

...sort of! This actually won’t work!

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 18 / 21

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 19 / 21

where the target y (s 0 , r ) is

Targets depend on parameters θ—so an update to Q changes the target!

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 19 / 21

Algorithm 1 Deep Q-Learning

Update Q by gradient descent on regression loss:

A2C / A3C: Mnih et al, 2016 (https://arxiv.org/abs/1602.01783)

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 21 / 21

You might also like