0% found this document useful (0 votes)
10 views

RL Intro-2

The document introduces reinforcement learning (RL). RL allows agents to learn policies through trial-and-error interactions with an environment to maximize rewards. Key concepts discussed include trajectories, returns, policies, and value functions. The goal of RL is to find the optimal policy that maximizes expected returns by solving the RL optimization problem using approaches like policy optimization or Q-learning.

Uploaded by

Mariana Meireles
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

RL Intro-2

The document introduces reinforcement learning (RL). RL allows agents to learn policies through trial-and-error interactions with an environment to maximize rewards. Key concepts discussed include trajectories, returns, policies, and value functions. The goal of RL is to find the optimal policy that maximizes expected returns by solving the RL optimization problem using approaches like policy optimization or Q-learning.

Uploaded by

Mariana Meireles
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Introduction to Reinforcement Learning

Joshua Achiam

March 3, 2018

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 1 / 21


What is RL?

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 2 / 21


What can RL do?

RL can...
Play video games from raw pixels
Control robots in simulation and in the real world
Play Go and Dota 1v1 at superhuman levels

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 3 / 21


What is RL?

An agent interacts with an environment.

obs = env.reset()
done = False
while not(done):
act = agent.get_action(obs)
next_obs, reward, done, info = env.step(act)
obs = next_obs

The goal of the agent is to maximize cumulative reward (called return).


Reinforcement learning (RL) is a field of study for algorithms that do that.

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 4 / 21


Key Concepts in RL

Before we can talk about algorithms, we have to talk about:


Trajectories
Return
Policies
The RL optimization problem
Value and Action-Value Functions
Note: For this talk, we will talk about all of these things in the context of deep RL,
where we use neural networks to represent them.

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 5 / 21


Trajectories

A trajectory τ is a sequence of states and actions in an environment:

τ = (s0 , a0 , s1 , a1 , ...).

The initial state s0 is sampled from a start state distribution µ:


s0 ∼ µ(·).
State transitions depend only on the most recent state and action. They could be
deterministic:
st+1 = f (st , at ),
or stochastic:
st+1 ∼ P(·|st , at ).
A trajectory is sometimes also called an episode or rollout.

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 6 / 21


Reward and Return
The reward function of an environment measures how good state-action pairs are:

rt = R(st , at ).

Example: if you want a robot to run forwards but use minimal energy,
R(s, a) = v − αkak22 .

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 7 / 21


Reward and Return
The reward function of an environment measures how good state-action pairs are:

rt = R(st , at ).

Example: if you want a robot to run forwards but use minimal energy,
R(s, a) = v − αkak22 .
The return of a trajectory is a measure of cumulative reward along it. There are two
main ways to compute return:
Finite horizon undiscounted sum of rewards::
T
X
R(τ ) = rt
t=0

Infinite horizon discounted sum of rewards:



X
R(τ ) = γ t rt
t=0

where γ ∈ (0, 1). This makes rewards less valuable if they are further in the future.
(Why would we ever want this? Think about cash: it’s valuable to have it sooner
rather than later!)
Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 7 / 21
Policies

A policy π is a rule for selecting actions. It can be either


stochastic, which means that it gives a probability distribution over actions, and
actions are selected randomly based on that distribution (at ∼ π(·|st )),
or deterministic, which means that π directly maps to an action (at = π(st )).

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 8 / 21


Policies

A policy π is a rule for selecting actions. It can be either


stochastic, which means that it gives a probability distribution over actions, and
actions are selected randomly based on that distribution (at ∼ π(·|st )),
or deterministic, which means that π directly maps to an action (at = π(st )).
Examples of policies:
Stochastic policy over discrete actions:
obs = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
net = mlp(obs, hidden_dims=(64,64), activation=tf.tanh)
logits = tf.layers.dense(net, units=num_actions, activation=None)
actions = tf.squeeze(tf.multinomial(logits=logits,num_samples=1), axis=1)

Deterministic policy for a vector-valued continuous action:


obs = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
net = mlp(obs, hidden_dims=(64,64), activation=tf.tanh)
actions = tf.layers.dense(net, units=act_dim, activation=None)

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 8 / 21


The Reinforcement Learning Problem

The goal in RL is to learn a policy which maximizes expected return. The optimal policy
π ∗ is:
π ∗ = arg max E [R(τ )] ,
π τ ∼π

where by τ ∼ π, we mean

s0 ∼ µ(·), at ∼ π(·|st ), st+1 ∼ P(·|st , at ).

There are two main approaches for solving this problem:


policy optimization
and Q-learning.

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 9 / 21


Value Functions and Action-Value Functions

Value functions tell you the expected return after a state or state-action pair.

V π (s) = E [R(τ ) |s0 = s ] Start in s and then sample from π


τ ∼π
π
Q (s, a) = E [R(τ ) |s0 = s, a0 = a ] Start in s, take action a, then sample from π
τ ∼π

V (s) = max E [R(τ ) |s0 = s ] Start in s and then act optimally
π τ ∼π

Q (s, a) = max E [R(τ ) |s0 = s, a0 = a ] Start in s, take action a, then act optimally
π τ ∼π

The value functions satisfy recursive Bellman equations:


 
E r (s, a) + γV π (s 0 )
 π 0 0 
V π (s) = a∼π Q π (s, a) = 0E
 
r (s, a) + γ 0E Q (s , a )
s ∼P a ∼π
s 0 ∼P
 
V ∗ (s) = max 0E r (s, a) + γV π (s 0 ) Q ∗ (s, a) = 0E π 0 0
 
r (s, a) + γ max
0
Q (s , a )
a s ∼P s ∼P a

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 10 / 21


Q∗

The optimal Q function, Q ∗ , is especially important because it gives us a policy. In any


state s, the optimal action is

a∗ = arg max Q ∗ (s, a).


a

We can measure how good a Q ∗ -approximator, Qθ , is by measuring its mean-squared


Bellman error:
X   2
1 0 0
`(θ) = Qθ (s, a) − r + γ max Qθ (s , a ) .
|D| 0
a0
(s,a,s ,r )∈D

This (roughly) says how well it satisfies the Bellman equation


 
Q ∗ (s, a) = 0E r (s, a) + γ max0
Q π 0
(s , a 0
)
s ∼P a

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 11 / 21


Deep RL Algorithms

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 12 / 21


Deep RL Algorithms
There are many different kinds of RL algorithms! This is a non-exhaustive taxonomy
(with specific algorithms in blue):

We will talk about two of them: Policy Gradient and DQN.


Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 13 / 21
A Few Notes

Using Model-Free RL Algorithms:


Algorithm a Discrete a Continuous
Policy optimization Yes Yes
DQN / C51 / QR-DQN Yes No
DDPG No Yes
Using Model-Based RL Algorithms:
Learning the model means learning to generate next state and/or reward:

ŝt+1 , rˆt = fˆφ (st , at )


Some algorithms may only work with an exact model of the environment
AlphaZero uses the rules of the game to build its search tree

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 14 / 21


Policy Gradients

An algorithm for training stochastic policies:


Run current policy in the environment to collect rollouts
Take stochastic gradient ascent on policy performance using the policy gradient:
" T #
X
g = ∇θ τ ∼π
E rt
θ
t=0
T T
" !#
X X
= τ ∼π
E ∇θ log πθ (at |st ) rt 0
θ
t=0 t 0 =t
T T
!
1 X X X
≈ ∇θ log πθ (at |st ) rt 0
|D| τ ∈D t=0
t 0 =t

Core idea: push up the probabilities of good actions and push down the probabilities
of bad actions
Definition: sum of rewards after time t is the reward-to-go at time t:
T
X
R̂t = rt 0
t 0 =t

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 15 / 21


Example Implementation

Make the model, loss function, and optimizer:


# make model
with tf.variable_scope(’model’):
obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
net = mlp(obs_ph, hidden_sizes=[hidden_dim]*n_layers)
logits = tf.layers.dense(net, units=n_acts, activation=None)
actions = tf.squeeze(tf.multinomial(logits=logits,num_samples=1), axis=1)

# make loss
adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
act_ph = tf.placeholder(shape=(None,), dtype=tf.int32)
action_one_hots = tf.one_hot(act_ph, n_acts)
log_probs = tf.reduce_sum(action_one_hots * tf.nn.log_softmax(logits), axis=1)
loss = -tf.reduce_mean(adv_ph * log_probs)

# make train op
train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 16 / 21


Example Implementation (Continued)

One iteration of training:


# train model for one iteration
batch_obs, batch_acts, batch_rtgs, batch_rets, batch_lens = [], [], [], [], []
obs, rew, done, ep_rews = env.reset(), 0, False, []
while True:
batch_obs.append(obs.copy())
act = sess.run(actions, {obs_ph: obs.reshape(1,-1)})[0]
obs, rew, done, _ = env.step(act)
batch_acts.append(act)
ep_rews.append(rew)
if done:
batch_rets.append(sum(ep_rews))
batch_lens.append(len(ep_rews))
batch_rtgs += list(discount_cumsum(ep_rews, gamma))
obs, rew, done, ep_rews = env.reset(), 0, False, []
if len(batch_obs) > batch_size:
break

# normalize advs trick:


batch_advs = np.array(batch_rtgs)
batch_advs = (batch_advs - np.mean(batch_advs))/(np.std(batch_advs) + 1e-8)
batch_loss, _ = sess.run([loss,train_op], feed_dict={obs_ph: np.array(batch_obs),
act_ph: np.array(batch_acts),
adv_ph: batch_advs})

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 17 / 21


Q-Learning

Core idea: learn Q ∗ and use it to get the optimal actions


Way to do it:
Collect experience in the environment using a policy which trades off between acting
randomly and acting according to current Qθ
Interleave data collection with updates to Qθ to minimize Bellman error:
X   2
min Qθ (s, a) − r + γ max Qθ (s 0 , a0 )
θ a0
(s,a,s 0 ,r )∈D

...sort of! This actually won’t work!

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 18 / 21


Getting Q-Learning to Work (DQN)

Experience replay:
Data distribution changes over time: as your Q function gets better and you exploit
this, you visit different (s, a, s 0 , r ) transitions than you did earlier
Stabilize learning by keeping old transitions in a replay buffer, and taking minibatch
gradient descent on mix of old and new transitions

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 19 / 21


Getting Q-Learning to Work (DQN)

Experience replay:
Data distribution changes over time: as your Q function gets better and you exploit
this, you visit different (s, a, s 0 , r ) transitions than you did earlier
Stabilize learning by keeping old transitions in a replay buffer, and taking minibatch
gradient descent on mix of old and new transitions
Target networks:
Minimizing Bellman error directly is unstable!
It’s like regression, but it’s not:
X 2
min Qθ (s, a) − y (s 0 , r ) ,
θ
(s,a,s 0 ,r )∈D

where the target y (s 0 , r ) is

y (s 0 , r ) = r + γ max
0
Qθ (s 0 , a0 ).
a

Targets depend on parameters θ—so an update to Q changes the target!


Stabilize it by holding the target fixed for a while: keep a separate target network,
Qθtarg , and every k steps update θtarg ← θ

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 19 / 21


DQN Pseudocode

Algorithm 1 Deep Q-Learning


Randomly generate Q-function parameters θ
Set target Q-network parameters θtarg ← θ
Make empty replay buffer D
Receive observation s0 from environment
for t = 0, 1, 2, ... do
With probability , select random action at ; otherwise select at = arg maxa Qθ (st , a)
Step environment to get st+1 , rt and end-of-episode signal dt
Linearly decay  until it reaches final value f
Store (st , at , rt , st+1 , dt ) → D
Sample mini-batch of transitions B = {(s, a, r , s 0 , d)i } from D
For each transition in B, compute

r transition is terminal (d = True)
y = r + γ max 0 Q (s 0 , a0 ) otherwise
a θtarg

Update Q by gradient descent on regression loss:


X
θ ← θ − α∇θ (Qθ (s, a) − y )2
(s,a,y )∈B

if t%tupdate = 0 then
Set θtarg ← θ
end if
end for
Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 20 / 21
Recommended Reading: Deep RL Algorithms

A2C / A3C: Mnih et al, 2016 (https://arxiv.org/abs/1602.01783)


PPO: Schulman et al, 2017 (https://arxiv.org/abs/1707.06347)
TRPO: Schulman et al, 2015 (https://arxiv.org/abs/1502.05477)
DQN: Mnih et al, 2013
(https://www.cs.toronto.edu/˜vmnih/docs/dqn.pdf)
C51: Bellemare et al, 2017 (https://arxiv.org/abs/1707.06887)
QR-DQN: Dabney et al, 2017 (https://arxiv.org/abs/1710.10044)
DDPG: Lillicrap et al, 2015 (https://arxiv.org/abs/1509.02971)
SVG: Heess et al, 2015 (https://arxiv.org/abs/1510.09142)
I2A: Weber et al, 2017 (https://arxiv.org/abs/1707.06203)
MBMF: Nagabandi et al, 2017 (https://sites.google.com/view/mbmf)
AlphaZero: Silver et al, 2017 (https://arxiv.org/abs/1712.01815)

Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 21 / 21

You might also like