RL Intro-2
RL Intro-2
Joshua Achiam
March 3, 2018
RL can...
Play video games from raw pixels
Control robots in simulation and in the real world
Play Go and Dota 1v1 at superhuman levels
obs = env.reset()
done = False
while not(done):
act = agent.get_action(obs)
next_obs, reward, done, info = env.step(act)
obs = next_obs
τ = (s0 , a0 , s1 , a1 , ...).
rt = R(st , at ).
Example: if you want a robot to run forwards but use minimal energy,
R(s, a) = v − αkak22 .
rt = R(st , at ).
Example: if you want a robot to run forwards but use minimal energy,
R(s, a) = v − αkak22 .
The return of a trajectory is a measure of cumulative reward along it. There are two
main ways to compute return:
Finite horizon undiscounted sum of rewards::
T
X
R(τ ) = rt
t=0
where γ ∈ (0, 1). This makes rewards less valuable if they are further in the future.
(Why would we ever want this? Think about cash: it’s valuable to have it sooner
rather than later!)
Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 7 / 21
Policies
The goal in RL is to learn a policy which maximizes expected return. The optimal policy
π ∗ is:
π ∗ = arg max E [R(τ )] ,
π τ ∼π
where by τ ∼ π, we mean
Value functions tell you the expected return after a state or state-action pair.
Core idea: push up the probabilities of good actions and push down the probabilities
of bad actions
Definition: sum of rewards after time t is the reward-to-go at time t:
T
X
R̂t = rt 0
t 0 =t
# make loss
adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
act_ph = tf.placeholder(shape=(None,), dtype=tf.int32)
action_one_hots = tf.one_hot(act_ph, n_acts)
log_probs = tf.reduce_sum(action_one_hots * tf.nn.log_softmax(logits), axis=1)
loss = -tf.reduce_mean(adv_ph * log_probs)
# make train op
train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
Experience replay:
Data distribution changes over time: as your Q function gets better and you exploit
this, you visit different (s, a, s 0 , r ) transitions than you did earlier
Stabilize learning by keeping old transitions in a replay buffer, and taking minibatch
gradient descent on mix of old and new transitions
Experience replay:
Data distribution changes over time: as your Q function gets better and you exploit
this, you visit different (s, a, s 0 , r ) transitions than you did earlier
Stabilize learning by keeping old transitions in a replay buffer, and taking minibatch
gradient descent on mix of old and new transitions
Target networks:
Minimizing Bellman error directly is unstable!
It’s like regression, but it’s not:
X 2
min Qθ (s, a) − y (s 0 , r ) ,
θ
(s,a,s 0 ,r )∈D
y (s 0 , r ) = r + γ max
0
Qθ (s 0 , a0 ).
a
if t%tupdate = 0 then
Set θtarg ← θ
end if
end for
Joshua Achiam (OpenAI, UC Berkeley) Introduction to Reinforcement Learning March 3, 2018 20 / 21
Recommended Reading: Deep RL Algorithms