0% found this document useful (0 votes)

1 views

lecture11

The document discusses imitation learning and inverse reinforcement learning, focusing on how to learn optimal policies from expert demonstrations without explicit reward functions. It introduces methods like behavior cloning and DAGGER, which enhance learning by incorporating expert feedback. Additionally, it explores maximum entropy inverse reinforcement learning and guided cost learning as advanced techniques for deriving reward functions from expert data.

Uploaded by

yandexemployee

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

lecture11

Uploaded by

yandexemployee

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Practical RL

Episode t+1

Definitely not reinforcement learning

1
Today’s menu

Imitation learning
a brief intro

Inverse reinforcement learning

a bit less brief
Problem formulation

You have a decision process,

but no reward function

Instead, there are examples

set by an “expert” agent

You want to learn

optimal policy by imitaton
Why bother
“natural” reward No natural reward

Dialogue
system

toy tasks, videogames, real world problems,

Robot gait @ race track, Robot gait @ public space
Online advertising Recommendation systems
Image captioning Conversation systems
...
Why bother
“natural” reward No natural reward

Dialogue
system

toy tasks, videogames, real world problems,

Robot gait @ race track, Robot gait @ public space
Online advertising Recommendation systems
Image captioning Conversation systems
Image Captioning, ...
Problem formulation

We have a dataset of sessions, D: {τ1, τ2, τ3}

under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>

You want to learn π(a|s) that mimics π*(a|s)

Q: how can we solve that?
Behavior cloning

We have a dataset of sessions, D: {τ1, τ2, τ3}

under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>

Supervised learning way:

π θ (a∣s)=argmax ∑ ∑ log π(a∣s)

πθ
τ ∈D s , a∈ τ
Behavior cloning

We have a dataset of sessions, D: {τ1, τ2, τ3}

under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>

Supervised learning way:

π θ (a∣s)=argmax ∑ ∑ log π(a∣s)

πθ
τ ∈D s , a∈ τ
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory

sT
state space
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory
learned trajectory

sT
state space
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory
Just one learned trajectory
error

sT
state space
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory

?!?! learned trajectory

Agent was never

trained to act
You white area
Never
Get
Here
sT
state space
DAGGER
s0
Idea: let’s ask expert what to do
in the new area we’ve reached

?!?!

You
Never
Get
Here
sT
state space
DAGGER
s0
Idea: let’s ask expert what to do
in the new area we’ve reached

Add them to the training set

You
Never
Get
Here
sT
state space
DAGGER
s0
Idea: let’s ask expert what to do
in the new area we’ve reached

Add them to the training set

You
Never
Get
Here
sT
state space
DAGGER
Initial dataset D: {τ1, τ2, τ3},
Initial policy πθ(a|s)

Forever:
●
Update policy
π θ (a∣s):=argmax ∑ ∑ log π(a∣s)
πθ
τ ∈D s , a∈ τ
●
Sample session under current πθ(a|s)
τ = <s,a,s’,a’, … s_T>
●
Update dataset
D :=D∪{ s , get_optimal_action (s), ∀ s∈ τ }
DAGGER

Benefits
+ fixes state space error

Drawbacks
- you need expert feedback on new sessions
- degrades if expert actions are noisy
DAGGER

Benefits
+ fixes state space error

Drawbacks
- you need expert feedback on new sessions
- degrades if expert actions are noisy

Idea: first infer what rewards does expert want

then train “regular” rl to maximize them
Inverse Reinforcement Learning
“regular” RL inverse RL
given:

Environment, Environment,
Reward function Optimal policy

find:

Optimal policy ???

Inverse Reinforcement Learning
“regular” RL inverse RL
given:

Environment, Environment,
Reward function Optimal policy

find:

Optimal policy Reward function

Is it even possible?
Is it even possible?
Is it even possible?

start

Agent is rewarded for the first time it enters a tile

It can exit the session at will. Also R( ■ ) = 0
Is it even possible?

start

Agent is rewarded for the first time it enters a tile

Q: what is the “cost of living” for 1 step? (+1 / -1)
Is it even possible?

start

Agent is rewarded for the first time it enters a tile

Agent gets -1 for each turn (cost of living)
Is it even possible?

start

R( ■ ) = ? R( ■ ) = ? R( ■ ) = ?
Is it even possible?

start

R( ■ ) >> 0 0 < R( ■ ) <= 2 R( ■ ) << 0

Is it even possible?

start

R( ■ ) = ? R( ■ ) = ?
Is it even possible?
Yes, to some extent
Maximum Entropy Inverse RL
D. Ziebart et al.

We have a dataset of sessions, D: {τ1, τ2, τ3}

under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>
* R( τ)
Assumption: assume that π ( τ)∼e
where
R ( τ)= ∑ r (s τ , a τ )
s τ , aτ
(alt: use gamma)
Maximum Entropy Inverse RL

We have a dataset of sessions, D: {τ1, τ2, τ3}

under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>
* R( τ)
Assumption: assume that π ( τ)∼e
where
R ( τ)= ∑ r (s τ , a τ )
s τ , aτ
(alt: use gamma)
Sketch: learn r (s τ , a τ ) to maximize likelihood of D
Maximum Entropy Inverse RL
* R( τ)
How it works: π ( τ)∼e

Rθ ( τ)
e
log P( D∣θ)= ∑ log π ( τ ; θ)= ∑ log
*
=
τ∈D τ ∈D ∑τ ' e R θ (τ ')

Do you see the problem?

Maximum Entropy Inverse RL
* R( τ)
How it works: π ( τ)∼e

Rθ ( τ)
e
log P( D∣θ)= ∑ log π ( τ ; θ)= ∑ log
*
=
τ∈D τ ∈D ∑τ ' e R θ (τ ')

sum over all trajectories

Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )

= ∑ [ Rθ ( τ)−log ∑τ ' e ]= ∑ R θ ( τ)−N⋅log ∑τ ' e

R θ (τ ' ) R θ (τ ' )

τ∈D τ∈ D

∇ θ llh=?
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )

= ∑ [ Rθ ( τ)−log ∑τ ' e ]= ∑ R θ ( τ)−N⋅log ∑τ ' e

R θ (τ ' ) R θ (τ ' )

= ∑ ∇ θ Rθ ( τ)−N⋅ E E ∇ θ r θ (s , a)
τ∈ D s∼d θ (s) a∼π *θ (a∣s)

state visitation freq;

(stationary distribution)
Model-free case

∇ θ llh=N⋅[ E ∇ θ R θ ( τ)− E ∇ θ R θ ( τ ' )]

τ∈D *
τ ' ∼π (τ ' ; R θ )

sample hard to even

from data sample

* R θ (τ)
To sample from π ( τ ' ; Rθ )∼e

We need to estimate ∑τ ' e Rθ ( τ ' )

Guided Cost Learning
computes reward
reward model
C. Finn et al.

R θ ( τ)
Help Reward to
compute maximize
denominator
policy model
~
π ϕ (a∣s)
maximizes reward

Guided Cost Learning, Finn et al, arXiv:1603.00448

Training policy

~ ( τ)
log π
~ *
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E ϕ
=
*
ϕ
~ ( τ)
π ϕ log π ( τ ; Rθ )
Training policy