0% found this document useful (0 votes)
1 views

lecture11

The document discusses imitation learning and inverse reinforcement learning, focusing on how to learn optimal policies from expert demonstrations without explicit reward functions. It introduces methods like behavior cloning and DAGGER, which enhance learning by incorporating expert feedback. Additionally, it explores maximum entropy inverse reinforcement learning and guided cost learning as advanced techniques for deriving reward functions from expert data.

Uploaded by

yandexemployee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

lecture11

The document discusses imitation learning and inverse reinforcement learning, focusing on how to learn optimal policies from expert demonstrations without explicit reward functions. It introduces methods like behavior cloning and DAGGER, which enhance learning by incorporating expert feedback. Additionally, it explores maximum entropy inverse reinforcement learning and guided cost learning as advanced techniques for deriving reward functions from expert data.

Uploaded by

yandexemployee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Practical RL

Episode t+1

Definitely not reinforcement learning

1
Today’s menu

Imitation learning
a brief intro

Inverse reinforcement learning


a bit less brief
Problem formulation

You have a decision process,


but no reward function

Instead, there are examples


set by an “expert” agent

You want to learn


optimal policy by imitaton
Why bother
“natural” reward No natural reward

Dialogue
system

toy tasks, videogames, real world problems,


Robot gait @ race track, Robot gait @ public space
Online advertising Recommendation systems
Image captioning Conversation systems
...
Why bother
“natural” reward No natural reward

Dialogue
system

toy tasks, videogames, real world problems,


Robot gait @ race track, Robot gait @ public space
Online advertising Recommendation systems
Image captioning Conversation systems
Image Captioning, ...
Problem formulation

We have a dataset of sessions, D: {τ1, τ2, τ3}


under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>

You want to learn π(a|s) that mimics π*(a|s)


Q: how can we solve that?
Behavior cloning

We have a dataset of sessions, D: {τ1, τ2, τ3}


under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>

Supervised learning way:

π θ (a∣s)=argmax ∑ ∑ log π(a∣s)


πθ
τ ∈D s , a∈ τ
Behavior cloning

We have a dataset of sessions, D: {τ1, τ2, τ3}


under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>

Supervised learning way:

π θ (a∣s)=argmax ∑ ∑ log π(a∣s)


πθ
τ ∈D s , a∈ τ
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory

sT
state space
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory
learned trajectory

sT
state space
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory
Just one learned trajectory
error

sT
state space
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory

?!?! learned trajectory

Agent was never


trained to act
You white area
Never
Get
Here
sT
state space
DAGGER
s0
Idea: let’s ask expert what to do
in the new area we’ve reached

?!?!

You
Never
Get
Here
sT
state space
DAGGER
s0
Idea: let’s ask expert what to do
in the new area we’ve reached

Add them to the training set

You
Never
Get
Here
sT
state space
DAGGER
s0
Idea: let’s ask expert what to do
in the new area we’ve reached

Add them to the training set

You
Never
Get
Here
sT
state space
DAGGER
Initial dataset D: {τ1, τ2, τ3},
Initial policy πθ(a|s)

Forever:

Update policy
π θ (a∣s):=argmax ∑ ∑ log π(a∣s)
πθ
τ ∈D s , a∈ τ

Sample session under current πθ(a|s)
τ = <s,a,s’,a’, … s_T>

Update dataset
D :=D∪{ s , get_optimal_action (s), ∀ s∈ τ }
DAGGER

Benefits
+ fixes state space error

Drawbacks
- you need expert feedback on new sessions
- degrades if expert actions are noisy
DAGGER

Benefits
+ fixes state space error

Drawbacks
- you need expert feedback on new sessions
- degrades if expert actions are noisy

Idea: first infer what rewards does expert want


then train “regular” rl to maximize them
Inverse Reinforcement Learning
“regular” RL inverse RL
given:

Environment, Environment,
Reward function Optimal policy

find:

Optimal policy ???


Inverse Reinforcement Learning
“regular” RL inverse RL
given:

Environment, Environment,
Reward function Optimal policy

find:

Optimal policy Reward function


Is it even possible?
Is it even possible?
Is it even possible?

start

Agent is rewarded for the first time it enters a tile


It can exit the session at will. Also R( ■ ) = 0
Is it even possible?

start

Agent is rewarded for the first time it enters a tile


Q: what is the “cost of living” for 1 step? (+1 / -1)
Is it even possible?

start

Agent is rewarded for the first time it enters a tile


Agent gets -1 for each turn (cost of living)
Is it even possible?

start

R( ■ ) = ? R( ■ ) = ? R( ■ ) = ?
Is it even possible?

start

R( ■ ) >> 0 0 < R( ■ ) <= 2 R( ■ ) << 0


Is it even possible?

start

R( ■ ) = ? R( ■ ) = ?
Is it even possible?
Yes, to some extent
Maximum Entropy Inverse RL
D. Ziebart et al.

We have a dataset of sessions, D: {τ1, τ2, τ3}


under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>
* R( τ)
Assumption: assume that π ( τ)∼e
where
R ( τ)= ∑ r (s τ , a τ )
s τ , aτ
(alt: use gamma)
Maximum Entropy Inverse RL

We have a dataset of sessions, D: {τ1, τ2, τ3}


under expert policy π*(a|s)

τ = <s,a,s’,a’, … s_T>
* R( τ)
Assumption: assume that π ( τ)∼e
where
R ( τ)= ∑ r (s τ , a τ )
s τ , aτ
(alt: use gamma)
Sketch: learn r (s τ , a τ ) to maximize likelihood of D
Maximum Entropy Inverse RL
* R( τ)
How it works: π ( τ)∼e

Rθ ( τ)
e
log P( D∣θ)= ∑ log π ( τ ; θ)= ∑ log
*
=
τ∈D τ ∈D ∑τ ' e R θ (τ ')

Do you see the problem?


Maximum Entropy Inverse RL
* R( τ)
How it works: π ( τ)∼e

Rθ ( τ)
e
log P( D∣θ)= ∑ log π ( τ ; θ)= ∑ log
*
=
τ∈D τ ∈D ∑τ ' e R θ (τ ')

sum over all trajectories


Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )

= ∑ [ Rθ ( τ)−log ∑τ ' e ]= ∑ R θ ( τ)−N⋅log ∑τ ' e


R θ (τ ' ) R θ (τ ' )

τ∈D τ∈ D

∇ θ llh=?
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )

= ∑ [ Rθ ( τ)−log ∑τ ' e ]= ∑ R θ ( τ)−N⋅log ∑τ ' e


R θ (τ ' ) R θ (τ ' )

τ∈D τ∈ D

∇ θ llh= ∑ ∇ θ R θ ( τ)−N ∇ θ log ∑ e Rθ ( τ ' )


=
τ∈ D τ'
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )

= ∑ [ Rθ ( τ)−log ∑τ ' e ]= ∑ R θ ( τ)−N⋅log ∑τ ' e


R θ (τ ' ) R θ (τ ' )

τ∈D τ∈ D

∇ θ llh= ∑ ∇ θ R θ ( τ)−N ∇ θ log ∑ e Rθ ( τ ' )


=
τ∈ D τ'

1
= ∑ ∇ θ Rθ ( τ)−N τ) ∑
R (τ ')
R (~
e ⋅∇ θ R θ ( τ ')=
θ

τ∈ D ∑ e τ'~
τ
θ
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )

= ∑ [ Rθ ( τ)−log ∑τ ' e ]= ∑ R θ ( τ)−N⋅log ∑τ ' e


R θ (τ ' ) R θ (τ ' )

τ∈D τ∈ D

∇ θ llh= ∑ ∇ θ R θ ( τ)−N ∇ θ log ∑ e Rθ ( τ ' )


=
τ∈ D τ'

1
= ∑ ∇ θ Rθ ( τ)−N τ) ∑
R (τ ')
R (~
e ⋅∇ θ R θ ( τ ')=
θ

τ∈ D ∑ e τ'~
τ
θ
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )

= ∑ [ Rθ ( τ)−log ∑τ ' e ]= ∑ R θ ( τ)−N⋅log ∑τ ' e


R θ (τ ' ) R θ (τ ' )

τ∈D τ∈ D

∇ θ llh= ∑ ∇ θ R θ ( τ)−N ∇ θ log ∑ e Rθ ( τ ' )


=
τ∈ D τ'

Reminds of sth?
1
= ∑ ∇ θ Rθ ( τ)−N τ) ∑
R (τ ')
R (~
e ⋅∇ θ R θ ( τ ')=
θ

τ∈ D ∑ e τ'~
τ
θ
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )

= ∑ [ Rθ ( τ)−log ∑τ ' e ]= ∑ R θ ( τ)−N⋅log ∑τ ' e


R θ (τ ' ) R θ (τ ' )

τ∈D τ∈ D

∇ θ llh= ∑ ∇ θ R θ ( τ)−N ∇ θ log ∑ e Rθ ( τ ' )


=
τ∈ D τ'

=N⋅[ E ∇ θ R θ ( τ)− E ∇ θ Rθ ( τ ' )]


τ∈ D *
τ '∼π ( τ ' ; Rθ )
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )

= ∑ [ R θ ( τ)−log ∑τ ' e R θ (τ ' )


]= ∑ Rθ ( τ)−N⋅log ∑ τ ' e Rθ ( τ ' )

τ∈ D τ ∈D

∇ θ llh= ∑ ∇ θ R θ ( τ)−N ∇ θ log ∑ e Rθ ( τ ' )


=
τ∈ D τ'

=N⋅[ E ∇ θ R θ ( τ)− E ∇ θ Rθ ( τ ' )]


τ∈ D *
τ '∼π ( τ ' ; Rθ )
* Rθ ( τ)
where π ( τ ' ; R θ )∼e
Tabular, model-based

Replace sum over trajectories...

∇ θ llh= ∑ ∇ θ R θ ( τ)−N⋅ E ∇ θ R θ ( τ ')=


*
τ∈ D τ '∼π ( τ ' ; θ)

… with sum over states

= ∑ ∇ θ Rθ ( τ)−N⋅ E E ∇ θ r θ (s , a)
τ∈ D s∼d θ (s) a∼π *θ (a∣s)

state visitation freq;


(stationary distribution)
Model-free case

∇ θ llh=N⋅[ E ∇ θ R θ ( τ)− E ∇ θ R θ ( τ ' )]


τ∈D *
τ ' ∼π (τ ' ; R θ )

sample hard to even


from data sample

* R θ (τ)
To sample from π ( τ ' ; Rθ )∼e

We need to estimate ∑τ ' e Rθ ( τ ' )


Guided Cost Learning
computes reward
reward model
C. Finn et al.

R θ ( τ)
Help Reward to
compute maximize
denominator
policy model
~
π ϕ (a∣s)
maximizes reward

Guided Cost Learning, Finn et al, arXiv:1603.00448


Training policy

~ ( τ)
log π
~ *
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E ϕ
=
*
ϕ
~ ( τ)
π ϕ log π ( τ ; Rθ )
Training policy

~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)

~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
Training policy

~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)

~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ

~
= E log π ϕ ( τ)− E R θ ( τ)+log ∑ τ ' e
R (τ') θ

~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
Training policy

~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)

~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ

log(e^R)
numerator denominator
~
= E log π ϕ ( τ)− E R θ ( τ)+log ∑ τ ' e
R (τ') θ

~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ

Anything
peculiar? ??? ???
Training policy

~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)

~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ

log(e^R)
numerator denominator
~
= E log π ϕ ( τ)− E R θ ( τ)+log ∑ τ ' e
R (τ') θ

~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ

Anything
peculiar? - entropy main stuff const(φ)!
Training them both

~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)

~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ

log(e^R)
numerator denominator
~
= E log π ϕ ( τ)− E R θ ( τ)+log ∑ τ ' e
R (τ') θ

~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ

Anything
peculiar? - entropy main stuff const(φ)!
Training them both

Update R θ ( τ) under current ~
π ϕ (a∣s)

∇ θ llh=N⋅[ E ∇ θ R θ ( τ)− E ∇ θ R θ ( τ ' )]


τ∈D τ ' ∼~
π (τ ' ; R θ )


Update ~
π ϕ (a∣s) under current R θ ( τ)
~ ( τ)⋅(1+ R ( τ))
∇ ϕ KL=− E ∇ log π ϕ θ
~ ( τ)
τ∼ π ϕ
See also

Generative Adversarial Imitation Learning
– arXiv:1606.03476 , Ho et al.

Model-based Adversarial Imitation learning
– arXiv:1612.02179 , Baram et al.

Cooperative Inverse Reinforcement Learning
– arXiv:1606.03137 , Hadfield-Menel et al.
– critical analysis: arXiv:1709.06275, Carey
– Agent learns to understand human’s goal and assist

RLHF reward modeling is (special case) inv.RL
… and a ton of other stuff, just google it

You might also like