lecture11
lecture11
Episode t+1
1
Today’s menu
Imitation learning
a brief intro
Dialogue
system
Dialogue
system
τ = <s,a,s’,a’, … s_T>
τ = <s,a,s’,a’, … s_T>
τ = <s,a,s’,a’, … s_T>
sT
state space
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory
learned trajectory
sT
state space
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory
Just one learned trajectory
error
sT
state space
Behavior cloning
s0
state frequency
(for expert data)
perfect trajectory
?!?!
You
Never
Get
Here
sT
state space
DAGGER
s0
Idea: let’s ask expert what to do
in the new area we’ve reached
You
Never
Get
Here
sT
state space
DAGGER
s0
Idea: let’s ask expert what to do
in the new area we’ve reached
You
Never
Get
Here
sT
state space
DAGGER
Initial dataset D: {τ1, τ2, τ3},
Initial policy πθ(a|s)
Forever:
●
Update policy
π θ (a∣s):=argmax ∑ ∑ log π(a∣s)
πθ
τ ∈D s , a∈ τ
●
Sample session under current πθ(a|s)
τ = <s,a,s’,a’, … s_T>
●
Update dataset
D :=D∪{ s , get_optimal_action (s), ∀ s∈ τ }
DAGGER
Benefits
+ fixes state space error
Drawbacks
- you need expert feedback on new sessions
- degrades if expert actions are noisy
DAGGER
Benefits
+ fixes state space error
Drawbacks
- you need expert feedback on new sessions
- degrades if expert actions are noisy
Environment, Environment,
Reward function Optimal policy
find:
Environment, Environment,
Reward function Optimal policy
find:
start
start
start
start
R( ■ ) = ? R( ■ ) = ? R( ■ ) = ?
Is it even possible?
start
start
R( ■ ) = ? R( ■ ) = ?
Is it even possible?
Yes, to some extent
Maximum Entropy Inverse RL
D. Ziebart et al.
τ = <s,a,s’,a’, … s_T>
* R( τ)
Assumption: assume that π ( τ)∼e
where
R ( τ)= ∑ r (s τ , a τ )
s τ , aτ
(alt: use gamma)
Maximum Entropy Inverse RL
τ = <s,a,s’,a’, … s_T>
* R( τ)
Assumption: assume that π ( τ)∼e
where
R ( τ)= ∑ r (s τ , a τ )
s τ , aτ
(alt: use gamma)
Sketch: learn r (s τ , a τ ) to maximize likelihood of D
Maximum Entropy Inverse RL
* R( τ)
How it works: π ( τ)∼e
Rθ ( τ)
e
log P( D∣θ)= ∑ log π ( τ ; θ)= ∑ log
*
=
τ∈D τ ∈D ∑τ ' e R θ (τ ')
Rθ ( τ)
e
log P( D∣θ)= ∑ log π ( τ ; θ)= ∑ log
*
=
τ∈D τ ∈D ∑τ ' e R θ (τ ')
τ∈D τ∈ D
∇ θ llh=?
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )
τ∈D τ∈ D
τ∈D τ∈ D
1
= ∑ ∇ θ Rθ ( τ)−N τ) ∑
R (τ ')
R (~
e ⋅∇ θ R θ ( τ ')=
θ
τ∈ D ∑ e τ'~
τ
θ
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )
τ∈D τ∈ D
1
= ∑ ∇ θ Rθ ( τ)−N τ) ∑
R (τ ')
R (~
e ⋅∇ θ R θ ( τ ')=
θ
τ∈ D ∑ e τ'~
τ
θ
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )
τ∈D τ∈ D
Reminds of sth?
1
= ∑ ∇ θ Rθ ( τ)−N τ) ∑
R (τ ')
R (~
e ⋅∇ θ R θ ( τ ')=
θ
τ∈ D ∑ e τ'~
τ
θ
Maximum Entropy Inverse RL
Rθ ( τ)
e
Let’s simplify: llh= ∑ log =
τ ∈D ∑τ ' e R θ (τ ' )
τ∈D τ∈ D
τ∈ D τ ∈D
= ∑ ∇ θ Rθ ( τ)−N⋅ E E ∇ θ r θ (s , a)
τ∈ D s∼d θ (s) a∼π *θ (a∣s)
* R θ (τ)
To sample from π ( τ ' ; Rθ )∼e
R θ ( τ)
Help Reward to
compute maximize
denominator
policy model
~
π ϕ (a∣s)
maximizes reward
~ ( τ)
log π
~ *
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E ϕ
=
*
ϕ
~ ( τ)
π ϕ log π ( τ ; Rθ )
Training policy
~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)
~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
Training policy
~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)
~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
~
= E log π ϕ ( τ)− E R θ ( τ)+log ∑ τ ' e
R (τ') θ
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
Training policy
~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)
~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
log(e^R)
numerator denominator
~
= E log π ϕ ( τ)− E R θ ( τ)+log ∑ τ ' e
R (τ') θ
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
Anything
peculiar? ??? ???
Training policy
~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)
~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
log(e^R)
numerator denominator
~
= E log π ϕ ( τ)− E R θ ( τ)+log ∑ τ ' e
R (τ') θ
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
Anything
peculiar? - entropy main stuff const(φ)!
Training them both
~ * π~ϕ ( τ)
Lπ~ =KL( π ϕ ( τ)∥π ( τ ; R θ ))= E log *
=
ϕ
~ ( τ)
π ϕ π ( τ ; Rθ)
~ *
= E log π ϕ ( τ)− E log π ( τ ; R θ )=
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
log(e^R)
numerator denominator
~
= E log π ϕ ( τ)− E R θ ( τ)+log ∑ τ ' e
R (τ') θ
~ ( τ)
τ∼ π ~ ( τ)
τ∼ π
ϕ ϕ
Anything
peculiar? - entropy main stuff const(φ)!
Training them both
●
Update R θ ( τ) under current ~
π ϕ (a∣s)
●
Update ~
π ϕ (a∣s) under current R θ ( τ)
~ ( τ)⋅(1+ R ( τ))
∇ ϕ KL=− E ∇ log π ϕ θ
~ ( τ)
τ∼ π ϕ
See also
●
Generative Adversarial Imitation Learning
– arXiv:1606.03476 , Ho et al.
●
Model-based Adversarial Imitation learning
– arXiv:1612.02179 , Baram et al.
●
Cooperative Inverse Reinforcement Learning
– arXiv:1606.03137 , Hadfield-Menel et al.
– critical analysis: arXiv:1709.06275, Carey
– Agent learns to understand human’s goal and assist
●
RLHF reward modeling is (special case) inv.RL
… and a ton of other stuff, just google it