2021_03_09_DualDice__Behavior_Agnostic_Estimation_of_Discounted_Stationary_Distribution_Corrections (1)
2021_03_09_DualDice__Behavior_Agnostic_Estimation_of_Discounted_Stationary_Distribution_Corrections (1)
09 March 2021
1 Core idea 1
2 1
2.1 Off-Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1.1 Discounted Stationary Distribution . . . . . . . . . . . . . . . . . 2
2.1.2 Learning Stationary Distribution Corrections . . . . . . . . . . . . . . . . . . . 3
2.1.3 Off-Policy Estimation with Multiple Unknown Behavior Policies . . . . . . . . 3
3 DualDICE 4
3.1 The Key Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.2 Fenchel Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.3 Extension to General Convex Functions . . . . . . . . . . . . . . . . . . . . . . 7
3.3 DualDICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Proof 8
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Bounding ϵr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.2 Bounding ϵest (F) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.3 Bounding ϵstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
1 Core idea
DAI Bo Google Brain NIPS 2019
discounted stationary distribution radios (DSDR)
DualDICE DSTR (s, a) (s, a)
DualDICE
2
⟨S, A, R, T, β⟩
sT P (sT |sT , a) = 1, R(sT , a) = 0
2.1 Off-Policy
" #
X
∞
ρ(π) := (1 − γ) · E γ rt | s0 ∼ β, ∀t, at ∼ π (st ) , rt ∼ R (st , at ) , st+1 ∼ T (st , at )
t
(1)
t=0
off-policy D : {(s, a, r, s′ )} D
D µ
(IS)
! !
Y
H−1
π(at |st ) X
H−1
ρ(π) = (1 − γ) t
γ rt
t=0
µ(at |st ) t=0
IS offline RL
1
RπT (τ ) τ 0:T γ=1 γ ∈ (0, 1)
1 X X
T ∞
Average: R(τ ) := lim rt , Discounted: R(τ ) := (1 − γ) γ t rt (3)
T →∞ T + 1
t=0 t=0
P∞
(1 − γ) = 1/ t=0 γt Off-Policy Rπ
π0 τ = i
{sit , ait , rti }Tt=0
dπ,t (·) s0 π, t st
0 s
(1)
st
rt ρ(π)
! !
X
T X
T
π t π,t t
d (s) = lim γd (s) / γ (4)
T →∞
t=0 t=0
X
∞
d (s) = (1 − γ) ·
π
γ t Pr(st = s | s0 ∼ β, ∀t, at ∼ π (st ) , rt ∼ R (st , at ) , st+1 ∼ T (st , at )) (5)
t=0
P∞
T →∞ γ ∈ (0, 1] dπ (s) = (1−γ) t=0 γ t dπ,t (s)
γ=1 dπ (s) st
1 X π,t
T
dπ (s) = lim d (s) = lim dπ,t (s)
T →∞ T + 1 t→∞
t=0
(1)
dπ (s,a)
wπ/µ (s, a) = dµ (s,a)
Discounted Stationary Distribution Correction wπ/µ (s, a)
Bo Dai {sit , ait , rti }m
i=1 ∼µ
1 XX i i
m T
wπ/µ (sit , ait )
ρ(π) = wr, wti := γ t P i′ i′
m i=1 t=1 t t t′ ,i′ wπ/µ (st′ , at′ )
(s, a) τ =
{(st , at )}Tt=0 Qiang Liu
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
2
2.1.2 Learning Stationary Distribution Corrections
X
∞
dπ (s′ ) = (1 − γ) γ t dπ,t (s′ )
t=0
X
∞
′
= (1 − γ)β (s ) | +(1 − γ) γ t dπ,t (s′ )
t=1
X∞
= (1 − γ)β (s′ ) + (1 − γ)γ γ t dπ,t+1 (s′ )
t=0
X
∞ X ′ X
= (1 − γ)β (s′ ) + (1 − γ)γ γt T π (s′ | s) dπ,t (s) //dπ,t+1(s ) = Tπ (s′ | s) dπ,t (s) (8)
t=0 s s,a
!
X X
∞
= (1 − γ)β (s′ ) + γ T π (s′ | s) (1 − γ) γ t dπ,t (s)
s t=0
X
′ ′
= (1 − γ)β (s ) + γ T π (s | s) d (s) π
s
X
= (1 − γ)β (s′ ) + γ T (s′ | s, a) π(a | s)dπ (s)
s,a
πP =
π π P µ (3)
′
s∈S
X X X
dπ (s′ ) = (1 − γ)β (s′ ) + γ T (s′ | s, a) π(a | s)dπ (s)
s′ s′ s,a,s′
X T (s′ | s, a) π(a | s)dπ (s)
Es′ ∼dµ wπ/µ(s′ ) =(1 − γ)β(s′ ) + γ ′ | s, a) µ(a | s)dµ (s)
d(s′ , a, s) (9)
s,a,s′
T (s
π(a | s)
Es′ ∼dµ wπ/µ (s′ ) =(1 − γ)β(s′ ) + γE(s,a,s′ )∼dµ wπ/µ (s′ )
µ(a | s)
E(st ,at ,st+1 )∼dµ TD st , at , st+1 | wπ/µ | st+1 = s′ = 0 (10)
π(a | s)
TD s, a, s′ | wπ/µ := −wπ/µ (s′ ) + (1 − γ)β (s′ ) + γwπ/µ (s) · (11)
µ(a | s)
TD wπ/µ dµ
TD wπ/µ Bellman
′
(8) s s
D D
ρ(π) = E(s,a,r)∼D wπ/D (s, a) · r (12)
3
Assumption 1 dπ (s, a) > 0 dµ (s, a) > 0 C
∥wπ/D ∥∞ ≤ C
3 DualDICE
DualDICE Discounted Stationary Distribution Correc-
dπ (s,a)
tions: wπ/D (s, a) = dD (s,a)
D D =
′ D
{(s, a, r, s )} ∼ d �
dD , β, π
dD
3.2 Derivation
m ∈ R>0 , n ∈ R≥0 minx J(x) := 12 mx2 − nx x∗ = n
m
C R [0, C]
1X D X π
min J1 (x) := d (s, a) x(s, a)2 − d (s, a)[x(s, a)]
x:S×A→C 2 s,a s,a (15)
1
= E(s,a)∼dD x(s, a)2 − E(s,a)∼dπ [x(s, a)]
2
dπ (s,a)
∀(s, a) ∈ S × A x∗ (s, a) = wπ/D (s, a) = dD (s,a)
(s, a) ∼ dπ (s, a)
ν :S×A→R
4
x(s, a) ∈ C γ ∈ [0, 1] ν(s, a) π t
E(s,a)∼dπ [x(s, a)] =E(s,a)∼dπ ν(s, a) − γEs′ ∼T (s,a),a′ ∼π(s′ ) [ν (s′ , a′ )]
X
∞
=(1 − γ) γ t Es∼βt ,a∼π(s) ν(s, a) − γEs′ ∼T (s,a),a′ ∼π(s′ ) [ν (s′ , a′ )]
t=0
(18)
X
∞ X
∞
=(1 − γ) γ Es∼βt ,a∼π(s) [ν(s, a)] − (1 − γ)
t
γ t+1
Es∼βt+1 ,a∼π(s) [ν(s, a)]
t=0 t=0
dπ (s, a)
(15) (13)
∗
ν
1. (ν − B π ν) (s, a)2
2. ν∗ (ν ∗ − B π ν ∗ ) (s, a)
Bπ Fenchel
f (x) f (x) = maxζ x · ζ − f ∗ (ξ) f ∗ (x) f Fenchel
1 2 ∗ 1 2
f (x) = 2
x Fenchel f (ζ) = 2
ζ
1 2
min J(ν) := E(s,a)∼dD max (ν − B ν) (s, a) · ζ − ζ − (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )]
π
(20)
ν:S×A→R ζ 2
ζ min-max ζ
(s, a) ζ Interchangeability principle
5
1: interchangeability principle: http://www.optimization-online.org/DB_FILE/2017/04/5983.pdf
min max J(ν, ζ) := E(s,a,s′ )∼dD ,a′ ∼π(s′ ) [ [ν(s, a) − γν (s′ , a′ )) ζ(s, a) − ζ(s, a)2 /2
ν:S×A→R ζ:S×A→R
(21)
− (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )] .
KKT ζ ν ζν∗ = ν − B π ν
(21) minimax (ν ∗ , ζ ∗ )
1. dD , π, β
2. (21) min-max ν ζ
1.
https://www.zhihu.com/question/268862097/answer/371323504
2.
3.
https://www.zhihu.com/question/263754316/answer/1290371489
6
3.2.3 Extension to General Convex Functions
(ν − B π ν) (s, a)2
f :R→R C R
[0, C] ∇f ∈ C
min J(ν) := E(s,a)∼dD [f ((ν − B π ν) (s, a))] − (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )] (25)
ν:S×A→R
f Fenchel :
min max J(ν, ζ) := E(s,a,s′ )∼dD ,a′ ∼π(s′ ) [(ν(s, a) − γν (s′ , a′ )) ζ(s, a) − f ∗ (ζ(s, a))]
ν:S×A→R ζ:S×A→R
(26)
− (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )]
KKT ∀ν ζν∗
ζ ∗ (s, a) = f ′ ((ν ∗ − B π ν ∗ ) (s, a)) = f ′ (x∗ (s, a)) = wπ/D (s, a) (28)
7
3.3 DualDICE
2: DualDICE
ν, ζ
3.4
N
{si , ai , ri , s′i }i=1 ∼ dD , {si0 }i=1 ∼ β a′i ∼ π(s′i ) ai0 ∼ π(si0 ) for i = 1, · · · , N.
N
ÊdD
f (x) = 1 2
2
x F, H (21)
ν, ζ ν̂, ζ̂ OPT off-policy policy estimate (OPE) ν̂, ζ̂
2
e 1
E ÊdD [ζ̂(s, a) · r] − ρ(π) = O ϵapprox (F, H) + ϵopt + √ (29)
N
ϵopt ϵapprox (F, H) F, H
N √1
N
→0 ϵapprox (F, H) ϵopt
4 Proof
(29)
8
1 ∥ν∥∞ ≤ C ∥ (ν − B π ν) ∥∞ ≤ 1+γ
1−γ
C
∥w∥∞ ≤ C ∥r̂(s, a)∥∞ ≤ Cr
[−Cr , Cr ]
dualdice
J(ν, ζ) =E(s,a,s′ )∼dD ,a′ ∼π(s′ ) (ν(s, a) − γν (s′ , a′ )) ζ(s, a) − ζ(s, a)2 /2
(30)
− (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )]
ζ
1
J(ν) = E(s,a)∼dD (ν − B π ν) (s, a)2 − (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )] (31)
2
1. ˆ ζ)
J(ν, J(ν, ζ) (ν̂ ∗ , ζ̂ ∗ )
2. νF∗ = arg minν∈F J(ν) and ν ∗ = arg minν∈S×A→R J(ν)
ν∈F
3. L(ν) = maxζ∈H J(ν, ζ) ˆ ζ)
L̂(ν) = maxζ∈H J(ν, ℓ(ζ) = minν∈F J(ν, ζ), ℓ̂(ζ) =
ˆ ζ)
minν∈F J(ν,
N
{si , ai , ri , s′i }i=1 ∼ dD , {si0 }i=1 ∼ β
ˆ ζ) N
4. J(ν,
a′i ∼ π (s′i ) , ai0 ∼ π(si0 ) (ν̂, ζ̂)
5.
R̄(s, a) =E·|s,a [r]
(32)
ρ(π) =EdD wπ/D (s, a) · R̄(s, a) .
ν̂ − B̂ π ν̂ (s, a) wπ/D (s, a) B̂ dD , π
Bellman backup DualDICE ζ̂(s, a)
wπ/D (s, a)
4.1
ν̂ − B̂ π ν̂ (s, a) wπ/D (s, a)
h i 2
ÊdD ν̂ − B̂ π ν̂ (s, a) · r − EdD wπ/D (s, a) · R̄(s, a)
h i h i
= ÊdD ν̂ − B̂ π ν̂ (s, a) · r − ÊdD ν̂ − B̂ π ν̂ (s, a) · R̄(s, a)
h i h i
+ ÊdD ν̂ − B̂ π ν̂ (s, a) · R̄(s, a) − ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · R̄(s, a)
h i 2
+ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · R̄(s, a) − EdD wπ/D (s, a) · R̄(s, a)
h i h i2
≤4 ÊdD ν̂ − B̂ π ν̂ (s, a) · r − ÊdD ν̂ − B̂ π ν̂ (s, a) · R̄(s, a) (33)
| {z }
ϵr
h i h i2
+ 4 ÊdD ν̂ − B̂ π ν̂ (s, a) · R̄(s, a) − ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · R̄(s, a)
| {z }
ϵ1
h i 2
+ 4 ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · R̄(s, a) − EdD [wπ/D (s, a) · R̄(s, a)] .
| {z }
ϵ2
9
(a − b)2 =(a − c + c − d + d − b)2
=(a − c)2 + (c − d)2 + (d − b)2 + 2(a − c)(c − d) + 2(a − c)(c − d) + 2(c − d)(d − b)
(34)
≤3(a − c)2 + 3(c − d)2 + 3(d − b)2
≤4(a − c)2 + 4(c − d)2 + 4(d − b)2
ϵr
h i2 1 + γ 2 2
ϵr = ÊdD ν̂ − B̂ π ν̂ (s, a) · (r(s, a) − R(s, a)) ≤ C 2 ÊdD [r(s, a)] − ÊdD [R(s, a)]
1−γ
(35)
2 2 2
ϵ1 ≤ Cr2 ν̂ − B̂ π ν̂ − ν̂ ∗ − B̂ π ν̂ ∗ ≤ Cr2 ( ζ̂ − ζ̂ ∗ + ν̂ ∗ − B̂ π ν̂ ∗ − ν̂ − B̂ π ν̂ ) (36)
D̂ D̂ Ô
| {z }
ϵ̂ppt
h i 2
ϵ2 ≤2 ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · r(s, a) − EdD [(ν̂ ∗ − B π ν̂ ∗ ) (s, a) · r(s, a)]
| {z }
ϵstat
2 (37)
+ 2 EdD [(ν̂ ∗ − B π ν̂ ∗ ) (s, a) · r(s, a)] − EdD wπ/D (s, a) · r(s, a)
≤2ϵstat + 2 (EdD [(ν̂ ∗ − B π ν̂ ∗ ) (s, a) · r(s, a)] − EdD [(ν ∗ − B π ν ∗ ) (s, a) · r(s, a)])
2
(14) (37)
(EdD [(ν̂ ∗ − B π ν̂ ∗ ) (s, a) · r(s, a)] − EdD [(ν ∗ − B π ν ∗ ) (s, a) · r(s, a)])
2
h i
≤EdD r(s, a)2 · ((ν̂ ∗ − B π ν̂ ∗ ) (s, a) − (ν ∗ − B π ν ∗ ) (s, a))
2
(38)
≤Cr2 ∥(ν̂ ∗ − B π ν̂ ∗ ) − (ν ∗ − B π ν ∗ )∥D
2
2Cr2
≤ (J (ν̂ ∗ ) − J (ν ∗ ))
η
f 1-strongly convexity ν∗ 0
10
J(ν̂ ∗ ) J(ν ∗ )
L (ν̂ ∗ ) − L (νF∗ )
≤2 sup |J(ν,
ˆ ζ) − J(ν, ζ)|
ν∈F ,ζ∈H
= 2 · ϵest (F)
L̂ (ν̂ ∗ ) − L̂ (νF∗ ) ≤ 0
J (ν̂ ∗ ) − L (ν̂ ∗ )
11
OPE DualDICE
ÊdD [ζ̂(s, a) · r]
2
ÊdD [ζ̂(s, a) · r] − EdD wπ/D (s, a) · R̄(s, a)
h i2
≤2 ÊdD [ζ̂(s, a) · r] − ÊdD ν̂ − B̂ π ν̂ (s, a) · r (45)
h i 2
+ 2 ÊdD ν̂ − B̂ π ν̂ (s, a) · r − EdD wπ/D (s, a) · R̄(s, a)
(44)
h i2 2
ÊdD [ζ̂(s, a) · r] − ÊdD ν̂ − B̂ π ν̂ (s, a) · r ≤ Cr2 ζ̂ − ν̂ − B̂ π ν̂ (46)
D̂
2
ζ̂ − ν̂ − B̂ π ν̂
D̂ 2
= ζ̂ − ζ̂ + ζ̂ − ν̂ ∗ − B̂ π ν̂ ∗ + ν̂ ∗ − B̂ π ν̂ ∗ − ν̂ − B̂ π ν̂
∗ ∗ (47)
2 2 D̂ 2
∗ ∗ π ∗
≤ 4 ζ̂ − ζ̂ + 4 ν̂ − B̂ ν̂ − ν̂ − B̂ ν̂ π
+ 4 ζ̂ − ν̂ ∗ − B̂ π ν̂ ∗
∗
D̂ D̂ D̂
16Cr2
ϵopt 0 η
max κ + κ ∥B π ∥D,1 , 1 ϵapprox (F) + L + 1+γ
1−γ
C ϵapprox (H)
ϵapprox (F, H) ϵr , ϵstat , ϵest (F)
4.2
4.2.1 Bounding ϵr
" #!2
2 X N XN
1+γ 1 1
E [ϵr ] ≤ C 2E ri − E ri
1−γ N i=1 N i=1
2 !
1+γ 1 XN
(48)
= C 2V ri
1−γ N i=1
2
1 1+γ 1
≤ C sup V(r | s, a) = O
2
N 1−γ s,a N
12
hν,ζ (s, a, s′ , a′ , s0 , a0 ) = (ν(s, a) − γν (s′ , a′ )) ζ(s, a) − f ∗ (ζ(s, a)) − (1 − γ)ν(s0 , a0 ) (49)
Z=S
| × A {z
× S × A} × S × A}, Zi = (si , ai , s′i , a′i , si0 , ai0 ) G = hF ×H
| {z hν,ζ
dD π βπ
ν∈F ζ∈H 1
1−γ
C C f ∗ (·) L-Lipzchitz
!
1 X
N
P sup |J(ν,
ˆ ζ) − J(ν, ζ)| ≥ ϵ =P sup | hν,ζ (Zi ) − E [hν,ζ ] |≥ ϵ
ν∈F ,ζ∈H ν∈F ,ζ∈H N
i=1
(51)
h ϵ i −N ϵ2
N
≤ 8E N1 , G, {Zi }i=1 exp .
8 512M12
13
G
1 X
N
|hν1 ,ζ1 (Zi ) − hν2 ,ζ2 (Zi )|
N i=1
1+γ
L + 1−γ C X N
C X
N
≤ |ζ1 (si , ai ) − ζ2 (si , ai )| + |ν1 (si , ai ) − ν2 (si , ai )| (52)
N i=1
N i=1
γC X (1 − γ) X
N N
+ |ν1 (s′i , a′i ) − ν2 (s′i , a′i )| + ν1 si0 , a0i − ν2 si0 , a0i
N i=1 N i=1
2 + γ − γ2 ′ N
N1 L+ C + (1 − γ) ϵ , G, {Zi }i=1
1−γ (53)
N
≤ N1 ϵ′ , H, {si , ai }i=1 N1 ϵ′ , F, {si , ai }i=1 N1 ϵ′ , F, {s′i , a′i }i=1 N1 ϵ′ , F, si0 , ai0
N N N
i=1
5
3DF +DH
2 + γ − γ2 ′ N 3 4eM1
N1 L+ C + (1 − γ) ϵ , G, {Zi }i=1 ≤ e (DF + 1) (DH + 1)
4
(54)
1−γ ϵ′
N
N1 ϵ
8
, G, {Zi }i=1
( ) 3DF +DH (55)
3 32 L+ 2+γ−γ
2
C+(1−γ) eM1
1 D1
≤ e (DF + 1) (DH + 1)
4 1−γ
ϵ
:= C1 ϵ
D1
3 2+γ−γ 2
C1 = e4 (DF + 1) (DH + 1) 32 L + 1−γ
C + (1 − γ) eM1 D1 = 3DF + DH
(51)
D1
1 −N ϵ2
P sup |J(ν,
ˆ ζ) − J(ν, ζ)| ≥ ϵ ≤ 8C1exp (56)
ν∈F ,ζ∈H ϵ 512M12
q
C2 (log N +log δ1 ) 2
ϵ= N
with C 2 = max (8C1 ) D1
, 512M1 D 1 , 512M 1 , 1
D1
1 −N ϵ2
8C1 exp ≤δ (57)
ϵ 512M12
6
log N + log 1δ
ϵstat = O (58)
N
4.3
(44)
2 1
E Êd D[ζ̂(s, a) · r̂(s, a)] − EdD wπ/D (s, a) · r(s, a) e ϵapprox (F, H) + ϵopt + √
=O (59)
N
14