0% found this document useful (0 votes)
2 views15 pages

2021_03_09_DualDice__Behavior_Agnostic_Estimation_of_Discounted_Stationary_Distribution_Corrections (1)

The document presents DualDice, a method for estimating discounted stationary distribution corrections in off-policy reinforcement learning. It discusses the core idea, derivation, and proof of the method, emphasizing its behavior-agnostic nature and ability to handle multiple unknown behavior policies. The framework aims to improve off-policy estimation by addressing the challenges associated with discounted stationary distributions.

Uploaded by

joyliu9562
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views15 pages

2021_03_09_DualDice__Behavior_Agnostic_Estimation_of_Discounted_Stationary_Distribution_Corrections (1)

The document presents DualDice, a method for estimating discounted stationary distribution corrections in off-policy reinforcement learning. It discusses the core idea, derivation, and proof of the method, emphasizing its behavior-agnostic nature and ability to handle multiple unknown behavior policies. The framework aims to improve off-policy estimation by addressing the challenges associated with discounted stationary distributions.

Uploaded by

joyliu9562
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DualDice: Behavior-Agnostic Estimation of Discounted

Stationary Distribution Corrections


Chen Gong

09 March 2021

1 Core idea 1

2 1
2.1 Off-Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1.1 Discounted Stationary Distribution . . . . . . . . . . . . . . . . . 2
2.1.2 Learning Stationary Distribution Corrections . . . . . . . . . . . . . . . . . . . 3
2.1.3 Off-Policy Estimation with Multiple Unknown Behavior Policies . . . . . . . . 3

3 DualDICE 4
3.1 The Key Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.2 Fenchel Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.3 Extension to General Convex Functions . . . . . . . . . . . . . . . . . . . . . . 7
3.3 DualDICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Proof 8
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Bounding ϵr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.2 Bounding ϵest (F) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.3 Bounding ϵstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1
1 Core idea
DAI Bo Google Brain NIPS 2019
discounted stationary distribution radios (DSDR)
DualDICE DSTR (s, a) (s, a)
DualDICE

discounted stationary distribution radios (DSDR) corrections


(s, a) (s, a) off-policy

2
⟨S, A, R, T, β⟩
sT P (sT |sT , a) = 1, R(sT , a) = 0

2.1 Off-Policy

" #
X

ρ(π) := (1 − γ) · E γ rt | s0 ∼ β, ∀t, at ∼ π (st ) , rt ∼ R (st , at ) , st+1 ∼ T (st , at )
t
(1)
t=0

off-policy D : {(s, a, r, s′ )} D
D µ
(IS)
! !
Y
H−1
π(at |st ) X
H−1
ρ(π) = (1 − γ) t
γ rt
t=0
µ(at |st ) t=0

IS offline RL

Qiang Liu Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation


pπ (·) π τ ∼ {st , at , rt }∞t=0 π
! !
  X T X
T
ρ(π) := lim Eτ ∼pπ RT (τ ) , T
R (τ ) := t
γ rt / t
γ , (2)
T →∞
t=0 t=0

1
RπT (τ ) τ 0:T γ=1 γ ∈ (0, 1)

1 X X
T ∞
Average: R(τ ) := lim rt , Discounted: R(τ ) := (1 − γ) γ t rt (3)
T →∞ T + 1
t=0 t=0
P∞
(1 − γ) = 1/ t=0 γt Off-Policy Rπ
π0 τ = i
{sit , ait , rti }Tt=0

2.1.1 Discounted Stationary Distribution

dπ,t (·) s0 π, t st
0 s
(1)
st
rt ρ(π)
! !
X
T X
T
π t π,t t
d (s) = lim γd (s) / γ (4)
T →∞
t=0 t=0

X

d (s) = (1 − γ) ·
π
γ t Pr(st = s | s0 ∼ β, ∀t, at ∼ π (st ) , rt ∼ R (st , at ) , st+1 ∼ T (st , at )) (5)
t=0
P∞
T →∞ γ ∈ (0, 1] dπ (s) = (1−γ) t=0 γ t dπ,t (s)
γ=1 dπ (s) st

1 X π,t
T
dπ (s) = lim d (s) = lim dπ,t (s)
T →∞ T + 1 t→∞
t=0

(1)

ρ(π) = E(s,a)∼dπ ,r∼R(s,a) [r] (6)

(s, a) ∼ dπ dπ (s, a) (s, a) dπ (s, a) := dπ (s)π(a|s)


D µ
 
ρ(π) = E(s,a)∼dµ ,r∼R(s,a) wπ/µ (s, a) · r (7)

dπ (s,a)
wπ/µ (s, a) = dµ (s,a)
Discounted Stationary Distribution Correction wπ/µ (s, a)
Bo Dai {sit , ait , rti }m
i=1 ∼µ

1 XX i i
m T
wπ/µ (sit , ait )
ρ(π) = wr, wti := γ t P i′ i′
m i=1 t=1 t t t′ ,i′ wπ/µ (st′ , at′ )

(s, a) τ =
{(st , at )}Tt=0 Qiang Liu
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

2
2.1.2 Learning Stationary Distribution Corrections

X

dπ (s′ ) = (1 − γ) γ t dπ,t (s′ )
t=0
X


= (1 − γ)β (s ) | +(1 − γ) γ t dπ,t (s′ )
t=1
X∞
= (1 − γ)β (s′ ) + (1 − γ)γ γ t dπ,t+1 (s′ )
t=0
X
∞ X ′ X
= (1 − γ)β (s′ ) + (1 − γ)γ γt T π (s′ | s) dπ,t (s) //dπ,t+1(s ) = Tπ (s′ | s) dπ,t (s) (8)
t=0 s s,a
!
X X

= (1 − γ)β (s′ ) + γ T π (s′ | s) (1 − γ) γ t dπ,t (s)
s t=0
X
′ ′
= (1 − γ)β (s ) + γ T π (s | s) d (s) π

s
X
= (1 − γ)β (s′ ) + γ T (s′ | s, a) π(a | s)dπ (s)
s,a

πP =
π π P µ (3)

s∈S
X X X
dπ (s′ ) = (1 − γ)β (s′ ) + γ T (s′ | s, a) π(a | s)dπ (s)
s′ s′ s,a,s′
  X T (s′ | s, a) π(a | s)dπ (s)
Es′ ∼dµ wπ/µ(s′ ) =(1 − γ)β(s′ ) + γ ′ | s, a) µ(a | s)dµ (s)
d(s′ , a, s) (9)
s,a,s′
T (s
 
  π(a | s)
Es′ ∼dµ wπ/µ (s′ ) =(1 − γ)β(s′ ) + γE(s,a,s′ )∼dµ wπ/µ (s′ )
µ(a | s)

  
E(st ,at ,st+1 )∼dµ TD st , at , st+1 | wπ/µ | st+1 = s′ = 0 (10)

 π(a | s)
TD s, a, s′ | wπ/µ := −wπ/µ (s′ ) + (1 − γ)β (s′ ) + γwπ/µ (s) · (11)
µ(a | s)
TD wπ/µ dµ
TD wπ/µ Bellman

(8) s s

2.1.3 Off-Policy Estimation with Multiple Unknown Behavior Policies

D D

 
ρ(π) = E(s,a,r)∼D wπ/D (s, a) · r (12)

3
Assumption 1 dπ (s, a) > 0 dµ (s, a) > 0 C
∥wπ/D ∥∞ ≤ C

3 DualDICE
DualDICE Discounted Stationary Distribution Correc-
dπ (s,a)
tions: wπ/D (s, a) = dD (s,a)
D D =
′ D
{(s, a, r, s )} ∼ d �

3.1 The Key Idea


ν :S×A→R
1  
min J(ν) := E(s,a)∼dD (ν − B π ν) (s, a)2 − (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )] (13)
ν:S×A→R 2
Bπ π 0 B π ν(s, a) = γEs′ ∼T (s,a),a′ ∼π(s′ ) [ν (s′ , a′ )]
(13) min ν∗ ≡ 0 ν∗ > 0
ν∗
ν∗

(ν ∗ − B π ν ∗ ) (s, a) = wπ/D (s, a) (14)

dD , β, π
dD

3.2 Derivation
m ∈ R>0 , n ∈ R≥0 minx J(x) := 12 mx2 − nx x∗ = n
m
C R [0, C]
1X D   X π
min J1 (x) := d (s, a) x(s, a)2 − d (s, a)[x(s, a)]
x:S×A→C 2 s,a s,a (15)
1  
= E(s,a)∼dD x(s, a)2 − E(s,a)∼dπ [x(s, a)]
2
dπ (s,a)
∀(s, a) ∈ S × A x∗ (s, a) = wπ/D (s, a) = dD (s,a)
(s, a) ∼ dπ (s, a)

3.2.1 Change of Variables

ν :S×A→R

ν(s, a) := x(s, a) + γEs′ ∼T (s,a),a′ ∼π(s′ ) [ν (s′ , a′ )] , ∀(s, a) ∈ S × A (16)

4
x(s, a) ∈ C γ ∈ [0, 1] ν(s, a) π t

βt (s) := Pr (s = st | s0 ∼ β, ak ∼ π (sk ) , sk+1 ∼ T (sk , ak ) for 0 ≤ k < t) , (17)

 
E(s,a)∼dπ [x(s, a)] =E(s,a)∼dπ ν(s, a) − γEs′ ∼T (s,a),a′ ∼π(s′ ) [ν (s′ , a′ )]
X

 
=(1 − γ) γ t Es∼βt ,a∼π(s) ν(s, a) − γEs′ ∼T (s,a),a′ ∼π(s′ ) [ν (s′ , a′ )]
t=0
(18)
X
∞ X

=(1 − γ) γ Es∼βt ,a∼π(s) [ν(s, a)] − (1 − γ)
t
γ t+1
Es∼βt+1 ,a∼π(s) [ν(s, a)]
t=0 t=0

=(1 − γ)Es∼β,a∼π(s) [ν(s, a)]

dπ (s, a)

(ν − B π ν) (s, a) = x(s, a) → (ν ∗ − B π ν ∗ ) (s, a) = x∗ (s, a) = wπ/D (s, a). (19)

(15) (13)

ν
1. (ν − B π ν) (s, a)2

2. ν∗ (ν ∗ − B π ν ∗ ) (s, a)

3.2.2 Fenchel Duality

Bπ Fenchel
f (x) f (x) = maxζ x · ζ − f ∗ (ξ) f ∗ (x) f Fenchel
1 2 ∗ 1 2
f (x) = 2
x Fenchel f (ζ) = 2
ζ
 
1 2
min J(ν) := E(s,a)∼dD max (ν − B ν) (s, a) · ζ − ζ − (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )]
π
(20)
ν:S×A→R ζ 2

ζ min-max ζ
(s, a) ζ Interchangeability principle

5
1: interchangeability principle: http://www.optimization-online.org/DB_FILE/2017/04/5983.pdf


min max J(ν, ζ) := E(s,a,s′ )∼dD ,a′ ∼π(s′ ) [ [ν(s, a) − γν (s′ , a′ )) ζ(s, a) − ζ(s, a)2 /2
ν:S×A→R ζ:S×A→R
(21)
− (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )] .
KKT ζ ν ζν∗ = ν − B π ν
(21) minimax (ν ∗ , ζ ∗ )

ζ ∗ (s, a) = (ν ∗ − B π ν ∗ ) (s, a) = wπ/D (s, a) (22)

1. dD , π, β
2. (21) min-max ν ζ

3. (11) ζ ∗ (s, a) ζ ∗ (s, a)

1.

https://www.zhihu.com/question/268862097/answer/371323504
2.

3.

https://www.zhihu.com/question/263754316/answer/1290371489

6
3.2.3 Extension to General Convex Functions

(ν − B π ν) (s, a)2
f :R→R C R
[0, C] ∇f ∈ C

min J(x) := m · f (x) − nx (23)


x

f KKT ∇x f (x) = 0− > f ′ (x∗ ) = n


m
(15)

min J1 (x) := E(s,a)∼dD [f (x(s, a))] − E(s,a)∼dπ [x(s, a)] (24)


x:S×A→C

f ′ (x∗ (s, a)) = wπ/D (s, a)


ν := x + B π ν (24)

min J(ν) := E(s,a)∼dD [f ((ν − B π ν) (s, a))] − (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )] (25)
ν:S×A→R

f Fenchel :

min max J(ν, ζ) := E(s,a,s′ )∼dD ,a′ ∼π(s′ ) [(ν(s, a) − γν (s′ , a′ )) ζ(s, a) − f ∗ (ζ(s, a))]
ν:S×A→R ζ:S×A→R
(26)
− (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )]

KKT ∀ν ζν∗

f ∗′ (ζν∗ (s, a)) = (ν − B π ν) (s, a). (27)



f′ Fenchel f∗ (ζ ∗ , ν ∗ )
wπ/D (s, a)

ζ ∗ (s, a) = f ′ ((ν ∗ − B π ν ∗ ) (s, a)) = f ′ (x∗ (s, a)) = wπ/D (s, a) (28)

(26) wπ/D (s, a)


ν ζ

ζ (s, a)

7
3.3 DualDICE

2: DualDICE

ν, ζ

3.4
N
{si , ai , ri , s′i }i=1 ∼ dD , {si0 }i=1 ∼ β a′i ∼ π(s′i ) ai0 ∼ π(si0 ) for i = 1, · · · , N.
N
ÊdD
f (x) = 1 2
2
x F, H (21)
ν, ζ ν̂, ζ̂ OPT off-policy policy estimate (OPE) ν̂, ζ̂

 2   
e 1
E ÊdD [ζ̂(s, a) · r] − ρ(π) = O ϵapprox (F, H) + ϵopt + √ (29)
N
ϵopt ϵapprox (F, H) F, H
N √1
N
→0 ϵapprox (F, H) ϵopt

4 Proof
(29)

8
1 ∥ν∥∞ ≤ C ∥ (ν − B π ν) ∥∞ ≤ 1+γ
1−γ
C
∥w∥∞ ≤ C ∥r̂(s, a)∥∞ ≤ Cr
[−Cr , Cr ]
dualdice
 
J(ν, ζ) =E(s,a,s′ )∼dD ,a′ ∼π(s′ ) (ν(s, a) − γν (s′ , a′ )) ζ(s, a) − ζ(s, a)2 /2
(30)
− (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )]
ζ
1  
J(ν) = E(s,a)∼dD (ν − B π ν) (s, a)2 − (1 − γ)Es0 ∼β,a0 ∼π(s0 ) [ν (s0 , a0 )] (31)
2

1. ˆ ζ)
J(ν, J(ν, ζ) (ν̂ ∗ , ζ̂ ∗ )
2. νF∗ = arg minν∈F J(ν) and ν ∗ = arg minν∈S×A→R J(ν)
ν∈F
3. L(ν) = maxζ∈H J(ν, ζ) ˆ ζ)
L̂(ν) = maxζ∈H J(ν, ℓ(ζ) = minν∈F J(ν, ζ), ℓ̂(ζ) =
ˆ ζ)
minν∈F J(ν,
N
{si , ai , ri , s′i }i=1 ∼ dD , {si0 }i=1 ∼ β
ˆ ζ) N
4. J(ν,
a′i ∼ π (s′i ) , ai0 ∼ π(si0 ) (ν̂, ζ̂)
5.
R̄(s, a) =E·|s,a [r]
  (32)
ρ(π) =EdD wπ/D (s, a) · R̄(s, a) .
 
ν̂ − B̂ π ν̂ (s, a) wπ/D (s, a) B̂ dD , π
Bellman backup DualDICE ζ̂(s, a)
wπ/D (s, a)

4.1
 
ν̂ − B̂ π ν̂ (s, a) wπ/D (s, a)
 h  i  2
ÊdD ν̂ − B̂ π ν̂ (s, a) · r − EdD wπ/D (s, a) · R̄(s, a)
 h  i h  i
= ÊdD ν̂ − B̂ π ν̂ (s, a) · r − ÊdD ν̂ − B̂ π ν̂ (s, a) · R̄(s, a)
h  i h  i
+ ÊdD ν̂ − B̂ π ν̂ (s, a) · R̄(s, a) − ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · R̄(s, a)
h  i  2
+ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · R̄(s, a) − EdD wπ/D (s, a) · R̄(s, a)
 h  i h  i2
≤4 ÊdD ν̂ − B̂ π ν̂ (s, a) · r − ÊdD ν̂ − B̂ π ν̂ (s, a) · R̄(s, a) (33)
| {z }
ϵr
 h  i h  i2
+ 4 ÊdD ν̂ − B̂ π ν̂ (s, a) · R̄(s, a) − ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · R̄(s, a)
| {z }
ϵ1
 h  i 2
+ 4 ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · R̄(s, a) − EdD [wπ/D (s, a) · R̄(s, a)] .
| {z }
ϵ2

9
(a − b)2 =(a − c + c − d + d − b)2
=(a − c)2 + (c − d)2 + (d − b)2 + 2(a − c)(c − d) + 2(a − c)(c − d) + 2(c − d)(d − b)
(34)
≤3(a − c)2 + 3(c − d)2 + 3(d − b)2
≤4(a − c)2 + 4(c − d)2 + 4(d − b)2

ϵr
 h  i2  1 + γ 2  2
ϵr = ÊdD ν̂ − B̂ π ν̂ (s, a) · (r(s, a) − R(s, a)) ≤ C 2 ÊdD [r(s, a)] − ÊdD [R(s, a)]
1−γ
(35)

    2 2     2
ϵ1 ≤ Cr2 ν̂ − B̂ π ν̂ − ν̂ ∗ − B̂ π ν̂ ∗ ≤ Cr2 ( ζ̂ − ζ̂ ∗ + ν̂ ∗ − B̂ π ν̂ ∗ − ν̂ − B̂ π ν̂ ) (36)
D̂ D̂ Ô
| {z }
ϵ̂ppt

 h  i 2
ϵ2 ≤2 ÊdD ν̂ ∗ − B̂ π ν̂ ∗ (s, a) · r(s, a) − EdD [(ν̂ ∗ − B π ν̂ ∗ ) (s, a) · r(s, a)]
| {z }
ϵstat
 2 (37)
+ 2 EdD [(ν̂ ∗ − B π ν̂ ∗ ) (s, a) · r(s, a)] − EdD wπ/D (s, a) · r(s, a)
≤2ϵstat + 2 (EdD [(ν̂ ∗ − B π ν̂ ∗ ) (s, a) · r(s, a)] − EdD [(ν ∗ − B π ν ∗ ) (s, a) · r(s, a)])
2

(14) (37)

(EdD [(ν̂ ∗ − B π ν̂ ∗ ) (s, a) · r(s, a)] − EdD [(ν ∗ − B π ν ∗ ) (s, a) · r(s, a)])
2

h i
≤EdD r(s, a)2 · ((ν̂ ∗ − B π ν̂ ∗ ) (s, a) − (ν ∗ − B π ν ∗ ) (s, a))
2

(38)
≤Cr2 ∥(ν̂ ∗ − B π ν̂ ∗ ) − (ν ∗ − B π ν ∗ )∥D
2

2Cr2
≤ (J (ν̂ ∗ ) − J (ν ∗ ))
η
f 1-strongly convexity ν∗ 0

10
J(ν̂ ∗ ) J(ν ∗ )

J (ν̂ ∗ ) − J (ν ∗ ) = J (ν̂ ∗ ) − J (νF∗ ) + J (νF∗ ) − J (ν ∗ )


(39)
= J (ν̂ ∗ ) − L (ν̂ ∗ ) + L (ν̂ ∗ ) − L (νF∗ ) + L (νF∗ ) − J (νF∗ ) + J (νF∗ ) − J (ν ∗ )

J (νF∗ ) − J (ν ∗ ) = ED [f (νF∗ − B π νF∗ ) − f (ν ∗ − B π ν ∗ )] − Eβπ [νF∗ − ν ∗ ]


≤ κ ∥νF∗ − ν ∗ ∥D,1 + κ ∥B π (νF∗ − ν ∗ )∥D,1 + ∥νF∗ − ν ∗ ∥βπ,1
   (40)
≤ max κ + κ ∥B π ∥D,1 , 1 ∥νF∗ − ν ∗ ∥D,1 + ∥νF∗ − ν ∗ ∥βπ,1
 
≤ max κ + κ ∥B π ∥D,1 , 1 · ϵapprox (F)
 
ϵapprox (F) := supν∈S×A→R infν∈F ∥νF − ν∥D,1 + ∥νF − ν∥βπ,1
L (νF∗ ) − J (νF∗ )

L (νF∗ ) − J (νF∗ ) = max J (νF∗ , ζ) − max J (νF∗ , ζ) ≤ 0 (41)


ζ∈H ζ∈S×A→R

L (ν̂ ∗ ) − L (νF∗ )

L (ν̂ ∗ ) − L (νF∗ ) = L (ν̂ ∗ ) − L̂ (ν̂ ∗ ) + L̂ (ν̂ ∗ ) − L̂ (νF∗ ) + L̂ (νF∗ ) − L (νF∗ )


≤ L (ν̂ ∗ ) − L̂ (ν̂ ∗ ) + L̂ (νF∗ ) − L (νF∗ )
≤ 2 sup |L(ν) − L̂(ν)|
ν∈F
(42)
= 2 sup max J(ν, ζ) − max J(ν,
ˆ ζ)
ν∈F ζ∈H ζ∈H

≤2 sup |J(ν,
ˆ ζ) − J(ν, ζ)|
ν∈F ,ζ∈H

= 2 · ϵest (F)

L̂ (ν̂ ∗ ) − L̂ (νF∗ ) ≤ 0
J (ν̂ ∗ ) − L (ν̂ ∗ )

J (ν̂ ∗ ) − L (ν̂ ∗ ) = J (ν̂ ∗ , ζ) − max J (ν̂ ∗ , ζ)


max
ζ∈S×A→R ζ∈H
 
1+γ ∗ (43)
≤ L+ C ∥ζH − ζ ∗ ∥D,1 ,
1−γ | {z }
≤ϵ approx (H)
 
ϵapprox (H) := supζ∈S×A→R infζ∈H ∥ζH − ζ∥D,1 + ∥ζH − ζ∥βπ,1
 h  i 2
ÊdD ν̂ − B̂ π ν̂ (s, a) · r̂(s, a) − ρ(π)
     
16Cr2 1+γ
≤ max κ + κ ∥B π ∥D,1 , 1 ϵapprox (F) + L + C ϵapprox (H)
η 1−γ
2
32Cr
+ 4ϵr + 8ϵstat + ϵest (F) + 4ϵ̂opt .
η
(44)

11
OPE DualDICE

ÊdD [ζ̂(s, a) · r]

  2
ÊdD [ζ̂(s, a) · r] − EdD wπ/D (s, a) · R̄(s, a)
 h  i2
≤2 ÊdD [ζ̂(s, a) · r] − ÊdD ν̂ − B̂ π ν̂ (s, a) · r (45)
 h  i  2
+ 2 ÊdD ν̂ − B̂ π ν̂ (s, a) · r − EdD wπ/D (s, a) · R̄(s, a)

(44)
 h  i2   2
ÊdD [ζ̂(s, a) · r] − ÊdD ν̂ − B̂ π ν̂ (s, a) · r ≤ Cr2 ζ̂ − ν̂ − B̂ π ν̂ (46)

  2
ζ̂ − ν̂ − B̂ π ν̂
 D̂      2
= ζ̂ − ζ̂ + ζ̂ − ν̂ ∗ − B̂ π ν̂ ∗ + ν̂ ∗ − B̂ π ν̂ ∗ − ν̂ − B̂ π ν̂
∗ ∗ (47)
2     2 D̂  2
∗ ∗ π ∗
≤ 4 ζ̂ − ζ̂ + 4 ν̂ − B̂ ν̂ − ν̂ − B̂ ν̂ π
+ 4 ζ̂ − ν̂ ∗ − B̂ π ν̂ ∗

D̂ D̂ D̂
     
16Cr2
ϵopt 0 η
max κ + κ ∥B π ∥D,1 , 1 ϵapprox (F) + L + 1+γ
1−γ
C ϵapprox (H)
ϵapprox (F, H) ϵr , ϵstat , ϵest (F)

4.2
4.2.1 Bounding ϵr

 " #!2 
 2 X N XN
1+γ 1 1
E [ϵr ] ≤ C 2E  ri − E ri 
1−γ N i=1 N i=1
 2 !
1+γ 1 XN
(48)
= C 2V ri
1−γ N i=1
 2  
1 1+γ 1
≤ C sup V(r | s, a) = O
2
N 1−γ s,a N

4.2.2 Bounding ϵest (F)

ϵest (F) = sup |J(ν,


ˆ ζ) − J(ν, ζ)|
ν∈F ,ζ∈H

12
hν,ζ (s, a, s′ , a′ , s0 , a0 ) = (ν(s, a) − γν (s′ , a′ )) ζ(s, a) − f ∗ (ζ(s, a)) − (1 − γ)ν(s0 , a0 ) (49)

Z=S
| × A {z
× S × A} × S × A}, Zi = (si , ai , s′i , a′i , si0 , ai0 ) G = hF ×H
| {z hν,ζ
dD π βπ
ν∈F ζ∈H 1
1−γ
C C f ∗ (·) L-Lipzchitz

∥hν,ζ ∥∞ ≤ (1 + γ)∥ν∥∞ ∥ζ∥∞ + (1 − γ)∥ν∥∞ + ∥f ∗ (ζ)∥∞


1+γ 2
≤ C + C + ∥f ∗ (ζ) − f ∗ (0)∥∞ + |f ∗ (0)|
1−γ
1+γ 2 (50)
≤ C + C + L∥ζ∥∞ + |f ∗ (0)|
1−γ
1+γ 2
≤ C + C + LC + |f ∗ (0)|
1−γ

  !
1 X
N
P sup |J(ν,
ˆ ζ) − J(ν, ζ)| ≥ ϵ =P sup | hν,ζ (Zi ) − E [hν,ζ ] |≥ ϵ
ν∈F ,ζ∈H ν∈F ,ζ∈H N
i=1
  (51)
h ϵ i −N ϵ2
N
≤ 8E N1 , G, {Zi }i=1 exp .
8 512M12

13
G

1 X
N
|hν1 ,ζ1 (Zi ) − hν2 ,ζ2 (Zi )|
N i=1
 
1+γ
L + 1−γ C X N
C X
N
≤ |ζ1 (si , ai ) − ζ2 (si , ai )| + |ν1 (si , ai ) − ν2 (si , ai )| (52)
N i=1
N i=1

γC X (1 − γ) X  
N N
+ |ν1 (s′i , a′i ) − ν2 (s′i , a′i )| + ν1 si0 , a0i − ν2 si0 , a0i
N i=1 N i=1

  
2 + γ − γ2 ′ N
N1 L+ C + (1 − γ) ϵ , G, {Zi }i=1
1−γ (53)
        N

≤ N1 ϵ′ , H, {si , ai }i=1 N1 ϵ′ , F, {si , ai }i=1 N1 ϵ′ , F, {s′i , a′i }i=1 N1 ϵ′ , F, si0 , ai0
N N N
i=1

5
    3DF +DH
2 + γ − γ2 ′ N 3 4eM1
N1 L+ C + (1 − γ) ϵ , G, {Zi }i=1 ≤ e (DF + 1) (DH + 1)
4
(54)
1−γ ϵ′

 
N
N1 ϵ
8
, G, {Zi }i=1
 ( ) 3DF +DH (55)
3 32 L+ 2+γ−γ
2
C+(1−γ) eM1 
1 D1
≤ e (DF + 1) (DH + 1)
4 1−γ
ϵ
:= C1 ϵ

   D1
3 2+γ−γ 2
C1 = e4 (DF + 1) (DH + 1) 32 L + 1−γ
C + (1 − γ) eM1 D1 = 3DF + DH
(51)
   D1  
1 −N ϵ2
P sup |J(ν,
ˆ ζ) − J(ν, ζ)| ≥ ϵ ≤ 8C1exp (56)
ν∈F ,ζ∈H ϵ 512M12
q  
C2 (log N +log δ1 ) 2
ϵ= N
with C 2 = max (8C1 ) D1
, 512M1 D 1 , 512M 1 , 1
 D1  
1 −N ϵ2
8C1 exp ≤δ (57)
ϵ 512M12
6

4.2.3 Bounding ϵstat

 
log N + log 1δ
ϵstat = O (58)
N

4.3
(44)
   
 2 1
E Êd D[ζ̂(s, a) · r̂(s, a)] − EdD wπ/D (s, a) · r(s, a) e ϵapprox (F, H) + ϵopt + √
=O (59)
N

14

You might also like