Stochastic optimal control & rl

Stochastic Optimal Control
&
Reinforcement Learning
Jinwon Choi

Contents
01
02
03
04
05
Reinforcement
Learning
Stochastic
Optimal Control
Stochastic
Control
to
Reinforcement
Learning
Large Scale
Reinforcement
Learning
Summary

Ivan Pavlov

Agent Environment
Action
Reward
State

Agent Environment
Action
Reward
State
Agent
Environment

Markov Decision Process
Markov?

Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡

Markov?
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Memoryless process!

Markov?
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov “Decision” Process:
ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
Future state only depends on the current state and action
&
Policy also depends on the current state only
𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡

Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State

Agent Environment
Action
Reward
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)

Agent Environment
Action
Reward
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
Objective function:
max
𝜋∈Π
𝑅 𝜋
• Policy: 𝜋: 𝑆 → 𝐴
• Total reward w.r.t. 𝜋:
𝑅 𝜋
= 𝐸𝑠~𝑝,𝑎~𝜋 σ 𝑡=0
∞
𝛾 𝑡
𝑟(𝑠𝑡, 𝑎 𝑡)
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)

Terminology of RL and Optimal control
State
Action
Agent
Environment
Reward of a stage
Reward (or value) function
Maximizing the value function
Bellman operator
Greedy policy w.r.t. 𝐽
State
Control Input
Controller
System
Cost of a stage
Value (or cost) function
Minimizing the value function
DP mapping or operator
Minimizing policy w.r.t. 𝐽
RL Optimal Control

DiscreteContinuous
Stochastic
Deterministicሶ𝑥 = 𝑓(𝑥, 𝑢)
𝑑𝑥 = 𝑓 𝑥, 𝑢 𝑑𝑡 + 𝜎 𝑥, 𝑢 𝑑𝑊
𝑥 𝑘+1 = 𝑓(𝑥 𝑘, 𝑢 𝑘)
𝑥 𝑘+1 = 𝑓 𝑥 𝑘, 𝑢 𝑘, 𝑤 𝑘
(𝑤 𝑘 is a random Gaussian noise)
or
𝑥 𝑘+1~𝑝 𝑥 𝑘+1 𝑥 𝑘, 𝑢 𝑘
System Dynamics

DiscreteContinuous
Stochastic
(Policy)
Deterministic
(Control Input)
𝑢 𝑥
𝑢 𝑥 ~ 𝜋 𝑢 𝑥
{𝑢0, 𝑢1, 𝑢2, … }
𝑢 𝑘 ~ 𝜋 𝑢 𝑘 𝑥 𝑘
Control Input and Policy

DiscreteContinuous
Infinite-horizon
Finite-horizon
Value function
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑇
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 + 𝑞(𝑥 𝑇)
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
∞
𝑒−𝛾𝑡
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
𝑁
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝑞(𝑥 𝑁)
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘

Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite-horizon
Finite-horizon
Dynamic Programming
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡
𝑡+Δ𝑡
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑡+Δ𝑡
𝑒−𝛾𝑠
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)

Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
inf
𝑢 𝑘∈𝑈
inf
𝑢 𝑘∈𝑈
Finite
HJB equation Bellman equation
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0

Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
inf
𝑢 𝑘∈𝑈
inf
𝑢 𝑘∈𝑈
Finite
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
HJB equation Bellman equation

Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?

Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Note) There is an other approach using different dynamic programming equation,
average reward.
Value Iteration & Policy Iteration

Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)

Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
𝑉 𝑥0 = inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘
𝑟 𝑥 𝑘, 𝑢 𝑘

Bellman Operator
𝑉 𝑥 𝑘 = inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝜋 𝑥 𝑘 = 𝑟(𝑥 𝑘, 𝜋(𝑥 𝑘)) + 𝛾𝐸 𝑥~𝑝 𝑉 𝜋(𝑥 𝑘+1)
𝑄 𝜋
𝑥 𝑘, 𝑢 𝑘 = 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝,𝜋 𝑄 𝜋
𝑥 𝑘+1, 𝜋(𝑥 𝑘+1)
Dynamic Programming

Bellman Operator
Let 𝔹, ∙ ∞, 𝑑∞ be a metric space where 𝔹 = 𝜓: Ω → ℝ continuous and bounded ,
𝜓 ∞ ≔ sup
𝑥∈𝑋
𝜓(𝑥) , and 𝑑∞ 𝜓, 𝜓′ = sup
𝑥∈𝑋
𝜓 𝑥 − 𝜓′(𝑥) .
Definition. Given policy 𝜋, the Bellman operator 𝑇 𝜋: 𝔹 → 𝔹 is defined by
𝑇 𝜋 𝜓 𝑥 𝑘 = 𝑟 𝑥 𝑘, 𝜋 𝑥 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
and the Bellman optimal operator 𝑇∗: 𝔹 → 𝔹 is defined by
𝑇∗ 𝜓 𝑥 𝑘 = min
𝑢 𝑘∈𝑈(𝑥 𝑘)
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]

Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.

Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Proposition 3. The Bellman operator 𝑇 𝜋
, 𝑇 are a contraction
with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e.
𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹
𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.

Bellman Operator
Theorem 2.3. (Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric
space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then,
1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗
∈ 𝔹 s.t. 𝑇𝑓∗
= 𝑓∗
.
2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹.
Then, lim
n→∞
𝑇 𝑛
𝑓0 → 𝑓∗
.

Value Iteration
Algorithm: Value Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. for all 𝑥 ∈ 𝑋 do
4. 𝑉𝑘+1 ← 𝑇∗
𝑉𝑘
5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′)
7. return 𝜋∗

Policy Iteration
Algorithm: Policy Iteration
Output: 𝜋
2. repeat
1) Policy Evaluation
3. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
4. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋 𝑉𝑘
𝜋 𝑘
5. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
6. 𝜋 𝑘+1 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′
)
7. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
8. return 𝜋

Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?

Learning-based approach

1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)

1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
2. Without system identification, obtain the value function and policy directly from simulation data
Model-free approach

𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)

𝐢𝐧𝐟
𝝅∈𝚷
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient

Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence

Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖Use Monte-Carlo Search 

𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖

𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖
Impractical

Stochastic Approximation
Consider the problem
𝑥 = 𝐿(𝑥).
Then, this problem can be solved by iterative algorithm
𝑥 𝑘+1 = 𝐿 𝑥 𝑘
or
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘).
If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by
𝐿 𝑥 =
1
𝑁
෍
𝑖=1
𝑁
𝑓(𝑥𝑖, 𝑤𝑖)
which becomes inefficient when 𝑁 is large.

Stochastic Approximation
Use a single sample as an estimation of expectation in each update
This update can be seen as a stochastic approximation of the form
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘).
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘
= 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘
where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 .
Robbins-Monro stochastic approximation guarantees the convergence under
contraction or monotonicity assumptions of the mapping 𝐿 with assumptions
෍
𝑘=0
∞
𝛼 𝑘 = + ∞ and ෍
𝑘=0
∞
𝛼 𝑘
2
< +∞.

Policy Iteration
Algorithm: Policy Iteration(Classical Dynamic Programming)
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3.
4. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋

Policy Iteration
Algorithm: Policy Iteration(Temporal Difference)
Output: 𝜋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑉𝑘+1
𝜋 𝑘
(𝑥𝑖) ← 1 − 𝛼 𝑘 𝑉𝑘
𝜋 𝑘
(𝑥𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑉𝑘
𝜋 𝑘
(𝑥𝑖+1)
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
𝑢∈𝑈(𝑥)
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋

Policy Iteration
Algorithm: Policy Iteration(SARSA)
Output: 𝜋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑄 𝑘+1
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) ← 1 − 𝛼 𝑘 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖+1, 𝜋 𝑘 𝑥𝑖+1 )
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
𝑢∈𝑈(𝑥)
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋

Value Iteration
Algorithm: Value Iteration(Classical Dynamic Programming)
Output: 𝜋
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do
5. 𝑉𝑘+1 𝑥 ← 𝑇∗ 𝑉𝑘
2. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗

Value Iteration
Algorithm: Value Iteration(Q-learning)
Output: 𝜋
𝑄(𝑥, 𝑢) ←Initialize arbitrarily for 𝑥 ∈ 𝑋, 𝑢 ∈ 𝑈
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do  for 𝑥𝑖, 𝑎𝑖, 𝑟 𝑥𝑖, 𝑢𝑖 , 𝑥𝑖+1 ∈ 𝒟 do
5. 𝑄 𝑘+1 𝑥, 𝑢 ← 𝑇∗ 𝑄 𝑘
𝑄 𝑘+1(𝑥𝑖, 𝑢𝑖) ← (1 − 𝛼 𝑘)𝑄 𝑘(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝑢𝑖 + min
𝑢𝑖+1∈𝑈
𝑄 𝑘(𝑥𝑖+1, 𝑢𝑖+1)
2. until 𝑄 𝑘+1 − 𝑄 𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗

𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning

Large-scale RL
Number of states > 2200
(for 10 × 20 board)

Approximate Dynamic Programming
Direct method (Gradient methods)
min
𝜃
෍
𝑖=1
𝑁
𝑉 𝜋
𝑥𝑖 − ෠𝑉(𝑥𝑖; 𝜃)
2
≈ min
𝜃
෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2
෠𝑉(𝑥; 𝜃) : Approximated value function (ex. Polynomial approx., Neural Network, etc.)
𝑉 𝜋
𝑥 : State-value function
𝐽(𝑥𝑖, 𝑚): 𝑚-th sample of the cost function at 𝑥𝑖 where 𝑚 = 1,2, … , 𝑀
𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂 ෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝛻 ෠𝑉(𝑥𝑖; 𝜃) 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2

Approximate Dynamic Programming
Indirect method (Projected equation)
Solve the projected Bellman equation: Φ𝜃 = Π𝑇(Φ𝜃)
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
𝐽
Π𝐽
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
Φ𝜃 = Π𝑇(Φ𝜃)
𝑇(Φ𝜃)
Direct Method Indirect Method

𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Direct
(Gradient
methods)
TD
SARSA
DQN
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Indirect
(Projected DP)
TD, LSTD
TD, LSTD
LSPE
Function
Approximation
Policy
Evaluation

RL is a toolbox solving
Infinite-horizon Discrete-time DP
𝐢𝐧𝐟
𝝅∈𝜫

Stochastic optimal control & rl

More Related Content

What's hot (20)

Similar to Stochastic optimal control & rl (20)

Recently uploaded (20)