SlideShare a Scribd company logo
Stochastic Optimal Control
&
Reinforcement Learning
Jinwon Choi
Stochastic optimal control & rl
Stochastic optimal control & rl
Stochastic optimal control & rl
Contents
01
02
03
04
05
Reinforcement
Learning
Stochastic
Optimal Control
Stochastic
Control
to
Reinforcement
Learning
Large Scale
Reinforcement
Learning
Summary
Reinforcement Learning
Stochastic optimal control & rl
Reinforcement Learning
Ivan Pavlov
Reinforcement Learning
Agent Environment
Action
Reward
State
Reinforcement Learning
Agent Environment
Action
Reward
State
Agent
Environment
Reinforcement Learning
Agent Environment
Action
Reward
State
Agent
Environment
Markov Decision Process
Markov?
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Memoryless process!
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov “Decision” Process:
ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
Future state only depends on the current state and action
&
Policy also depends on the current state only
𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
Objective function:
max
𝜋∈Π
𝑅 𝜋
• Policy: 𝜋: 𝑆 → 𝐴
• Total reward w.r.t. 𝜋:
𝑅 𝜋
= 𝐸𝑠~𝑝,𝑎~𝜋 σ 𝑡=0
∞
𝛾 𝑡
𝑟(𝑠𝑡, 𝑎 𝑡)
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
Terminology of RL and Optimal control
State
Action
Agent
Environment
Reward of a stage
Reward (or value) function
Maximizing the value function
Bellman operator
Greedy policy w.r.t. 𝐽
State
Control Input
Controller
System
Cost of a stage
Value (or cost) function
Minimizing the value function
DP mapping or operator
Minimizing policy w.r.t. 𝐽
RL Optimal Control
Stochastic Optimal Control
Stochastic Optimal Control
DiscreteContinuous
Stochastic
Deterministicሶ𝑥 = 𝑓(𝑥, 𝑢)
𝑑𝑥 = 𝑓 𝑥, 𝑢 𝑑𝑡 + 𝜎 𝑥, 𝑢 𝑑𝑊
𝑥 𝑘+1 = 𝑓(𝑥 𝑘, 𝑢 𝑘)
𝑥 𝑘+1 = 𝑓 𝑥 𝑘, 𝑢 𝑘, 𝑤 𝑘
(𝑤 𝑘 is a random Gaussian noise)
or
𝑥 𝑘+1~𝑝 𝑥 𝑘+1 𝑥 𝑘, 𝑢 𝑘
System Dynamics
Stochastic Optimal Control
DiscreteContinuous
Stochastic
(Policy)
Deterministic
(Control Input)
𝑢 𝑥
𝑢 𝑥 ~ 𝜋 𝑢 𝑥
{𝑢0, 𝑢1, 𝑢2, … }
𝑢 𝑘 ~ 𝜋 𝑢 𝑘 𝑥 𝑘
Control Input and Policy
Stochastic Optimal Control
DiscreteContinuous
Infinite-horizon
Finite-horizon
Value function
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑇
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 + 𝑞(𝑥 𝑇)
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
∞
𝑒−𝛾𝑡
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
𝑁
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝑞(𝑥 𝑁)
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite-horizon
Finite-horizon
Dynamic Programming
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡
𝑡+Δ𝑡
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑡+Δ𝑡
𝑒−𝛾𝑠
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Finite
HJB equation Bellman equation
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Finite
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
HJB equation Bellman equation
Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Note) There is an other approach using different dynamic programming equation,
average reward.
Value Iteration & Policy Iteration
Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
𝑉 𝑥0 = inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘
𝑟 𝑥 𝑘, 𝑢 𝑘
Bellman Operator
𝑉 𝑥 𝑘 = inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝜋 𝑥 𝑘 = 𝑟(𝑥 𝑘, 𝜋(𝑥 𝑘)) + 𝛾𝐸 𝑥~𝑝 𝑉 𝜋(𝑥 𝑘+1)
𝑄 𝜋
𝑥 𝑘, 𝑢 𝑘 = 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝,𝜋 𝑄 𝜋
𝑥 𝑘+1, 𝜋(𝑥 𝑘+1)
Dynamic Programming
Bellman Operator
Let 𝔹, ∙ ∞, 𝑑∞ be a metric space where 𝔹 = 𝜓: Ω → ℝ continuous and bounded ,
𝜓 ∞ ≔ sup
𝑥∈𝑋
𝜓(𝑥) , and 𝑑∞ 𝜓, 𝜓′ = sup
𝑥∈𝑋
𝜓 𝑥 − 𝜓′(𝑥) .
Definition. Given policy 𝜋, the Bellman operator 𝑇 𝜋: 𝔹 → 𝔹 is defined by
𝑇 𝜋 𝜓 𝑥 𝑘 = 𝑟 𝑥 𝑘, 𝜋 𝑥 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
and the Bellman optimal operator 𝑇∗: 𝔹 → 𝔹 is defined by
𝑇∗ 𝜓 𝑥 𝑘 = min
𝑢 𝑘∈𝑈(𝑥 𝑘)
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Proposition 3. The Bellman operator 𝑇 𝜋
, 𝑇 are a contraction
with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e.
𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹
𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.
Bellman Operator
Theorem 2.3. (Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric
space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then,
1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗
∈ 𝔹 s.t. 𝑇𝑓∗
= 𝑓∗
.
2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹.
Then, lim
n→∞
𝑇 𝑛
𝑓0 → 𝑓∗
.
Value Iteration
Algorithm: Value Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. for all 𝑥 ∈ 𝑋 do
4. 𝑉𝑘+1 ← 𝑇∗
𝑉𝑘
5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′)
7. return 𝜋∗
Policy Iteration
Algorithm: Policy Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
1) Policy Evaluation
3. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
4. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋 𝑉𝑘
𝜋 𝑘
5. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
6. 𝜋 𝑘+1 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′
)
7. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
8. return 𝜋
Stochastic Control to RL
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
2. Without system identification, obtain the value function and policy directly from simulation data
Model-free approach
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖Use Monte-Carlo Search 
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖
Impractical
Stochastic Approximation
Consider the problem
𝑥 = 𝐿(𝑥).
Then, this problem can be solved by iterative algorithm
𝑥 𝑘+1 = 𝐿 𝑥 𝑘
or
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘).
If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by
𝐿 𝑥 =
1
𝑁
෍
𝑖=1
𝑁
𝑓(𝑥𝑖, 𝑤𝑖)
which becomes inefficient when 𝑁 is large.
Stochastic Approximation
Use a single sample as an estimation of expectation in each update
This update can be seen as a stochastic approximation of the form
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘).
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘
= 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘
where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 .
Robbins-Monro stochastic approximation guarantees the convergence under
contraction or monotonicity assumptions of the mapping 𝐿 with assumptions
෍
𝑘=0
∞
𝛼 𝑘 = + ∞ and ෍
𝑘=0
∞
𝛼 𝑘
2
< +∞.
Policy Iteration
Algorithm: Policy Iteration(Classical Dynamic Programming)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3.
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Policy Iteration
Algorithm: Policy Iteration(Temporal Difference)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑉𝑘+1
𝜋 𝑘
(𝑥𝑖) ← 1 − 𝛼 𝑘 𝑉𝑘
𝜋 𝑘
(𝑥𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑉𝑘
𝜋 𝑘
(𝑥𝑖+1)
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Policy Iteration
Algorithm: Policy Iteration(SARSA)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑄 𝑘+1
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) ← 1 − 𝛼 𝑘 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖+1, 𝜋 𝑘 𝑥𝑖+1 )
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Value Iteration
Algorithm: Value Iteration(Classical Dynamic Programming)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do
5. 𝑉𝑘+1 𝑥 ← 𝑇∗ 𝑉𝑘
2. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗
Value Iteration
Algorithm: Value Iteration(Q-learning)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
𝑄(𝑥, 𝑢) ←Initialize arbitrarily for 𝑥 ∈ 𝑋, 𝑢 ∈ 𝑈
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do  for 𝑥𝑖, 𝑎𝑖, 𝑟 𝑥𝑖, 𝑢𝑖 , 𝑥𝑖+1 ∈ 𝒟 do
5. 𝑄 𝑘+1 𝑥, 𝑢 ← 𝑇∗ 𝑄 𝑘
𝑄 𝑘+1(𝑥𝑖, 𝑢𝑖) ← (1 − 𝛼 𝑘)𝑄 𝑘(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝑢𝑖 + min
𝑢𝑖+1∈𝑈
𝑄 𝑘(𝑥𝑖+1, 𝑢𝑖+1)
2. until 𝑄 𝑘+1 − 𝑄 𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗
Approximation in Value Space
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
Large Scale RL
Large-scale RL
Number of states > 2200
(for 10 × 20 board)
”Function approximation”
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
Approximate Dynamic Programming
Direct method (Gradient methods)
min
𝜃
෍
𝑖=1
𝑁
𝑉 𝜋
𝑥𝑖 − ෠𝑉(𝑥𝑖; 𝜃)
2
≈ min
𝜃
෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2
෠𝑉(𝑥; 𝜃) : Approximated value function (ex. Polynomial approx., Neural Network, etc.)
𝑉 𝜋
𝑥 : State-value function
𝐽(𝑥𝑖, 𝑚): 𝑚-th sample of the cost function at 𝑥𝑖 where 𝑚 = 1,2, … , 𝑀
𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂 ෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝛻 ෠𝑉(𝑥𝑖; 𝜃) 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2
Approximate Dynamic Programming
Indirect method (Projected equation)
Solve the projected Bellman equation: Φ𝜃 = Π𝑇(Φ𝜃)
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
𝐽
Π𝐽
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
Φ𝜃 = Π𝑇(Φ𝜃)
𝑇(Φ𝜃)
Direct Method Indirect Method
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Direct
(Gradient
methods)
TD
SARSA
DQN
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Indirect
(Projected DP)
TD, LSTD
TD, LSTD
LSPE
Function
Approximation
Policy
Evaluation
Summary
RL is a toolbox solving
Infinite-horizon Discrete-time DP
𝐢𝐧𝐟
𝝅∈𝜫
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Direct
(Gradient
methods)
TD
SARSA
DQN
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Indirect
(Projected DP)
TD, LSTD
TD, LSTD
LSPE
Function
Approximation
Policy
Evaluation
Q&A
Thank you

More Related Content

What's hot (20)

PDF
Signals and systems-1
sarun soman
 
PDF
Reinfrocement Learning
Natan Katz
 
PDF
HMM-Based Speech Synthesis
IJMER
 
PDF
Machine Learning - Reinforcement Learning
JY Chun
 
PPTX
Av 738-Adaptive Filters - Extended Kalman Filter
Dr. Bilal Siddiqui, C.Eng., MIMechE, FRAeS
 
PPTX
Gauge Theory for Beginners.pptx
Hassaan Saleem
 
PDF
Lecture 5 backpropagation
ParveenMalik18
 
PDF
Introduction to Reinforcement Learning
Edward Balaban
 
PDF
Proje kt report
Olalekan Ogunmolu
 
PPTX
An introduction to reinforcement learning (rl)
pauldix
 
PDF
International Refereed Journal of Engineering and Science (IRJES)
irjes
 
PPTX
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Jongsu "Liam" Kim
 
DOCX
Physical Chemistry II (CHEM 308)
Jahnn Drigo
 
PDF
Continuous control
Reiji Hatsugai
 
PDF
Lecture 6 radial basis-function_network
ParveenMalik18
 
PDF
Lecture 4 neural networks
ParveenMalik18
 
PDF
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchez
kevinct2001
 
PDF
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
PDF
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
hirokazutanaka
 
Signals and systems-1
sarun soman
 
Reinfrocement Learning
Natan Katz
 
HMM-Based Speech Synthesis
IJMER
 
Machine Learning - Reinforcement Learning
JY Chun
 
Av 738-Adaptive Filters - Extended Kalman Filter
Dr. Bilal Siddiqui, C.Eng., MIMechE, FRAeS
 
Gauge Theory for Beginners.pptx
Hassaan Saleem
 
Lecture 5 backpropagation
ParveenMalik18
 
Introduction to Reinforcement Learning
Edward Balaban
 
Proje kt report
Olalekan Ogunmolu
 
An introduction to reinforcement learning (rl)
pauldix
 
International Refereed Journal of Engineering and Science (IRJES)
irjes
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Jongsu "Liam" Kim
 
Physical Chemistry II (CHEM 308)
Jahnn Drigo
 
Continuous control
Reiji Hatsugai
 
Lecture 6 radial basis-function_network
ParveenMalik18
 
Lecture 4 neural networks
ParveenMalik18
 
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchez
kevinct2001
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
hirokazutanaka
 

Similar to Stochastic optimal control &amp; rl (20)

PDF
Cs229 notes12
VuTran231
 
PDF
Machine learning (13)
NYversity
 
PPTX
Reinforcement learning Markov decisions process mdp ppt
GandikotaVivekvardha
 
PDF
Policy-Gradient for deep reinforcement learning.pdf
21522733
 
PDF
Intro to Reinforcement learning - part I
Mikko Mäkipää
 
PDF
Presentation on stochastic control problem with financial applications (Merto...
Asma Ben Slimene
 
PDF
Research internship on optimal stochastic theory with financial application u...
Asma Ben Slimene
 
PDF
Understanding Dynamic Programming through Bellman Operators
Ashwin Rao
 
PDF
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Sean Meyn
 
PPTX
14_ReinforcementLearning.pptx
RithikRaj25
 
PDF
Lecture20.pdf
farouqalfuhidi
 
PPTX
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
Diksha363458
 
PPT
about reinforcement-learning ,reinforcement-learning.ppt
ommrudraprasad21
 
PDF
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 
PPT
reinforcement-learning its based on the slide of university
MOHDNADEEM971008
 
PPT
reinforcement-learning.ppt
hemalathache
 
PPT
reinforcement-learning.prsentation for c
RahulChouhan572633
 
PPTX
AI - Introduction to Bellman Equations
Andrew Ferlitsch
 
PDF
Reinforcement learning
Shahan Ali Memon
 
Cs229 notes12
VuTran231
 
Machine learning (13)
NYversity
 
Reinforcement learning Markov decisions process mdp ppt
GandikotaVivekvardha
 
Policy-Gradient for deep reinforcement learning.pdf
21522733
 
Intro to Reinforcement learning - part I
Mikko Mäkipää
 
Presentation on stochastic control problem with financial applications (Merto...
Asma Ben Slimene
 
Research internship on optimal stochastic theory with financial application u...
Asma Ben Slimene
 
Understanding Dynamic Programming through Bellman Operators
Ashwin Rao
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Sean Meyn
 
14_ReinforcementLearning.pptx
RithikRaj25
 
Lecture20.pdf
farouqalfuhidi
 
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
Diksha363458
 
about reinforcement-learning ,reinforcement-learning.ppt
ommrudraprasad21
 
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 
reinforcement-learning its based on the slide of university
MOHDNADEEM971008
 
reinforcement-learning.ppt
hemalathache
 
reinforcement-learning.prsentation for c
RahulChouhan572633
 
AI - Introduction to Bellman Equations
Andrew Ferlitsch
 
Reinforcement learning
Shahan Ali Memon
 
Ad

Recently uploaded (20)

PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PDF
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PPTX
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
PDF
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
PDF
4 Tier Teamcenter Installation part1.pdf
VnyKumar1
 
PPTX
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
SG1-ALM-MS-EL-30-0008 (00) MS - Isolators and disconnecting switches.pdf
djiceramil
 
PPTX
cybersecurityandthe importance of the that
JayachanduHNJc
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PPTX
Introduction to Fluid and Thermal Engineering
Avesahemad Husainy
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
Information Retrieval and Extraction - Module 7
premSankar19
 
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
4 Tier Teamcenter Installation part1.pdf
VnyKumar1
 
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
SG1-ALM-MS-EL-30-0008 (00) MS - Isolators and disconnecting switches.pdf
djiceramil
 
cybersecurityandthe importance of the that
JayachanduHNJc
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
Introduction to Fluid and Thermal Engineering
Avesahemad Husainy
 
Ad

Stochastic optimal control &amp; rl

  • 13. Markov Decision Process Markov? “The future is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
  • 14. Markov Decision Process Markov? “The future is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡 Memoryless process!
  • 15. Markov Decision Process Markov? “The future is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡 Markov “Decision” Process: ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 Future state only depends on the current state and action & Policy also depends on the current state only 𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡
  • 16. Reinforcement Learning Agent Environment Action Reward • State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State
  • 17. Reinforcement Learning Agent Environment Action Reward • State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State • Reward 𝑟: 𝑆 × 𝐴 → ℝ • Discounting factor 𝛾 ∈ (0,1) • Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
  • 18. Reinforcement Learning Agent Environment Action Reward • State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State Objective function: max 𝜋∈Π 𝑅 𝜋 • Policy: 𝜋: 𝑆 → 𝐴 • Total reward w.r.t. 𝜋: 𝑅 𝜋 = 𝐸𝑠~𝑝,𝑎~𝜋 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡) • Reward 𝑟: 𝑆 × 𝐴 → ℝ • Discounting factor 𝛾 ∈ (0,1) • Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
  • 19. Terminology of RL and Optimal control State Action Agent Environment Reward of a stage Reward (or value) function Maximizing the value function Bellman operator Greedy policy w.r.t. 𝐽 State Control Input Controller System Cost of a stage Value (or cost) function Minimizing the value function DP mapping or operator Minimizing policy w.r.t. 𝐽 RL Optimal Control
  • 21. Stochastic Optimal Control DiscreteContinuous Stochastic Deterministicሶ𝑥 = 𝑓(𝑥, 𝑢) 𝑑𝑥 = 𝑓 𝑥, 𝑢 𝑑𝑡 + 𝜎 𝑥, 𝑢 𝑑𝑊 𝑥 𝑘+1 = 𝑓(𝑥 𝑘, 𝑢 𝑘) 𝑥 𝑘+1 = 𝑓 𝑥 𝑘, 𝑢 𝑘, 𝑤 𝑘 (𝑤 𝑘 is a random Gaussian noise) or 𝑥 𝑘+1~𝑝 𝑥 𝑘+1 𝑥 𝑘, 𝑢 𝑘 System Dynamics
  • 22. Stochastic Optimal Control DiscreteContinuous Stochastic (Policy) Deterministic (Control Input) 𝑢 𝑥 𝑢 𝑥 ~ 𝜋 𝑢 𝑥 {𝑢0, 𝑢1, 𝑢2, … } 𝑢 𝑘 ~ 𝜋 𝑢 𝑘 𝑥 𝑘 Control Input and Policy
  • 23. Stochastic Optimal Control DiscreteContinuous Infinite-horizon Finite-horizon Value function inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 𝑇 𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 + 𝑞(𝑥 𝑇) inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 ∞ 𝑒−𝛾𝑡 𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 𝑁 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝑞(𝑥 𝑁) inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
  • 24. Stochastic Optimal Control Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite-horizon Finite-horizon Dynamic Programming inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡 𝑡+Δ𝑡 𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 ) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 𝑡+Δ𝑡 𝑒−𝛾𝑠 𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
  • 25. Stochastic Optimal Control Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite Dynamic Programming 𝜕𝑉 𝜕𝑡 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1) Finite HJB equation Bellman equation −𝛾 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
  • 26. Stochastic Optimal Control Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite Dynamic Programming 𝜕𝑉 𝜕𝑡 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1) Finite −𝛾 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 HJB equation Bellman equation
  • 27. Stochastic Optimal Control Dynamic Programming 𝐢𝐧𝐟 𝒖 𝒌∈𝑼 𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) How to solve the infinite-horizon discrete time system stochastic optimal control problem?
  • 28. Stochastic Optimal Control Dynamic Programming 𝐢𝐧𝐟 𝒖 𝒌∈𝑼 𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) How to solve the infinite-horizon discrete time system stochastic optimal control problem? Note) There is an other approach using different dynamic programming equation, average reward. Value Iteration & Policy Iteration
  • 29. Bellman Operator Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by 𝑉 𝜋 𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0 = 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) and the state-input-value function 𝑄 𝜋 : ℝn × ℝm → ℝ is defined by 𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0 = 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=1 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
  • 30. Bellman Operator Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by 𝑉 𝜋 𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0 = 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) and the state-input-value function 𝑄 𝜋 : ℝn × ℝm → ℝ is defined by 𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0 = 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=1 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) 𝑉 𝑥0 = inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
  • 31. Bellman Operator 𝑉 𝑥 𝑘 = inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝜋 𝑥 𝑘 = 𝑟(𝑥 𝑘, 𝜋(𝑥 𝑘)) + 𝛾𝐸 𝑥~𝑝 𝑉 𝜋(𝑥 𝑘+1) 𝑄 𝜋 𝑥 𝑘, 𝑢 𝑘 = 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝,𝜋 𝑄 𝜋 𝑥 𝑘+1, 𝜋(𝑥 𝑘+1) Dynamic Programming
  • 32. Bellman Operator Let 𝔹, ∙ ∞, 𝑑∞ be a metric space where 𝔹 = 𝜓: Ω → ℝ continuous and bounded , 𝜓 ∞ ≔ sup 𝑥∈𝑋 𝜓(𝑥) , and 𝑑∞ 𝜓, 𝜓′ = sup 𝑥∈𝑋 𝜓 𝑥 − 𝜓′(𝑥) . Definition. Given policy 𝜋, the Bellman operator 𝑇 𝜋: 𝔹 → 𝔹 is defined by 𝑇 𝜋 𝜓 𝑥 𝑘 = 𝑟 𝑥 𝑘, 𝜋 𝑥 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ] and the Bellman optimal operator 𝑇∗: 𝔹 → 𝔹 is defined by 𝑇∗ 𝜓 𝑥 𝑘 = min 𝑢 𝑘∈𝑈(𝑥 𝑘) 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
  • 33. Bellman Operator Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋 , 𝑇∗ are monotone, i.e. if 𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋, then 𝑇 𝜋 𝜓 𝑥 ≤ 𝑇 𝜋 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 𝑥 ≤ 𝑇∗ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋. Proposition 2. (Constant shift property) For any scalar 𝑟, 𝑇 𝜋 𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 + 𝑟𝑒 𝑥 = 𝑇∗ 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
  • 34. Bellman Operator Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋 , 𝑇∗ are monotone, i.e. if 𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋, then 𝑇 𝜋 𝜓 𝑥 ≤ 𝑇 𝜋 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 𝑥 ≤ 𝑇∗ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋. Proposition 2. (Constant shift property) For any scalar 𝑟, 𝑇 𝜋 𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 + 𝑟𝑒 𝑥 = 𝑇∗ 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋. Proposition 3. The Bellman operator 𝑇 𝜋 , 𝑇 are a contraction with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e. 𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹 𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.
  • 35. Bellman Operator Theorem 2.3. (Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then, 1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗ ∈ 𝔹 s.t. 𝑇𝑓∗ = 𝑓∗ . 2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹. Then, lim n→∞ 𝑇 𝑛 𝑓0 → 𝑓∗ .
  • 36. Value Iteration Algorithm: Value Iteration Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. for all 𝑥 ∈ 𝑋 do 4. 𝑉𝑘+1 ← 𝑇∗ 𝑉𝑘 5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥 6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′) 7. return 𝜋∗
  • 37. Policy Iteration Algorithm: Policy Iteration Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 1) Policy Evaluation 3. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do 4. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 5. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 6. 𝜋 𝑘+1 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′ ) 7. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 8. return 𝜋
  • 39. Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation?
  • 40. Learning-based approach Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation?
  • 41. Learning-based approach Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation? 1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods Model-based approach (model learning)
  • 42. Learning-based approach Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation? 1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods Model-based approach (model learning) 2. Without system identification, obtain the value function and policy directly from simulation data Model-free approach
  • 43. 𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
  • 44. 𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) Approximation in policy space Approximation in value space Approximate 𝐸[∙] Parametric approximation Problem approximation Rollout, MPC Monte-Carlo search Certainty equivalence Policy search Policy gradient
  • 45. Approximation in Value space Approximation in Policy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 46. Approximation in Value Space DP algorithms sweep over “all state” for each step 𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖Use Monte-Carlo Search 
  • 47. Approximation in Value Space DP algorithms sweep over “all state” for each step 𝑁~14,000,605 Use Monte-Carlo Search  𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖
  • 48. Approximation in Value Space DP algorithms sweep over “all state” for each step 𝑁~14,000,605 Use Monte-Carlo Search  𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖 Impractical
  • 49. Stochastic Approximation Consider the problem 𝑥 = 𝐿(𝑥). Then, this problem can be solved by iterative algorithm 𝑥 𝑘+1 = 𝐿 𝑥 𝑘 or 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘). If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by 𝐿 𝑥 = 1 𝑁 ෍ 𝑖=1 𝑁 𝑓(𝑥𝑖, 𝑤𝑖) which becomes inefficient when 𝑁 is large.
  • 50. Stochastic Approximation Use a single sample as an estimation of expectation in each update This update can be seen as a stochastic approximation of the form 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘). 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘 where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 . Robbins-Monro stochastic approximation guarantees the convergence under contraction or monotonicity assumptions of the mapping 𝐿 with assumptions ෍ 𝑘=0 ∞ 𝛼 𝑘 = + ∞ and ෍ 𝑘=0 ∞ 𝛼 𝑘 2 < +∞.
  • 51. Policy Iteration Algorithm: Policy Iteration(Classical Dynamic Programming) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 52. Policy Iteration Algorithm: Policy Iteration(Temporal Difference) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )} 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 𝑉𝑘+1 𝜋 𝑘 (𝑥𝑖) ← 1 − 𝛼 𝑘 𝑉𝑘 𝜋 𝑘 (𝑥𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑉𝑘 𝜋 𝑘 (𝑥𝑖+1) 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 53. Policy Iteration Algorithm: Policy Iteration(SARSA) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )} 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 𝑄 𝑘+1 𝜋 𝑘 (𝑥𝑖, 𝑢𝑖) ← 1 − 𝛼 𝑘 𝑄 𝑘 𝜋 𝑘 (𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑄 𝑘 𝜋 𝑘 (𝑥𝑖+1, 𝜋 𝑘 𝑥𝑖+1 ) 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 54. Value Iteration Algorithm: Value Iteration(Classical Dynamic Programming) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)} 4. for all 𝑥 ∈ 𝑋 do 5. 𝑉𝑘+1 𝑥 ← 𝑇∗ 𝑉𝑘 2. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥 3. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′ ) 4. return 𝜋∗
  • 55. Value Iteration Algorithm: Value Iteration(Q-learning) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 𝑄(𝑥, 𝑢) ←Initialize arbitrarily for 𝑥 ∈ 𝑋, 𝑢 ∈ 𝑈 2. repeat 3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)} 4. for all 𝑥 ∈ 𝑋 do  for 𝑥𝑖, 𝑎𝑖, 𝑟 𝑥𝑖, 𝑢𝑖 , 𝑥𝑖+1 ∈ 𝒟 do 5. 𝑄 𝑘+1 𝑥, 𝑢 ← 𝑇∗ 𝑄 𝑘 𝑄 𝑘+1(𝑥𝑖, 𝑢𝑖) ← (1 − 𝛼 𝑘)𝑄 𝑘(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝑢𝑖 + min 𝑢𝑖+1∈𝑈 𝑄 𝑘(𝑥𝑖+1, 𝑢𝑖+1) 2. until 𝑄 𝑘+1 − 𝑄 𝑘 < 𝛥 3. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′ ) 4. return 𝜋∗
  • 56. Approximation in Value Space 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning
  • 58. Large-scale RL Number of states > 2200 (for 10 × 20 board)
  • 60. Approximation in Value space Approximation in Policy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 61. Approximate Dynamic Programming Direct method (Gradient methods) min 𝜃 ෍ 𝑖=1 𝑁 𝑉 𝜋 𝑥𝑖 − ෠𝑉(𝑥𝑖; 𝜃) 2 ≈ min 𝜃 ෍ 𝑥 𝑖∈𝑋 ෍ 𝑚=1 𝑀 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃) 2 ෠𝑉(𝑥; 𝜃) : Approximated value function (ex. Polynomial approx., Neural Network, etc.) 𝑉 𝜋 𝑥 : State-value function 𝐽(𝑥𝑖, 𝑚): 𝑚-th sample of the cost function at 𝑥𝑖 where 𝑚 = 1,2, … , 𝑀 𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂 ෍ 𝑥 𝑖∈𝑋 ෍ 𝑚=1 𝑀 𝛻 ෠𝑉(𝑥𝑖; 𝜃) 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃) 2
  • 62. Approximate Dynamic Programming Indirect method (Projected equation) Solve the projected Bellman equation: Φ𝜃 = Π𝑇(Φ𝜃) Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠 0 𝐽 Π𝐽 Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠 0 Φ𝜃 = Π𝑇(Φ𝜃) 𝑇(Φ𝜃) Direct Method Indirect Method
  • 63. 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Direct (Gradient methods) TD SARSA DQN 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Indirect (Projected DP) TD, LSTD TD, LSTD LSPE Function Approximation Policy Evaluation
  • 65. RL is a toolbox solving Infinite-horizon Discrete-time DP 𝐢𝐧𝐟 𝝅∈𝜫 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
  • 66. 𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) Approximation in policy space Approximation in value space Approximate 𝐸[∙] Parametric approximation Problem approximation Rollout, MPC Monte-Carlo search Certainty equivalence Policy search Policy gradient
  • 67. Approximation in Value space Approximation in Policy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 68. 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Direct (Gradient methods) TD SARSA DQN 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Indirect (Projected DP) TD, LSTD TD, LSTD LSPE Function Approximation Policy Evaluation
  • 69. Q&A