SlideShare a Scribd company logo
Lecture	2:	Sampling-based	Approximations
And
Function	Fitting
Yan	(Rocky)	Duan
Berkeley	AI	Research	Lab
Many	slides	made	with	John	Schulman,	Xi	(Peter)	Chen	and	Pieter	Abbeel
n Optimal	Control
=	
given	an	MDP	(S, A, P, R, γ, H)
find	the	optimal	policy	π*
Quick	One-Slide	Recap
n Exact	Methods:
n Value	Iteration
n Policy	Iteration
Limitations:	
• Update	equations	require	access	to	dynamics	
model
• Iteration	over	/	Storage	for	all	states	and	actions:	
requires	small,	discrete	state-action	space
->	sampling-based	approximations
->	Q/V	function	fitting
n Q	Value	Iteration
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation
n Policy	Improvement?
Sampling-Based	Approximation
Recap	Q-Values
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally
Bellman Equation:
Q-Value Iteration:
n Q-value	iteration:
n Rewrite	as	expectation:	
n (Tabular)	Q-Learning:	replace	expectation	by	samples
n For	an	state-action	pair	(s,a),	receive:
n Consider	your	old	estimate:
n Consider	your	new	sample	estimate:
n Incorporate	the	new	estimate	into	a	running	average:
(Tabular)	Q-Learning
Qk+1 Es0⇠P (s0|s,a)
h
R(s, a, s0
) + max
a0
Qk(s0
, a0
)
i
s0
⇠ P(s0
|s, a)
Qk(s, a)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]
(Tabular)	Q-Learning
Algorithm:
Start	with	 for	all	s,	a.
Get	initial	state	s
For	k =	1,	2,	…	till	convergence
Sample	action	a,	get	next	state	s’
If	s’	is	terminal:
Sample	new	initial	state	s’
else:
Q0(s, a)
target = R(s, a, s0
) + max
a0
Qk(s0
, a0
)
target = R(s, a, s0
)
s s0
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target]
n Choose random actions?
n Choose action that maximizes (i.e.	greedily)?
n ɛ-Greedy:	choose	random	action	with	prob.	ɛ,	otherwise	choose	
action	greedily
How	to	sample	actions?
Qk(s, a)
n Amazing	result:	Q-learning	converges	to	optimal	policy	--
even	if	you’re	acting	suboptimally!
n This	is	called	off-policy	learning
n Caveats:
n You	have	to	explore	enough
n You	have	to	eventually	make	the	learning	rate
small	enough
n …	but	not	decrease	it	too	quickly
Q-Learning	Properties
n Technical	requirements.	
n All	states	and	actions	are	visited	infinitely	often
n Basically,	in	the	limit,	it	doesn’t	matter	how	you	select	actions	(!)
n Learning	rate	schedule	such	that	for	all	state	and	action	
pairs	(s,a):
Q-Learning	Properties
For details, see Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative
dynamic programming algorithms. Neural Computation, 6(6), November 1994.
1X
t=0
↵t(s, a) = 1
1X
t=0
↵2
t (s, a) < 1
Q-Learning	Demo:	Gridworld
• States:	11	cells
• Actions:	{up,	down,	left,	right}
• Deterministic	transition	function
• Learning	rate:	0.5
• Discount:	1
• Reward:	+1	for	getting	diamond,	-1	for	falling	into	trap
Q-Learning	Demo:	Crawler
• States:	discretized	value	of	2d	state:	(arm	angle,	hand	angle)
• Actions:	Cartesian	product	of	{arm	up,	arm	down}	and	{hand	up,	hand	down}
• Reward:	speed	in	the	forward	direction
Sampling-Based	Approximation
n Q	Value	Iteration	à (Tabular)	Q-learning
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation
n Policy	Improvement?
n Value	Iteration
n unclear	how	to	draw	samples	through	max…...
Value	Iteration	w/	Samples?
V ⇤
i+1(s) max
a
Es0⇠P (s0|s,a) [R(s, a, s0
) + V ⇤
i (s0
)]
n Q	Value	Iteration	à (Tabular)	Q-learning
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation
n Policy	Improvement?
Sampling-Based	Approximation
Recap:	Policy	Iteration
One	iteration	of	policy	iteration:
n Policy	evaluation	for	current	policy									:
n Iterate	until	convergence
n Policy	improvement:	find	the	best	action	according	to	one-step	
look-ahead
⇡k
Can	be	approximated	by	samples
This	is	called	Temporal	Difference	(TD)	Learning
Unclear	what	to	do	with	the	max	(for	now)
V ⇡k
i+1(s) Es0⇠P (s0|s,⇡k(s))[R(s, ⇡k(s), s0
) + V ⇡k
i (s0
)]
⇡k+1(s) arg max
a
Es0⇠P (s0|s,a)[R(s, a, s0
) + V ⇡k
(s0
)]
n Q	Value	Iteration	à (Tabular)	Q-learning
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation	à (Tabular)	TD-learning
n Policy	Improvement	(for	now)
Sampling-Based	Approximation
n Optimal	Control
=	
given	an	MDP	(S, A, P, R, γ, H)
find	the	optimal	policy	π*
Quick	One-Slide	Recap
n Exact	Methods:
n Value	Iteration
n Policy	Iteration
Limitations:	
• Update	equations	require	access	to	dynamics	
model
• Iteration	over	/	Storage	for	all	states	and	actions:	
requires	small,	discrete	state-action	space
->	sampling-based	approximations
->	Q/V	function	fitting
n Discrete	environments
Can	tabular	methods	scale?
Tetris
10^60
Atari
10^308 (ram) 10^16992 (pixels)
Gridworld
10^1
n Continuous	environments	(by	crude	discretization)
Crawler
10^2
Hopper
10^10
Humanoid
10^100
Can	tabular	methods	scale?
Generalizing	Across	States
n Basic	Q-Learning	keeps	a	table	of	all	q-values
n In	realistic	situations,	we	cannot	possibly	learn	
about	every	single	state!
n Too	many	states	to	visit	them	all	in	training
n Too	many	states	to	hold	the	q-tables	in	memory
n Instead,	we	want	to	generalize:
n Learn	about	some	small	number	of	training	states	from	
experience
n Generalize	that	experience	to	new,	similar	situations
n This	is	a	fundamental	idea	in	machine	learning,	and	
we’ll	see	it	over	and	over	again
n Instead	of	a	table,	we	have	a	parametrized	Q	function:
n Can	be	a	linear	function	in	features:	
n Or	a	complicated	neural	net
n Learning	rule:
n Remember:	
n Update:
Approximate	Q-Learning
Q✓(s, a)
Q✓(s, a) = ✓0f0(s, a) + ✓1f1(s, a) + · · · + ✓nfn(s, a)
target(s0
) = R(s, a, s0
) + max
a0
Q✓k
(s0
, a0
)
✓k+1 ✓k ↵r✓

1
2
(Q✓(s, a) target(s0
))2
✓=✓k
Connection	to	Tabular	Q-Learning
n Suppose	
n Plug	into	update:
n Compare	with	Tabular	Q-Learning	update:
✓ 2 R|S|⇥|A|
, Q✓(s, a) ⌘ ✓sa
r✓sa

1
2
(Q✓(s, a) target(s0
))2
= r✓sa

1
2
(✓sa target(s0
))2
= ✓sa target(s0
)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]
✓sa ✓sa ↵(✓sa target(s0
))
= (1 ↵)✓sa + ↵[target(s0
)]
n state:	naïve	board	configuration	+	shape	of	the	falling	piece	~1060 states!
n action:	rotation	and	translation	applied	to	the	falling	piece
n 22	features	aka	basis	functions	
n Ten	basis	functions,	0,	.	.	.	,	9,	mapping	the	state	to	the	height	h[k]	of	each	column.
n Nine	basis	functions,	10,	.	.	.	,	18,	each	mapping	the	state	to	the	absolute	difference	
between	heights	of	successive	columns:	|h[k+1]	−	h[k]|,	k	=	1,	.	.	.	,	9.
n One	basis	function,	19,	that	maps	state	to	the	maximum	column	height:	maxk h[k]
n One	basis	function,	20,	that	maps	state	to	the	number	of	‘holes’	in	the	board.
n One	basis	function,	21,	that	is	equal	to	1	in	every	state.
[Bertsekas &	Ioffe,	1996	(TD);	Bertsekas &	Tsitsiklis 1996	(TD);	Kakade 2002	(policy	gradient);	Farias &	Van	Roy,	2006	(approximate	LP)]
ˆV (s) =
21X
i=0
i⇥i(s) = >
⇥(s)
i
Engineered	Approximation	Example:	Tetris
Deep	Reinforcement	Learning
Pong Enduro Beamrider Q*bert
• From	pixels	to	actions
• Same	algorithm	(with	effective	tricks)
• CNN	function	approximator,	w/	3M	free	parameters
n We	have	now	covered	enough	materials	for	Lab	1.
n Will	be	released	on	Piazza	by	this	afternoon.
n Covers	value	iteration,	policy	iteration,	and	tabular	Q-learning.
Lab	1
Lec2 sampling-based-approximations-and-function-fitting
n The	bad:	it	is	not	guaranteed	to	converge…
n Even	if	the	function	approximation	is	expressive	enough	to	
represent	the	true	Q	function
Convergence	of	Approximate	Q-Learning
Function	approximator:		[1	2]	*	θ
θ 2θ
x1 x2r=0
r=0
Simple	Example**
n Definition.		An	operator	G	is	a	non-expansion with	respect	to	a	norm	||	.	||		if
n Fact. If	the	operator	F	is	a	γ-contraction	with	respect	to	a	norm	||	.	||	and	the	
operator	G	is	a	non-expansion	with	respect	to	the	same	norm,	then	the	
sequential	application	of	the	operators	G	and	F	is	a	γ-contraction,	i.e.,	
n Corollary. If	the	supervised	learning	step	is	a	non-expansion,	then	iteration	in	
value	iteration	with	function	approximation	is	a	γ-contraction,	and	in	this	case	
we	have	a	convergence	guarantee.
Composing	Operators**
n Examples:	
n nearest	neighbor	(aka	state	aggregation)
n linear	interpolation	over	triangles	
(tetrahedrons,	…)
Averager Function	Approximators Are	Non-Expansions**
Averager Function	Approximators Are	Non-Expansions**
Example	taken	from	Gordon,	1995
Linear	Regression	L **
n I.e.,	if	we	pick	a	non-expansion	function	approximator which	can	approximate	
J*	well,	then	we	obtain	a	good	value	function	estimate.
n To	apply	to	discretization:	use	continuity	assumptions	to	show	that	J*	can	be	
approximated	well	by	chosen	discretization	scheme.
Guarantees	for	Fixed	Point**

More Related Content

What's hot (6)

PPTX
Final slide (bsc csit) chapter 5
Subash Chandra Pakhrin
 
PDF
Critical Overview of Some Pumping Test Analysis Equations
Scientific Review SR
 
PPTX
Introduction to reinforcement learning - Phu Nguyen
Tu Le Dinh
 
PDF
Everything You Wanted to Know About Optimization
indico data
 
PDF
sigir2017bayesian
Tetsuya Sakai
 
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
The Statistical and Applied Mathematical Sciences Institute
 
Final slide (bsc csit) chapter 5
Subash Chandra Pakhrin
 
Critical Overview of Some Pumping Test Analysis Equations
Scientific Review SR
 
Introduction to reinforcement learning - Phu Nguyen
Tu Le Dinh
 
Everything You Wanted to Know About Optimization
indico data
 
sigir2017bayesian
Tetsuya Sakai
 
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
The Statistical and Applied Mathematical Sciences Institute
 

Similar to Lec2 sampling-based-approximations-and-function-fitting (20)

PDF
Reinfrocement Learning
Natan Katz
 
PDF
Game Playing RL Agent
Apache MXNet
 
PPTX
14_ReinforcementLearning.pptx
RithikRaj25
 
PPT
reinforcement-learning.ppt
hemalathache
 
PPT
reinforcement-learning its based on the slide of university
MOHDNADEEM971008
 
PPT
reinforcement-learning.prsentation for c
RahulChouhan572633
 
PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
Diksha363458
 
PPT
about reinforcement-learning ,reinforcement-learning.ppt
ommrudraprasad21
 
PPTX
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
SmartCat
 
PDF
Sutton reinforcement learning new ppt.pdf
ratnababum
 
PDF
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
PDF
Computing near-optimal policies from trajectories by solving a sequence of st...
Université de Liège (ULg)
 
PPTX
2Multi_armed_bandits.pptx
ZhiwuGuo1
 
PDF
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Sean Meyn
 
PPTX
Lecture 8 artificial intelligence .pptx
diya172004
 
PDF
Stochastic optimal control &amp; rl
ChoiJinwon3
 
PDF
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Rikiya Takahashi
 
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
PPTX
Survey of Modern Reinforcement Learning
Julia Maddalena
 
PDF
Cs229 notes12
VuTran231
 
Reinfrocement Learning
Natan Katz
 
Game Playing RL Agent
Apache MXNet
 
14_ReinforcementLearning.pptx
RithikRaj25
 
reinforcement-learning.ppt
hemalathache
 
reinforcement-learning its based on the slide of university
MOHDNADEEM971008
 
reinforcement-learning.prsentation for c
RahulChouhan572633
 
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
Diksha363458
 
about reinforcement-learning ,reinforcement-learning.ppt
ommrudraprasad21
 
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
SmartCat
 
Sutton reinforcement learning new ppt.pdf
ratnababum
 
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
Computing near-optimal policies from trajectories by solving a sequence of st...
Université de Liège (ULg)
 
2Multi_armed_bandits.pptx
ZhiwuGuo1
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Sean Meyn
 
Lecture 8 artificial intelligence .pptx
diya172004
 
Stochastic optimal control &amp; rl
ChoiJinwon3
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Rikiya Takahashi
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Survey of Modern Reinforcement Learning
Julia Maddalena
 
Cs229 notes12
VuTran231
 
Ad

More from Ronald Teo (16)

PDF
Mc td
Ronald Teo
 
PDF
07 regularization
Ronald Teo
 
PDF
Dp
Ronald Teo
 
PDF
06 mlp
Ronald Teo
 
PDF
Mdp
Ronald Teo
 
PDF
04 numerical
Ronald Teo
 
PPTX
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Ronald Teo
 
PDF
Intro rl
Ronald Teo
 
PDF
Lec7 deeprlbootcamp-svg+scg
Ronald Teo
 
PDF
Lec5 advanced-policy-gradient-methods
Ronald Teo
 
PDF
Lec6 nuts-and-bolts-deep-rl-research
Ronald Teo
 
PDF
Lec4b pong from_pixels
Ronald Teo
 
PDF
Lec4a policy-gradients-actor-critic
Ronald Teo
 
PDF
Lec3 dqn
Ronald Teo
 
PDF
Lec1 intro-mdps-exact-methods
Ronald Teo
 
PDF
02 linear algebra
Ronald Teo
 
Mc td
Ronald Teo
 
07 regularization
Ronald Teo
 
06 mlp
Ronald Teo
 
04 numerical
Ronald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Ronald Teo
 
Intro rl
Ronald Teo
 
Lec7 deeprlbootcamp-svg+scg
Ronald Teo
 
Lec5 advanced-policy-gradient-methods
Ronald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Ronald Teo
 
Lec4b pong from_pixels
Ronald Teo
 
Lec4a policy-gradients-actor-critic
Ronald Teo
 
Lec3 dqn
Ronald Teo
 
Lec1 intro-mdps-exact-methods
Ronald Teo
 
02 linear algebra
Ronald Teo
 
Ad

Recently uploaded (20)

PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PPTX
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
PPTX
Wondershare Filmora Crack Free Download 2025
josanj305
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
PPTX
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
PDF
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PDF
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
Wondershare Filmora Crack Free Download 2025
josanj305
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 

Lec2 sampling-based-approximations-and-function-fitting