0% found this document useful (0 votes)

2 views

RL chap 4

Chapter 4 discusses approximate solution methods in reinforcement learning, focusing on function value approximation, on-policy and off-policy action value approximation. It emphasizes the use of function approximators like neural networks to generalize from limited data, enabling efficient learning in large state spaces. The chapter also explores the challenges and techniques associated with off-policy methods, including importance sampling and actor-critic methods, while highlighting the importance of eligibility traces for enhancing learning efficiency.

Uploaded by

20 JANANI N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

RL chap 4

Uploaded by

20 JANANI N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CHAPTER 4

APPROXIMATE SOLUTION METHODS

Function Value Approximation-On Policy Approximation of Action Values-Off Policy Approximation of Action Values.

4.1 FUNCTION VALUE APPROXIMATION:

To a large extent we need only combine reinforcement learning methods with existing generalization methods. The
kind of generalization we require is often called function approximation because it takes examples from a desired
function (e.g., a value function) and attempts to generalize from them to construct an approximation of the entire
function. Function approximation is an instance of supervised learning, the primary topic studied in machine learning,
arti cial neural networks, pattern recognition.

In reinforcement learning, Function Value Approximation is used when it's impractical to store exact values for every state
or state-action pair due to the vastness of the environment. Instead of using a table (which is only feasible for small
problems), we use an approximate function to estimate the value function V(s)V(s) or the action-value function Q(s,a)Q(s,
a).

There are two main types of value functions we might want to approximate:

• State-value function V(s): Predicts how good it is to be in a particular state.

• Action-value function Q(s,a): Predicts how good it is to take a specific action in a specific state.

The general idea is to represent the value function using a parameterized function, such as:

V^(s,w)≈Vπ(s)

where {w} is a vector of weights that define the function. This can be linear (e.g., V^(s)=w⊤x(s)) or nonlinear (e.g., neural
networks).

The goal is to tune the parameters {w} such that the approximation closely matches the true value function. The process
typically uses gradient descent or stochastic gradient descent to minimize a loss function that represents the error
between predicted and actual values (like TD error in TD(0)).

Function approximation allows RL to scale to large or continuous state spaces, making it a cornerstone in modern RL
applications such as AlphaGo and deep Q-networks (DQNs).

4.2 ON POLICY APPROXIMATION OF ACTION VALUES:

4.2.1 Value Prediction with Function Approximation

The goal remains the same: estimate the state-value function v(s)v(s) under a certain policy π\pi, but now with function
approximation. Instead of using a lookup table (which doesn’t scale), we approximate v(s)v(s) using a parameterized
function:

v^(s,w)≈vπ(s) ,where w∈R ^n is a vector of weights or parameters.

The number of possible states in most real-world problems is too large (or infinite).Function approximators (like neural
networks, decision trees, etc.) generalize from seen states to unseen ones.Updating one parameter affects many state
estimates, enabling learning even from sparse data.
A backup is an operation that shifts the estimate v^(s,w)\hat{v}(s, \mathbf{w}) toward a target value based on experience.
Examples include:

• Monte Carlo: Backup s←Gt

• TD(0): Backup s←Rt+1+γv^(st+1,w)

• TD(λ): Backup s←λ These backups become training examples .

Function approximators learn from these examples, using methods like:Gradient descent to minimize the difference
between the predicted value and the backup target.Sophisticated architectures (e.g., deep networks) that can represent
complex mappings from state to value.This shifts reinforcement learning closer to supervised learning, where backups
play the role of labeled training data — even though they’re generated from experience rather than from a dataset.

4.2.2 Gradient-Descent Methods:

You're estimating the value function: v^(s,w)

where:

• w{w} is a parameter vector (e.g., weights in a neural network).

• v^(s,w) is a smooth function, meaning it’s differentiable and suitable for gradient-based updates.

• Each time step t, you see a training example: (St,v ^π(St)) is the true value of state St under the policy π\pi (for
now, assumed to be known or observed).

Learning Objective:

Gradient Descent Update Rule

Taking a small step in the direction of the negative gradient ensures stable learning and avoids overshooting.A large step
may lead to divergence, especially when:

o The function is non-linear,

o The gradient is steep or erratic.

A gradient method is a learning approach where the parameters of the value function are updated using the
true gradient of a loss function that compares predicted values with known targets.
It considers the entire loss function, including how the target depends on the weights.

A semi-gradient method is used when the target value depends on the current value function itself (as in
bootstrapping). In this case, the update uses only the part of the gradient that affects the prediction, and
ignores the gradient of the target.

4.2.3 Linear methods:

Linear Function Approximation : Approximate value function v(s,w)is linear in the parameter vector w.
Features x(s) represent states, and the approximation is a dot product: v(s,w)=w⊤x(s).Gradient descent simplifies to
updating weights based on feature vectors. Linear methods are mathematically tractable and often converge to a global
optimum.

Convergence of Linear TD(λ) :Linear TD(λ) converges with decreasing step sizes (per standard conditions).Asymptotic error
is bounded by 1/1−γ1 times the best possible error (where γγ is the discount factor).Requires on-policy state backups;
other distributions may cause divergence.

Advantages of Linear Methods : Efficient in computation and data usage.Performance depends heavily on feature
selection, which encodes domain knowledge.Features should capture task-relevant generalizations (e.g., interactions via
conjunctions).

Coarse Coding

• A method to represent the Binary features with overlapping receptive fields (e.g., circles in state space).

• Generalization depends on feature size/shape:

o Large features → broad generalization.

o Small features → fine-grained discrimination.

• Asymptotic accuracy depends on feature density, not size (Example 9.1).

Tile Coding

• A form of coarse coding with exhaustive partitions (tilings).

• Each tiling is a grid, and each tile is a binary feature.

• Advantages:Fixed number of active features per state (one per tiling). Efficient computation (summing weights of
active tiles).
• Generalization controlled by tile shape/offset (Figure 9.5).

• Hashing reduces memory by pseudo-randomly mapping tiles.

Radial Basis Functions (RBFs)

• Continuous-valued features (Gaussian responses) centered at prototype states.

• Smoother approximations than binary features.

• Can adapt centers/widths for better fit (but more computationally complex).

Kanerva Coding

• Addresses high-dimensional spaces by using prototype states.

• Features activate based on similarity (e.g., Hamming distance for binary states).

• Complexity scales with target function complexity, not dimensionality.

• Avoids exponential growth in resources but remains experimental.

4.2.4 Control with Function Approximation:

1. Extending Value Prediction to Control

Goal: Adapt value prediction methods like TD(λ) and Monte Carlo to control problems using action-value functions.

Approach: Use Generalized Policy Iteration (GPI), combining action-value approximation with policy improvement. Shift
from state-value to state-action-value triples for training.

2. Gradient-Descent Updates for Action-Values

Update Rule: Action-values are updated using a gradient descent approach, with Sarsa(λ) using eligibility traces for
backward updates.

3. Policy Improvement & Action Selection

• Discrete Actions: Action-values are computed for all actions, with greedy or ε-greedy selection.

• Continuous Actions: An open research challenge.

4. On-Policy vs. Off-Policy Methods

• On-Policy: Uses ε-greedy for exploration and policy improvement, updating weights with linear function
approximation.

• Off-Policy: Uses an arbitrary behavior policy but learns the greedy policy, clearing traces for non-selected actions.
5. Eligibility Traces with Function Approximation

• Accumulating Traces: Increment traces by 1 per visit.

• Replacing Traces: For binary features, reset traces to 1 on revisit for stability.

• Clearing Traces: Optional—reset traces for non-selected actions.

6. Example: MountainCar Task

• Task: Drive an underpowered car up a steep hill using strategic back-and-forth motion.

• Results: Sarsa(λ) with replacing traces learned near-optimal policy in ~100 episodes. Optimistic initialization
encouraged exploration even without ε-greedy.

4.2.5 Should we Bootstrap:

1. Non-Bootstrapping vs. Bootstrapping Methods

• Non-Bootstrapping Methods: These methods tend to have lower asymptotic error and can be implemented
online using eligibility traces. However, empirical comparisons show that bootstrapping methods usually perform
better in practice.

• Bootstrapping Methods: Despite having higher asymptotic error, bootstrapping methods often perform much
better in real-world tasks. This is due to their ability to learn faster and more effectively in most scenarios.

2. TD Methods with Eligibility Traces

• Using TD methods with eligibility traces, varying from 0 (pure bootstrapping) to 1 (pure nonbootstrapping),
reveals that performance worsens as the method approaches the nonbootstrapping case (λ=1\lambda = 1).

• Empirical Results: The performance of bootstrapping methods is consistently better across tasks such as
MountainCar, Puddle World, and Random Walk, as shown in Figure 9.12. Nonbootstrapping methods typically
perform worse before reaching their asymptote.

3. Understanding the Performance: The reason bootstrapping methods outperform nonbootstrapping methods is still
unclear, but it could be due to their faster learning process or their ability to converge to a better policy.Nonbootstrapping
methods may reduce RMSE from the true value function, but that doesn't necessarily correlate with finding the optimal
policy. Bootstrapping methods may be learning something more beneficial than just minimizing RMSE.
4.3 OFF POLICY APPROXIMATION OF ACTION VALUES:

Off-Policy Methods with Approximation

Off-policy reinforcement learning becomes more challenging when moving from tabular to approximate settings. Semi-
gradient methods, while stable in tabular settings, may not converge in approximate cases due to the complex interaction
between the target and behavior policies. The main issue is the misalignment between the update distribution and the
on-policy distribution, which is critical for stability.

Two primary approaches address these challenges:

1. Importance Sampling: Adjusts the update distribution to match the on-policy distribution.

2. True Gradient Methods: Provide more robust solutions without relying on a specific distribution.

Semi-gradient Methods

Semi-gradient methods extend off-policy algorithms to function approximation. While simple and commonly used, they
may diverge in complex settings. For example, the TD(0) algorithm updates the weight vector based on the semi-gradient
of the value function, using the importance sampling ratio to adjust the weights at each step.

Baird’s Counterexample

Baird’s counterexample shows that semi-gradient methods can fail to converge, even with small step sizes. In this
example, using linear function approximation with TD(0) leads to divergence because the update distribution does not
match the on-policy distribution. However, using on-policy updates guarantees convergence.

Off-policy learning with function approximation faces challenges related to update distributions and convergence. While
semi-gradient methods work in tabular cases, they may not converge in more complex settings, as shown by Baird’s
counterexample. Methods like importance sampling and true gradient approaches are being explored to address these
issues.

Policy Approximation Actor – Critic

Actor-Critic Methods in reinforcement learning combine the strengths of value-based and policy-based methods by
separating the learning process into two components: the actor and the critic.

Actor and Critic Components: Actor-Critic methods involve two components: the actor, which selects actions based on
the current state, and the critic, which evaluates the actions using TD error to guide the actor's decision-making.

On-Policy Learning: The learning process is on-policy, where the critic evaluates the actor's current actions, and the TD
error helps the actor improve its policy continuously.Efficient Action Selection: Actor-Critic methods are computationally
efficient, particularly in large or continuous action spaces, as they minimize the computation needed to select actions.

Stochastic Policy Learning: These methods can learn stochastic policies, enabling a balance between exploration and
exploitation, which is useful in complex or competitive environments.Iterative Feedback Loop: Actor-Critic methods rely
on a feedback loop where the critic evaluates actions, computes TD error, and updates both the actor and critic, leading to
continuous improvement of the policy over time.
Eligibility Traces for Actor-Critic Methods : enhance learning by allowing updates not just for the current state-action pair,
but also for past experiences that influence the outcome.

Critic's Trace: The critic uses a TD(λ) approach, where λ determines how much past states influence the value function,
leading to faster learning by considering past experiences.Actor's Trace: The actor updates its policy using eligibility traces
for state-action pairs, with traces decaying over time to prioritize recent experiences while accounting for earlier ones.

TD(λ) Algorithm: The TD(λ) method adjusts state-action values using eligibility traces, improving policy and performance,
especially in tasks dependent on sequences of states and rewards.Decay Factor: The decay factor controls how past
actions influence learning, making updates more efficient.Efficient Learning: By using eligibility traces, both actor and
critic update rules become more efficient, particularly in complex environments.

R-Learning and the Average-Reward Setting focus on optimizing the long-term average reward per time step, rather than
cumulative discounted rewards, making it ideal for continuous tasks.

Average-Reward Optimization: R-learning aims to maximize the average reward per time step rather than the total
discounted reward, especially in continuing tasks.

Value Functions: In R-learning, value functions are based on average reward rather than total reward, guiding the agent in
maximizing long-term average rewards.

Two Policies: The algorithm uses two policies: the behavior policy (for experience generation) and the estimation policy
(for selecting optimal actions).

Action-Value and Reward Function: The agent updates its action-value functions (Q-values) and average reward estimate
to maximize the average reward over time.

Off-Policy Control: R-learning is an off-policy method, similar to Q-learning, but designed for undiscounted environments,
making it useful for tasks like queuing systems where rewards are consistent across time steps.

Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
Function Approximation
No ratings yet
Function Approximation
35 pages
2.3+Value+Function+Approximation
No ratings yet
2.3+Value+Function+Approximation
55 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Lecture 6: Value Function Approximation: David Silver
No ratings yet
Lecture 6: Value Function Approximation: David Silver
56 pages
3 - Chapter 8 Value Function Approximation
No ratings yet
3 - Chapter 8 Value Function Approximation
39 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
07 FA Methods
No ratings yet
07 FA Methods
58 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
Reinforcement Learning - A comprehensive Overview
No ratings yet
Reinforcement Learning - A comprehensive Overview
177 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Audio to text embedding
No ratings yet
Audio to text embedding
144 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Decision Uncertainty
No ratings yet
Decision Uncertainty
269 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Book-Decision Making Under Uncertainty and Reinforcement Learning
No ratings yet
Book-Decision Making Under Uncertainty and Reinforcement Learning
273 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Online Learning Lecture Notes 2011 Oct 20
No ratings yet
Online Learning Lecture Notes 2011 Oct 20
125 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
Instance Based Learning: Artificial Intelligence and Machine Learning 18CS71
No ratings yet
Instance Based Learning: Artificial Intelligence and Machine Learning 18CS71
19 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
Intelligent Optimization Algorithm for Master (2)
No ratings yet
Intelligent Optimization Algorithm for Master (2)
47 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
RL Notes
No ratings yet
RL Notes
69 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
20 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
ML UNIT IV
No ratings yet
ML UNIT IV
8 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
No ratings yet
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
110 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
RL Class Notes (4)
No ratings yet
RL Class Notes (4)
68 pages
Patent_Application_AI_Marketing
No ratings yet
Patent_Application_AI_Marketing
2 pages
Get The Art of Reinforcement Learning Michael Hu Free All Chapters
100% (8)
Get The Art of Reinforcement Learning Michael Hu Free All Chapters
64 pages
AI Industrial
No ratings yet
AI Industrial
493 pages
AI-Driven+Approaches+for+Autonomous+Vehicle+Fleet+Coordination+and+Routing
No ratings yet
AI-Driven+Approaches+for+Autonomous+Vehicle+Fleet+Coordination+and+Routing
22 pages
Paper 105
No ratings yet
Paper 105
4 pages
joshi2020
No ratings yet
joshi2020
6 pages
A Tutorial on LLM Reasoning
No ratings yet
A Tutorial on LLM Reasoning
15 pages
Jailbraking Prompts
No ratings yet
Jailbraking Prompts
16 pages
Dynamics of The Bush-Mosteller Learning Algorithm in 2x2 Games
No ratings yet
Dynamics of The Bush-Mosteller Learning Algorithm in 2x2 Games
26 pages
AI in Gaming Finale
No ratings yet
AI in Gaming Finale
3 pages
Deep Reinforcement Learning For Dynamic Scheduling of A Flexible Job Shop
No ratings yet
Deep Reinforcement Learning For Dynamic Scheduling of A Flexible Job Shop
22 pages
7gt
No ratings yet
7gt
40 pages
Exploration of Reinforcement Learning To SNAKE: Bowei Ma, Meng Tang, Jun Zhang
No ratings yet
Exploration of Reinforcement Learning To SNAKE: Bowei Ma, Meng Tang, Jun Zhang
5 pages
Control Meets Learning
No ratings yet
Control Meets Learning
40 pages
Haghighat et al. - 2020 - Applications of Deep Learning in Intelligent Transportation Systems-annotated
No ratings yet
Haghighat et al. - 2020 - Applications of Deep Learning in Intelligent Transportation Systems-annotated
31 pages
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools to Build Learning Machines 1st Edition by AurÃ©lien GÃ©ron 9352135210 9789352135219instant download
100% (3)
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools to Build Learning Machines 1st Edition by AurÃ©lien GÃ©ron 9352135210 9789352135219instant download
89 pages
Aids pyqs and dsa for engineers 2nd year
No ratings yet
Aids pyqs and dsa for engineers 2nd year
50 pages
Robots Science
No ratings yet
Robots Science
14 pages
Ai 4 All
No ratings yet
Ai 4 All
31 pages
IBM AI Level (1-3)
No ratings yet
IBM AI Level (1-3)
310 pages
HW 1
No ratings yet
HW 1
6 pages
Literature Review On Traffic Control Systems Used Worldwide
No ratings yet
Literature Review On Traffic Control Systems Used Worldwide
4 pages
A Beginners Guide To Machine Learning For HR Practitioners
No ratings yet
A Beginners Guide To Machine Learning For HR Practitioners
6 pages
AScI Summer Research Project List 2025
No ratings yet
AScI Summer Research Project List 2025
43 pages
KDSH 2024 - PS
No ratings yet
KDSH 2024 - PS
18 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Pavirani 2024 Eenergy
No ratings yet
Pavirani 2024 Eenergy
2 pages
01 Introduction
No ratings yet
01 Introduction
50 pages
Reinforcement Learning Notes
No ratings yet
Reinforcement Learning Notes
167 pages
UGRD CYBS6101 Artificial Intelligence Fundamentals Final Lab Exam - 92 Over 100
No ratings yet
UGRD CYBS6101 Artificial Intelligence Fundamentals Final Lab Exam - 92 Over 100
17 pages

RL chap 4

Uploaded by

RL chap 4

Uploaded by

CHAPTER 4

APPROXIMATE SOLUTION METHODS

4.1 FUNCTION VALUE APPROXIMATION:

• State-value function V(s): Predicts how good it is to be in a particular state.

4.2 ON POLICY APPROXIMATION OF ACTION VALUES:

4.2.1 Value Prediction with Function Approximation

v^(s,w)≈vπ(s) ,where w∈R ^n is a vector of weights or parameters.

• Monte Carlo: Backup s←Gt

• TD(0): Backup s←Rt+1+γv^(st+1,w)

• TD(λ): Backup s←λ These backups become training examples .

4.2.2 Gradient-Descent Methods:

You're estimating the value function: v^(s,w)

• w{w} is a parameter vector (e.g., weights in a neural network).

Gradient Descent Update Rule

o The function is non-linear,

o The gradient is steep or erratic.

4.2.3 Linear methods:

• Generalization depends on feature size/shape:

o Large features → broad generalization.

o Small features → fine-grained discrimination.

• Asymptotic accuracy depends on feature density, not size (Example 9.1).

• A form of coarse coding with exhaustive partitions (tilings).

• Each tiling is a grid, and each tile is a binary feature.

• Hashing reduces memory by pseudo-randomly mapping tiles.

Radial Basis Functions (RBFs)

• Continuous-valued features (Gaussian responses) centered at prototype states.

• Smoother approximations than binary features.

• Addresses high-dimensional spaces by using prototype states.

• Complexity scales with target function complexity, not dimensionality.

• Avoids exponential growth in resources but remains experimental.

4.2.4 Control with Function Approximation:

1. Extending Value Prediction to Control

2. Gradient-Descent Updates for Action-Values

3. Policy Improvement & Action Selection

• Continuous Actions: An open research challenge.

4. On-Policy vs. Off-Policy Methods

• Accumulating Traces: Increment traces by 1 per visit.

• Clearing Traces: Optional—reset traces for non-selected actions.

6. Example: MountainCar Task

4.2.5 Should we Bootstrap:

1. Non-Bootstrapping vs. Bootstrapping Methods

2. TD Methods with Eligibility Traces

Off-Policy Methods with Approximation

Two primary approaches address these challenges:

Policy Approximation Actor – Critic

You might also like