RL chap 4
RL chap 4
Function Value Approximation-On Policy Approximation of Action Values-Off Policy Approximation of Action Values.
To a large extent we need only combine reinforcement learning methods with existing generalization methods. The
kind of generalization we require is often called function approximation because it takes examples from a desired
function (e.g., a value function) and attempts to generalize from them to construct an approximation of the entire
function. Function approximation is an instance of supervised learning, the primary topic studied in machine learning,
arti cial neural networks, pattern recognition.
In reinforcement learning, Function Value Approximation is used when it's impractical to store exact values for every state
or state-action pair due to the vastness of the environment. Instead of using a table (which is only feasible for small
problems), we use an approximate function to estimate the value function V(s)V(s) or the action-value function Q(s,a)Q(s,
a).
There are two main types of value functions we might want to approximate:
• Action-value function Q(s,a): Predicts how good it is to take a specific action in a specific state.
The general idea is to represent the value function using a parameterized function, such as:
V^(s,w)≈Vπ(s)
where {w} is a vector of weights that define the function. This can be linear (e.g., V^(s)=w⊤x(s)) or nonlinear (e.g., neural
networks).
The goal is to tune the parameters {w} such that the approximation closely matches the true value function. The process
typically uses gradient descent or stochastic gradient descent to minimize a loss function that represents the error
between predicted and actual values (like TD error in TD(0)).
Function approximation allows RL to scale to large or continuous state spaces, making it a cornerstone in modern RL
applications such as AlphaGo and deep Q-networks (DQNs).
The goal remains the same: estimate the state-value function v(s)v(s) under a certain policy π\pi, but now with function
approximation. Instead of using a lookup table (which doesn’t scale), we approximate v(s)v(s) using a parameterized
function:
The number of possible states in most real-world problems is too large (or infinite).Function approximators (like neural
networks, decision trees, etc.) generalize from seen states to unseen ones.Updating one parameter affects many state
estimates, enabling learning even from sparse data.
A backup is an operation that shifts the estimate v^(s,w)\hat{v}(s, \mathbf{w}) toward a target value based on experience.
Examples include:
Function approximators learn from these examples, using methods like:Gradient descent to minimize the difference
between the predicted value and the backup target.Sophisticated architectures (e.g., deep networks) that can represent
complex mappings from state to value.This shifts reinforcement learning closer to supervised learning, where backups
play the role of labeled training data — even though they’re generated from experience rather than from a dataset.
where:
• v^(s,w) is a smooth function, meaning it’s differentiable and suitable for gradient-based updates.
• Each time step t, you see a training example: (St,v ^π(St)) is the true value of state St under the policy π\pi (for
now, assumed to be known or observed).
Learning Objective:
A gradient method is a learning approach where the parameters of the value function are updated using the
true gradient of a loss function that compares predicted values with known targets.
It considers the entire loss function, including how the target depends on the weights.
A semi-gradient method is used when the target value depends on the current value function itself (as in
bootstrapping). In this case, the update uses only the part of the gradient that affects the prediction, and
ignores the gradient of the target.
Linear Function Approximation : Approximate value function v(s,w)is linear in the parameter vector w.
Features x(s) represent states, and the approximation is a dot product: v(s,w)=w⊤x(s).Gradient descent simplifies to
updating weights based on feature vectors. Linear methods are mathematically tractable and often converge to a global
optimum.
Convergence of Linear TD(λ) :Linear TD(λ) converges with decreasing step sizes (per standard conditions).Asymptotic error
is bounded by 1/1−γ1 times the best possible error (where γγ is the discount factor).Requires on-policy state backups;
other distributions may cause divergence.
Advantages of Linear Methods : Efficient in computation and data usage.Performance depends heavily on feature
selection, which encodes domain knowledge.Features should capture task-relevant generalizations (e.g., interactions via
conjunctions).
Coarse Coding
• A method to represent the Binary features with overlapping receptive fields (e.g., circles in state space).
Tile Coding
• Advantages:Fixed number of active features per state (one per tiling). Efficient computation (summing weights of
active tiles).
• Generalization controlled by tile shape/offset (Figure 9.5).
• Can adapt centers/widths for better fit (but more computationally complex).
Kanerva Coding
• Features activate based on similarity (e.g., Hamming distance for binary states).
Goal: Adapt value prediction methods like TD(λ) and Monte Carlo to control problems using action-value functions.
Approach: Use Generalized Policy Iteration (GPI), combining action-value approximation with policy improvement. Shift
from state-value to state-action-value triples for training.
Update Rule: Action-values are updated using a gradient descent approach, with Sarsa(λ) using eligibility traces for
backward updates.
• Discrete Actions: Action-values are computed for all actions, with greedy or ε-greedy selection.
• On-Policy: Uses ε-greedy for exploration and policy improvement, updating weights with linear function
approximation.
• Off-Policy: Uses an arbitrary behavior policy but learns the greedy policy, clearing traces for non-selected actions.
5. Eligibility Traces with Function Approximation
• Replacing Traces: For binary features, reset traces to 1 on revisit for stability.
• Task: Drive an underpowered car up a steep hill using strategic back-and-forth motion.
• Results: Sarsa(λ) with replacing traces learned near-optimal policy in ~100 episodes. Optimistic initialization
encouraged exploration even without ε-greedy.
• Non-Bootstrapping Methods: These methods tend to have lower asymptotic error and can be implemented
online using eligibility traces. However, empirical comparisons show that bootstrapping methods usually perform
better in practice.
• Bootstrapping Methods: Despite having higher asymptotic error, bootstrapping methods often perform much
better in real-world tasks. This is due to their ability to learn faster and more effectively in most scenarios.
• Using TD methods with eligibility traces, varying from 0 (pure bootstrapping) to 1 (pure nonbootstrapping),
reveals that performance worsens as the method approaches the nonbootstrapping case (λ=1\lambda = 1).
• Empirical Results: The performance of bootstrapping methods is consistently better across tasks such as
MountainCar, Puddle World, and Random Walk, as shown in Figure 9.12. Nonbootstrapping methods typically
perform worse before reaching their asymptote.
3. Understanding the Performance: The reason bootstrapping methods outperform nonbootstrapping methods is still
unclear, but it could be due to their faster learning process or their ability to converge to a better policy.Nonbootstrapping
methods may reduce RMSE from the true value function, but that doesn't necessarily correlate with finding the optimal
policy. Bootstrapping methods may be learning something more beneficial than just minimizing RMSE.
4.3 OFF POLICY APPROXIMATION OF ACTION VALUES:
Off-policy reinforcement learning becomes more challenging when moving from tabular to approximate settings. Semi-
gradient methods, while stable in tabular settings, may not converge in approximate cases due to the complex interaction
between the target and behavior policies. The main issue is the misalignment between the update distribution and the
on-policy distribution, which is critical for stability.
1. Importance Sampling: Adjusts the update distribution to match the on-policy distribution.
2. True Gradient Methods: Provide more robust solutions without relying on a specific distribution.
Semi-gradient Methods
Semi-gradient methods extend off-policy algorithms to function approximation. While simple and commonly used, they
may diverge in complex settings. For example, the TD(0) algorithm updates the weight vector based on the semi-gradient
of the value function, using the importance sampling ratio to adjust the weights at each step.
Baird’s Counterexample
Baird’s counterexample shows that semi-gradient methods can fail to converge, even with small step sizes. In this
example, using linear function approximation with TD(0) leads to divergence because the update distribution does not
match the on-policy distribution. However, using on-policy updates guarantees convergence.
Off-policy learning with function approximation faces challenges related to update distributions and convergence. While
semi-gradient methods work in tabular cases, they may not converge in more complex settings, as shown by Baird’s
counterexample. Methods like importance sampling and true gradient approaches are being explored to address these
issues.
Actor-Critic Methods in reinforcement learning combine the strengths of value-based and policy-based methods by
separating the learning process into two components: the actor and the critic.
Actor and Critic Components: Actor-Critic methods involve two components: the actor, which selects actions based on
the current state, and the critic, which evaluates the actions using TD error to guide the actor's decision-making.
On-Policy Learning: The learning process is on-policy, where the critic evaluates the actor's current actions, and the TD
error helps the actor improve its policy continuously.Efficient Action Selection: Actor-Critic methods are computationally
efficient, particularly in large or continuous action spaces, as they minimize the computation needed to select actions.
Stochastic Policy Learning: These methods can learn stochastic policies, enabling a balance between exploration and
exploitation, which is useful in complex or competitive environments.Iterative Feedback Loop: Actor-Critic methods rely
on a feedback loop where the critic evaluates actions, computes TD error, and updates both the actor and critic, leading to
continuous improvement of the policy over time.
Eligibility Traces for Actor-Critic Methods : enhance learning by allowing updates not just for the current state-action pair,
but also for past experiences that influence the outcome.
Critic's Trace: The critic uses a TD(λ) approach, where λ determines how much past states influence the value function,
leading to faster learning by considering past experiences.Actor's Trace: The actor updates its policy using eligibility traces
for state-action pairs, with traces decaying over time to prioritize recent experiences while accounting for earlier ones.
TD(λ) Algorithm: The TD(λ) method adjusts state-action values using eligibility traces, improving policy and performance,
especially in tasks dependent on sequences of states and rewards.Decay Factor: The decay factor controls how past
actions influence learning, making updates more efficient.Efficient Learning: By using eligibility traces, both actor and
critic update rules become more efficient, particularly in complex environments.
R-Learning and the Average-Reward Setting focus on optimizing the long-term average reward per time step, rather than
cumulative discounted rewards, making it ideal for continuous tasks.
Average-Reward Optimization: R-learning aims to maximize the average reward per time step rather than the total
discounted reward, especially in continuing tasks.
Value Functions: In R-learning, value functions are based on average reward rather than total reward, guiding the agent in
maximizing long-term average rewards.
Two Policies: The algorithm uses two policies: the behavior policy (for experience generation) and the estimation policy
(for selecting optimal actions).
Action-Value and Reward Function: The agent updates its action-value functions (Q-values) and average reward estimate
to maximize the average reward over time.
Off-Policy Control: R-learning is an off-policy method, similar to Q-learning, but designed for undiscounted environments,
making it useful for tasks like queuing systems where rewards are consistent across time steps.