20AI903_RL_UNIT 4
20AI903_RL_UNIT 4
L T P C
20AI903 REINFORCEMENT LEARNING
3 0 0 3
OBJECTIVES:
To know about the concepts of Reinforcement Learning
To understand Monte Carlo Decision Process and Dynamic Programming.
To understand temporal difference learning.
To know how functional approximation is used in reinforcement learning.
To study basics of Deep Reinforcement Learning.
UNIT I INTRODUCTION TO REINFORCEMENT LEARNING 9
Introduction - Elements of RL, History of RL- Limitation and Scope - Examples – Multi-arm Bandits
– n-armed Bandit Problem – Action-Value Methods – Incremental Implementation –
Nonstationary Problem – Optimistic Initial Values – Upper Confidence Bound Action Selection –
Gradient Bandits – Contextual Bandits.
UNIT II MONTE DECISION PROCESS AND DYNAMIC PROGRAMMING 9
Finite Markov Decision Processes - The Agent Environment interface - Goals and Rewards –
Returns – Episodic and Continuing Tasks Markov Property – MDP - Value functions – Optimality
and Approximation - Dynamic Programming - Policy Evaluation - Policy Improvement, Iteration -
Value Iteration - Asynchronous DP- Efficiency of DP - Monte Carlo Prediction - Monte Carlo
Estimation of Action Values - Monte Carlo Control- Off - policy Monte Carlo Prediction – Off-Policy
Monte Carlo Controls.
UNIT III TEMPORAL DIFFERENCE LEARNING 9
Temporal-Difference prediction – Advantages of TD Prediction Methods -Optimality of TD(0) –
Sarsa: On-policy TD Control – Q Learning: Off-Policy TD Control – n-step TD Prediction – Forward
view – Backward view – SARSA – Watkins’s Q – Off-policy Eligibility Traces.
UNIT IV FUNCTION APPROXIMATION METHODS 9
Value Prediction with function Approximation – Gradient Descent Methods – Linear Methods –
Control with Function Approximation – Off-Policy Approximation of Action Values – Policy
Approximation – Actor-Critic Methods – Eligibility Traces – R-Learning and the Average-Reward
Setting.
UNIT V DEEP REINFORCEMENT LEARNING 9
Deep Q-Learning – Rainbow DQN – DQN Improvements –
Policy Gradient Methods – Benefits – Calculation – Theorem – Policy Functions – Implementation
– Hierarchical RL – Multi-Agent RL
TOTAL: 45 PERIODS
OUTCOMES:
At the end of this course, the students will be able to:
CO4: Learn about the concepts of Reinforcement Learning.
CO2: Understand the concept of Dynamic Programming and Monte Carlo Decision Process
CO3: Understand the concept of Temporal difference Learning (TML)
CO4: Describe how functional approximation is used in reinforcement learning.
CO5: Learn the basics of Deep Reinforcement Learning.
TEXT BOOKS:
1. Sutton R. S. and Barto A. G., "Reinforcement Learning: An Introduction", MIT Press, Second
Edition, 2020.
2. Phil Winder, “Reinforcement Learning: Industrial Applications of Intelligent Agents“.
Oreilly, 2021.
REFERENCES:
1. Kevin Murphy, “Machine Learning - A Probabilistic Perspective” , MIT press, 2012.
2. Christopher Bishop, “Pattern Recognition and Machine Learning”, Springer, 2006.
11
UNIT – IV
FUNCTION
APPROXIMATIO
N METHODS
16
UNIT-4 FUNCTION
APPROXIMATION METHODS
No. No.
4 Linear Methods 31
7 Policy Approximation 50
8 Actor-Critic Methods 51
9 Eligibility Traces 54
11 Reward Setting 55
22
UNIT-4 FUNCTION
APPROXIMATION METHODS
Value Prediction with function Approximation – Gradient Descent Methods –
Linear Methods –Control with Function Approximation – Off-Policy
Approximation of Action Values – Policy Approximation – Actor-Critic Methods –
Eligibility Traces – R-Learning and the Average-Reward Setting.
23
24
26
25
26
4.2 GRADIENT-DESCENT METHODS
We now develop in detail one class of learning methods for function approximation in
value prediction, those based on gradient descent. Gradient-descent methods are
among the most widely used of all function approximation methods and are
particularly well suited to reinforcement learning.
27
28
29
30
31
32
\
33
34
36
35
36
37
38
39
40
41
4.4 CONTROL WITH FUNCTION APPROXIMATION
We now extend value prediction methods using function approximation to control
methods, following the pattern of GPI. First we extend the state-value prediction
methods to action-value prediction methods, then we combine them with policy
improvement and action selection techniques. As usual, the problem of ensuring
exploration is solved by pursuing either an on-policy or an off-policy approach.
42
43
44
All the methods we have discussed above have used accumulating eligibility
traces. Although replacing traces (Section 3.8) are known to have
advantages in tabular
45
Example 4.2: Mountain-Car Task Consider the task of driving an
underpowered car up a steep mountain road, as suggested by the diagram in
the upper left of Figure 4.10. The difficulty is that gravity is stronger than the
car's engine, and even at full throttle the car cannot accelerate up the steep
slope. The only solution is to first move away from the goal and up the
opposite slope on the left. Then, by applying full throttle the car can build up
enough inertia to carry it up the steep slope even though it is slowing down
the whole way. This is a simple example of a continuous control task where
things have to get worse in a sense (farther from the goal) before they can
get better. Many control methodologies have great difficulties with tasks of
this kind unless explicitly aided by a human designer.
46
47
Exercise 4.8 Describe how the actor-critic control method can be combined with
gradient-descent function approximation.
48
4.5 OFF-POLICY APPROXIMATION OF ACTION VALUES
The extension of function approximation turns out to be significantly different and harder for
on-policy learning than it is for on-policy learning.
The tabular on-policy methods readily extend to semi-gradient algorithms, but these
algorithms do not converge as robustly as they do under on-policy training.
Recall that in on-policy learning we seek to learn a value function for a target policy Π , given
data due to a different behavior policy b .
In the control case, action values are learned, and both policies typically change during
learning- being the greedy policy with respect to , and being something more exploratory
such as the -greedy policy with respect to 𝑞 ∧ .
The challenge of off-policy learning comes from:
Target of the update, which arises from the tabular case. Importance Sampling is used
to deal with this.
49
4.6 POLICY APPROXIMATION
50
4.7 ACTOR-CRITIC METHODS
Actor-critic methods are TD methods that have a separate memory structure to explicitly
represent the policy independent of the value function. The policy structure is known as
the actor, because it is used to select actions, and the estimated value function is known as
the critic, because it criticizes the actions made by the actor. Learning is always on-policy:
the critic must learn about and critique whatever policy is currently being followed by the
actor. The critique takes the form of a TD error. This scalar signal is the sole output of the
critic and drives all learning in both actor and critic, as suggested by Figure 4.15.
51
These issues were explored early on, primarily for the immediate reward case
(Sutton, 1984; Williams, 1992) and have not been brought fully up to date.
Many of the earliest reinforcement learning systems that used TD methods were
actor-critic methods (Witten, 1977; Barto, Sutton, and Anderson, 1983). Since
then, more attention has been devoted to methods that learn action-value
functions and determine a policy exclusively from the estimated values (such as
Sarsa and Q-learning).
52
This divergence may be just historical accident. For example, one could imagine
intermediate architectures in which both an action-value function and an
independent policy would be learned. In any event, actor-critic methods are likely
to remain of current interest because of two significant apparent advantages:
• They require minimal computation in order to select actions. Consider a case
where there are an infinite number of possible actions--for example, a
continuous-valued action. Any method learning just action values must search
through this infinite set in order to pick an action. If the policy is explicitly
stored, then this extensive computation may not be needed for each action
selection.
• They can learn an explicitly stochastic policy; that is, they can learn the optimal
probabilities of selecting various actions. This ability turns out to be useful in
competitive and non-Markov cases (e.g., see Singh, Jaakkola, and Jordan,
1994).
In addition, the separate actor in actor-critic methods makes them more
appealing in some respects as psychological and biological models. In some cases
it may also make it easier to impose domain-specific constraints on the set of
allowed policies.
53
54
4.9 R-LEARNING AND THE AVERAGE-REWARD SETTING
We now introduce a third classical setting—alongside the episodic and discounted settings— for
formulating the goal in Markov decision problems (MDPs). Like the discounted setting, the
average reward setting applies to continuing problems, problems for which the interaction
between agent and environment goes on and on forever without termination or start states.
Unlike that setting, however, there is no discounting—the agent cares just as much about
delayed rewards as it does about immediate reward. The average-reward setting is one of the
major settings commonly considered in the classical theory of dynamic programming and less-
commonly in reinforcement learning. As we discuss in the next section, the discounted setting is
problematic with function approximation, and thus the average-reward setting is needed to
replace it.
In the average-reward setting, the quality of a policy π is defined as the average rate of reward,
or simply average reward, while following that policy, which we denote as r(π):
4.9
4.10
where the expectations are conditioned on the initial state, S0, and on the subsequent
actions, A0, A1,...,At
which is assumed to exist for any π and to be independent of S0. This assumption about the
MDP is known as ergodicity. It means that where the MDP starts or any early decision made by
the agent can have only a temporary effect in the long run the expectation of being in a state
depends only on the policy and the MDP transition probabilities. Ergodicity is sufficient to
guarantee the existence of the limits in the equations above.
There are subtle distinctions that can be drawn between di↵erent kinds of optimality in the
undiscounted continuing case.
55
Nevertheless, for most practical purposes it may be adequate simply to order policies
according to their average reward per time step, in other words, according to their r(π).
This quantity is essentially the average reward under π, as suggested by (4.10). In
particular, we consider all policies that attain the maximal value of r(π) to be optimal. Note
that the steady state distribution is the special distribution under which, if you select
actions according to π, you remain in the same distribution. That is, for which
4.11
4.12
This is known as the differential return, and the corresponding value functions
are known as differential value functions. They are defined in the same way and
we will use the same notation for them as we have all along:
(similarly for V* and q*). Differential value functions also have Bellman
equations, just slightly different from those we have seen earlier. We simply
remove all ϒs and replace all rewards by the difference between the reward
and the true average reward:
56
There is also a differential form of the two TD errors:
57
Part A – Questions
& Answers
Unit – IV
66
Part A - Questions & Answers
1. What is the main objective of value prediction with
function approximation? [K2, CO4]
Gradient descent methods are used to update the parameters of the function
approximator to minimize the error between predicted and actual values.
67
7. Describe actor-critic methods in reinforcement learning..
[K3, CO4]
68
11. What are the advantages of using function approximation over
tabular methods in reinforcement learning? [K2, CO4]
69
16. What are the main advantages of policy approximation over value
70
21. What are some drawbacks of using function approximation in
reinforcement learning?. [K2, CO4]
Function approximation can introduce approximation errors, stability issues, and
difficulties in convergence, especially in high-dimensional and non-linear environments.
24. What are some common techniques to ensure stability and convergence
in control with function approximation? [K3, CO4]
Common techniques include using function approximators with bounded outputs,
employing appropriate exploration strategies, and incorporating stability constraints into
the learning algorithm.
71
26. What are some typical policy parameterizations used in policy
approximation methods? [K3, CO4]
Common parameterizations include softmax policies, Gaussian policies, and deterministic
policies parameterized by neural networks.
28. What is the purpose of the eligibility trace in TD(λ) algorithms? [K3, CO4]
The eligibility trace captures the influence of past states and actions on the current
estimate, allowing for more efficient updates of the function approximator's parameters.
30. How does the choice of reward function affect the performance of R-
learning algorithms? [K3, CO4]
The choice of reward function directly impacts the estimation of the average reward and
can significantly influence the learning process and final performance of R-learning
algorithms.
31. In what scenarios would you prefer to use Q-learning over R-learning? [K3,
CO4]
Q-learning is typically preferred when the discount factor is significant or when the
environment has a well-defined terminal state, making the discounted sum of rewards a
more appropriate measure of performance.
72
Part B – Questions
Unit – IV
73
Real time Applications
Applications in self-driving cars
Various papers have proposed Deep Reinforcement Learning for
autonomous driving. In self-driving cars, there are various aspects to
consider, such as speed limits at various places, drivable zones, avoiding
collisions—just to mention a few.
79
The Tiger Problem
The Tiger Problem is a renowned POMDP problem wherein a decision
maker stands between two closed doors on the left and right. A tiger is
put behind one of the doors with equal probability, and treasure is
placed behind the other. Therefore, the problem has two states tiger-left
and tiger-right. The decision maker can either listen to the roars of the
tiger or can open one of the doors. Therefore, the possible actions are
listen, open-left, and open-right. However, the observation capability of
the decision-maker is not 100% accurate. Therefore, there is a 15%
probability that the decision maker listens and interprets wrongly the
side on which the tiger is present. Naturally, if the decision maker opens
the door with the tiger, then they will get hurt. Therefore, we assign a
reward of -100 in this case. On the other hand, the decision maker will
gain a reward of 10 if they choose the door with treasure. Once a door
is opened, the tiger is again randomly assigned to one of the doors, and
the decision maker gets to choose again.
Now, to solve the above Tiger Problem using the default grid method,
which uses point-based value iteration:
81
different, then the decision maker returns to the "Initial
Belief" state. However, if the same observation is made twice,
the decision maker chooses to open the right door. A similar
case happens for the observation "tiger-right". Once the
reward is obtained, the problem is reset, and the decision
maker returns to the "Initial Belief" state. In the subsequent
sections, we present extensions to the POMDP problem and
its applications.
82