0% found this document useful (0 votes)
9 views

20AI903_RL_UNIT 4

The document is a syllabus for a course on Reinforcement Learning, covering key concepts such as Monte Carlo Decision Processes, Dynamic Programming, Temporal Difference Learning, and Deep Reinforcement Learning. It outlines the course objectives, units of study, and expected outcomes for students, along with recommended textbooks and references. The syllabus emphasizes practical applications and theoretical foundations, preparing students to understand and apply reinforcement learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

20AI903_RL_UNIT 4

The document is a syllabus for a course on Reinforcement Learning, covering key concepts such as Monte Carlo Decision Processes, Dynamic Programming, Temporal Difference Learning, and Deep Reinforcement Learning. It outlines the course objectives, units of study, and expected outcomes for students, along with recommended textbooks and references. The syllabus emphasizes practical applications and theoretical foundations, preparing students to understand and apply reinforcement learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Syllabus

L T P C
20AI903 REINFORCEMENT LEARNING
3 0 0 3
OBJECTIVES:
 To know about the concepts of Reinforcement Learning
 To understand Monte Carlo Decision Process and Dynamic Programming.
 To understand temporal difference learning.
 To know how functional approximation is used in reinforcement learning.
 To study basics of Deep Reinforcement Learning.
UNIT I INTRODUCTION TO REINFORCEMENT LEARNING 9
Introduction - Elements of RL, History of RL- Limitation and Scope - Examples – Multi-arm Bandits
– n-armed Bandit Problem – Action-Value Methods – Incremental Implementation –
Nonstationary Problem – Optimistic Initial Values – Upper Confidence Bound Action Selection –
Gradient Bandits – Contextual Bandits.
UNIT II MONTE DECISION PROCESS AND DYNAMIC PROGRAMMING 9
Finite Markov Decision Processes - The Agent Environment interface - Goals and Rewards –
Returns – Episodic and Continuing Tasks Markov Property – MDP - Value functions – Optimality
and Approximation - Dynamic Programming - Policy Evaluation - Policy Improvement, Iteration -
Value Iteration - Asynchronous DP- Efficiency of DP - Monte Carlo Prediction - Monte Carlo
Estimation of Action Values - Monte Carlo Control- Off - policy Monte Carlo Prediction – Off-Policy
Monte Carlo Controls.
UNIT III TEMPORAL DIFFERENCE LEARNING 9
Temporal-Difference prediction – Advantages of TD Prediction Methods -Optimality of TD(0) –
Sarsa: On-policy TD Control – Q Learning: Off-Policy TD Control – n-step TD Prediction – Forward
view – Backward view – SARSA – Watkins’s Q – Off-policy Eligibility Traces.
UNIT IV FUNCTION APPROXIMATION METHODS 9
Value Prediction with function Approximation – Gradient Descent Methods – Linear Methods –
Control with Function Approximation – Off-Policy Approximation of Action Values – Policy
Approximation – Actor-Critic Methods – Eligibility Traces – R-Learning and the Average-Reward
Setting.
UNIT V DEEP REINFORCEMENT LEARNING 9
Deep Q-Learning – Rainbow DQN – DQN Improvements –
Policy Gradient Methods – Benefits – Calculation – Theorem – Policy Functions – Implementation
– Hierarchical RL – Multi-Agent RL
TOTAL: 45 PERIODS
OUTCOMES:
At the end of this course, the students will be able to:
CO4: Learn about the concepts of Reinforcement Learning.
CO2: Understand the concept of Dynamic Programming and Monte Carlo Decision Process
CO3: Understand the concept of Temporal difference Learning (TML)
CO4: Describe how functional approximation is used in reinforcement learning.
CO5: Learn the basics of Deep Reinforcement Learning.
TEXT BOOKS:
1. Sutton R. S. and Barto A. G., "Reinforcement Learning: An Introduction", MIT Press, Second
Edition, 2020.
2. Phil Winder, “Reinforcement Learning: Industrial Applications of Intelligent Agents“.
Oreilly, 2021.
REFERENCES:
1. Kevin Murphy, “Machine Learning - A Probabilistic Perspective” , MIT press, 2012.
2. Christopher Bishop, “Pattern Recognition and Machine Learning”, Springer, 2006.

11
UNIT – IV
FUNCTION
APPROXIMATIO
N METHODS

16
UNIT-4 FUNCTION
APPROXIMATION METHODS

Sl. Contents Page

No. No.

1 FUNCTION APPROXIMATION METHODS 23

2 Value Prediction with function 23


Approximation

3 Gradient Descent Methods 27

4 Linear Methods 31

5 Control with Function Approximation 42


Off-Policy Approximation of Action Values
6 49

7 Policy Approximation 50

8 Actor-Critic Methods 51

9 Eligibility Traces 54

10 R-Learning and the Average 55

11 Reward Setting 55

22
UNIT-4 FUNCTION
APPROXIMATION METHODS
Value Prediction with function Approximation – Gradient Descent Methods –
Linear Methods –Control with Function Approximation – Off-Policy
Approximation of Action Values – Policy Approximation – Actor-Critic Methods –
Eligibility Traces – R-Learning and the Average-Reward Setting.

4. FUNCTION APPROXIMATION METHODS.


Consider finite state-action spaces. If the size of the table representation of the
Q-function is very large, then the Q-function can be appropriately approximated
to cope with this problem. Function approximation can be also applied in the
case of infinite state-action spaces.
4.1 VALUE PREDICTION WITH FUNCTION APPROXIMATION

23
24
26
25
26
4.2 GRADIENT-DESCENT METHODS

We now develop in detail one class of learning methods for function approximation in
value prediction, those based on gradient descent. Gradient-descent methods are
among the most widely used of all function approximation methods and are
particularly well suited to reinforcement learning.

27
28
29
30
31
32
\

33
34
36
35
36
37
38
39
40
41
4.4 CONTROL WITH FUNCTION APPROXIMATION
We now extend value prediction methods using function approximation to control
methods, following the pattern of GPI. First we extend the state-value prediction
methods to action-value prediction methods, then we combine them with policy
improvement and action selection techniques. As usual, the problem of ensuring
exploration is solved by pursuing either an on-policy or an off-policy approach.

42
43
44
All the methods we have discussed above have used accumulating eligibility
traces. Although replacing traces (Section 3.8) are known to have
advantages in tabular

45
Example 4.2: Mountain-Car Task Consider the task of driving an
underpowered car up a steep mountain road, as suggested by the diagram in
the upper left of Figure 4.10. The difficulty is that gravity is stronger than the
car's engine, and even at full throttle the car cannot accelerate up the steep
slope. The only solution is to first move away from the goal and up the
opposite slope on the left. Then, by applying full throttle the car can build up
enough inertia to carry it up the steep slope even though it is slowing down
the whole way. This is a simple example of a continuous control task where
things have to get worse in a sense (farther from the goal) before they can
get better. Many control methodologies have great difficulties with tasks of
this kind unless explicitly aided by a human designer.

46
47
Exercise 4.8 Describe how the actor-critic control method can be combined with
gradient-descent function approximation.

48
4.5 OFF-POLICY APPROXIMATION OF ACTION VALUES

The extension of function approximation turns out to be significantly different and harder for
on-policy learning than it is for on-policy learning.
The tabular on-policy methods readily extend to semi-gradient algorithms, but these
algorithms do not converge as robustly as they do under on-policy training.
Recall that in on-policy learning we seek to learn a value function for a target policy Π , given
data due to a different behavior policy b .
In the control case, action values are learned, and both policies typically change during
learning- being the greedy policy with respect to , and being something more exploratory
such as the -greedy policy with respect to 𝑞 ∧ .
The challenge of off-policy learning comes from:
 Target of the update, which arises from the tabular case. Importance Sampling is used
to deal with this.

 Distribution of the updates, which arises from function approximation.

Dealing with the first challenge, we have Semi-gradient off-policy TD(0):


where,
and is defined appropriately depending on whether the problem is episodic and discounted,
or continuing and undiscounted using average reward:

49
4.6 POLICY APPROXIMATION

50
4.7 ACTOR-CRITIC METHODS
Actor-critic methods are TD methods that have a separate memory structure to explicitly
represent the policy independent of the value function. The policy structure is known as
the actor, because it is used to select actions, and the estimated value function is known as
the critic, because it criticizes the actions made by the actor. Learning is always on-policy:
the critic must learn about and critique whatever policy is currently being followed by the
actor. The critique takes the form of a TD error. This scalar signal is the sole output of the
critic and drives all learning in both actor and critic, as suggested by Figure 4.15.

Actor-critic methods are the natural extension of the idea of reinforcement


comparison methods (Section 3.8) to TD learning and to the full reinforcement
learning problem. Typically, the critic is a state-value function. After each
action selection, the critic evaluates the new state to determine whether things
have gone better or worse than expected. That evaluation is the TD error:

51
These issues were explored early on, primarily for the immediate reward case
(Sutton, 1984; Williams, 1992) and have not been brought fully up to date.

Many of the earliest reinforcement learning systems that used TD methods were
actor-critic methods (Witten, 1977; Barto, Sutton, and Anderson, 1983). Since
then, more attention has been devoted to methods that learn action-value
functions and determine a policy exclusively from the estimated values (such as
Sarsa and Q-learning).

52
This divergence may be just historical accident. For example, one could imagine
intermediate architectures in which both an action-value function and an
independent policy would be learned. In any event, actor-critic methods are likely
to remain of current interest because of two significant apparent advantages:
• They require minimal computation in order to select actions. Consider a case
where there are an infinite number of possible actions--for example, a
continuous-valued action. Any method learning just action values must search
through this infinite set in order to pick an action. If the policy is explicitly
stored, then this extensive computation may not be needed for each action
selection.
• They can learn an explicitly stochastic policy; that is, they can learn the optimal
probabilities of selecting various actions. This ability turns out to be useful in
competitive and non-Markov cases (e.g., see Singh, Jaakkola, and Jordan,
1994).
In addition, the separate actor in actor-critic methods makes them more
appealing in some respects as psychological and biological models. In some cases
it may also make it easier to impose domain-specific constraints on the set of
allowed policies.

53
54
4.9 R-LEARNING AND THE AVERAGE-REWARD SETTING

We now introduce a third classical setting—alongside the episodic and discounted settings— for
formulating the goal in Markov decision problems (MDPs). Like the discounted setting, the
average reward setting applies to continuing problems, problems for which the interaction
between agent and environment goes on and on forever without termination or start states.
Unlike that setting, however, there is no discounting—the agent cares just as much about
delayed rewards as it does about immediate reward. The average-reward setting is one of the
major settings commonly considered in the classical theory of dynamic programming and less-
commonly in reinforcement learning. As we discuss in the next section, the discounted setting is
problematic with function approximation, and thus the average-reward setting is needed to
replace it.
In the average-reward setting, the quality of a policy π is defined as the average rate of reward,
or simply average reward, while following that policy, which we denote as r(π):

4.9

4.10

where the expectations are conditioned on the initial state, S0, and on the subsequent
actions, A0, A1,...,At

which is assumed to exist for any π and to be independent of S0. This assumption about the
MDP is known as ergodicity. It means that where the MDP starts or any early decision made by
the agent can have only a temporary effect in the long run the expectation of being in a state
depends only on the policy and the MDP transition probabilities. Ergodicity is sufficient to
guarantee the existence of the limits in the equations above.
There are subtle distinctions that can be drawn between di↵erent kinds of optimality in the
undiscounted continuing case.

55
Nevertheless, for most practical purposes it may be adequate simply to order policies
according to their average reward per time step, in other words, according to their r(π).
This quantity is essentially the average reward under π, as suggested by (4.10). In
particular, we consider all policies that attain the maximal value of r(π) to be optimal. Note
that the steady state distribution is the special distribution under which, if you select
actions according to π, you remain in the same distribution. That is, for which

4.11

In the average-reward setting, returns are defined in terms of differences between


rewards and the average reward:

4.12

This is known as the differential return, and the corresponding value functions
are known as differential value functions. They are defined in the same way and
we will use the same notation for them as we have all along:

(similarly for V* and q*). Differential value functions also have Bellman
equations, just slightly different from those we have seen earlier. We simply
remove all ϒs and replace all rewards by the difference between the reward
and the true average reward:

56
There is also a differential form of the two TD errors:

57
Part A – Questions
& Answers
Unit – IV

66
Part A - Questions & Answers
1. What is the main objective of value prediction with
function approximation? [K2, CO4]

The main objective is to estimate the value function of a reinforcement learning


problem using a function approximator instead of tabular methods..

2. How are gradient descent methods utilized in value prediction with


function approximation?. [K3, CO4]

Gradient descent methods are used to update the parameters of the function
approximator to minimize the error between predicted and actual values.

3. Define linear methods in the context of function approximation. [K2,


CO4]
Linear methods approximate the value function using a linear combination of
features or basis functions.
4. What is the key challenge in controlling systems using function
approximation?[K2, CO4]
The key challenge is to ensure stability and convergence of the control algorithm,
especially in complex and high-dimensional environments.
5. Explain the concept of off-policy approximation of action values. [K3,
CO4]

Off-policy approximation involves learning action values from data generated by


following a different policy than the one being evaluated.

6. How does policy approximation differ from value prediction with


function approximation? [K3, CO4]

Policy approximation focuses on directly approximating the policy function, while


value prediction approximates the value function.

67
7. Describe actor-critic methods in reinforcement learning..
[K3, CO4]

Actor-critic methods combine policy approximation (actor) with value


function approximation (critic) to improve learning efficiency.

8. What is the role of eligibility traces in reinforcement


learning?[K2, CO4]

Eligibility traces help in credit assignment by tracking the influence of


past actions on future rewards.

9. What distinguishes R-learning from traditional Q-learning


approaches. [K3, CO4]

R-learning is a variant of Q-learning that focuses on estimating the


average reward instead of the discounted sum of rewards.

10. How does the average-reward setting differ from the


discounted reward setting in reinforcement learning?. [K3,
CO4]

In the average-reward setting, the goal is to maximize the average


reward per time step, while in the discounted reward setting, the goal is
to maximize the discounted sum of rewards.

68
11. What are the advantages of using function approximation over
tabular methods in reinforcement learning? [K2, CO4]

Function approximation allows handling of continuous state and action


spaces, as well as reducing the memory requirements for storing value
functions.

12. How do you update the parameters of a linear function


approximator using gradient descent?. [K3, CO4]

Parameters are updated in the direction of the negative gradient of the


error with respect to the parameters, scaled by a learning rate.

13. Can linear methods handle non-linear relationships between


state features and values?. [K3, CO4]

No, linear methods cannot capture non-linear relationships between


state features and values without additional transformation or feature
engineering.

14. What are some common control objectives in reinforcement


learning tasks?. [K3, CO4]

Common control objectives include stabilizing unstable systems, tracking


reference trajectories, and optimizing performance metrics such as cost
or reward.

15. How do off-policy methods handle the discrepancy between


the behavior policy and the target policy?

Off-policy methods use importance sampling techniques to re-weight


samples generated by the behavior policy to match the target policy.

69
16. What are the main advantages of policy approximation over value

function approximation? [K2, CO4]

Policy approximation can directly output actions without needing an


explicit value function, and it can handle continuous action spaces more
naturally.
17. In actor-critic methods, what role does the actor play? [K3,
CO4]
The actor selects actions based on the policy approximation, while the
critic evaluates the selected actions using the value function
approximation.

18. How are eligibility traces used in TD(λ) algorithms for


function approximation? [K3, CO4]
Eligibility traces accumulate the gradients over multiple time steps to
update the parameters of the function approximator more efficiently.

19. What distinguishes R-learning from Q-learning in terms of


the learning objective? [K3, CO4]
R-learning aims to estimate the average reward per time step, while Q-
learning estimates the discounted sum of rewards.
20. Why is it important to consider the average-reward setting in
certain reinforcement learning tasks?[K3, CO4]
The average-reward setting is important in tasks where the environment
does not have a terminal state or where the discount factor approaches
one, as it provides a stable and consistent measure of performance.

70
21. What are some drawbacks of using function approximation in
reinforcement learning?. [K2, CO4]
Function approximation can introduce approximation errors, stability issues, and
difficulties in convergence, especially in high-dimensional and non-linear environments.

22 What is the primary disadvantage of using linear methods for function


approximation? [K3, CO4]
Linear methods may not be expressive enough to capture complex relationships
between states and values, especially in high-dimensional spaces.

23 How can function approximation help in dealing with the curse of


dimensionality in reinforcement learning? [K3, CO4]
Function approximation allows for compact representations of value functions, reducing
the computational and memory requirements in high-dimensional state spaces.

24. What are some common techniques to ensure stability and convergence
in control with function approximation? [K3, CO4]
Common techniques include using function approximators with bounded outputs,
employing appropriate exploration strategies, and incorporating stability constraints into
the learning algorithm.

25. Explain the concept of exploration-exploitation trade-off in reinforcement


learning. [K3, CO4]
The exploration-exploitation trade-off refers to the dilemma of choosing between
exploiting known knowledge to maximize immediate rewards and exploring unknown
regions to discover potentially better actions or states.

71
26. What are some typical policy parameterizations used in policy
approximation methods? [K3, CO4]
Common parameterizations include softmax policies, Gaussian policies, and deterministic
policies parameterized by neural networks.

27. How do actor-critic methods combine value function approximation and


policy approximation? [K3, CO4]
A: Actor-critic methods use the critic to evaluate the actions selected by the actor,
providing feedback on the quality of the policy while simultaneously updating the policy
parameters.

28. What is the purpose of the eligibility trace in TD(λ) algorithms? [K3, CO4]
The eligibility trace captures the influence of past states and actions on the current
estimate, allowing for more efficient updates of the function approximator's parameters.

29.What are some drawbacks of using off-policy approximation methods? [K3,


CO4]
Off-policy methods may suffer from high variance due to importance sampling, and they
may require careful tuning of exploration strategies to ensure effective learning.

30. How does the choice of reward function affect the performance of R-
learning algorithms? [K3, CO4]
The choice of reward function directly impacts the estimation of the average reward and
can significantly influence the learning process and final performance of R-learning
algorithms.
31. In what scenarios would you prefer to use Q-learning over R-learning? [K3,
CO4]
Q-learning is typically preferred when the discount factor is significant or when the
environment has a well-defined terminal state, making the discounted sum of rewards a
more appropriate measure of performance.

72
Part B – Questions
Unit – IV

73
Real time Applications
Applications in self-driving cars
Various papers have proposed Deep Reinforcement Learning for
autonomous driving. In self-driving cars, there are various aspects to
consider, such as speed limits at various places, drivable zones, avoiding
collisions—just to mention a few.

Some of the autonomous driving tasks where reinforcement learning


could be applied include trajectory optimization, motion planning,
dynamic pathing, controller optimization, and scenario-based learning
policies for highways.

For example, parking can be achieved by learning automatic parking


policies. Lane changing can be achieved using Q-Learning while
overtaking can be implemented by learning an overtaking policy while
avoiding collision and maintaining a steady speed thereafter.

AWS DeepRacer is an autonomous racing car that has been designed to


test out RL in a physical track. It uses cameras to visualize the runway
and a reinforcement learning model to control the throttle and direction.

79
The Tiger Problem
The Tiger Problem is a renowned POMDP problem wherein a decision
maker stands between two closed doors on the left and right. A tiger is
put behind one of the doors with equal probability, and treasure is
placed behind the other. Therefore, the problem has two states tiger-left
and tiger-right. The decision maker can either listen to the roars of the
tiger or can open one of the doors. Therefore, the possible actions are
listen, open-left, and open-right. However, the observation capability of
the decision-maker is not 100% accurate. Therefore, there is a 15%
probability that the decision maker listens and interprets wrongly the
side on which the tiger is present. Naturally, if the decision maker opens
the door with the tiger, then they will get hurt. Therefore, we assign a
reward of -100 in this case. On the other hand, the decision maker will
gain a reward of 10 if they choose the door with treasure. Once a door
is opened, the tiger is again randomly assigned to one of the doors, and
the decision maker gets to choose again.
Now, to solve the above Tiger Problem using the default grid method,
which uses point-based value iteration:

As can be interpreted from the above graph, the decision


maker starts at the "Initial Belief" state, wherein there is an
equal probability of the tiger being behind the left or right
door. Once an observation of "tiger-left" is made, the decision
maker again chooses to listen. If the second observation is

81
different, then the decision maker returns to the "Initial
Belief" state. However, if the same observation is made twice,
the decision maker chooses to open the right door. A similar
case happens for the observation "tiger-right". Once the
reward is obtained, the problem is reset, and the decision
maker returns to the "Initial Belief" state. In the subsequent
sections, we present extensions to the POMDP problem and
its applications.

82

You might also like