0% found this document useful (0 votes)

9 views

20AI903_RL_UNIT 4

The document is a syllabus for a course on Reinforcement Learning, covering key concepts such as Monte Carlo Decision Processes, Dynamic Programming, Temporal Difference Learning, and Deep Reinforcement Learning. It outlines the course objectives, units of study, and expected outcomes for students, along with recommended textbooks and references. The syllabus emphasizes practical applications and theoretical foundations, preparing students to understand and apply reinforcement learning techniques.

Uploaded by

THARUN KUMAR M 21AD069

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

20AI903_RL_UNIT 4

Uploaded by

THARUN KUMAR M 21AD069

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Syllabus

L T P C
20AI903 REINFORCEMENT LEARNING
3 0 0 3
OBJECTIVES:
 To know about the concepts of Reinforcement Learning
 To understand Monte Carlo Decision Process and Dynamic Programming.
 To understand temporal difference learning.
 To know how functional approximation is used in reinforcement learning.
 To study basics of Deep Reinforcement Learning.
UNIT I INTRODUCTION TO REINFORCEMENT LEARNING 9
Introduction - Elements of RL, History of RL- Limitation and Scope - Examples – Multi-arm Bandits
– n-armed Bandit Problem – Action-Value Methods – Incremental Implementation –
Nonstationary Problem – Optimistic Initial Values – Upper Confidence Bound Action Selection –
Gradient Bandits – Contextual Bandits.
UNIT II MONTE DECISION PROCESS AND DYNAMIC PROGRAMMING 9
Finite Markov Decision Processes - The Agent Environment interface - Goals and Rewards –
Returns – Episodic and Continuing Tasks Markov Property – MDP - Value functions – Optimality
and Approximation - Dynamic Programming - Policy Evaluation - Policy Improvement, Iteration -
Value Iteration - Asynchronous DP- Efficiency of DP - Monte Carlo Prediction - Monte Carlo
Estimation of Action Values - Monte Carlo Control- Off - policy Monte Carlo Prediction – Off-Policy
Monte Carlo Controls.
UNIT III TEMPORAL DIFFERENCE LEARNING 9
Temporal-Difference prediction – Advantages of TD Prediction Methods -Optimality of TD(0) –
Sarsa: On-policy TD Control – Q Learning: Off-Policy TD Control – n-step TD Prediction – Forward
view – Backward view – SARSA – Watkins’s Q – Off-policy Eligibility Traces.
UNIT IV FUNCTION APPROXIMATION METHODS 9
Value Prediction with function Approximation – Gradient Descent Methods – Linear Methods –
Control with Function Approximation – Off-Policy Approximation of Action Values – Policy
Approximation – Actor-Critic Methods – Eligibility Traces – R-Learning and the Average-Reward
Setting.
UNIT V DEEP REINFORCEMENT LEARNING 9
Deep Q-Learning – Rainbow DQN – DQN Improvements –
Policy Gradient Methods – Benefits – Calculation – Theorem – Policy Functions – Implementation
– Hierarchical RL – Multi-Agent RL
TOTAL: 45 PERIODS
OUTCOMES:
At the end of this course, the students will be able to:
CO4: Learn about the concepts of Reinforcement Learning.
CO2: Understand the concept of Dynamic Programming and Monte Carlo Decision Process
CO3: Understand the concept of Temporal difference Learning (TML)
CO4: Describe how functional approximation is used in reinforcement learning.
CO5: Learn the basics of Deep Reinforcement Learning.
TEXT BOOKS:
1. Sutton R. S. and Barto A. G., "Reinforcement Learning: An Introduction", MIT Press, Second
Edition, 2020.
2. Phil Winder, “Reinforcement Learning: Industrial Applications of Intelligent Agents“.
Oreilly, 2021.
REFERENCES:
1. Kevin Murphy, “Machine Learning - A Probabilistic Perspective” , MIT press, 2012.
2. Christopher Bishop, “Pattern Recognition and Machine Learning”, Springer, 2006.

11
UNIT – IV
FUNCTION
APPROXIMATIO
N METHODS

16
UNIT-4 FUNCTION
APPROXIMATION METHODS

Sl. Contents Page

No. No.

1 FUNCTION APPROXIMATION METHODS 23

2 Value Prediction with function 23

Approximation

3 Gradient Descent Methods 27

4 Linear Methods 31

5 Control with Function Approximation 42

Off-Policy Approximation of Action Values
6 49

7 Policy Approximation 50

8 Actor-Critic Methods 51

9 Eligibility Traces 54

10 R-Learning and the Average 55

11 Reward Setting 55

22
UNIT-4 FUNCTION
APPROXIMATION METHODS
Value Prediction with function Approximation – Gradient Descent Methods –
Linear Methods –Control with Function Approximation – Off-Policy
Approximation of Action Values – Policy Approximation – Actor-Critic Methods –
Eligibility Traces – R-Learning and the Average-Reward Setting.

4. FUNCTION APPROXIMATION METHODS.

Consider finite state-action spaces. If the size of the table representation of the
Q-function is very large, then the Q-function can be appropriately approximated
to cope with this problem. Function approximation can be also applied in the
case of infinite state-action spaces.
4.1 VALUE PREDICTION WITH FUNCTION APPROXIMATION

23
24
26
25
26
4.2 GRADIENT-DESCENT METHODS

We now develop in detail one class of learning methods for function approximation in
value prediction, those based on gradient descent. Gradient-descent methods are
among the most widely used of all function approximation methods and are
particularly well suited to reinforcement learning.

27
28
29
30
31
32
\

33
34
36
35
36
37
38
39
40
41
4.4 CONTROL WITH FUNCTION APPROXIMATION
We now extend value prediction methods using function approximation to control
methods, following the pattern of GPI. First we extend the state-value prediction
methods to action-value prediction methods, then we combine them with policy
improvement and action selection techniques. As usual, the problem of ensuring
exploration is solved by pursuing either an on-policy or an off-policy approach.

42
43
44
All the methods we have discussed above have used accumulating eligibility
traces. Although replacing traces (Section 3.8) are known to have
advantages in tabular

45
Example 4.2: Mountain-Car Task Consider the task of driving an
underpowered car up a steep mountain road, as suggested by the diagram in
the upper left of Figure 4.10. The difficulty is that gravity is stronger than the
car's engine, and even at full throttle the car cannot accelerate up the steep
slope. The only solution is to first move away from the goal and up the
opposite slope on the left. Then, by applying full throttle the car can build up
enough inertia to carry it up the steep slope even though it is slowing down
the whole way. This is a simple example of a continuous control task where
things have to get worse in a sense (farther from the goal) before they can
get better. Many control methodologies have great difficulties with tasks of
this kind unless explicitly aided by a human designer.

46
47
Exercise 4.8 Describe how the actor-critic control method can be combined with
gradient-descent function approximation.

48
4.5 OFF-POLICY APPROXIMATION OF ACTION VALUES

The extension of function approximation turns out to be significantly different and harder for
on-policy learning than it is for on-policy learning.
The tabular on-policy methods readily extend to semi-gradient algorithms, but these
algorithms do not converge as robustly as they do under on-policy training.
Recall that in on-policy learning we seek to learn a value function for a target policy Π , given
data due to a different behavior policy b .
In the control case, action values are learned, and both policies typically change during
learning- being the greedy policy with respect to , and being something more exploratory
such as the -greedy policy with respect to 𝑞 ∧ .
The challenge of off-policy learning comes from:
 Target of the update, which arises from the tabular case. Importance Sampling is used
to deal with this.

 Distribution of the updates, which arises from function approximation.

Dealing with the first challenge, we have Semi-gradient off-policy TD(0):

where,
and is defined appropriately depending on whether the problem is episodic and discounted,
or continuing and undiscounted using average reward:

49
4.6 POLICY APPROXIMATION

50
4.7 ACTOR-CRITIC METHODS
Actor-critic methods are TD methods that have a separate memory structure to explicitly
represent the policy independent of the value function. The policy structure is known as
the actor, because it is used to select actions, and the estimated value function is known as
the critic, because it criticizes the actions made by the actor. Learning is always on-policy:
the critic must learn about and critique whatever policy is currently being followed by the
actor. The critique takes the form of a TD error. This scalar signal is the sole output of the
critic and drives all learning in both actor and critic, as suggested by Figure 4.15.

Actor-critic methods are the natural extension of the idea of reinforcement

comparison methods (Section 3.8) to TD learning and to the full reinforcement
learning problem. Typically, the critic is a state-value function. After each
action selection, the critic evaluates the new state to determine whether things
have gone better or worse than expected. That evaluation is the TD error:

51
These issues were explored early on, primarily for the immediate reward case
(Sutton, 1984; Williams, 1992) and have not been brought fully up to date.

Many of the earliest reinforcement learning systems that used TD methods were
actor-critic methods (Witten, 1977; Barto, Sutton, and Anderson, 1983). Since
then, more attention has been devoted to methods that learn action-value
functions and determine a policy exclusively from the estimated values (such as
Sarsa and Q-learning).

52
This divergence may be just historical accident. For example, one could imagine
intermediate architectures in which both an action-value function and an
independent policy would be learned. In any event, actor-critic methods are likely
to remain of current interest because of two significant apparent advantages:
• They require minimal computation in order to select actions. Consider a case
where there are an infinite number of possible actions--for example, a
continuous-valued action. Any method learning just action values must search
through this infinite set in order to pick an action. If the policy is explicitly
stored, then this extensive computation may not be needed for each action
selection.
• They can learn an explicitly stochastic policy; that is, they can learn the optimal
probabilities of selecting various actions. This ability turns out to be useful in
competitive and non-Markov cases (e.g., see Singh, Jaakkola, and Jordan,
1994).
In addition, the separate actor in actor-critic methods makes them more
appealing in some respects as psychological and biological models. In some cases
it may also make it easier to impose domain-specific constraints on the set of
allowed policies.

53
54
4.9 R-LEARNING AND THE AVERAGE-REWARD SETTING

We now introduce a third classical setting—alongside the episodic and discounted settings— for
formulating the goal in Markov decision problems (MDPs). Like the discounted setting, the
average reward setting applies to continuing problems, problems for which the interaction
between agent and environment goes on and on forever without termination or start states.
Unlike that setting, however, there is no discounting—the agent cares just as much about
delayed rewards as it does about immediate reward. The average-reward setting is one of the
major settings commonly considered in the classical theory of dynamic programming and less-
commonly in reinforcement learning. As we discuss in the next section, the discounted setting is
problematic with function approximation, and thus the average-reward setting is needed to
replace it.
In the average-reward setting, the quality of a policy π is defined as the average rate of reward,
or simply average reward, while following that policy, which we denote as r(π):

4.9

4.10

where the expectations are conditioned on the initial state, S0, and on the subsequent
actions, A0, A1,...,At

which is assumed to exist for any π and to be independent of S0. This assumption about the
MDP is known as ergodicity. It means that where the MDP starts or any early decision made by
the agent can have only a temporary effect in the long run the expectation of being in a state
depends only on the policy and the MDP transition probabilities. Ergodicity is sufficient to
guarantee the existence of the limits in the equations above.
There are subtle distinctions that can be drawn between di↵erent kinds of optimality in the
undiscounted continuing case.

55
Nevertheless, for most practical purposes it may be adequate simply to order policies
according to their average reward per time step, in other words, according to their r(π).
This quantity is essentially the average reward under π, as suggested by (4.10). In
particular, we consider all policies that attain the maximal value of r(π) to be optimal. Note
that the steady state distribution is the special distribution under which, if you select
actions according to π, you remain in the same distribution. That is, for which

4.11

In the average-reward setting, returns are defined in terms of differences between

rewards and the average reward:

4.12

This is known as the differential return, and the corresponding value functions
are known as differential value functions. They are defined in the same way and
we will use the same notation for them as we have all along:

(similarly for V* and q*). Differential value functions also have Bellman
equations, just slightly different from those we have seen earlier. We simply
remove all ϒs and replace all rewards by the difference between the reward
and the true average reward:

56
There is also a differential form of the two TD errors:

57
Part A – Questions
& Answers
Unit – IV

66
Part A - Questions & Answers
1. What is the main objective of value prediction with
function approximation? [K2, CO4]

The main objective is to estimate the value function of a reinforcement learning

problem using a function approximator instead of tabular methods..

2. How are gradient descent methods utilized in value prediction with

function approximation?. [K3, CO4]

Gradient descent methods are used to update the parameters of the function
approximator to minimize the error between predicted and actual values.

3. Define linear methods in the context of function approximation. [K2,

CO4]
Linear methods approximate the value function using a linear combination of
features or basis functions.
4. What is the key challenge in controlling systems using function
approximation?[K2, CO4]
The key challenge is to ensure stability and convergence of the control algorithm,
especially in complex and high-dimensional environments.
5. Explain the concept of off-policy approximation of action values. [K3,
CO4]

Off-policy approximation involves learning action values from data generated by

following a different policy than the one being evaluated.

6. How does policy approximation differ from value prediction with

function approximation? [K3, CO4]

Policy approximation focuses on directly approximating the policy function, while

value prediction approximates the value function.

67
7. Describe actor-critic methods in reinforcement learning..
[K3, CO4]

Actor-critic methods combine policy approximation (actor) with value

function approximation (critic) to improve learning efficiency.

8. What is the role of eligibility traces in reinforcement

learning?[K2, CO4]

Eligibility traces help in credit assignment by tracking the influence of

past actions on future rewards.

9. What distinguishes R-learning from traditional Q-learning

approaches. [K3, CO4]

R-learning is a variant of Q-learning that focuses on estimating the

average reward instead of the discounted sum of rewards.

10. How does the average-reward setting differ from the

discounted reward setting in reinforcement learning?. [K3,
CO4]

In the average-reward setting, the goal is to maximize the average

reward per time step, while in the discounted reward setting, the goal is
to maximize the discounted sum of rewards.

68
11. What are the advantages of using function approximation over
tabular methods in reinforcement learning? [K2, CO4]

Function approximation allows handling of continuous state and action

spaces, as well as reducing the memory requirements for storing value
functions.

12. How do you update the parameters of a linear function

approximator using gradient descent?. [K3, CO4]

Parameters are updated in the direction of the negative gradient of the

error with respect to the parameters, scaled by a learning rate.

13. Can linear methods handle non-linear relationships between

state features and values?. [K3, CO4]

No, linear methods cannot capture non-linear relationships between

state features and values without additional transformation or feature
engineering.

14. What are some common control objectives in reinforcement

learning tasks?. [K3, CO4]

Common control objectives include stabilizing unstable systems, tracking

reference trajectories, and optimizing performance metrics such as cost
or reward.

15. How do off-policy methods handle the discrepancy between

the behavior policy and the target policy?

Off-policy methods use importance sampling techniques to re-weight

samples generated by the behavior policy to match the target policy.

69
16. What are the main advantages of policy approximation over value

function approximation? [K2, CO4]

Policy approximation can directly output actions without needing an

explicit value function, and it can handle continuous action spaces more
naturally.
17. In actor-critic methods, what role does the actor play? [K3,
CO4]
The actor selects actions based on the policy approximation, while the
critic evaluates the selected actions using the value function
approximation.

18. How are eligibility traces used in TD(λ) algorithms for

function approximation? [K3, CO4]
Eligibility traces accumulate the gradients over multiple time steps to
update the parameters of the function approximator more efficiently.

19. What distinguishes R-learning from Q-learning in terms of

the learning objective? [K3, CO4]
R-learning aims to estimate the average reward per time step, while Q-
learning estimates the discounted sum of rewards.
20. Why is it important to consider the average-reward setting in
certain reinforcement learning tasks?[K3, CO4]
The average-reward setting is important in tasks where the environment
does not have a terminal state or where the discount factor approaches
one, as it provides a stable and consistent measure of performance.

70
21. What are some drawbacks of using function approximation in
reinforcement learning?. [K2, CO4]
Function approximation can introduce approximation errors, stability issues, and
difficulties in convergence, especially in high-dimensional and non-linear environments.

22 What is the primary disadvantage of using linear methods for function

approximation? [K3, CO4]
Linear methods may not be expressive enough to capture complex relationships
between states and values, especially in high-dimensional spaces.

23 How can function approximation help in dealing with the curse of

dimensionality in reinforcement learning? [K3, CO4]
Function approximation allows for compact representations of value functions, reducing
the computational and memory requirements in high-dimensional state spaces.

24. What are some common techniques to ensure stability and convergence
in control with function approximation? [K3, CO4]
Common techniques include using function approximators with bounded outputs,
employing appropriate exploration strategies, and incorporating stability constraints into
the learning algorithm.

25. Explain the concept of exploration-exploitation trade-off in reinforcement

learning. [K3, CO4]
The exploration-exploitation trade-off refers to the dilemma of choosing between
exploiting known knowledge to maximize immediate rewards and exploring unknown
regions to discover potentially better actions or states.

71
26. What are some typical policy parameterizations used in policy
approximation methods? [K3, CO4]
Common parameterizations include softmax policies, Gaussian policies, and deterministic
policies parameterized by neural networks.

27. How do actor-critic methods combine value function approximation and

policy approximation? [K3, CO4]
A: Actor-critic methods use the critic to evaluate the actions selected by the actor,
providing feedback on the quality of the policy while simultaneously updating the policy
parameters.

28. What is the purpose of the eligibility trace in TD(λ) algorithms? [K3, CO4]
The eligibility trace captures the influence of past states and actions on the current
estimate, allowing for more efficient updates of the function approximator's parameters.

29.What are some drawbacks of using off-policy approximation methods? [K3,

CO4]
Off-policy methods may suffer from high variance due to importance sampling, and they
may require careful tuning of exploration strategies to ensure effective learning.

30. How does the choice of reward function affect the performance of R-
learning algorithms? [K3, CO4]
The choice of reward function directly impacts the estimation of the average reward and
can significantly influence the learning process and final performance of R-learning
algorithms.
31. In what scenarios would you prefer to use Q-learning over R-learning? [K3,
CO4]
Q-learning is typically preferred when the discount factor is significant or when the
environment has a well-defined terminal state, making the discounted sum of rewards a
more appropriate measure of performance.

72
Part B – Questions
Unit – IV

73
Real time Applications
Applications in self-driving cars
Various papers have proposed Deep Reinforcement Learning for
autonomous driving. In self-driving cars, there are various aspects to
consider, such as speed limits at various places, drivable zones, avoiding
collisions—just to mention a few.

Some of the autonomous driving tasks where reinforcement learning

could be applied include trajectory optimization, motion planning,
dynamic pathing, controller optimization, and scenario-based learning
policies for highways.

For example, parking can be achieved by learning automatic parking

policies. Lane changing can be achieved using Q-Learning while
overtaking can be implemented by learning an overtaking policy while
avoiding collision and maintaining a steady speed thereafter.

AWS DeepRacer is an autonomous racing car that has been designed to

test out RL in a physical track. It uses cameras to visualize the runway
and a reinforcement learning model to control the throttle and direction.

79
The Tiger Problem
The Tiger Problem is a renowned POMDP problem wherein a decision
maker stands between two closed doors on the left and right. A tiger is
put behind one of the doors with equal probability, and treasure is
placed behind the other. Therefore, the problem has two states tiger-left
and tiger-right. The decision maker can either listen to the roars of the
tiger or can open one of the doors. Therefore, the possible actions are
listen, open-left, and open-right. However, the observation capability of
the decision-maker is not 100% accurate. Therefore, there is a 15%
probability that the decision maker listens and interprets wrongly the
side on which the tiger is present. Naturally, if the decision maker opens
the door with the tiger, then they will get hurt. Therefore, we assign a
reward of -100 in this case. On the other hand, the decision maker will
gain a reward of 10 if they choose the door with treasure. Once a door
is opened, the tiger is again randomly assigned to one of the doors, and
the decision maker gets to choose again.
Now, to solve the above Tiger Problem using the default grid method,
which uses point-based value iteration:

As can be interpreted from the above graph, the decision

maker starts at the "Initial Belief" state, wherein there is an
equal probability of the tiger being behind the left or right
door. Once an observation of "tiger-left" is made, the decision
maker again chooses to listen. If the second observation is

81
different, then the decision maker returns to the "Initial
Belief" state. However, if the same observation is made twice,
the decision maker chooses to open the right door. A similar
case happens for the observation "tiger-right". Once the
reward is obtained, the problem is reset, and the decision
maker returns to the "Initial Belief" state. In the subsequent
sections, we present extensions to the POMDP problem and
its applications.

HPT3
100% (1)
HPT3
12 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
PSB Singapore Bs5234 Part 2 - 100mm s3 Aac Block Wall - (PSB 7191102720-Mec15-Yx)
No ratings yet
PSB Singapore Bs5234 Part 2 - 100mm s3 Aac Block Wall - (PSB 7191102720-Mec15-Yx)
30 pages
Kron Reduction Method in Power System
50% (2)
Kron Reduction Method in Power System
9 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
No ratings yet
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
36 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Rl Dp and Value and Policy
No ratings yet
Rl Dp and Value and Policy
4 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Lecture10
No ratings yet
Lecture10
25 pages
RL chap 4
No ratings yet
RL chap 4
7 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
DRL
No ratings yet
DRL
9 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Unit 3
No ratings yet
Unit 3
12 pages
2.3+Value+Function+Approximation
No ratings yet
2.3+Value+Function+Approximation
55 pages
qp ans
No ratings yet
qp ans
40 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Issues in Using Function Approximation For Reinforcement Learning
No ratings yet
Issues in Using Function Approximation For Reinforcement Learning
9 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Algorithm for RL
No ratings yet
Algorithm for RL
99 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
Unit-3 Unit-3 RL Problems,Prediction and Control p 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems,Prediction and Control p 241111 181426
15 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
20CM1111
No ratings yet
20CM1111
3 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
rl-3
No ratings yet
rl-3
31 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
No ratings yet
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
25 pages
Rule-based Reinforcement Learning augmented by External Knowledge
No ratings yet
Rule-based Reinforcement Learning augmented by External Knowledge
7 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
A (Long) Peek Into Reinforcement Learning _ Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning _ Lil'Log
23 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Exam_MT7051_VT24
No ratings yet
Exam_MT7051_VT24
2 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
important questions
No ratings yet
important questions
3 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Reinforcement Learning: From Basics to Expert Proficiency
From Everand
Reinforcement Learning: From Basics to Expert Proficiency
William Smith
No ratings yet
40 Machine Learning Algorithms
From Everand
40 Machine Learning Algorithms
Anam Giri
No ratings yet
Chapter 8
No ratings yet
Chapter 8
28 pages
Chapter 4 - Process Fluid Flow (Pumps)
100% (1)
Chapter 4 - Process Fluid Flow (Pumps)
16 pages
Correct Instruction How To Create IOU Lic
No ratings yet
Correct Instruction How To Create IOU Lic
2 pages
Lkitn235b02, C06 - Phase Ii Main
No ratings yet
Lkitn235b02, C06 - Phase Ii Main
14 pages
Principles of Soil Science Module 1 Genesis, Morpholgy and Classification
No ratings yet
Principles of Soil Science Module 1 Genesis, Morpholgy and Classification
51 pages
Power Plant Engineering 4
No ratings yet
Power Plant Engineering 4
12 pages
Kinder-G6 COMPLETERS With AWARDS
No ratings yet
Kinder-G6 COMPLETERS With AWARDS
3 pages
Past Paper Booklet - QP
100% (1)
Past Paper Booklet - QP
506 pages
ReservoirandPetrophysicalStudiesofEla 1WellNigerDeltaNigeria
No ratings yet
ReservoirandPetrophysicalStudiesofEla 1WellNigerDeltaNigeria
10 pages
MCA Microsoft Office Specialist Office 365 and Office 2019 Study Guide Excel Associate Exam MO 200 1st Edition Butow instant download
100% (3)
MCA Microsoft Office Specialist Office 365 and Office 2019 Study Guide Excel Associate Exam MO 200 1st Edition Butow instant download
60 pages
Teac PD D2610 Service Manual
No ratings yet
Teac PD D2610 Service Manual
17 pages
Cruise Control (1GD-FTV, 2GD-FTV), ECT and A/T Indicator (1GD-FTV, 2GD-FTV), Engine Control (1GD-FTV, 2GD-FTV)
100% (1)
Cruise Control (1GD-FTV, 2GD-FTV), ECT and A/T Indicator (1GD-FTV, 2GD-FTV), Engine Control (1GD-FTV, 2GD-FTV)
7 pages
Power Harvesting by Using Human Foot ST
No ratings yet
Power Harvesting by Using Human Foot ST
60 pages
Lesson Plan 4 Chemistry
No ratings yet
Lesson Plan 4 Chemistry
10 pages
Presentation Chapter 13
No ratings yet
Presentation Chapter 13
15 pages
Sewing Tools
100% (1)
Sewing Tools
2 pages
Elasticity of Demand
No ratings yet
Elasticity of Demand
34 pages
Physical Principles Related To Operation Basic Parts of The Engine Assembly Four Stroke Operating Theory
No ratings yet
Physical Principles Related To Operation Basic Parts of The Engine Assembly Four Stroke Operating Theory
37 pages
Patrol Agent 3.6
No ratings yet
Patrol Agent 3.6
468 pages
Maths Key Notes ch7 Congruence of Triangles Unlocked
No ratings yet
Maths Key Notes ch7 Congruence of Triangles Unlocked
2 pages
Fundamentals Of Pattern Recognition And Machine Learning Ulisses Braganeto download
100% (1)
Fundamentals Of Pattern Recognition And Machine Learning Ulisses Braganeto download
54 pages
Mechanical Engineering: Thermodynamics
No ratings yet
Mechanical Engineering: Thermodynamics
19 pages
Capter 10
No ratings yet
Capter 10
44 pages
Calculating Building Materials
No ratings yet
Calculating Building Materials
5 pages
5 Steps To Designing An Embedded Software Architecture, Step 2
No ratings yet
5 Steps To Designing An Embedded Software Architecture, Step 2
4 pages
DBMS Keys: Types of Key
No ratings yet
DBMS Keys: Types of Key
6 pages

20AI903_RL_UNIT 4

Uploaded by

20AI903_RL_UNIT 4

Uploaded by

Syllabus

Sl. Contents Page

1 FUNCTION APPROXIMATION METHODS 23

2 Value Prediction with function 23

3 Gradient Descent Methods 27

5 Control with Function Approximation 42

10 R-Learning and the Average 55

4. FUNCTION APPROXIMATION METHODS.

 Distribution of the updates, which arises from function approximation.

Dealing with the first challenge, we have Semi-gradient off-policy TD(0):

Actor-critic methods are the natural extension of the idea of reinforcement

In the average-reward setting, returns are defined in terms of differences between

The main objective is to estimate the value function of a reinforcement learning

2. How are gradient descent methods utilized in value prediction with

3. Define linear methods in the context of function approximation. [K2,

Off-policy approximation involves learning action values from data generated by

6. How does policy approximation differ from value prediction with

Policy approximation focuses on directly approximating the policy function, while

Actor-critic methods combine policy approximation (actor) with value

8. What is the role of eligibility traces in reinforcement

Eligibility traces help in credit assignment by tracking the influence of

9. What distinguishes R-learning from traditional Q-learning

R-learning is a variant of Q-learning that focuses on estimating the

10. How does the average-reward setting differ from the

In the average-reward setting, the goal is to maximize the average

Function approximation allows handling of continuous state and action

12. How do you update the parameters of a linear function

Parameters are updated in the direction of the negative gradient of the

13. Can linear methods handle non-linear relationships between

No, linear methods cannot capture non-linear relationships between

14. What are some common control objectives in reinforcement

Common control objectives include stabilizing unstable systems, tracking

15. How do off-policy methods handle the discrepancy between

Off-policy methods use importance sampling techniques to re-weight

function approximation? [K2, CO4]

Policy approximation can directly output actions without needing an

18. How are eligibility traces used in TD(λ) algorithms for

19. What distinguishes R-learning from Q-learning in terms of

22 What is the primary disadvantage of using linear methods for function

23 How can function approximation help in dealing with the curse of

25. Explain the concept of exploration-exploitation trade-off in reinforcement

27. How do actor-critic methods combine value function approximation and

29.What are some drawbacks of using off-policy approximation methods? [K3,

Some of the autonomous driving tasks where reinforcement learning

For example, parking can be achieved by learning automatic parking

AWS DeepRacer is an autonomous racing car that has been designed to

As can be interpreted from the above graph, the decision

You might also like