0% found this document useful (0 votes)

314 views

Assignment 6 (Sol.) : Reinforcement Learning

This document contains the solutions to 10 questions about reinforcement learning concepts like Monte Carlo methods, temporal difference learning, and off-policy evaluation. Key points summarized: 1) Monte Carlo tree search uses the depth parameter to identify leaf/terminal states and impact action selection. 2) The value functions calculated by Monte Carlo and temporal difference methods may not necessarily agree, even if the MDP used is the true model. 3) Dynamic programming, temporal differences, and Monte Carlo methods all use estimates in their update equations, unlike what some answers suggested. 4) Batch temporal difference learning applied to the given transaction data would estimate the state values as v(P)=2 and v(Q)=-2.

Uploaded by

sachin bhadang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

314 views

Assignment 6 (Sol.) : Reinforcement Learning

Uploaded by

sachin bhadang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Assignment 6 (Sol.

)
Reinforcement Learning
Prof. B. Ravindran

1. In the search procedure listed in the lecture for Monte-Carlo tree search what is/are the uses
of the depth parameter?

(a) allows us to identify leaf states

(b) allows us to identify terminal states
(c) can be used to impact the choice of action selection
(d) allows us to specialise value functions based on the number of steps that have been taken

Sol. (a), (c), (d)

2. Suppose you are given a finite set of transition data. Assuming that the Markov model that
can be formed with the given data is the actual MDP from which the data is generated, will
the value functions calculated by the MC and TD methods (in a manner similar to what we
saw in the lectures) necessarily agree?

(a) no
(b) yes

Sol. (a)
In our lecture example, we saw that the value functions calculated by the MC and TD methods
did not agree. The same would still hold if the MDP that generated the data was the MDP
that we formed from the data in applying the TD method.
3. In the iterative policy evaluation process, we have seen the use of different update equations
in DP, MC, and TD methods. With regard to these update equations

(a) DP and TD make use of estimates but not MC

(b) TD makes use of estimates but not DP and MC
(c) MC and TD make use of estimates but not DP
(d) all three methods make use of estimates

Sol. (d)
DP methods update estimates based on other learned estimates, i.e., they bootstrap. MC
methods, while they do not bootstrap, make use of estimates because the target in MC meth-
ods, i.e., the sample return, is an estimate of the actual expected return; the expected value
of course, is not known. TD methods use both the above, i.e., they make use of samples of the
expected values as well as bootstrap.

1
4. Is it necessary for the behaviour policy of an off-policy learning method to have non-zero
probability of selecting all actions?

(a) no
(b) yes

Sol. (a)
If the probability of selecting certain actions in the estimation policy, i.e., the policy which
is being evaluated and/or improved, is zero, then the corresponding probability of selecting
those same actions in the behaviour policy can also be zero without causing any problem to
the off-policy learning procedure.
5. With respect to the Expected SARSA algorithm, is exploration (using for example -greedy
action selection) required as it is in the normal SARSA and Q-learning algorithms?

(a) no
(b) yes

Sol. (b)
The difference in the update rules that differentiate Expected SARSA from the SARSA al-
gorithm do not obviate the need for exploration in the former, since without exploration the
algorithm would, in general, miss out on large parts of the state space, preventing it from
correctly converging (in the limit) to an optimal policy.
6. Assume that we have available a simulation model for a particular problem. To learn an optimal
policy, instead of following trajectories end-to-end, in each iteration we randomly supply a state
and an action to the model and receive the corresponding reward. This information is used for
updating the value function. Which method among the following would you expect to work
in this scenario?

(a) SARSA
(b) Expected SARSA
(c) Q-learning
(d) none of the above

Sol. (d)
Note that each of the three algorithms listed above require, in addition to the reward, the next
state information, which is not provided by the described simulation model.
7. Consider the following transitions observed for an undiscounted MDP with two states P and
Q.
P, +3, P, +2, Q, -4, P, +4, Q, -3
Q, -2, P, +3, Q, -3
Estimates the state value function using first-visit Monte-Carlo evaluation.

(a) v(P) = 2, v(Q) = -5/2

(b) v(P) = 2, v(Q) = 0

2
(c) v(P) = 1, v(Q) = -5/2
(d) v(P) = 1, v(Q) = 0

Sol. (c)
For first-visit MC, we consider only the first occurrence of each state in each transition. Thus,
we have
v(P) = (2 + 0)/2 = 1
v(Q) = (-3 - 2)/2 = -5/2

8. Considering the same transition data as above, estimate the state value function using the
every-visit Monte-Carlo evaluation.

(a) v(P) = 2, v(Q) = -5/2

(b) v(P) = 2, v(Q) = -11/4
(c) v(P) = 1/2, v(Q) = -11/4
(d) v(P) = 1/4, v(Q) = -5/2

Sol. (c)
In the every-visit case we consider each occurrence of each state in the transitions. Thus, we
have
v(P) = (2 + -1 + 1 + 0)/4 = 1/2
v(Q) = (-3 - 3 - 2 - 3)/4 = -11/4
9. Construct a Markov model that best explains the observations given in question 7. In this
model, what is the probability of transitioning from state P to itself? What is the expected
reward received on transitioning from state Q to state P?

(a) 1/4, -4
(b) 1/4, -3
(c) 1/2, -4
(d) 1/2, -3

Sol. (b)
Following is the model that can be constructed from the available data.

3
10. What would be the value function estimate if batch TD(0) were applied to the above transac-
tion data?

(a) 1, -1/2
(b) 1, -2
(c) 2, -1/2
(d) 2, -2

Sol. (d)
We can solve the Bellman equations directly based on the above model to get
v(A) = 3 + 1/4 * v(A) + 3/4 * v(B)
v(B) = -3 + 1/2 * v(A)
v(A) = 2
v(B) = -2

Problem 1: Markov Reward Process
No ratings yet
Problem 1: Markov Reward Process
3 pages
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
notes
No ratings yet
notes
6 pages
Model free methods
No ratings yet
Model free methods
31 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Solution3
No ratings yet
Solution3
4 pages
Rl Exam Tutti
No ratings yet
Rl Exam Tutti
47 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
Assignment 5
No ratings yet
Assignment 5
2 pages
A Short Tutorial On Reinforcement Learning: Review and Applications
No ratings yet
A Short Tutorial On Reinforcement Learning: Review and Applications
5 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Unit 4
100% (1)
Unit 4
7 pages
Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Reinforcement Learning Prof. B. Ravindran
4 pages
Bits
No ratings yet
Bits
5 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
RL_2021_22_Exam_I_63163060c243ad69c552d008b899be82
No ratings yet
RL_2021_22_Exam_I_63163060c243ad69c552d008b899be82
4 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
HW 2
No ratings yet
HW 2
2 pages
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
2 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
E0_270_RL
No ratings yet
E0_270_RL
10 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
DRL_Homework_1
No ratings yet
DRL_Homework_1
4 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
RL
No ratings yet
RL
9 pages
qp ans
No ratings yet
qp ans
40 pages
2.2+Model Free+Control
No ratings yet
2.2+Model Free+Control
92 pages
CHAPTER 21-Final
No ratings yet
CHAPTER 21-Final
20 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
EE 675 Lecture 27th March
No ratings yet
EE 675 Lecture 27th March
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Unit 06 Temporal Difference Learning
No ratings yet
Unit 06 Temporal Difference Learning
9 pages
q2B Review
No ratings yet
q2B Review
9 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
No ratings yet
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
5 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lec 17 SARSA Expected SARSA Q Learning
No ratings yet
Lec 17 SARSA Expected SARSA Q Learning
4 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Assignment 7 (Sol.) : Reinforcement Learning
0% (1)
Assignment 7 (Sol.) : Reinforcement Learning
3 pages
Assignment 5 (Sol.) : Reinforcement Learning
100% (1)
Assignment 5 (Sol.) : Reinforcement Learning
4 pages
Coupled Pendulum
No ratings yet
Coupled Pendulum
6 pages
The King Speaks To The Scribe
0% (1)
The King Speaks To The Scribe
7 pages
Design and Fabrication-22042023-3
No ratings yet
Design and Fabrication-22042023-3
12 pages
Brockton Police Log April 2, 2019
No ratings yet
Brockton Police Log April 2, 2019
22 pages
DLD Contents
No ratings yet
DLD Contents
5 pages
Picture of My Heart Crash Landing On You OST
No ratings yet
Picture of My Heart Crash Landing On You OST
7 pages
Al Quran Software For Windows 7
No ratings yet
Al Quran Software For Windows 7
3 pages
Barek Noureddine HSE
No ratings yet
Barek Noureddine HSE
1 page
MPLS Configuration Is Simple:: Step 1
No ratings yet
MPLS Configuration Is Simple:: Step 1
5 pages
Petitioner vs. vs. Respondent: en Banc
No ratings yet
Petitioner vs. vs. Respondent: en Banc
10 pages
MT6315
No ratings yet
MT6315
13 pages
Lesson Plan English - Preposition
No ratings yet
Lesson Plan English - Preposition
5 pages
internship report yash2
No ratings yet
internship report yash2
22 pages
CPU Penryn: Layer 1: Top Layer 2: GND
No ratings yet
CPU Penryn: Layer 1: Top Layer 2: GND
46 pages
Micamaxxpdplus User Manual
0% (1)
Micamaxxpdplus User Manual
87 pages
HTML Tags
0% (1)
HTML Tags
4 pages
Linking Clauses
No ratings yet
Linking Clauses
11 pages
CHEM Test 1
No ratings yet
CHEM Test 1
3 pages
Differential Calculus: Engr. Joeydann M. Telin
No ratings yet
Differential Calculus: Engr. Joeydann M. Telin
41 pages
Hand Outs EMTECH W2
No ratings yet
Hand Outs EMTECH W2
15 pages
AI Dungeon
No ratings yet
AI Dungeon
5 pages
Uninstall TCP IP
No ratings yet
Uninstall TCP IP
2 pages
The Tempest
No ratings yet
The Tempest
8 pages
IInd Sem Final Exam Corporate Accounting June-2023
No ratings yet
IInd Sem Final Exam Corporate Accounting June-2023
2 pages
Accessibility in E
No ratings yet
Accessibility in E
7 pages
Mediapedia Colored Pencil PDF
67% (3)
Mediapedia Colored Pencil PDF
4 pages
Concepts in Engineering Design - Mechanical Engineering Second Year Notes, Books, Ebook PDF Download
100% (1)
Concepts in Engineering Design - Mechanical Engineering Second Year Notes, Books, Ebook PDF Download
149 pages
Cấu trúc viết email cho bạn
No ratings yet
Cấu trúc viết email cho bạn
6 pages
Venturi Effect
No ratings yet
Venturi Effect
8 pages
The Complete Book of Essential Oils and Aromatherapy, Revised and Expanded: Over 800 Natural, Nontoxic, and Fragrant Recipes To Create Health, Beauty, and Safe Home and Work Environments
No ratings yet
The Complete Book of Essential Oils and Aromatherapy, Revised and Expanded: Over 800 Natural, Nontoxic, and Fragrant Recipes To Create Health, Beauty, and Safe Home and Work Environments
1 page
INDEPENDENT LEARNING PLAN in Ethics For February 26 2020
No ratings yet
INDEPENDENT LEARNING PLAN in Ethics For February 26 2020
3 pages
Instant Download A First Course in Functional Analysis 1st Edition Orr Moshe Shalit PDF All Chapters
100% (1)
Instant Download A First Course in Functional Analysis 1st Edition Orr Moshe Shalit PDF All Chapters
65 pages

Assignment 6 (Sol.) : Reinforcement Learning

Uploaded by

Assignment 6 (Sol.) : Reinforcement Learning

Uploaded by

Assignment 6 (Sol.

(a) allows us to identify leaf states

Sol. (a), (c), (d)

(a) DP and TD make use of estimates but not MC

(a) v(P) = 2, v(Q) = -5/2

(a) v(P) = 2, v(Q) = -5/2

You might also like