0% found this document useful (0 votes)
11 views

Q-learn

The document outlines a Q-learning practice exercise for a grid-world scenario where an agent learns an optimal policy through given episodes. It includes specific Q-value calculations and update equations for both Q-learning and SARSA methods. Additionally, it requires filling in the first occurrence of non-zero Q-values based on the provided episodes.

Uploaded by

Harshit Rathour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Q-learn

The document outlines a Q-learning practice exercise for a grid-world scenario where an agent learns an optimal policy through given episodes. It includes specific Q-value calculations and update equations for both Q-learning and SARSA methods. Additionally, it requires filling in the first occurrence of non-zero Q-values based on the provided episodes.

Uploaded by

Harshit Rathour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CS 6300 Q-Learning Practice March 9, 2021

Consider the grid-world given below and an agent who is trying to learn the optimal policy.
Rewards are only awarded for taking the Exit action from one of the shaded states. Taking this
action moves the agent to the Done state, and the MDP terminates. Assume γ = 1 and α = 0.5 for
all calculations. All equations need to explicitly mention γ and α if necessary.

1. The agent starts from the top left corner and you are given the following episodes from runs
of the agent through this grid-world. Each line in an Episode is a tuple containing (s, a, s′ , r).

Episode 1 Episode 2 Episode 3 Episode 4 Episode 5


(3,1), S, (2,1), 0 (3,1), S, (2,1), 0 (3,1), S, (2,1), 0 (3,1), S, (2,1), 0 (3,1), S, (2,1), 0
(2,1), E, (2,2), 0 (2,1), E, (2,2), 0 (2,1), E, (2,2), 0 (2,1), E, (2,2), 0 (2,1), E, (2,2), 0
(2,2), E, (2,3), 0 (2,2), S, (1,2), -100 (2,2), E, (2,3), 0 (2,2), E, (2,3), 0 (2,2), E, (2,3), 0
(2,3), N, (3,3), +50 (2,3), S, (1,3), +30 (2,3), N, (3,3), +50 (2,3), S, (1,3), +30

Fill in the following Q-values obtained from direct evaluation from the samples:

Q((2,3), N) = Q((2,3), S) = Q((2,2), E) =

1
2. Q-learning is an online algorithm to learn optimal Q-values in an MDP with unknown re-
wards and transition function. The update equation is:

Q(st , at ) = (1 − α)Q(st , at ) + α(R(st , at , st+1 ) + γ max



Q(st+1 , a′ ))
a

where γ is the discount factor, α is the learning rate and the sequence of observations are
(· · · , st , at , st+1 , rt , · · · ).

(a) Particularize the Q-learning equation for this problem.

Q((2, 1), E) =

Q((2, 2), E) =

Q((2, 2), S) =

Q((2, 3), N ) =

Q((2, 3), S) =

Q((3, 1), S) =

2
(b) Given the episodes in part 1, fill in the time at which the following Q values first be-
come non-zero. Your answer should be of the form (episode#,iter#) where iter# is the
Q-learning update iteration in that episode.

Q((2,1), E) = Q((2,2), E) = Q((2,3), S) =

3
3. Repeat with SARSA. The update equation is:

Q(st , at ) = (1 − α)Q(st , at ) + α(R(st , at , st+1 ) + γQ(st+1 , at+1 ))

(a) Particularize the SARSA equation for this problem.

Q((2, 1), E) =

Q((2, 2), E) =

Q((2, 2), S) =

Q((2, 3), N ) =

Q((2, 3), S) =

Q((3, 1), S) =

4
(b) Given the episodes in part 1, fill in the time at which the following Q values first be-
come non-zero.

Q((2,1), E) = Q((2,2), E) = Q((2,3), S) =

You might also like