Q-learn
Q-learn
Consider the grid-world given below and an agent who is trying to learn the optimal policy.
Rewards are only awarded for taking the Exit action from one of the shaded states. Taking this
action moves the agent to the Done state, and the MDP terminates. Assume γ = 1 and α = 0.5 for
all calculations. All equations need to explicitly mention γ and α if necessary.
1. The agent starts from the top left corner and you are given the following episodes from runs
of the agent through this grid-world. Each line in an Episode is a tuple containing (s, a, s′ , r).
Fill in the following Q-values obtained from direct evaluation from the samples:
1
2. Q-learning is an online algorithm to learn optimal Q-values in an MDP with unknown re-
wards and transition function. The update equation is:
where γ is the discount factor, α is the learning rate and the sequence of observations are
(· · · , st , at , st+1 , rt , · · · ).
Q((2, 1), E) =
Q((2, 2), E) =
Q((2, 2), S) =
Q((2, 3), N ) =
Q((2, 3), S) =
Q((3, 1), S) =
2
(b) Given the episodes in part 1, fill in the time at which the following Q values first be-
come non-zero. Your answer should be of the form (episode#,iter#) where iter# is the
Q-learning update iteration in that episode.
3
3. Repeat with SARSA. The update equation is:
Q((2, 1), E) =
Q((2, 2), E) =
Q((2, 2), S) =
Q((2, 3), N ) =
Q((2, 3), S) =
Q((3, 1), S) =
4
(b) Given the episodes in part 1, fill in the time at which the following Q values first be-
come non-zero.