A12 Spring2024
A12 Spring2024
3. Which of the following is/are the shortcomings of TD Learning that Q-learning resolves?
(a) TD learning cannot provide values for (state, action) pairs, limiting the ability to extract
an optimal policy directly.
(b) TD learning requires knowledge of the reward and transition functions, which is not
always available.
(c) TD learning is computationally expensive and slow compared to Q-learning.
1
(d) TD learning often suffers from high variance in value estimation, leading to unstable
learning.
(e) TD learning cannot handle environments with continuous state and action spaces effec-
tively.
Sol. (a), (b), (d)
Refer to the lectures.
4. Given 100 hypothesis functions, each trained with 106 samples, what is the lower bound on
the probability that there does not exist a hypothesis function with error greater than 0.1?
4
(a) 1 − 200e−2·10
4
(b) 1 − 100e10
2
(c) 1 − 200e10
2
(d) 1 − 200e−2·10
Sol.
k = 100
m = 106
γ = 0.1
2
.106
P (∄ hi s.t.|E(hi ) − Ẽ(hi )| > 0.1) ≥ 1 − 2.100.e−2.0.1
4
≥ 1 − 200e−2.10
2
COMPREHENSION:
For the rest of the questions, we will follow a simplistic game and see how a Reinforcement
Learning agent can learn to behave optimally in it.
This is our game:
At the start of the game, the agent is on the Start state and can choose to move left or right
at each turn. If it reaches the right end(RE), it wins and if it reaches the left end(LE), it loses.
Because we love maths so much, instead of saying the agent wins or loses, we will say that the
agent gets a reward of +1 at RE and a reward of -1 at LE. Then the objective of the agent is
simply to maximum the reward it obtains!
For each state, we define a variable that will store its value. The value of the state will help
the agent determine how to behave later. First we will learn this value.
(a) 1
(b) 0.9
(c) 0.81
(d) 0
Sol. (b)
V (X4) = 0.9 × max(V (X3), V (RE))
V (X4) = 0.9 × max(0, 1)
V (X4) = 0.9 × 1
V (X4) = 0.9
3
(a) -1
(b) -0.9
(c) -0.81
(d) 0
Sol. (d)
V (X4) = 0.9 × max(V (LE), V (X2))
V (X4) = 0.9 × max(−1, 0)
V (X4) = 0.9 × 0
V (X4) = 0
8. What is V(X1) after V converges?
(a) 0.54
(b) -0.9
(c) 0.63
(d) 0
Sol. (a)
This is the sequence of changes in V:
V (X4) = 0.9 → V (X3) = 0.81 → V (Start) = 0.72 → V (X2) = 0.63 → V (X1) = 0.54
Final value for X1 is 0.54.
9. The behavior of an agent is called a policy. Formally, a policy is a mapping from states to
actions. In our case, we have two actions: left and right. We will denote the action for our
policy as A.
Clearly, the optimal policy would be to choose action right in every state. Which of the
following can we use to mathematically describe our optimal policy using the learnt V?
For options (c) and (d), T is the transition function defined as: T (state, action) = next state.
(more than one options may apply)
(a) (
Left if V (SL ) > V (SR )
A=
Right otherwise
(b) (
Left if V (SR ) > V (SL )
A=
Right otherwise
(c)
A = arg max({V (T (S, a))})
a
(d)
A = arg min({V (T (S, a))})
a
4
10. In games like Chess or Ludo, the transition function is known to us. But what about Counter
Strike or Mortal Combat or Super Mario? In games where we do not know T, we can only
query the game simulator with current state and action, and it returns the next state. This
means we cannot directly argmax or argmin for V(T(S,a)). Therefore, learning the value
function V is not sufficient to construct a policy. Which of these could we do to overcome this?
(more than 1 may apply)
Assume there exists a method to do each option. You have to judge whether doing it solves
the stated problem.
(a) Directly learn the policy.
(b) Learn a different function which stores value for state-action pairs (instead of only state
like V does).
(c) Learn T along with V.
(d) Run a random agent repeatedly till it wins. Use this as the winning policy.
Sol. (a), (b), (c)
(a) - If we learn the policy itself, problem solved.
(b) - Given a function Q(s, a), we can use policy: A = arg max({Q(S, a)}).
a
(c) - If we have T and V, we can do what we saw in the previous question.
(d) - If the agent learns a single sequence of actions as its policy, then it will fail when any one
of the states that it saw for that sequence changes, which can easily happen for a stochastic
environment (i.e. transitions are probabilistic).