20AI903_RL_UNIT 2
20AI903_RL_UNIT 2
MONTE
DECISION
PROCESS AND
DYNAMIC
PROGRAMMING
16
UNIT 2 MONTE DECISION
PROCESS AND DYNAMIC
Page
PROGRAMMING
Sl. No. Contents
No.
1 Finite Markov Decision Processes 24
2 The Agent Environment interface 24
4 Returns 25
9 Dynamic Programming 30
10 Policy Evaluation 30
11 Policy Improvement 31
12 Iteration 31
13 Value Iteration 32
14 Asynchronous DP 32
15 Efficiency of DP 33
16 Monte Carlo Prediction 34
17 Monte Carlo Estimation of Action Values 34
18 Monte Carlo Control- Off 35
19 policy Monte Carlo Prediction 36
20 Off-Policy Monte Carlo Controls 39
22
UNIT 2- MONTE DECISION
PROCESS AND DYNAMIC
PROGRAMMING
Finite Markov Decision Processes - The Agent Environment interface -
Goals and Rewards –Returns – Episodic and Continuing Tasks Markov
Property – MDP - Value functions – Optimality and Approximation -
Dynamic Programming - Policy Evaluation - Policy Improvement,
Iteration - Value Iteration - Asynchronous DP- Efficiency of DP - Monte
Carlo Prediction - Monte Carlo Estimation of Action Values - Monte Carlo
Control- Off - policy Monte Carlo Prediction – Off-Policy Monte Carlo
Controls.
24
E.g. if we are controlling a robot, the voltages or stresses in its structure
are part of the environment, not the agent. Indeed reward signals are
part of the environment, despite very possibly being produced by the
agent e.g. dopamine.
2.2 Goals and Rewards
The reward hypothesis:
All we mean by goals and purposes can be well thought of as the
maximization of the expected value of the cumulative sum of a received
scalar signal (called reward).
The reward signal is our way of communicating to the agent what we
want to achieve not how we want to achieve it.
2.3 Returns and Episodes
The return Gt is the sum of future rewards:
Gt = Rt+1 + Rt+2 + Rt+3 + · · · + Rt (1)
• This approach makes sense in applications that finish, or are periodic.
That is, the agent environment interaction breaks into episodes.
• We call these systems episodic tasks. e.g playing a board game, trips
through a maze etc.
• Notation for state space in an episodic task varies from the conventional
case (s ∈ S) to (s ∈ S+)
.
• The opposite, continuous applications are called continuing tasks.
• For these tasks we use discounted returns to avoid a sum of returns
going to infinity.
(2)
(3)
25
Discounting is a crucial topic in RL. It allows us to store a finite value
of any state (sum-marised by its expected cumulative reward) for
continuous tasks, where the non-discounted value would run to infinity.
(4)
(5)
If the state is Markov, however, then the state and reward right now
completely characterizes the history, and the above can be reduced to:
(6)
26
25
cooling power and we may not be tracking that temperature. It is hard
for a process to be Markov without sensing all possible variables.
2.6 Markov Decision Process (MDP)
Given any state and action s and a, the probability of each possible pair
of next state and reward, s’, r is denoted:
(7)
We can think of p(s’, r|s, a) as the dynamics of our MDP, often called
the transition function–it defines how we move from state to state
given actions.
2.7 Policies and Value Functions
Value functions are functions of states or functions of state-value
pairs.
They estimate how good it is to be in a given state, or how good it is
to perform a given action in a given state.
Given future rewards are dependent on future actions, value
functions are defined with respect to particular policies as the value of
a state depends on the action an agent takes in said state.
A policy is a mapping from states to probabilities of selecting each
possible action.
RL methods specify how the agent’s policy changes as a result of its
experience.
For MDPs, we can define v(π(s)) formally as:
(8)
i.e. the expected future rewards, given state St, and policy π. We call
vπ(s)the state value function for policy π. Similarly, we can define
the value of taking action a in state s under policy π as:
27
(9)
i.e. the expected value, taking action a in state s then following policy π
We call qπ the action-value function for policy π.
Both value functions are estimated from experience.
A fundamental property of value functions used throughout
reinforcement learning and dynamic programming is that they satisfy
recursive relationships similar to that which we have
already established for the return. This recursive relationship is
characterised by the Bellman Equation:
(10)
This recursion looks from one state through to all possible next states
given our policy and the dynamics as suggested by 5:
2.8 Optimal Policies and Value Functions
A policy π' is defined as better than policy pi if its expected return is
higher for all states.
There is always at least one policy that is better than or equal to all
other policies – this is the optimal policy.
Optimal policies are denoted π∗
Optimal state-value functions are denoted v∗
Optimal action-value functions are denoted q∗
(11)
(12)
(13)
29
2.10 Dynamic Programming
Dynamic Programming (DP) refers to the collection of algorithms that
can be used to compute optimal policies given a perfect model of the
environment as an MDP. DP can rarely be used in practice because of
their great cost, but are nonetheless important theoretically as all other
approaches to computing the value function are, in effect,
approximations of DP. DP algorithms are obtained by turning the
Bellman equations into assignments, that is, into update rules for
improving approximations of the desired value functions.
2.11 Policy Evaluation (Prediction)
We know from the value function can be represented as follows:
(14)
(15)
30
2.12 Policy Improvement
We can obtain a value function for an arbitrary policy π as per the policy
evaluation algorithm discussed above. We may then want to know if
there is a policy π’ that is better than our current policy. A way of
evaluating this is by taking a new action a in state s that is not in our
current policy, running our policy thereafter and seeing how the value
function changes. Formally that looks like:
(16)
(17)
31
4. Repeat until new policy is no longer better than the old policy, at
which point we have obtained the optimal policy. (Only for finite MDPs)
32
2.16 Generalised Policy Iteration
Generalised Policy Iteration is the process of letting policy evaluation and policy
improvement interact, independent of granularity. That is to say,
improvement/evaluation can be performed by doing complete sweeps of the state
space, or it can be performed after every visit to a state (as is the case with value
iteration). The level of granularity is independent of our final outcome:
convergence to the optimal policy and optimal value function. This process can be
illustrated as two convergening lines - Figure 3. We can see that policy
improvement and policy evaluation work both in opposition and in cooperation -
each time we act greedily we get further away from our true value function; and
each time we evaluate our value function our policy is likely no longer greedy.
33
2.18 Monte Carlo Policy Prediction
Recall that the value of a state is the expected discounted future
reward from that state. One way of estimating that value is by
observing the rewards obtained after visiting the state, we would
expect that in the limit this would converge toward the true value.
We can therefore run a policy in an environment for an episode.
When the episode ends, we receive a reward and we assign that
reward to each of the states visited en route to the terminal state.
Where DP algorithms perform one-step predictions to every possible
next state; monte carlo methods only sample one trajectory/episode.
This can be summarised in a new backup diagram as follows:
34
2.19 Monte Carlo Estimation of Action Values
With a model we only need to estimate the state value function v as, paired with our
model, we can evaluate the rewards and next states for each of our actions and pick the
best one.
With model free methods we need to estimate the state-action value function q as we
must explicitly estimate the value of each action in order for the values to be useful in
suggesting a policy. (If we only have the values of states, and don’t know how states are
linked through a model, then selecting the optimal action is impossible)
One serious complication arises when we do not visit every state, as can be the case if
our policy is deterministic. If we do not visit states then we do not observe returns from
these states and cannot estimate their value. We therefore need to maintain
exploration of the state space. One way of doing so is stochastically selected a state-
action pair to start an episode, giving every state-action pair a non-zero probability of
being selected. In this case, we are said to be utilising exploring starts.
Exploring starts falls down when learning from real experience because we cannot
guarantee that we start in a new state-action pair in the real world.
An alternative approach is to use stochastic policies that have non-zero probability of
35
2.21 Monte Carlo Control without Exploring Starts
To avoid having to use exploring starts we can use either on-policy or off-policy
methods. The only way to ensure we visit everything is to visit them directly.
On-policy methods attempt to improve or evaluate or improve the policy that is
making decisions.
On-policy control methods are generally soft meaning that they assign non-zero
probability to each possible action in a state e.g. e-greedy policies.
We take actions in the environment using e-greedy policy, after each episode we
back propagate the rewards to obtain the value function for our e-greedy policy.
Then we perform policy improvement by updating our policy to take the new greedy
reward in each state. Note: based on our new value function, the new greedy action
may have changed in some states. Then we perform policy evaluation using our new
e-greedy policy and repeat (as per generalised policy iteration).
The idea of on-policy Monte Carlo control is still that of GPI. We use first visit MC
methods to estimate the action-value function i.e. to do policy evaluation, but we
cannot then make improve our policy merely by acting greedily w.r.t our value
function because that would prevent further exploration of non-greedy actions. We
must maintain exploration and so improve the ε-greedy version of our policy. That is
to say, when we find the greedy action (the action that maximises our reward for
our given value function) we assign it probability of being selected so that
the policy remains stochastic.
2.22 Off-policy Prediction via Importance Sampling
We face a dilemma when learning control: we want to find the optimal policy, but we
can only find the optimal policy by acting suboptimally to explore sufficient actions.
What we saw with on-policy learning was a compromise - it learns action values not for
the optimal policy but for a near-optimal policy that still explores. An alternative is off-
policy control where we have two policies: one used to generate the data (behaviour
policy) and one that is learned for control (target policy). This is called off policy
learning.
36
35
Off-policy learning methods are powerful and more general than on-
policy methods (on policy methods being a special case of off-policy
where target and behaviour policies are the same). They can be used
to learn from data generated by a conventional non-learning controller
or from a human expert.
If we consider an example of finding target policy π using episodes
collected through a behaviour policy b, we require that every action in
π must also be taken, at least occasionally, by b i.e. π(a|s) > 0 implies
b(a|s) > 0. This is called the assumption of coverage.
Almost all off-policy methods utilize importance sampling, a general
technique for estimating expected values under one distribution given
samples from another. Given a starting state St, the probability of the
subsequent state-action trajectory occuring under any policy π is:
(19)
(20)
37
(20)
(22)
(23)
Where,
(24)
38
2.24 Off-policy Monte Carlo Control
39
Part A - Questions & Answers
1. What is the Agent-Environment interface in the context of
Markov Decision Processes (MDP)? [K2, CO2]
3. How are goals and rewards related in the context of MDPs? [K2, CO2]
Rewards serve as a measure of the goal achievement in MDPs; they guide the
agent's decision-making towards desirable outcomes.
4. Define returns in the context of MDPs. [K2, CO2]
Returns represent the cumulative sum of rewards obtained by the agent over a
sequence of time steps.
5. What is the difference between episodic and continuing tasks in
reinforcement learning? [K3, CO2]
Episodic tasks have a finite duration with a terminal state, while continuing tasks
continue indefinitely.
The Markov property states that the future state depends only on the current
state and action, not on the sequence of events that led to the current state.
The value function estimates the expected cumulative future rewards for being in
a certain state (or state-action pair) and following a specific policy.
53
8. What does it mean for a policy to be optimal in
reinforcement learning? . [K3, CO2]
Policy improvement involves making the policy more greedy with respect
to the current value function, leading to better decision-making.
12. What is the key idea behind the Value Iteration algorithm?
[K3, CO2]
54
14. Discuss the efficiency of dynamic programming in solving
MDPs. [K3, CO2]
Off-policy Monte Carlo controls aim to find the optimal policy even when
the agent is following a different behavior policy.
56
20. Why is the exploration-exploitation trade-off important in reinforcement
learning? [K3, CO2]
Balancing exploration (trying new actions) and exploitation (choosing known good
actions) is crucial for discovering optimal policies.
21. What role does the discount factor play in Markov Decision Processes?
[K3, CO2]
The discount factor determines the importance of future rewards, with values closer to 1
giving more weight to distant rewards.
22.Contrast Value Iteration and Policy Iteration in dynamic programming.
[K3, CO2]
Value Iteration combines policy evaluation and improvement in each iteration, while
Policy Iteration alternates between policy evaluation and improvement.
23. Define the state-action value function (Q-function) in reinforcement
learning. [K3, CO2]
The state-action value function represents the expected cumulative rewards of taking a
specific action in a given state and following a certain policy.
24. What is the key idea behind Temporal Difference learning in
reinforcement learning? [K3, CO2]
TD learning updates value estimates based on the difference between the current
estimate and a target estimate, combining ideas from both Monte Carlo and dynamic
programming.
25.Why might function approximation be used in reinforcement learning? [K3,
CO2]
Function approximation is employed to handle large state spaces by approximating value
functions or policies using parameterized functions like neural networks.
26. Briefly explain the difference between on-policy and off-policy methods
in reinforcement learning. [K3, CO2]
On-policy methods optimize the current policy, while off-policy methods optimize a
different target policy using a separate behavior policy.
57
27. What does the Bellman equation express in the
context of MDPs? [K3, CO2]
The Bellman equation expresses a recursive relationship between the
value of a state (or state-action pair) and the values of its successor
states.
28.Name two common exploration strategies used in
reinforcement learning. [K3, CO2]
Epsilon-greedy and softmax exploration are two common strategies
balancing exploration and exploitation.
29. Explain the difference between policy space and
value space in reinforcement learning. [K3, CO2]
Policy space refers to the set of all possible policies, while value
space refers to the range of possible values that the value function
can take.
30. How does the Markov property contribute to the
concept of memory less systems in reinforcement learning?
[K3, CO2]
The Markov property implies that the current state encapsulates all
relevant information, making the system memory less in the sense
that past states can be ignored for decision-making.
59
Part B Questions
Q. No. Questions K Level CO Mapping
62
Real time Applications
1.Smart Home Automation
Application: Energy Management
Description: MDPs can be used to optimize energy consumption
in smart homes. States may include occupancy and weather
conditions, actions may be adjusting thermostat settings or
turning off/on appliances, and rewards could be energy savings.
The system learns and adapts to user preferences over time.
2. Online Advertising:
Description: MDPs can optimize the placement of online
advertisements. States may include user profiles and website
content, actions may be different ad placements, and rewards
could be user engagement. The system learns over time to
display ads that maximize user interaction.
67
The Tiger Problem
The Tiger Problem is a renowned POMDP problem wherein a decision
maker stands between two closed doors on the left and right. A tiger is
put behind one of the doors with equal probability, and treasure is
placed behind the other. Therefore, the problem has two states tiger-left
and tiger-right. The decision maker can either listen to the roars of the
tiger or can open one of the doors. Therefore, the possible actions are
listen, open-left, and open-right. However, the observation capability of
the decision-maker is not 100% accurate. Therefore, there is a 15%
probability that the decision maker listens and interprets wrongly the
side on which the tiger is present. Naturally, if the decision maker opens
the door with the tiger, then they will get hurt. Therefore, we assign a
reward of -100 in this case. On the other hand, the decision maker will
gain a reward of 10 if they choose the door with treasure. Once a door
is opened, the tiger is again randomly assigned to one of the doors, and
the decision maker gets to choose again.
Now, to solve the above Tiger Problem using the default grid method,
which uses point-based value iteration:
69
different, then the decision maker returns to the "Initial
Belief" state. However, if the same observation is made twice,
the decision maker chooses to open the right door. A similar
case happens for the observation "tiger-right". Once the
reward is obtained, the problem is reset, and the decision
maker returns to the "Initial Belief" state. In the subsequent
sections, we present extensions to the POMDP problem and
its applications.
70