0% found this document useful (0 votes)
3 views

20AI903_RL_UNIT 2

The document discusses the Monte Carlo methods and Markov Decision Processes (MDPs) in reinforcement learning, covering key concepts such as policies, value functions, and dynamic programming. It explains the agent-environment interface, goals and rewards, and the importance of the Markov property in decision-making. Additionally, it outlines various algorithms for policy evaluation and improvement, including policy iteration and value iteration, as well as the role of Monte Carlo methods in estimating action values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

20AI903_RL_UNIT 2

The document discusses the Monte Carlo methods and Markov Decision Processes (MDPs) in reinforcement learning, covering key concepts such as policies, value functions, and dynamic programming. It explains the agent-environment interface, goals and rewards, and the importance of the Markov property in decision-making. Additionally, it outlines various algorithms for policy evaluation and improvement, including policy iteration and value iteration, as well as the role of Monte Carlo methods in estimating action values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT – II

MONTE
DECISION
PROCESS AND
DYNAMIC
PROGRAMMING
16
UNIT 2 MONTE DECISION
PROCESS AND DYNAMIC
Page
PROGRAMMING
Sl. No. Contents
No.
1 Finite Markov Decision Processes 24
2 The Agent Environment interface 24

3 Goals and Rewards 25

4 Returns 25

5 Episodic and Continuing Tasks Markov Property 26


MDP
6 27
7 Value functions 27

8 Optimality and Approximation 28

9 Dynamic Programming 30

10 Policy Evaluation 30

11 Policy Improvement 31

12 Iteration 31

13 Value Iteration 32
14 Asynchronous DP 32
15 Efficiency of DP 33
16 Monte Carlo Prediction 34
17 Monte Carlo Estimation of Action Values 34
18 Monte Carlo Control- Off 35
19 policy Monte Carlo Prediction 36
20 Off-Policy Monte Carlo Controls 39

22
UNIT 2- MONTE DECISION
PROCESS AND DYNAMIC
PROGRAMMING
Finite Markov Decision Processes - The Agent Environment interface -
Goals and Rewards –Returns – Episodic and Continuing Tasks Markov
Property – MDP - Value functions – Optimality and Approximation -
Dynamic Programming - Policy Evaluation - Policy Improvement,
Iteration - Value Iteration - Asynchronous DP- Efficiency of DP - Monte
Carlo Prediction - Monte Carlo Estimation of Action Values - Monte Carlo
Control- Off - policy Monte Carlo Prediction – Off-Policy Monte Carlo
Controls.

2. Finite Markov Decision Processes

2.1 The Agent-Environment Interface

Figure 2.1: The agent-environment interface in reinforcement


learning

 At each timestep the agent implements a mapping from states to


probabilities of selecting a possible action. The mapping is called the
agents policy, denoted π, where π(a|s) is the probability of the agent
selecting actions a in states.
 In general, actions can be any decision we want to learn how to
make, and states can be any interpretation of the world that might
inform those actions.
 The boundary between agent and environment is much closer to the
agent than is first intuitive.

24
E.g. if we are controlling a robot, the voltages or stresses in its structure
are part of the environment, not the agent. Indeed reward signals are
part of the environment, despite very possibly being produced by the
agent e.g. dopamine.
2.2 Goals and Rewards
The reward hypothesis:
All we mean by goals and purposes can be well thought of as the
maximization of the expected value of the cumulative sum of a received
scalar signal (called reward).
The reward signal is our way of communicating to the agent what we
want to achieve not how we want to achieve it.
2.3 Returns and Episodes
The return Gt is the sum of future rewards:
Gt = Rt+1 + Rt+2 + Rt+3 + · · · + Rt (1)
• This approach makes sense in applications that finish, or are periodic.
That is, the agent environment interaction breaks into episodes.
• We call these systems episodic tasks. e.g playing a board game, trips
through a maze etc.
• Notation for state space in an episodic task varies from the conventional
case (s ∈ S) to (s ∈ S+)
.
• The opposite, continuous applications are called continuing tasks.
• For these tasks we use discounted returns to avoid a sum of returns
going to infinity.

(2)

If the reward is a constant + 1 at each timestep, cumulative discounted


reward Gt becomes:

(3)

25
Discounting is a crucial topic in RL. It allows us to store a finite value
of any state (sum-marised by its expected cumulative reward) for
continuous tasks, where the non-discounted value would run to infinity.

2.4 Unified Notation for Episodic and Continuing Tasks

(4)

2.5 The Markov Property

A state signal that succeeds in retaining all relevant information about


the past is Markov. Examples include:
• A cannonball with known position, velocity and acceleration
• All positions of chess pieces on a chess board.
In normal causal processes, we would think that our
expectation of the state and reward at the next timestep is a function of
all previous states, rewards and actions, as follows:

(5)

If the state is Markov, however, then the state and reward right now
completely characterizes the history, and the above can be reduced to:

(6)

 Even for non-Markov states, it is appropriate to think of all states as


at least an approximation of a Markov state.
 Markov property is important in RL because decisions and values are
assumed to be a function only of the current state.
 Most real scenarios are unlikely to be Markov. In the example of
controlling HVAC, the HVAC motor might heat up which affects.

26
25
cooling power and we may not be tracking that temperature. It is hard
for a process to be Markov without sensing all possible variables.
2.6 Markov Decision Process (MDP)
Given any state and action s and a, the probability of each possible pair
of next state and reward, s’, r is denoted:

(7)

We can think of p(s’, r|s, a) as the dynamics of our MDP, often called
the transition function–it defines how we move from state to state
given actions.
2.7 Policies and Value Functions
 Value functions are functions of states or functions of state-value
pairs.
 They estimate how good it is to be in a given state, or how good it is
to perform a given action in a given state.
 Given future rewards are dependent on future actions, value
functions are defined with respect to particular policies as the value of
a state depends on the action an agent takes in said state.
 A policy is a mapping from states to probabilities of selecting each
possible action.
 RL methods specify how the agent’s policy changes as a result of its
experience.
 For MDPs, we can define v(π(s)) formally as:

(8)

i.e. the expected future rewards, given state St, and policy π. We call
vπ(s)the state value function for policy π. Similarly, we can define
the value of taking action a in state s under policy π as:

27
(9)

i.e. the expected value, taking action a in state s then following policy π
 We call qπ the action-value function for policy π.
 Both value functions are estimated from experience.
A fundamental property of value functions used throughout
reinforcement learning and dynamic programming is that they satisfy
recursive relationships similar to that which we have
already established for the return. This recursive relationship is
characterised by the Bellman Equation:

(10)

This recursion looks from one state through to all possible next states
given our policy and the dynamics as suggested by 5:
2.8 Optimal Policies and Value Functions
 A policy π' is defined as better than policy pi if its expected return is
higher for all states.
 There is always at least one policy that is better than or equal to all
other policies – this is the optimal policy.
 Optimal policies are denoted π∗
 Optimal state-value functions are denoted v∗
 Optimal action-value functions are denoted q∗

Figure 2: Backup diagrams for vπand qπ


28
We can write q∗ in terms of v∗ :

(11)

We can adapt the Bellman equation to achieve the Bellman optimality


equation, which takes two forms. Firstly for v∗ :

(12)

and secondly for q∗ :

(13)

 Using v∗ the optimal expected long term return is turned into a


quantity that is immediately available for each state. Hence a one-
step-ahead search, acting greedily, yield the optimal long-term
actions.
 Fully solving the Bellman optimality equations can be hugely
expensive, especially if the number of states is huge, as is the case
with most interesting problems.
 Solving the Bellman optimality equation is akin to exhaustive search.
We play out every possible scenario until the terminal state and
collect their expected reward. Our policy then defines the action that
maximises this expected reward.
 In the continuous case the Bellman optimality equation is unsolvable
as the recursion on the next state’s value function would never end.
2.9 Optimality and Approximation
 We must approximate because calculation of optimality is too
expensive.
 A nice way of doing this is allowing the agent to make sub-optimal
decisions in scenarios it has low probability of encountering. This is a
trade off for being optimal in situations that occur frequently.

29
2.10 Dynamic Programming
Dynamic Programming (DP) refers to the collection of algorithms that
can be used to compute optimal policies given a perfect model of the
environment as an MDP. DP can rarely be used in practice because of
their great cost, but are nonetheless important theoretically as all other
approaches to computing the value function are, in effect,
approximations of DP. DP algorithms are obtained by turning the
Bellman equations into assignments, that is, into update rules for
improving approximations of the desired value functions.
2.11 Policy Evaluation (Prediction)
We know from the value function can be represented as follows:

(14)

If the dynamics are known perfectly, this becomes a system of |S|


simultaneous linear equations in |S| unknowns, where the unknowns are
vπ(s), s ∈ S. If we consider an iterative sequence of value function
approximations v0, v1, v2, . . ., with initial approximation v0 chosen
arbitrarily e.g. vk(s) = 0 ∀s (ensuring terminal state = 0). We can update
it using the Bellman equation:

(15)

Eventually this update will converge when vk = vπ after infinite sweeps


of the state-space, the value function for our policy. This algorithm is
called iterative policy evaluation. We call this update an expected update
because it is based on the expectation over all possible next states,
rather than a sample of reward/value from the next state. We think of the
updates occurring through sweeps of state space.

30
2.12 Policy Improvement
We can obtain a value function for an arbitrary policy π as per the policy
evaluation algorithm discussed above. We may then want to know if
there is a policy π’ that is better than our current policy. A way of
evaluating this is by taking a new action a in state s that is not in our
current policy, running our policy thereafter and seeing how the value
function changes. Formally that looks like:

(16)

Note the mixing of action-value and state-value functions. If taking this


new action in state s produces a value function that is greater than or
equal to the previous value function for all states then we say the policy
π’ is an improvement over π:

(17)

This is known as the policy improvement theorem. Critically, the value


function must be greater than the previous value function for all states.
One way of choosing new actions for policy improvement is by acting
greedily w.r.t the value function. Acting greedily will always produce a
new policy π’ ≥ π, but it is not necessarily the optimal policy immediately.
2.13 Policy Iteration
By flipping between policy evaluation and improvement we can achieve a
sequence of monotonically increasing policies and value functions. The
algorithm is roughly:
1. Evaluate policy π to obtain value function Vπ
2. Improve policy π by acting greedily with respect to Vπ to obtain new
policy π’
3. Evaluate new policy π’ to obtain new value function Vπ’

31
4. Repeat until new policy is no longer better than the old policy, at
which point we have obtained the optimal policy. (Only for finite MDPs)

This process can be illustrated as:

Figure 3: Iterative policy evaluation and improvement

2.14 Value Iteration


Above, we discussed policy iteration which requires full policy
evaluation at each iteration step, an often expensive process which
(formally) requires infinite sweeps of the state space to approach the true
value function. In value iteration, the policy evaluation is stopped after
one visit to each s ∈ S, or one sweep of the state space. Value iteration
is achieved by turning the Bellman optimality equation into an update
rule:
(18)

for all s ∈ S. Value iteration effectively combines, in each of its sweeps,


one sweep of policy evaluation and one sweep of policy improvement.
2.15 Asynchronous Dynamic Programming
Each of the above methods has required full sweeps of the state space to
first evaluate a policy then improve it. Asynchronous dynamic
programming does not require us to evaluate the entire state space each
sweep. Instead, we can perform in place updates to our value function as
they are experienced, focusing only on the most relevant states to our
problem initially, then working to less relevant states elsewhere. This can
mean that our agent can learn to act well more quickly and save
optimality for later.

32
2.16 Generalised Policy Iteration
Generalised Policy Iteration is the process of letting policy evaluation and policy
improvement interact, independent of granularity. That is to say,
improvement/evaluation can be performed by doing complete sweeps of the state
space, or it can be performed after every visit to a state (as is the case with value
iteration). The level of granularity is independent of our final outcome:
convergence to the optimal policy and optimal value function. This process can be
illustrated as two convergening lines - Figure 3. We can see that policy
improvement and policy evaluation work both in opposition and in cooperation -
each time we act greedily we get further away from our true value function; and
each time we evaluate our value function our policy is likely no longer greedy.

Figure 3: Generalised policy iteration leading to optimality

2.17 Monte Carlo Methods


If we do not have knowledge of the transition probabilities (model of the
environment) then we must learn directly from experience. To do so, we use
Monte Carlo methods. Monte carlo methods are most effective in episodic tasks
where there is a terminal state and the value of the states visited en route to
the terminal state can be updated based on the reward received at the terminal
state. We use general policy iteration, but this time instead of computing the
value function we learn it from samples. We first consider the prediction
problem to obtain vπ and/or qπ for a fixed policy, then we look to improve using
policy improvement, then we use it for control.

33
2.18 Monte Carlo Policy Prediction
 Recall that the value of a state is the expected discounted future
reward from that state. One way of estimating that value is by
observing the rewards obtained after visiting the state, we would
expect that in the limit this would converge toward the true value.
 We can therefore run a policy in an environment for an episode.
When the episode ends, we receive a reward and we assign that
reward to each of the states visited en route to the terminal state.
 Where DP algorithms perform one-step predictions to every possible
next state; monte carlo methods only sample one trajectory/episode.
This can be summarised in a new backup diagram as follows:

Figure 4: Monte carlo backup diagram for one episode

 Importantly, monte carlo methods do not bootstrap in the same way DP


methods do. They take the reward at the end of an episode, rather than
estimated reward based on the value of the next state.
 Because of the lack of bootstrapping, this expense of estimating the value
of one state is independent of the number of states, unlike DP. A significant
advantage, in addition to the other advantages of being able to learn from
experience without a model or from simulated experience.

34
2.19 Monte Carlo Estimation of Action Values

 With a model we only need to estimate the state value function v as, paired with our
model, we can evaluate the rewards and next states for each of our actions and pick the
best one.
 With model free methods we need to estimate the state-action value function q as we
must explicitly estimate the value of each action in order for the values to be useful in
suggesting a policy. (If we only have the values of states, and don’t know how states are
linked through a model, then selecting the optimal action is impossible)
 One serious complication arises when we do not visit every state, as can be the case if
our policy is deterministic. If we do not visit states then we do not observe returns from
these states and cannot estimate their value. We therefore need to maintain
exploration of the state space. One way of doing so is stochastically selected a state-
action pair to start an episode, giving every state-action pair a non-zero probability of
being selected. In this case, we are said to be utilising exploring starts.
 Exploring starts falls down when learning from real experience because we cannot
guarantee that we start in a new state-action pair in the real world.
 An alternative approach is to use stochastic policies that have non-zero probability of

selecting each state-action pair.

2.20 Monte Carlo Control


 Much like we did with value iteration, we do not need to fully evaluate the value function
for a given policy in monte carlo control. Instead we can merely move the value toward
the correct value and then switch to policy improvement thereafter. It is natural to do this
episodically i.e. evaluate the policy using one episode of experience, then act greedily
w.r.t the previous value function to improve the policy in the next episode.
 If we use a deterministic policy for control, we must use exploring starts to ensure
sufficient exploration. This creates the Monte Carlo ES algorithm.

35
2.21 Monte Carlo Control without Exploring Starts

 To avoid having to use exploring starts we can use either on-policy or off-policy
methods. The only way to ensure we visit everything is to visit them directly.
 On-policy methods attempt to improve or evaluate or improve the policy that is
making decisions.
 On-policy control methods are generally soft meaning that they assign non-zero
probability to each possible action in a state e.g. e-greedy policies.
 We take actions in the environment using e-greedy policy, after each episode we
back propagate the rewards to obtain the value function for our e-greedy policy.
Then we perform policy improvement by updating our policy to take the new greedy
reward in each state. Note: based on our new value function, the new greedy action
may have changed in some states. Then we perform policy evaluation using our new
e-greedy policy and repeat (as per generalised policy iteration).
 The idea of on-policy Monte Carlo control is still that of GPI. We use first visit MC
methods to estimate the action-value function i.e. to do policy evaluation, but we
cannot then make improve our policy merely by acting greedily w.r.t our value
function because that would prevent further exploration of non-greedy actions. We
must maintain exploration and so improve the ε-greedy version of our policy. That is
to say, when we find the greedy action (the action that maximises our reward for
our given value function) we assign it probability of being selected so that
the policy remains stochastic.
2.22 Off-policy Prediction via Importance Sampling
We face a dilemma when learning control: we want to find the optimal policy, but we
can only find the optimal policy by acting suboptimally to explore sufficient actions.
What we saw with on-policy learning was a compromise - it learns action values not for
the optimal policy but for a near-optimal policy that still explores. An alternative is off-
policy control where we have two policies: one used to generate the data (behaviour
policy) and one that is learned for control (target policy). This is called off policy
learning.

36
35
 Off-policy learning methods are powerful and more general than on-
policy methods (on policy methods being a special case of off-policy
where target and behaviour policies are the same). They can be used
to learn from data generated by a conventional non-learning controller
or from a human expert.
 If we consider an example of finding target policy π using episodes
collected through a behaviour policy b, we require that every action in
π must also be taken, at least occasionally, by b i.e. π(a|s) > 0 implies
b(a|s) > 0. This is called the assumption of coverage.
 Almost all off-policy methods utilize importance sampling, a general
technique for estimating expected values under one distribution given
samples from another. Given a starting state St, the probability of the
subsequent state-action trajectory occuring under any policy π is:

(19)

where p is the state transition probability function. We can then obtain


the relative probability of the trajectory under the target and behaviour
policies (the importance sampling ratio) as:

(20)

We see here that the state transition probability function helpfully


cancels.
 We want to estimate the expected returns under the target policy
but we only have returns from the behaviour policy. To address this
we simply multiply expected returns from the behaviour policy by
the importance sampling ratio to get the value function for our
target policy

37
(20)

2.23 Incremental Implementation

We perform monte carlo policy evaluation (prediction) incrementally in


the same way as was done for the bandit problem. Generally
incremental implementation follows this formula:

(22)

With on-policy monte carlo methods, this update is performed exactly


after each episode for each visit to a state given the observed rewards,
with off-policy methods the update is slightly more complex. With
ordinary importance sampling, the step size is 1/n where n is the
number of visits to that state, and so acts as an average of the scaled
returns. For weighted importance sampling, we have to form a weighted
average of the returns which requires us to keep track of the weights. If
the weight takes the form Wi = ρt:T(t)−1 then our update rule is:

(23)

Where,

(24)

with C0 = 0. This allows us to keep tracking of the corrected weighted


average term for each update as they are made. Note here that the
weighted average gives more weight to updates based on common
trajectories from b in π that we have some more often.

38
2.24 Off-policy Monte Carlo Control

Using incremental implementation (updates to the value function) and


importance sampling we can now discuss off-policy monte carlo
control–the algorithm for obtaining optimal policy π∗ by using rewards
obtained through behaviour policy b. This works in much the same way
as in previous sections; b must be soft to ensure the entire state space is
explored in the limit; updates are only made to our estimate for qπ, Q, if
the sequence of states an actions produced by b could have been
produced by π. This algorithm is also based on GPI: we update our
estimate of Q using Equation 23, then update π by acting greedily w.r.t
to our value function. If this policy improvement changes our policy such
that the trajectory we are in from b no longer obeys our policy, then we
exit the episode and start again. The full algorithm is shown in figure 5.

Figure 5: Off-policy monte carlo control

39
Part A - Questions & Answers
1. What is the Agent-Environment interface in the context of
Markov Decision Processes (MDP)? [K2, CO2]

The Agent-Environment interface represents the interaction between an agent


and its environment in an MDP, where the agent makes decisions based on the
environment's state.

2.Name two components of an MDP. [K3, CO2]

State space and action space

3. How are goals and rewards related in the context of MDPs? [K2, CO2]
Rewards serve as a measure of the goal achievement in MDPs; they guide the
agent's decision-making towards desirable outcomes.
4. Define returns in the context of MDPs. [K2, CO2]
Returns represent the cumulative sum of rewards obtained by the agent over a
sequence of time steps.
5. What is the difference between episodic and continuing tasks in
reinforcement learning? [K3, CO2]

Episodic tasks have a finite duration with a terminal state, while continuing tasks
continue indefinitely.

6.Explain the Markov property in the context of Markov Decision


Processes. [K3, CO2]

The Markov property states that the future state depends only on the current
state and action, not on the sequence of events that led to the current state.

7. Define the value function in the context of MDPs. [K3, CO2]

The value function estimates the expected cumulative future rewards for being in
a certain state (or state-action pair) and following a specific policy.

53
8. What does it mean for a policy to be optimal in
reinforcement learning? . [K3, CO2]

An optimal policy is one that maximizes the expected cumulative


rewards in the long run..

9. What is the primary idea behind dynamic programming in


the context of MDPs? [K2, CO2]

Dynamic programming involves breaking down a complex problem into


smaller sub-problems and solving them iteratively.

10. Describe the process of policy evaluation in dynamic


programming. [K3, CO2]

Policy evaluation computes the value function for a given policy by


iteratively updating the estimated values based on the Bellman
equation.

11. How does policy improvement contribute to solving Markov


Decision Processes? [K2, CO2]

Policy improvement involves making the policy more greedy with respect
to the current value function, leading to better decision-making.

12. What is the key idea behind the Value Iteration algorithm?
[K3, CO2]

Value Iteration combines policy evaluation and policy improvement in a


single step to find the optimal value function and policy.

13. How does asynchronous dynamic programming differ from


synchronous dynamic programming? [K3, CO2]

Asynchronous DP updates the value function for specific states rather


than iterating through all states in a synchronized manner.

54
14. Discuss the efficiency of dynamic programming in solving
MDPs. [K3, CO2]

Dynamic programming is efficient for small to moderately sized


problems but may face scalability challenges for large state spaces.

15. What does Monte Carlo Prediction aim to estimate in


reinforcement learning? [K2, CO2]

Monte Carlo Prediction estimates the value function or state values by


averaging the returns obtained from multiple episodes.

16. How does Monte Carlo estimate action values in


reinforcement learning? [K3, CO2]

Monte Carlo estimates action values by averaging the returns specifically


for state-action pairs.

17.What is the main objective of Monte Carlo Control in


reinforcement learning? [K3, CO2]

Monte Carlo Control aims to find the optimal policy by iteratively


improving the current policy based on Monte Carlo estimates of action
values.

18.How does off-policy Monte Carlo prediction differ from on-


policy prediction? ?[K3, CO2]

Off-policy prediction estimates the value function for a target policy


using samples generated by a different behavior policy.

19.Explain the concept of off-policy Monte Carlo controls. [K3,


CO2]

Off-policy Monte Carlo controls aim to find the optimal policy even when
the agent is following a different behavior policy.

56
20. Why is the exploration-exploitation trade-off important in reinforcement
learning? [K3, CO2]
Balancing exploration (trying new actions) and exploitation (choosing known good
actions) is crucial for discovering optimal policies.
21. What role does the discount factor play in Markov Decision Processes?
[K3, CO2]
The discount factor determines the importance of future rewards, with values closer to 1
giving more weight to distant rewards.
22.Contrast Value Iteration and Policy Iteration in dynamic programming.
[K3, CO2]
Value Iteration combines policy evaluation and improvement in each iteration, while
Policy Iteration alternates between policy evaluation and improvement.
23. Define the state-action value function (Q-function) in reinforcement
learning. [K3, CO2]
The state-action value function represents the expected cumulative rewards of taking a
specific action in a given state and following a certain policy.
24. What is the key idea behind Temporal Difference learning in
reinforcement learning? [K3, CO2]
TD learning updates value estimates based on the difference between the current
estimate and a target estimate, combining ideas from both Monte Carlo and dynamic
programming.
25.Why might function approximation be used in reinforcement learning? [K3,
CO2]
Function approximation is employed to handle large state spaces by approximating value
functions or policies using parameterized functions like neural networks.
26. Briefly explain the difference between on-policy and off-policy methods
in reinforcement learning. [K3, CO2]
On-policy methods optimize the current policy, while off-policy methods optimize a
different target policy using a separate behavior policy.

57
27. What does the Bellman equation express in the
context of MDPs? [K3, CO2]
The Bellman equation expresses a recursive relationship between the
value of a state (or state-action pair) and the values of its successor
states.
28.Name two common exploration strategies used in
reinforcement learning. [K3, CO2]
Epsilon-greedy and softmax exploration are two common strategies
balancing exploration and exploitation.
29. Explain the difference between policy space and
value space in reinforcement learning. [K3, CO2]
Policy space refers to the set of all possible policies, while value
space refers to the range of possible values that the value function
can take.
30. How does the Markov property contribute to the
concept of memory less systems in reinforcement learning?
[K3, CO2]
The Markov property implies that the current state encapsulates all
relevant information, making the system memory less in the sense
that past states can be ignored for decision-making.

59
Part B Questions
Q. No. Questions K Level CO Mapping

1 Explain the components of the Agent-Environment K3 CO1


Interface in a Markov Decision Process (MDP).

2 How does the agent interact with the environment, K4 CO1


and what information is exchanged between them in
each time step?
3 Define the goals in the context of a Markov Decision K4 CO1
Process.

4 Discuss the role of rewards in shaping the behavior K3 CO1


of an agent in an MDP.

5 Differentiate between episodic and continuing tasks, K4 CO1


and explain how returns are calculated in each case.

6 Explain why the Markov Property is crucial for K3 CO1


modeling decision-making problems.

7 Describe the main components of a Markov Decision K4 CO1


Process (MDP).

8 Explain how value functions are used to evaluate K3 CO1


the desirability of states and state-action pairs.
9 Provide an overview of Dynamic Programming and K4 CO1
its relevance in solving MDPs.

10 Discuss the key steps involved in dynamic K3 CO1


programming for solving MDPs.

11 Explain the process of policy evaluation in the K4 CO1


context of MDPs.

12 Discuss how policy improvement is carried out K4 CO1


based on the results of policy evaluation.

13 Describe the Value Iteration algorithm for finding K4 CO1


optimal policies in MDPs.

14 Discuss the convergence properties of the Value K3 CO1


Iteration algorithm.

15 explain the concept of Monte Carlo prediction in the K4 CO1


context of MDPs.

16 Explain how off-policy Monte Carlo methods can be K4 CO1


used for prediction and control.

62
Real time Applications
1.Smart Home Automation
Application: Energy Management
Description: MDPs can be used to optimize energy consumption
in smart homes. States may include occupancy and weather
conditions, actions may be adjusting thermostat settings or
turning off/on appliances, and rewards could be energy savings.
The system learns and adapts to user preferences over time.

2. Online Advertising:
Description: MDPs can optimize the placement of online
advertisements. States may include user profiles and website
content, actions may be different ad placements, and rewards
could be user engagement. The system learns over time to
display ads that maximize user interaction.

67
The Tiger Problem
The Tiger Problem is a renowned POMDP problem wherein a decision
maker stands between two closed doors on the left and right. A tiger is
put behind one of the doors with equal probability, and treasure is
placed behind the other. Therefore, the problem has two states tiger-left
and tiger-right. The decision maker can either listen to the roars of the
tiger or can open one of the doors. Therefore, the possible actions are
listen, open-left, and open-right. However, the observation capability of
the decision-maker is not 100% accurate. Therefore, there is a 15%
probability that the decision maker listens and interprets wrongly the
side on which the tiger is present. Naturally, if the decision maker opens
the door with the tiger, then they will get hurt. Therefore, we assign a
reward of -100 in this case. On the other hand, the decision maker will
gain a reward of 10 if they choose the door with treasure. Once a door
is opened, the tiger is again randomly assigned to one of the doors, and
the decision maker gets to choose again.
Now, to solve the above Tiger Problem using the default grid method,
which uses point-based value iteration:

As can be interpreted from the above graph, the decision


maker starts at the "Initial Belief" state, wherein there is an
equal probability of the tiger being behind the left or right
door. Once an observation of "tiger-left" is made, the decision
maker again chooses to listen. If the second observation is

69
different, then the decision maker returns to the "Initial
Belief" state. However, if the same observation is made twice,
the decision maker chooses to open the right door. A similar
case happens for the observation "tiger-right". Once the
reward is obtained, the problem is reset, and the decision
maker returns to the "Initial Belief" state. In the subsequent
sections, we present extensions to the POMDP problem and
its applications.

70

You might also like