Unit 1, 2 RL
Unit 1, 2 RL
Regret is a feeling of sorrow or disappointment over something that has happened or something one
has done, often related to a perceived mistake or missed opportunity. It can stem from the realization
that a different action or decision could have led to a better outcome.
Example:
Imagine you're deciding between two job offers. One job offers a higher salary, but the other offers
more work-life balance and aligns better with your long-term goals. After much thought, you decide to
take the higher-paying job. A few months later, you realize you're working long hours, feeling stressed,
and have less time for family or personal activities. In this situation, you might feel regret, wishing you
had chosen the job that offered better work-life balance instead.
Regret typically involves reflecting on past choices and wishing for a different result, often coupled with
the belief that a better outcome could have been achieved with another decision.
Regret is when you feel sad or disappointed about a decision you made because you wish you had done
something differently.
Simple Example:
You have two ice cream flavors to choose from—chocolate and vanilla. You pick chocolate, but after
tasting it, you realize you would have liked vanilla more. Later, you feel regret because you wish you had
chosen vanilla instead.
UNIT - II
Markov Decision Problem, policy, and value function, Reward models (infinite discounted, total,
finite horizon, and average), Episodic & continuing tasks, Bellman's optimality operator, and Value
iteration & policy iteration
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/29
Unit II covers foundational concepts in Markov Decision Processes (MDPs), which are used to model
decision-making problems where outcomes are partly random and partly under the control of an agent.
Below is a breakdown of the key concepts in this unit:
MDP is a mathematical framework for modeling decision-making in situations where outcomes are
uncertain. An MDP is defined by:
States (S): The set of all possible states the system can be in.
Actions (A): The set of actions the agent can take.
Transition Model (P): The probability distribution over the next state given the current state
and action, i.e., P(s'|s, a).
Reward Model (R): A function that gives the immediate reward received after taking action a
in state s, transitioning to state s′ .
Discount Factor (γ): A value between 0 and 1 that represents the importance of future
rewards compared to immediate rewards.
2. Policy
A policy is a strategy used by the agent to decide which action to take in each state. It can be
either:
Deterministic Policy: A specific action is chosen for each state.
Stochastic Policy: The action chosen for each state is probabilistic.
3. Value Function
The value function measures the expected long-term reward an agent can achieve starting from a
given state, following a particular policy. It is denoted as:
V (s) = Expected return from state s, following the policy.
The value of a state is important in decision-making because it helps the agent estimate the
desirability of being in a state.
4. Reward Models
Reward models define how rewards are calculated in MDPs. The four types are:
Infinite Discounted Reward: The sum of rewards over time, discounted by a factor γ for
∞
each time step, i.e., R = ∑t=0 γ t rt . This model is used when an agent is interested in long-
Average Reward: The average reward per time step over an infinite number of steps, useful
when the goal is to maximize long-term steady rewards.
Episodic Tasks: In episodic tasks, the decision-making process is divided into episodes (distinct
episodes or episodes end after a fixed number of steps), and each episode starts fresh. The goal is
to maximize the reward within each episode.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/29
Continuing Tasks: These tasks have no natural endpoint. The agent continually interacts with the
environment, and the goal is to maximize long-term rewards over an indefinite period.
Bellman’s Optimality Equation expresses the relationship between the value function of a state
and the value function of its possible successor states. For the optimal policy π ∗ , the Bellman
equation is:
a
s′
This equation helps find the optimal value function, V ∗ (s), by considering all possible actions and
their corresponding rewards and value estimates.
7. Value Iteration
Value Iteration is an algorithm used to compute the optimal value function for an MDP. The idea
is to iteratively update the value of each state using the Bellman optimality equation until the
values converge to the optimal values.
Start with an initial guess for the values of all states.
Iteratively update the value of each state using the Bellman equation.
Repeat this until the values converge (i.e., changes in the values are smaller than a threshold).
8. Policy Iteration
Policy Iteration is another algorithm to solve MDPs, which alternates between two steps:
1. Policy Evaluation: Given a policy, calculate the value function for all states.
2. Policy Improvement: Based on the value function, improve the policy by choosing actions
that maximize the expected reward.
Repeat these steps until the policy converges (i.e., no further improvements can be made).
These methods form the core of solving MDPs and help in finding the best policy to maximize long-term
rewards in decision-making scenarios.
If you need a more detailed explanation of any of these topics, feel free to ask!
1. States (S): All the possible situations or conditions the system can be in.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/29
2. Actions (A): The set of choices the agent can make to affect the system.
3. Transition Model (P): The probability of moving from one state to another after taking an action.
4. Reward (R): The immediate benefit the agent gets from taking an action in a state.
5. Discount Factor (γ): A value between 0 and 1 that represents how much future rewards are valued
compared to immediate rewards.
Simple Example:
Imagine you're playing a robot navigation game in a grid world. The robot is in a 3x3 grid and can
move in four directions: up, down, left, or right. The goal is to get to the target position (bottom-right
corner) from the starting position (top-left corner).
1. States (S): Each cell in the grid is a state. For example, S = { (0,0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0),
(2,1), (2,2) } where (0,0) is the top-left and (2,2) is the target.
2. Actions (A): The robot can move in one of four directions: up, down, left, or right. A = {up, down,
left, right}.
3. Transition Model (P): The transition model defines the probability of moving from one state to
another after taking an action. For example, if the robot is at (0,0) and it moves right, it will
transition to (0,1). If the robot moves down from (0,0), it will go to (1,0), and so on.
4. Reward (R): The reward is given based on the action taken. For example:
Moving to the target (2,2) gives a reward of +10 (this is the goal).
Moving to any other cell gives a reward of -1 (cost of movement).
If the robot hits a wall or tries to move outside the grid, it stays in the same position and gets
a reward of -1.
5. Discount Factor (γ): If the robot is to value immediate rewards over future rewards, a discount
factor of γ = 0.9 might be used. This means the robot values future rewards 90% as much as
immediate ones.
Example Process:
The robot continues this process until it reaches the target (2,2), and its objective is to maximize the total
reward.
The robot needs to learn a policy, which is a strategy for choosing actions that will help it reach the
target with the maximum reward. The policy could be, for example, "always move towards the right
unless blocked," or more complex, depending on the environment and the goal.
In this scenario, the MDP helps the robot decide which actions to take to achieve the highest possible
total reward in the long run, considering the uncertainty in the outcomes (e.g., when moving left or right
might lead to different states).
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/29
This is a basic MDP setup, but it can be expanded with more complex rewards, states, and actions to
represent more intricate decision-making problems.
Example Policy And Value Functions with simple formula and example
In the context of a Markov Decision Process (MDP), the policy and value function are essential for
understanding how the agent behaves and how good the states are, respectively.
1. Policy (π)
A policy is a strategy or rule that the agent follows to decide which action to take in each state. It can be
deterministic or stochastic.
Deterministic Policy: The agent always chooses the same action in a given state.
Stochastic Policy: The agent chooses actions probabilistically based on the state.
The policy π(s) defines the action the agent should take in state s.
= (0, 0), the policy might suggest π(0, 0) = right, meaning the
For example, if the agent is in state s
agent should move right when it is in state (0, 0).
Example of a Policy:
Consider the grid world example from before (3x3 grid). Here's a simple deterministic policy:
This policy guides the agent on which action to take in each state.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/29
The value function V (s) gives the expected return (or reward) from a state s, following a certain policy.
It represents how good it is to be in a particular state, considering the long-term rewards.
t=0
Where:
In simpler terms, the value function tells us how much future rewards we can expect from a given state,
assuming the agent follows the policy π .
Let’s consider the grid world example and use a simple reward system. We will assume:
If the agent follows the deterministic policy we defined earlier, we can estimate the value of state (0, 0).
The agent moves right to (0, 1), and it receives -1 as the reward.
It continues until it reaches the goal at (2, 2), getting rewards at each step.
Let's assume the value of state (2, 2) (the goal) is V (2, 2) = 10.
We can calculate the value of state (1, 1) as:
V (1, 1) = −1 + γ ⋅ V (2, 1)
where γ is the discount factor. If we assume γ = 0.9, and let's say V (2, 1) = 8 (a previously computed
value based on the policy), then:
Similarly, you would continue calculating values for other states based on the rewards and their
subsequent states, updating each state's value until all values converge.
State 0 (start): The agent starts here. The value is based on expected future rewards.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/29
State 1 (goal): The goal state with a reward of +10.
Assume the agent will always move towards the goal with a reward of -1 for each move.
Policy:
V (0) = −1 + γ ⋅ V (1)
V (0) = −1 + 0.9 ⋅ 10 = −1 + 9 = 8
Summary:
Policy (π): A strategy that tells the agent which action to take in each state.
Value Function (V): A measure of how good it is to be in a given state, based on expected future
rewards.
These functions guide the agent in making decisions that maximize long-term rewards in an MDP.
We use Markov Decision Processes (MDP) to model situations where an agent (like a robot, player, or
system) needs to make decisions in environments that are both uncertain and sequential. In simple
words, an MDP helps us figure out the best actions for an agent to take in order to achieve a goal,
especially when the future is not completely predictable.
1. Decision-Making Under Uncertainty: In real life, many decisions are made in environments
where we don't know exactly what will happen next. MDP allows us to make decisions even when
outcomes are uncertain.
2. Long-Term Planning: MDP helps us focus not just on immediate rewards, but also on the long-
term rewards (like future benefits), so that the agent can plan better.
3. Optimal Strategy: MDP helps the agent learn or find the best strategy (called a policy) to achieve
the goal, considering both current and future situations.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/29
Simple Example:
Imagine you are a robot in a maze. You don't know exactly where the obstacles are (uncertainty), and
each move you make affects your position in the maze. You want to get to the goal in the shortest path
possible, but some moves might take you away from the goal. MDP helps you figure out the best way to
move, considering both the immediate rewards (like avoiding obstacles) and long-term rewards (like
reaching the goal).
In short, MDP provides a way to make smart decisions when you have to choose between different
actions, considering both current and future outcomes.
In a Markov Decision Process (MDP), the next move and future actions are predicted based on
probabilities and rewards associated with the current state, action taken, and subsequent states. The
agent doesn't have perfect knowledge of the future but relies on a framework that helps make the best
decisions based on the current information.
Here's how the MDP predicts the next move and future:
The agent uses the transition probabilities to predict the likelihood of moving from one state to
another after performing an action.
For example, if the agent is in state s and takes action a, the transition model tells the agent the
probability of ending up in a new state s′ after taking that action.
2. Rewards (R):
After each move, the agent receives a reward based on the action it took. This reward helps the agent
understand whether the action was good or bad, guiding it towards better decisions in the future.
The agent uses these rewards to evaluate actions. A positive reward signals a good decision, while a
negative reward signals a bad one.
The value function estimates how good it is to be in a particular state, considering not just the
immediate rewards but the future rewards as well. The value function is updated based on the agent's
experience.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/29
For state s, the value function V (s) predicts the expected return (total reward) from that state,
considering both current and future rewards.
∞
V (s) = E [∑ γ t R(st ) ∣ s0 = s]
t=0
Where:
γ is the discount factor, which decides how much future rewards are worth compared to
immediate rewards (0 for no future value, 1 for full future value).
The summation represents the total expected reward over time from that state, considering future
actions.
4. Policy (π):
The policy is a strategy that tells the agent what action to take in each state. The goal is to learn the
optimal policy, which maximizes long-term rewards.
The policy is typically represented as:
The agent uses the policy to decide what action to take in its current state. The action is chosen based
on the value of future states, as predicted by the value function. The agent picks the action that leads
to the state with the highest predicted value (highest reward).
For example:
If the agent is in state s0 , the policy π(s0 ) will tell it the best action to take (say, move right).
The agent will then predict the next state s1 using the transition model P (s′ ∣s0 , a), and calculate
6. Future Predictions:
To predict the future and decide on the best course of action, the agent looks at all possible actions and
their corresponding outcomes:
Example of Prediction:
Let’s say the agent is in a grid world and wants to reach the goal (top-right corner).
The agent uses this information to choose the best action that maximizes its future reward.
Summary:
By following these steps, the agent can make decisions that lead to optimal outcomes, even in uncertain
environments.
Explain Reward models (infinite discounted, total, finite horizon, and average) with simple example
In the context of Markov Decision Processes (MDP), reward models are used to define how the rewards
are accumulated over time. These models determine how the agent evaluates and maximizes the
rewards it receives in different environments. There are different types of reward models:
In this model, the agent receives rewards over time, but future rewards are discounted by a factor γ (0 ≤
γ ≤ 1). The idea is that immediate rewards are more valuable than future rewards.
Formula:
∞
Rdiscounted = ∑ γ t Rt
t=0
Where:
Example:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/29
Suppose an agent receives the following rewards in each time step: 10, 5, 3, 2, ...
R0 = 10, R1 = 5, R2 = 3, R3 = 2
Here, the agent values immediate rewards more than future rewards because of the discount factor γ .
In this model, the agent receives rewards over time without any discount. It simply sums up all the
rewards received, regardless of when they occur.
Formula:
∞
Rtotal = ∑ Rt
t=0
Where:
Example:
Suppose the agent receives the following rewards in each time step: 10, 5, 3, 2, ...
R0 = 10, R1 = 5, R2 = 3, R3 = 2
Rtotal = 10 + 5 + 3 + 2 = 20
In this case, there is no discounting, so the agent simply adds up all the rewards it receives.
In the finite horizon model, the agent receives rewards over a finite number of time steps. After a
certain number of steps, the process ends, and no future rewards are received.
Formula:
T −1
Rfinite horizon = ∑ Rt
t=0
Where:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/29
Rt is the reward received at time step t.
Example:
Suppose the agent receives rewards: 10, 5, 3, 2, and we have a finite horizon of 3 steps.
R0 = 10, R1 = 5, R2 = 3
Rfinite horizon = 10 + 5 + 3 = 18
In this case, even if the agent would have continued to receive rewards beyond time step 2, it stops after
3 steps, and no future rewards are considered.
The average reward model focuses on the long-term average reward per time step. It looks at how
much reward the agent gets on average for each step it takes over an indefinite period.
Formula:
T −1
1
Raverage
= lim
∑ Rt
T →∞ T
t=0
Where:
Example:
Suppose the agent receives rewards: 10, 5, 3, 2, and so on. To calculate the average reward, we take the
average of all rewards received over time.
R0 = 10, R1 = 5, R2 = 3, R3 = 2
10 + 5 + 3 + 2 20
Raverage = = =5
4 4
As the agent continues over more steps, this average reward remains consistent over time.
Infinite Discounted Reward: Rewards are discounted over time. The agent values immediate
rewards more than future rewards.
Total Reward: The agent simply adds up all rewards over time without any discounting.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/29
Finite Horizon Reward: Rewards are considered over a limited number of steps, after which the
process ends.
Average Reward: The agent considers the long-term average reward per time step, focusing on
sustained performance over time.
Each of these reward models is useful in different types of decision-making scenarios, depending on
how we want the agent to prioritize immediate versus future rewards, and whether we want to limit or
focus on long-term behavior.
In finite horizon why there in no reward after finbte step. is the agent reached the goal?
In the finite horizon reward model, the idea is that the agent operates within a fixed period of time
(the horizon), which is set in advance. This means that after a certain number of steps (the horizon T ),
the agent's decision-making process ends—there are no more rewards considered beyond that point,
regardless of whether the agent has reached its goal or not.
The finite horizon simply means that the agent's environment or task is constrained by a limited time
frame, and the model does not account for events or decisions beyond that horizon. This is a modeling
choice that helps simplify the problem. In reality, you could extend the time frame, but for the finite
horizon model, the task is considered completed after the set number of steps.
So, no rewards after the finite horizon because the process is defined to end at time T , regardless of
whether the goal is reached. The focus is on maximizing rewards within that fixed time period.
Whether or not the agent reaches its goal within the finite horizon depends on the task and how the
agent behaves. The goal could be to reach a specific state, accumulate a certain amount of reward, or
complete a particular action within the time horizon.
If the agent reaches the goal before the horizon ends, it will still receive rewards for the actions
taken during that time.
If the agent doesn't reach the goal by the end of the horizon, the task simply stops at time T , and
there are no further rewards considered.
Example to Clarify:
Imagine a simple task where an agent is trying to reach a goal in a grid. The agent can move up, down,
left, or right, and it receives a reward of +10 if it reaches the goal (bottom-right corner), or a reward of -1
for each move.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/29
At time t = 1, the agent moves down and gets another reward of -1.
At time t = 2, the agent moves right again and gets a reward of -1.
At time t = 3, the agent moves up and gets another reward of -1.
After time step t = 3, the task ends (because we defined a finite horizon of 4), and the agent has not
reached the goal. No further rewards are given after time step 3, even though the agent could
potentially have reached the goal if more time was available.
Key Takeaways:
In a finite horizon, the agent's task ends after a fixed number of steps, and no rewards are
considered beyond that point.
The goal may or may not be reached by the end of the horizon.
The agent only cares about maximizing rewards within the finite period (horizon T ).
The use of negative rewards (like -1) is a way to penalize the agent for undesirable actions or to
encourage it to avoid certain behaviors. In reinforcement learning (RL), rewards guide the agent's
learning process, helping it distinguish between actions that lead to good outcomes and those that
don't.
Here’s why negative rewards like -1 might be used:
Negative rewards can be used to discourage the agent from taking actions that are not aligned with its
goals. If the agent performs an action that doesn't help it reach its objective (like moving away from the
goal), it can be penalized with a negative reward to steer it in the right direction.
Example:
In a grid world, where the agent’s goal is to reach a specific location, each time step the agent spends
away from its goal could be penalized with a negative reward (e.g., -1 per step). This discourages the
agent from aimlessly wandering and encourages it to find the quickest path.
In many problems, especially those involving navigation or tasks with multiple time steps, the agent
might be penalized for each action it takes. This reflects the idea that the agent should aim to complete
its task as quickly as possible, reducing unnecessary steps.
Example:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/29
If you're solving a maze, each move might cost the agent -1 to encourage it to find the shortest path to
the goal.
If the agent makes a mistake, such as moving into an invalid state or not following the rules, negative
rewards can be assigned to signal that the action was a bad decision.
Example:
In a game where the agent has to avoid obstacles, running into an obstacle might result in a negative
reward, like -1, to indicate that it's a poor decision.
In some cases, the agent might receive both positive and negative rewards. Positive rewards (like +10)
are given when the agent performs the right action (like reaching the goal), and negative rewards (like
-1) are given for every other action that doesn’t help the agent achieve its goal.
This balance helps the agent learn that not all actions are equally beneficial. It teaches the agent that
action costs (negative rewards) should be minimized and that goal-reaching actions (positive rewards)
should be maximized.
Negative rewards can also be part of balancing exploration (trying new things) and exploitation
(sticking to what’s known to work). If the agent explores the environment and gets penalized, it learns to
focus on exploiting the actions that lead to higher rewards, but it also avoids costly mistakes by learning
from negative penalties.
Let’s use a simple grid world where the agent must move from the top-left corner to the bottom-right
corner.
Positive reward (+10): When the agent reaches the goal (bottom-right corner).
Negative reward (-1): For each step taken (to encourage the agent to minimize the number of
steps).
Big negative reward (-5): If the agent steps into an obstacle or goes in the wrong direction.
Situation:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/29
At each time step, the agent can move in one of four directions.
If it moves towards the goal, it gets -1, but when it reaches the goal, it gets +10.
If it moves into an obstacle, it gets a penalty of -5.
In this case, the negative rewards penalize the agent for actions that don’t bring it closer to the goal or
involve mistakes (like hitting an obstacle), while the positive reward of +10 is given for successfully
reaching the goal. The agent then learns to avoid unnecessary steps and mistakes, aiming to reach the
goal in the fewest moves.
Summary:
By carefully balancing positive and negative rewards, the agent learns to make decisions that maximize
its long-term cumulative reward.
In reinforcement learning (RL), tasks can be broadly classified into two categories based on whether
they have a fixed endpoint or continue indefinitely. These categories are called episodic tasks and
continuing tasks.
1. Episodic Tasks
An episodic task is one where the process is divided into episodes. Each episode consists of a sequence
of actions and rewards, and the task ends after a fixed number of steps or when the agent reaches a
specific goal. After an episode ends, the environment is reset, and the agent starts a new episode.
Key Characteristics:
Fixed duration: The task ends after a set number of steps or when a goal is reached.
Resetting: After each episode ends, the environment is reset, and the agent starts over.
Goal-oriented: These tasks are usually structured so the agent has a goal to reach by the end of
the episode.
Gridworld Navigation: Suppose the agent is in a 5x5 grid, and its goal is to reach the bottom-right
corner. The agent starts at the top-left corner. It moves in one of four directions (up, down, left,
right), and the task ends when the agent reaches the goal.
Start: The agent starts at (0, 0).
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/29
End: The agent reaches the goal at (4, 4).
The task is episodic because once the agent reaches the goal, the environment is reset, and
the agent starts a new episode from (0, 0).
Each time the agent reaches the goal, an episode is considered completed, and the environment is reset
to begin a new episode.
2. Continuing Tasks
A continuing task is one where there is no fixed endpoint, and the agent interacts with the
environment continuously over an indefinite period. There is no concept of episodes; the agent keeps
acting and receiving rewards until some external stopping condition is met (like a time limit or a
specified condition to stop).
Key Characteristics:
No fixed duration: The task doesn't have an end after a specific number of steps.
Ongoing process: The task continues indefinitely, with no reset after each action or step.
No clear goal: These tasks usually do not have a clear endpoint or goal to reach; instead, the agent
aims to keep maximizing rewards over time.
Robotic Vacuum Cleaning: Imagine a robotic vacuum cleaner that moves around a house,
cleaning the floor. The task doesn't end unless the vacuum is turned off, or the battery runs out. It
continues to receive rewards (e.g., -0.1 for each step to encourage efficiency and +10 for
completing a cleaning zone).
The robot moves, cleans, and keeps working continuously.
There is no fixed end to the task. The robot keeps performing actions, and its objective is to
maximize its efficiency while cleaning the house.
This is a continuing task because there is no fixed point at which the task ends, and the
agent keeps interacting with the environment.
Summary of Differences:
Ends after a fixed number of steps or when a goal is Continues indefinitely without a fixed
Duration
reached. endpoint.
Environment does not reset; agent continues
Reset Environment resets after each episode.
interacting.
Often goal-oriented, with the task ending once the No clear goal, focuses on continuous actions
Goal
goal is reached. and rewards.
Gridworld navigation, games (e.g., chess, tic-tac- Robotic vacuum cleaning, stock market
Examples
toe). trading.
Both types of tasks have their own unique challenges and require different strategies for the agent to
learn and maximize rewards effectively.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/29
explain Bellmans optimality with simple example
Bellman’s Optimality Equation is a key concept in reinforcement learning and dynamic programming.
It provides a recursive way of calculating the optimal policy (the best way to act) in a Markov Decision
Process (MDP). It tells us how to calculate the value function of a state, which represents the total
expected reward that can be obtained from that state, following the best possible actions.
The Bellman Optimality Equation for the value function V ∗ (s) is given by:
a
s′
Where:
This equation tells us that the value of a state is the maximum over all possible actions a, of the reward
from the action plus the expected value of the next state, weighted by the transition probabilities.
Simple Example:
Imagine a grid world where an agent is trying to get to a goal. The grid looks like this:
diff
`+---+---+---+
| S | | G |
+---+---+---+
| | | |
+---+---+---+
| | | |
+---+---+---+
`
Steps:
1. Start State S : The agent can take one of the four actions (up, down, left, or right). Each action
leads to a different state, and each state has a corresponding value.
2. Action Calculation: For each action, we calculate the expected value using the formula V ∗ (s) =
′ ∗ ′
maxa (R(s, a) + γ ∑s′ P (s ∣s, a)V (s )).
3. Value of the Goal State (G): Since the goal gives a reward of +10, and once the agent reaches the
goal, there are no more steps to take (it’s an endpoint), the value of the goal state is:
V ∗ (G) = 10
4. Value of State S : Now, let’s compute the value of state S . The agent can move right towards the
goal, and if it reaches the goal, it gets a reward of +10. Alternatively, it could move in other
directions, getting a reward of -1 for each step.
V ∗ (S) = max [Reward for moving right + γ × V ∗ (G), Reward for moving down + γ × V ∗ (n
For simplicity, let’s assume all other directions lead to non-optimal paths. The calculation will be:
V ∗ (S) = 8
So, the value of the start state S is 8, which means that the optimal policy is to move right and
eventually reach the goal, maximizing the cumulative reward.
Bellman’s Optimality Equation helps calculate the value of each state by considering the rewards
and the future values of the next states.
The agent aims to maximize the expected value by choosing the best possible action.
Recursive calculation: The value of a state is recursively defined by the rewards for taking actions
and the values of subsequent states.
policy store the previous game moves and tells the action to the present to get the maximum
reward is it right?
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/29
Yes, you're on the right track!
In reinforcement learning, a policy is a strategy or function that determines the action the agent should
take at each state to maximize its cumulative reward. However, it’s important to clarify that a policy
doesn't directly store all previous game moves. Instead, it typically either maps the current state to an
action or, in more advanced scenarios, it might be state-dependent, meaning it decides on actions
based on the current state and sometimes the history of actions.
Key points:
1. Policy: A policy π(s) defines the action to take for a given state s. In simpler terms, it’s like a set of
rules or a guide that tells the agent what to do next.
Deterministic policy: Always takes the same action for the same state (e.g., π(s) = a).
Stochastic policy: Chooses actions probabilistically for a given state (e.g., π(s) = a1 with
2. Memory of Past Moves: In many reinforcement learning problems, the agent does not explicitly
store the entire history of moves. The agent only focuses on the current state (in most standard
problems like Markov Decision Processes). However, in some advanced settings like Partially
Observable Markov Decision Processes (POMDPs) or when dealing with recurrent neural
networks (RNNs), the agent can maintain memory of past actions or states. This allows the agent
to take past actions into account when making decisions.
3. Maximizing Reward: The policy’s goal is to maximize the expected cumulative reward. It does so
by choosing actions based on the current state in such a way that future rewards (often discounted
by a factor γ ) are maximized. The agent doesn't need to store all previous game moves explicitly
unless the problem requires it (e.g., for non-Markovian problems).
Let’s consider a simple grid world where the agent’s goal is to reach a goal state (G) from a starting
position (S). The agent can move up, down, left, or right.
At each state, the policy will determine the best action for the agent to take.
In a deterministic policy, for example:
If the agent is at position (1, 1), the policy might decide: "Move right to (1, 2)."
If the agent is at (1, 2), it might decide: "Move right to (1, 3)."
The policy is designed to guide the agent step-by-step based on the current position (state) to
reach the goal while maximizing the rewards.
If the agent receives a penalty for each step (say, -1 for each move), the policy might choose the most
direct route to minimize the total steps and reach the goal.
Summary:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/29
Explain Value Iteration
Value Iteration
Value Iteration is an algorithm used to compute the optimal policy and value function in a Markov
Decision Process (MDP). It is a method to find the optimal value function V ∗ (s) for each state, which
can then be used to derive the optimal policy π ∗ (s) for selecting the best actions.
The algorithm iterates through all states, updating the value of each state based on the Bellman
Optimality Equation until the values converge (i.e., changes in the value function become very small).
1. Initialize Values: Start by assigning an initial value to all states. This can be done arbitrarily (e.g.,
setting all states' values to zero).
2. Bellman Update: For each state, update its value by using the Bellman Optimality Equation:
a
s′
Where:
V (s) is the value of state s,
a is an action taken in state s,
R(s, a) is the reward for taking action a in state s,
γ is the discount factor (between 0 and 1, which determines the importance of future
rewards),
P (s′ ∣s, a) is the transition probability (the probability of reaching state s′ from state s by
taking action a).
Essentially, the update finds the action a that maximizes the expected value, considering the
rewards and the expected future values of the next states.
3. Repeat: Repeat the update step for all states until the values converge. Convergence means that
the changes in state values are smaller than a pre-defined threshold ϵ (e.g., 0.01), and further
iterations won’t change the values much.
4. Extract the Optimal Policy: Once the value function converges, you can derive the optimal policy
by selecting the action that maximizes the expected value for each state:
a
s′
This step gives you the optimal action to take at each state based on the optimal value function.
States: The grid cells are states (e.g., positions like (1, 1), (1, 2), etc.).
Actions: The agent can move up, down, left, or right.
Rewards: A reward of +10 for reaching the goal state (G), and -1 for every step taken.
Discount factor: γ = 0.9.
Grid Layout:
diff
`+---+---+---+
| S | | G |
+---+---+---+
| | | |
+---+---+---+
| | | |
+---+---+---+
`
S = Start state.
G = Goal state.
The agent receives a +10 reward for reaching G, and -1 for each move it makes.
1. Initialization: Set the initial values of all states to 0 (or any arbitrary value).
2. First Iteration:
For state S , the agent can move right, and the next state is (1, 2). The reward is -1, and the
next value is V (1, 2) = 0.
V (S) = max(−1 + 0.9 × 0) = −1.
For state (1, 2), the agent can move right toward the goal. The reward is -1, and V (G) = 0
(since the goal gives a reward of 10).
V (1, 2) = max(−1 + 0.9 × 10) = 8.
For the goal G, no further moves are made, so:
V (G) = 10 (since it’s the final reward).
After the first iteration:
3. Subsequent Iterations: The values will continue to update, iterating through all states, and
gradually converging to their final optimal values.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/29
4. Convergence: The values will eventually stabilize (i.e., the updates will become very small), and at
that point, the value iteration process ends.
Key Takeaways:
Value Iteration computes the optimal value function for each state, and from this, the optimal
policy can be extracted.
It works by iterating over all states and updating their values based on the Bellman Optimality
Equation until the values converge.
After convergence, the agent can use the optimal value function to decide the best action (optimal
policy) for each state.
Policy Iteration
Policy Iteration is another algorithm used to find the optimal policy in a Markov Decision Process
(MDP). Unlike Value Iteration, which directly updates the value function, Policy Iteration alternates
between policy evaluation (calculating the value function for a given policy) and policy improvement
(improving the policy based on the calculated value function) until convergence.
1. Initialize a Policy: Start with an initial policy. This can be arbitrary, such as randomly selecting an
action for each state or using a simple default policy (e.g., always moving in a certain direction).
2. Policy Evaluation: Given the current policy, evaluate its value function. This means calculating the
state values V π (s) for each state under the current policy. The goal is to determine the expected
reward when following the policy from each state.
s′
Where:
V π (s) is the value of state s under the policy π ,
π(s) is the action selected by the policy for state s,
R(s, π(s)) is the immediate reward for taking the action π(s) in state s,
γ is the discount factor,
P (s′ ∣s, π(s)) is the probability of transitioning to state s′ from state s by taking action π(s).
3. Policy Improvement: Once the value function V π (s) for the current policy is calculated, update
the policy. For each state, choose the action that maximizes the expected reward given the current
value function V π (s):
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 23/29
π ′ (s) = arg max (R(s, a) + γ ∑ P (s′ ∣s, a)V π (s′ ))
a
s′
This step improves the policy by choosing the best action for each state based on the current value
function.
4. Repeat: Repeat the Policy Evaluation and Policy Improvement steps until the policy stabilizes,
i.e., the policy no longer changes. When the policy stops changing, the process has converged, and
you have the optimal policy.
Let’s consider the same gridworld example with a goal (G) and a start state (S), where the agent needs
to reach the goal while minimizing penalties.
Grid Layout:
diff
`+---+---+---+
| S | | G |
+---+---+---+
| | | |
+---+---+---+
| | | |
+---+---+---+
`
S = Start state.
G = Goal state.
The agent receives a +10 reward for reaching G, and -1 for each move it makes.
Discount factor γ = 0.9.
1. Initialize Policy: Start with an initial policy. Let’s say the policy tells the agent to move right at
every state.
sql
`Policy:
S → Right
(1,1) → Right
(1,2) → Right
G → Stay (because it's the goal)
`
2. Policy Evaluation: Using the current policy, evaluate the value function V π (s). We will calculate
the value of each state based on the policy. For example, if the agent is at (1, 1) and the policy tells
it to move right, we calculate the value for this state by considering the reward for the action and
the value of the next state (since the agent moves to (1, 2)).
V π (S) = −1 + 0.9 × V π (1, 2)
V π (1, 2) = −1 + 0.9 × V π (G)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/29
V π (G) = 10
After running this evaluation for all states, we get the values of each state.
3. Policy Improvement: Once the value function is calculated, improve the policy by selecting the
best action at each state. The goal is to maximize the expected reward given the value function.
For each state, the policy will be updated. For example:
At (1, 1), the policy might switch to moving right (because it maximizes the reward).
At (1, 2), the policy might stay the same or change, depending on the value of other states.
4. Repeat:
After improving the policy, re-evaluate it and update it again.
Continue until the policy converges (i.e., the policy does not change anymore).
5. Convergence: When the policy stops changing, the process is complete, and the agent now has
the optimal policy.
Key Points:
Policy Iteration alternates between policy evaluation (calculating the value function for the
current policy) and policy improvement (improving the policy based on the current value
function).
The algorithm repeats this process until the policy converges to the optimal policy, where further
policy updates will not change it.
Value Iteration updates the value function directly, while Policy Iteration improves the policy by
evaluating it and refining it in each step.
UNIT - III
The Reinforcement Learning problem, prediction and control problems, Model-based algorithm,
Monte Carlo methods for prediction, and Online implementation of Monte Carlo policy evaluation
This unit focuses on Reinforcement Learning (RL), covering key topics such as prediction and control
problems, model-based algorithms, Monte Carlo methods for prediction, and online Monte Carlo
policy evaluation. Below is a breakdown of each concept with simple explanations and examples.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/29
Example: Self-driving Car
Example:
In a chess game, if we have a policy that always moves the queen aggressively, the prediction
problem is to estimate how good this strategy is in terms of winning chances.
The goal is to find the best policy that maximizes the expected reward.
The agent improves its policy over time using exploration and exploitation.
Example:
In game playing, the agent tries different strategies and improves its moves to win more games.
3. Model-Based Algorithm
What is a Model-Based Algorithm?
The agent knows or learns the environment dynamics (state transition probabilities).
It simulates future steps using the learned model to improve decision-making.
Example: Chess AI
A chess-playing AI like AlphaZero first learns how the game works (legal moves, opponent
reactions).
It simulates moves and outcomes before making a decision.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/29
4. Monte Carlo Methods for Prediction
What is Monte Carlo Prediction?
Monte Carlo (MC) methods estimate value functions using experience (sampled episodes).
The agent learns by playing many episodes and averaging the rewards received.
N i=1
Where:
Instead of waiting for the end of the episode, we update the value function after every action.
It allows the agent to learn faster in dynamic environments.
Instead of waiting until the end of the year, an AI trading agent updates its strategy daily based
on new data.
This helps the agent quickly adapt to market changes.
Summary
Reinforcement Learning helps an agent learn the best actions by trial and error.
Prediction problems estimate how good a policy is, while control problems improve the policy.
Model-based RL learns the environment’s structure, while Monte Carlo methods learn from
sampled experiences.
Monte Carlo prediction helps estimate value functions, and online updates make learning faster.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 27/29
is value function estimate the future reward?
Yes! The value function estimates the expected future reward from a given state or state-action pair.
Estimates the expected reward starting from state s and following a policy π .
Formula:
Example: In chess, if you're in a certain board position, V (s) tells you the expected future
reward if you continue playing optimally.
2. Action-Value Function (Q(s, a)):
Estimates the expected reward if you take action a in state s and follow policy π .
Formula:
Example: In a self-driving car, Q(s, turn left) tells you how good turning left is in a specific
situation.
Key Takeaway:
Yes, the value function predicts future rewards by estimating the total return the agent expects to
🚀
receive from a state or action.
Reinforcement Learning (RL) is about an agent learning to make the best decisions by interacting with
an environment and receiving rewards based on its actions. The goal is to find the best strategy
(policy) that maximizes the total reward over time.
RL Components:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 28/29
5. Reward (r ) – The feedback (positive or negative) based on the action (e.g., stopping at a red light =
+10 points, running a red light = -50 points).
6. Policy (π ) – The strategy the agent follows to choose actions.
1. Agent = Dog
2. Environment = Your home
3. State = Dog is sitting
4. Action = Dog jumps
5. Reward = You give a treat (+1) if the dog jumps when told, or no treat (0) if it doesn’t.
Goal: The dog learns over time that jumping on command gets a treat and not jumping doesn’t. It will
adjust its behavior to maximize treats (rewards).
Key Takeaway:
RL is about learning from experience through trial and error. The agent tries different actions, receives
rewards or penalties, and improves its strategy to achieve the best long-term results. 🚀
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 29/29