0% found this document useful (0 votes)

6 views

Unit 1, 2 RL

The document explains the concept of regret, providing examples to illustrate how it arises from decisions that lead to disappointment over missed opportunities. It also covers foundational concepts in Markov Decision Processes (MDPs), including key components like states, actions, policies, and value functions, as well as algorithms for solving MDPs like value iteration and policy iteration. Lastly, it emphasizes the usefulness of MDPs in decision-making under uncertainty and long-term planning.

Uploaded by

sunkaralokesh0

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Unit 1, 2 RL

Uploaded by

sunkaralokesh0

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Unit 1, 2 RL

Explain regret with example

Regret is a feeling of sorrow or disappointment over something that has happened or something one
has done, often related to a perceived mistake or missed opportunity. It can stem from the realization
that a different action or decision could have led to a better outcome.

Example:

Imagine you're deciding between two job offers. One job offers a higher salary, but the other offers
more work-life balance and aligns better with your long-term goals. After much thought, you decide to
take the higher-paying job. A few months later, you realize you're working long hours, feeling stressed,
and have less time for family or personal activities. In this situation, you might feel regret, wishing you
had chosen the job that offered better work-life balance instead.

Regret typically involves reflecting on past choices and wishing for a different result, often coupled with
the belief that a better outcome could have been achieved with another decision.

Explain regret with simple example

Regret is when you feel sad or disappointed about a decision you made because you wish you had done
something differently.

Simple Example:

You have two ice cream flavors to choose from—chocolate and vanilla. You pick chocolate, but after
tasting it, you realize you would have liked vanilla more. Later, you feel regret because you wish you had
chosen vanilla instead.

UNIT - II

Markov Decision Problem, policy, and value function, Reward models (infinite discounted, total,
finite horizon, and average), Episodic & continuing tasks, Bellman's optimality operator, and Value
iteration & policy iteration

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/29
Unit II covers foundational concepts in Markov Decision Processes (MDPs), which are used to model
decision-making problems where outcomes are partly random and partly under the control of an agent.
Below is a breakdown of the key concepts in this unit:

1. Markov Decision Problem (MDP)

MDP is a mathematical framework for modeling decision-making in situations where outcomes are
uncertain. An MDP is defined by:
States (S): The set of all possible states the system can be in.
Actions (A): The set of actions the agent can take.
Transition Model (P): The probability distribution over the next state given the current state
and action, i.e., P(s'|s, a).
Reward Model (R): A function that gives the immediate reward received after taking action a
in state s, transitioning to state s′ .
Discount Factor (γ): A value between 0 and 1 that represents the importance of future
rewards compared to immediate rewards.

2. Policy

A policy is a strategy used by the agent to decide which action to take in each state. It can be
either:
Deterministic Policy: A specific action is chosen for each state.
Stochastic Policy: The action chosen for each state is probabilistic.

3. Value Function

The value function measures the expected long-term reward an agent can achieve starting from a
given state, following a particular policy. It is denoted as:
V (s) = Expected return from state s, following the policy.
The value of a state is important in decision-making because it helps the agent estimate the
desirability of being in a state.

4. Reward Models

Reward models define how rewards are calculated in MDPs. The four types are:
Infinite Discounted Reward: The sum of rewards over time, discounted by a factor γ for
∞
each time step, i.e., R = ∑t=0 γ t rt . This model is used when an agent is interested in long-

term rewards but values immediate rewards more.

Total Reward: The total accumulated reward over a fixed time horizon, without discounting.
Finite Horizon Reward: The sum of rewards over a finite period of time, typically over a
T −1
specific number of steps T , i.e., R = ∑t=0 rt .

Average Reward: The average reward per time step over an infinite number of steps, useful
when the goal is to maximize long-term steady rewards.

5. Episodic vs Continuing Tasks

Episodic Tasks: In episodic tasks, the decision-making process is divided into episodes (distinct
episodes or episodes end after a fixed number of steps), and each episode starts fresh. The goal is
to maximize the reward within each episode.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/29
Continuing Tasks: These tasks have no natural endpoint. The agent continually interacts with the
environment, and the goal is to maximize long-term rewards over an indefinite period.

6. Bellman’s Optimality Operator

Bellman’s Optimality Equation expresses the relationship between the value function of a state
and the value function of its possible successor states. For the optimal policy π ∗ , the Bellman
equation is:

V ∗ (s) = max (R(s, a) + γ ∑ P (s′ ∣s, a)V ∗ (s′ ))

a
s′

This equation helps find the optimal value function, V ∗ (s), by considering all possible actions and
their corresponding rewards and value estimates.

7. Value Iteration

Value Iteration is an algorithm used to compute the optimal value function for an MDP. The idea
is to iteratively update the value of each state using the Bellman optimality equation until the
values converge to the optimal values.
Start with an initial guess for the values of all states.
Iteratively update the value of each state using the Bellman equation.
Repeat this until the values converge (i.e., changes in the values are smaller than a threshold).

8. Policy Iteration

Policy Iteration is another algorithm to solve MDPs, which alternates between two steps:
1. Policy Evaluation: Given a policy, calculate the value function for all states.
2. Policy Improvement: Based on the value function, improve the policy by choosing actions
that maximize the expected reward.
Repeat these steps until the policy converges (i.e., no further improvements can be made).

These methods form the core of solving MDPs and help in finding the best policy to maximize long-term
rewards in decision-making scenarios.
If you need a more detailed explanation of any of these topics, feel free to ask!

Explain Markov Decision Process (MDP) with simple example

A Markov Decision Process (MDP) is a mathematical framework used to model decision-making

situations where outcomes are partly random and partly under the control of an agent. It helps in
understanding how an agent should act to maximize some long-term objective, typically a reward.

Key Components of MDP:

1. States (S): All the possible situations or conditions the system can be in.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/29
2. Actions (A): The set of choices the agent can make to affect the system.
3. Transition Model (P): The probability of moving from one state to another after taking an action.
4. Reward (R): The immediate benefit the agent gets from taking an action in a state.
5. Discount Factor (γ): A value between 0 and 1 that represents how much future rewards are valued
compared to immediate rewards.

Simple Example:

Imagine you're playing a robot navigation game in a grid world. The robot is in a 3x3 grid and can
move in four directions: up, down, left, or right. The goal is to get to the target position (bottom-right
corner) from the starting position (top-left corner).

Components in this example:

1. States (S): Each cell in the grid is a state. For example, S = { (0,0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0),
(2,1), (2,2) } where (0,0) is the top-left and (2,2) is the target.

2. Actions (A): The robot can move in one of four directions: up, down, left, or right. A = {up, down,
left, right}.
3. Transition Model (P): The transition model defines the probability of moving from one state to
another after taking an action. For example, if the robot is at (0,0) and it moves right, it will
transition to (0,1). If the robot moves down from (0,0), it will go to (1,0), and so on.

4. Reward (R): The reward is given based on the action taken. For example:
Moving to the target (2,2) gives a reward of +10 (this is the goal).
Moving to any other cell gives a reward of -1 (cost of movement).
If the robot hits a wall or tries to move outside the grid, it stays in the same position and gets
a reward of -1.
5. Discount Factor (γ): If the robot is to value immediate rewards over future rewards, a discount
factor of γ = 0.9 might be used. This means the robot values future rewards 90% as much as
immediate ones.

Example Process:

Initial state: The robot starts at (0,0).

Action: The robot moves right (action = "right").
New state: The robot is now at (0,1).
Reward: The robot gets -1 for the movement.

The robot continues this process until it reaches the target (2,2), and its objective is to maximize the total
reward.

Policy and Goal:

The robot needs to learn a policy, which is a strategy for choosing actions that will help it reach the
target with the maximum reward. The policy could be, for example, "always move towards the right
unless blocked," or more complex, depending on the environment and the goal.
In this scenario, the MDP helps the robot decide which actions to take to achieve the highest possible
total reward in the long run, considering the uncertainty in the outcomes (e.g., when moving left or right
might lead to different states).
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/29
This is a basic MDP setup, but it can be expanded with more complex rewards, states, and actions to
represent more intricate decision-making problems.

Example Policy And Value Functions with simple formula and example

Policy and Value Functions:

In the context of a Markov Decision Process (MDP), the policy and value function are essential for
understanding how the agent behaves and how good the states are, respectively.

1. Policy (π)

A policy is a strategy or rule that the agent follows to decide which action to take in each state. It can be
deterministic or stochastic.

Deterministic Policy: The agent always chooses the same action in a given state.
Stochastic Policy: The agent chooses actions probabilistically based on the state.

The policy π(s) defines the action the agent should take in state s.

Formula for Policy:

π(s) = Action to take in state s

= (0, 0), the policy might suggest π(0, 0) = right, meaning the
For example, if the agent is in state s
agent should move right when it is in state (0, 0).

Example of a Policy:

Consider the grid world example from before (3x3 grid). Here's a simple deterministic policy:

π((0, 0)) = right

π((0, 1)) = right
π((0, 2)) = down
π((1, 2)) = down
π((1, 1)) = right
π((1, 0)) = right
π((2, 0)) = down
π((2, 1)) = right
π((2, 2)) = stay (this is the goal, so the agent stays here)

This policy guides the agent on which action to take in each state.

2. Value Function (V)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/29
The value function V (s) gives the expected return (or reward) from a state s, following a certain policy.
It represents how good it is to be in a particular state, considering the long-term rewards.

Formula for Value Function:

∞
V (s) = E [∑ γ t R(st ) ∣ s0 = s]

t=0

Where:

V (s) is the value of state s.

γ is the discount factor, which is a number between 0 and 1. It determines how much future
rewards are valued compared to immediate rewards.
R(st ) is the reward received at time step t.

The sum is taken over all future time steps.

In simpler terms, the value function tells us how much future rewards we can expect from a given state,
assuming the agent follows the policy π .

Example of Value Function:

Let’s consider the grid world example and use a simple reward system. We will assume:

Goal (state (2, 2)) gives a reward of +10.

Each move between states gives a reward of -1.

If the agent follows the deterministic policy we defined earlier, we can estimate the value of state (0, 0).

State (0, 0):

The agent moves right to (0, 1), and it receives -1 as the reward.
It continues until it reaches the goal at (2, 2), getting rewards at each step.

Let's assume the value of state (2, 2) (the goal) is V (2, 2) = 10.
We can calculate the value of state (1, 1) as:

V (1, 1) = −1 + γ ⋅ V (2, 1)

where γ is the discount factor. If we assume γ = 0.9, and let's say V (2, 1) = 8 (a previously computed
value based on the policy), then:

V (1, 1) = −1 + 0.9 ⋅ 8 = −1 + 7.2 = 6.2

Similarly, you would continue calculating values for other states based on the rewards and their
subsequent states, updating each state's value until all values converge.

Simple Example of Value Function Calculation:

Let's say we have a simplified grid with only two states:

State 0 (start): The agent starts here. The value is based on expected future rewards.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/29
State 1 (goal): The goal state with a reward of +10.

Assume the agent will always move towards the goal with a reward of -1 for each move.

Policy:

π(0) = move to state 1

Value Function Calculation:

For State 1 (goal): V (1) = 10 (since it's the goal).

For State 0 (start):

V (0) = −1 + γ ⋅ V (1)

Substituting γ = 0.9 and V (1) = 10:

V (0) = −1 + 0.9 ⋅ 10 = −1 + 9 = 8

So, the value of State 0 is 8.

Summary:

Policy (π): A strategy that tells the agent which action to take in each state.
Value Function (V): A measure of how good it is to be in a given state, based on expected future
rewards.

These functions guide the agent in making decisions that maximize long-term rewards in an MDP.

why we use mdp in simple words

We use Markov Decision Processes (MDP) to model situations where an agent (like a robot, player, or
system) needs to make decisions in environments that are both uncertain and sequential. In simple
words, an MDP helps us figure out the best actions for an agent to take in order to achieve a goal,
especially when the future is not completely predictable.

Why MDP is Useful:

1. Decision-Making Under Uncertainty: In real life, many decisions are made in environments
where we don't know exactly what will happen next. MDP allows us to make decisions even when
outcomes are uncertain.
2. Long-Term Planning: MDP helps us focus not just on immediate rewards, but also on the long-
term rewards (like future benefits), so that the agent can plan better.

3. Optimal Strategy: MDP helps the agent learn or find the best strategy (called a policy) to achieve
the goal, considering both current and future situations.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/29
Simple Example:

Imagine you are a robot in a maze. You don't know exactly where the obstacles are (uncertainty), and
each move you make affects your position in the maze. You want to get to the goal in the shortest path
possible, but some moves might take you away from the goal. MDP helps you figure out the best way to
move, considering both the immediate rewards (like avoiding obstacles) and long-term rewards (like
reaching the goal).

In short, MDP provides a way to make smart decisions when you have to choose between different
actions, considering both current and future outcomes.

How it predict the next move and future

In a Markov Decision Process (MDP), the next move and future actions are predicted based on
probabilities and rewards associated with the current state, action taken, and subsequent states. The
agent doesn't have perfect knowledge of the future but relies on a framework that helps make the best
decisions based on the current information.
Here's how the MDP predicts the next move and future:

1. Transition Model (P):

The agent uses the transition probabilities to predict the likelihood of moving from one state to
another after performing an action.
For example, if the agent is in state s and takes action a, the transition model tells the agent the
probability of ending up in a new state s′ after taking that action.

P (s′ ∣s, a) = Probability of moving from state s to state s′ by taking action a

2. Rewards (R):

After each move, the agent receives a reward based on the action it took. This reward helps the agent
understand whether the action was good or bad, guiding it towards better decisions in the future.

R(s, a, s′ ) = Reward after taking action a in state s and ending up in state s′

The agent uses these rewards to evaluate actions. A positive reward signals a good decision, while a
negative reward signals a bad one.

3. Value Function (V):

The value function estimates how good it is to be in a particular state, considering not just the
immediate rewards but the future rewards as well. The value function is updated based on the agent's
experience.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/29
For state s, the value function V (s) predicts the expected return (total reward) from that state,
considering both current and future rewards.
∞
V (s) = E [∑ γ t R(st ) ∣ s0 = s]

t=0

Where:

γ is the discount factor, which decides how much future rewards are worth compared to
immediate rewards (0 for no future value, 1 for full future value).
The summation represents the total expected reward over time from that state, considering future
actions.

4. Policy (π):

The policy is a strategy that tells the agent what action to take in each state. The goal is to learn the
optimal policy, which maximizes long-term rewards.
The policy is typically represented as:

π(s) = Best action to take in state s

5. Prediction of Next Move:

The agent uses the policy to decide what action to take in its current state. The action is chosen based
on the value of future states, as predicted by the value function. The agent picks the action that leads
to the state with the highest predicted value (highest reward).

For example:

If the agent is in state s0 , the policy π(s0 ) will tell it the best action to take (say, move right).

The agent will then predict the next state s1 using the transition model P (s′ ∣s0 , a), and calculate

the expected future reward using the value function V (s1 ).

6. Future Predictions:

To predict the future and decide on the best course of action, the agent looks at all possible actions and
their corresponding outcomes:

It predicts the next state using the transition model.

It calculates the future rewards using the value function.
It updates its policy to choose actions that lead to states with the highest future rewards, adjusting
for the discount factor.

Example of Prediction:

Let’s say the agent is in a grid world and wants to reach the goal (top-right corner).

1. Current State: s0 = (0, 0) (starting point).

2. Possible Actions: Move up, down, left, or right.
3. Transition Model:
If the agent moves right, the new state s′ = (0, 1).
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/29
If the agent moves down, the new state s′ = (1, 0).
4. Rewards:
Moving to the goal (2, 2) gives +10.
Moving elsewhere gives a penalty of -1.
5. Value Function:
V ((2, 2)) = 10 (goal).
V ((0, 0)) is estimated by considering future rewards from possible next moves.

The agent uses this information to choose the best action that maximizes its future reward.

Summary:

MDP predicts the next move and future outcomes by:

Using transition probabilities to predict the next state.

Considering rewards for each action to evaluate the immediate benefits.
Updating a value function to estimate long-term rewards.
Using a policy to choose the best action based on the state and expected future rewards.

By following these steps, the agent can make decisions that lead to optimal outcomes, even in uncertain
environments.

Explain Reward models (infinite discounted, total, finite horizon, and average) with simple example

In the context of Markov Decision Processes (MDP), reward models are used to define how the rewards
are accumulated over time. These models determine how the agent evaluates and maximizes the
rewards it receives in different environments. There are different types of reward models:

1. Infinite Discounted Reward Model (Discounted Reward)

In this model, the agent receives rewards over time, but future rewards are discounted by a factor γ (0 ≤
γ ≤ 1). The idea is that immediate rewards are more valuable than future rewards.

Formula:

∞
Rdiscounted = ∑ γ t Rt

t=0

Where:

γ is the discount factor.

Rt is the reward received at time step t.

Example:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/29
Suppose an agent receives the following rewards in each time step: 10, 5, 3, 2, ...

R0 = 10, R1 = 5, R2 = 3, R3 = 2

Let γ = 0.9 (so future rewards are discounted by 10%).

The total discounted reward would be:

Rdiscounted = 10 + 0.9 × 5 + 0.92 × 3 + 0.93 × 2 = 10 + 4.5 + 2.43 + 1.458 = 18.39

Here, the agent values immediate rewards more than future rewards because of the discount factor γ .

2. Total Reward Model

In this model, the agent receives rewards over time without any discount. It simply sums up all the
rewards received, regardless of when they occur.

Formula:

∞
Rtotal = ∑ Rt

t=0

Where:

Rt is the reward received at time step t.

Example:

Suppose the agent receives the following rewards in each time step: 10, 5, 3, 2, ...

R0 = 10, R1 = 5, R2 = 3, R3 = 2

The total reward would be:

Rtotal = 10 + 5 + 3 + 2 = 20

In this case, there is no discounting, so the agent simply adds up all the rewards it receives.

3. Finite Horizon Reward Model

In the finite horizon model, the agent receives rewards over a finite number of time steps. After a
certain number of steps, the process ends, and no future rewards are received.

Formula:

T −1
Rfinite horizon = ∑ Rt

t=0

Where:

T is the finite time horizon (the total number of time steps).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/29
Rt is the reward received at time step t.

Example:

Suppose the agent receives rewards: 10, 5, 3, 2, and we have a finite horizon of 3 steps.

R0 = 10, R1 = 5, R2 = 3

The finite horizon reward would be:

Rfinite horizon = 10 + 5 + 3 = 18

In this case, even if the agent would have continued to receive rewards beyond time step 2, it stops after
3 steps, and no future rewards are considered.

4. Average Reward Model

The average reward model focuses on the long-term average reward per time step. It looks at how
much reward the agent gets on average for each step it takes over an indefinite period.

Formula:

T −1
1
Raverage
= lim
∑ Rt

T →∞ T
t=0

Where:

T is the total number of time steps.

Rt is the reward received at time step t.

Example:

Suppose the agent receives rewards: 10, 5, 3, 2, and so on. To calculate the average reward, we take the
average of all rewards received over time.

For the first 4 steps:

R0 = 10, R1 = 5, R2 = 3, R3 = 2

The average reward would be:

10 + 5 + 3 + 2 20
Raverage = = =5
4 4

As the agent continues over more steps, this average reward remains consistent over time.

Summary of Reward Models:

Infinite Discounted Reward: Rewards are discounted over time. The agent values immediate
rewards more than future rewards.
Total Reward: The agent simply adds up all rewards over time without any discounting.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/29
Finite Horizon Reward: Rewards are considered over a limited number of steps, after which the
process ends.
Average Reward: The agent considers the long-term average reward per time step, focusing on
sustained performance over time.

Each of these reward models is useful in different types of decision-making scenarios, depending on
how we want the agent to prioritize immediate versus future rewards, and whether we want to limit or
focus on long-term behavior.

In finite horizon why there in no reward after finbte step. is the agent reached the goal?

In the finite horizon reward model, the idea is that the agent operates within a fixed period of time
(the horizon), which is set in advance. This means that after a certain number of steps (the horizon T ),
the agent's decision-making process ends—there are no more rewards considered beyond that point,
regardless of whether the agent has reached its goal or not.

Why There Are No Rewards After the Horizon:

The finite horizon simply means that the agent's environment or task is constrained by a limited time
frame, and the model does not account for events or decisions beyond that horizon. This is a modeling
choice that helps simplify the problem. In reality, you could extend the time frame, but for the finite
horizon model, the task is considered completed after the set number of steps.
So, no rewards after the finite horizon because the process is defined to end at time T , regardless of
whether the goal is reached. The focus is on maximizing rewards within that fixed time period.

Does the Agent Reach the Goal?

Whether or not the agent reaches its goal within the finite horizon depends on the task and how the
agent behaves. The goal could be to reach a specific state, accumulate a certain amount of reward, or
complete a particular action within the time horizon.

If the agent reaches the goal before the horizon ends, it will still receive rewards for the actions
taken during that time.
If the agent doesn't reach the goal by the end of the horizon, the task simply stops at time T , and
there are no further rewards considered.

Example to Clarify:

Imagine a simple task where an agent is trying to reach a goal in a grid. The agent can move up, down,
left, or right, and it receives a reward of +10 if it reaches the goal (bottom-right corner), or a reward of -1
for each move.

Let's say the finite horizon is 4 time steps.

The agent starts at (0, 0), and it takes the following actions:
At time t = 0, the agent moves right and gets a reward of -1.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/29
At time t = 1, the agent moves down and gets another reward of -1.
At time t = 2, the agent moves right again and gets a reward of -1.
At time t = 3, the agent moves up and gets another reward of -1.

After time step t = 3, the task ends (because we defined a finite horizon of 4), and the agent has not
reached the goal. No further rewards are given after time step 3, even though the agent could
potentially have reached the goal if more time was available.

Key Takeaways:

In a finite horizon, the agent's task ends after a fixed number of steps, and no rewards are
considered beyond that point.
The goal may or may not be reached by the end of the horizon.
The agent only cares about maximizing rewards within the finite period (horizon T ).

whey rewards in negative -1

The use of negative rewards (like -1) is a way to penalize the agent for undesirable actions or to
encourage it to avoid certain behaviors. In reinforcement learning (RL), rewards guide the agent's
learning process, helping it distinguish between actions that lead to good outcomes and those that
don't.
Here’s why negative rewards like -1 might be used:

1. Discourage Unwanted Actions

Negative rewards can be used to discourage the agent from taking actions that are not aligned with its
goals. If the agent performs an action that doesn't help it reach its objective (like moving away from the
goal), it can be penalized with a negative reward to steer it in the right direction.

Example:

In a grid world, where the agent’s goal is to reach a specific location, each time step the agent spends
away from its goal could be penalized with a negative reward (e.g., -1 per step). This discourages the
agent from aimlessly wandering and encourages it to find the quickest path.

2. Create a Time Penalty

In many problems, especially those involving navigation or tasks with multiple time steps, the agent
might be penalized for each action it takes. This reflects the idea that the agent should aim to complete
its task as quickly as possible, reducing unnecessary steps.

Example:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/29
If you're solving a maze, each move might cost the agent -1 to encourage it to find the shortest path to
the goal.

3. Penalty for Failure or Mistakes

If the agent makes a mistake, such as moving into an invalid state or not following the rules, negative
rewards can be assigned to signal that the action was a bad decision.

Example:

In a game where the agent has to avoid obstacles, running into an obstacle might result in a negative
reward, like -1, to indicate that it's a poor decision.

4. Balance Positive and Negative Rewards

In some cases, the agent might receive both positive and negative rewards. Positive rewards (like +10)
are given when the agent performs the right action (like reaching the goal), and negative rewards (like
-1) are given for every other action that doesn’t help the agent achieve its goal.
This balance helps the agent learn that not all actions are equally beneficial. It teaches the agent that
action costs (negative rewards) should be minimized and that goal-reaching actions (positive rewards)
should be maximized.

Example of Reward Balance:

If the agent reaches the goal: +10

If the agent takes a step towards the goal: -1 (penalty for each step to encourage faster
achievement)
If the agent does something wrong (like stepping into an obstacle): -5 (larger penalty for a mistake)

5. Exploration vs. Exploitation

Negative rewards can also be part of balancing exploration (trying new things) and exploitation
(sticking to what’s known to work). If the agent explores the environment and gets penalized, it learns to
focus on exploiting the actions that lead to higher rewards, but it also avoids costly mistakes by learning
from negative penalties.

Example with Explanation:

Let’s use a simple grid world where the agent must move from the top-left corner to the bottom-right
corner.

Positive reward (+10): When the agent reaches the goal (bottom-right corner).
Negative reward (-1): For each step taken (to encourage the agent to minimize the number of
steps).
Big negative reward (-5): If the agent steps into an obstacle or goes in the wrong direction.

Situation:

The agent starts at (0, 0).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/29
At each time step, the agent can move in one of four directions.
If it moves towards the goal, it gets -1, but when it reaches the goal, it gets +10.
If it moves into an obstacle, it gets a penalty of -5.

In this case, the negative rewards penalize the agent for actions that don’t bring it closer to the goal or
involve mistakes (like hitting an obstacle), while the positive reward of +10 is given for successfully
reaching the goal. The agent then learns to avoid unnecessary steps and mistakes, aiming to reach the
goal in the fewest moves.

Summary:

Negative rewards like -1 are used to:

Discourage unnecessary actions or mistakes.

Penalize undesirable behavior.
Encourage the agent to reach its goal efficiently.
Create a balance between exploration and exploitation.

By carefully balancing positive and negative rewards, the agent learns to make decisions that maximize
its long-term cumulative reward.

Explain Episodic and Continuing tasks example

In reinforcement learning (RL), tasks can be broadly classified into two categories based on whether
they have a fixed endpoint or continue indefinitely. These categories are called episodic tasks and
continuing tasks.

1. Episodic Tasks

An episodic task is one where the process is divided into episodes. Each episode consists of a sequence
of actions and rewards, and the task ends after a fixed number of steps or when the agent reaches a
specific goal. After an episode ends, the environment is reset, and the agent starts a new episode.

Key Characteristics:

Fixed duration: The task ends after a set number of steps or when a goal is reached.
Resetting: After each episode ends, the environment is reset, and the agent starts over.
Goal-oriented: These tasks are usually structured so the agent has a goal to reach by the end of
the episode.

Example of an Episodic Task:

Gridworld Navigation: Suppose the agent is in a 5x5 grid, and its goal is to reach the bottom-right
corner. The agent starts at the top-left corner. It moves in one of four directions (up, down, left,
right), and the task ends when the agent reaches the goal.
Start: The agent starts at (0, 0).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/29
End: The agent reaches the goal at (4, 4).
The task is episodic because once the agent reaches the goal, the environment is reset, and
the agent starts a new episode from (0, 0).

Each time the agent reaches the goal, an episode is considered completed, and the environment is reset
to begin a new episode.

2. Continuing Tasks

A continuing task is one where there is no fixed endpoint, and the agent interacts with the
environment continuously over an indefinite period. There is no concept of episodes; the agent keeps
acting and receiving rewards until some external stopping condition is met (like a time limit or a
specified condition to stop).

Key Characteristics:

No fixed duration: The task doesn't have an end after a specific number of steps.
Ongoing process: The task continues indefinitely, with no reset after each action or step.
No clear goal: These tasks usually do not have a clear endpoint or goal to reach; instead, the agent
aims to keep maximizing rewards over time.

Example of a Continuing Task:

Robotic Vacuum Cleaning: Imagine a robotic vacuum cleaner that moves around a house,
cleaning the floor. The task doesn't end unless the vacuum is turned off, or the battery runs out. It
continues to receive rewards (e.g., -0.1 for each step to encourage efficiency and +10 for
completing a cleaning zone).
The robot moves, cleans, and keeps working continuously.
There is no fixed end to the task. The robot keeps performing actions, and its objective is to
maximize its efficiency while cleaning the house.
This is a continuing task because there is no fixed point at which the task ends, and the
agent keeps interacting with the environment.

Summary of Differences:

Feature Episodic Task Continuing Task

Ends after a fixed number of steps or when a goal is Continues indefinitely without a fixed
Duration
reached. endpoint.
Environment does not reset; agent continues
Reset Environment resets after each episode.
interacting.

Often goal-oriented, with the task ending once the No clear goal, focuses on continuous actions
Goal
goal is reached. and rewards.
Gridworld navigation, games (e.g., chess, tic-tac- Robotic vacuum cleaning, stock market
Examples
toe). trading.

Both types of tasks have their own unique challenges and require different strategies for the agent to
learn and maximize rewards effectively.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/29
explain Bellmans optimality with simple example

Bellman’s Optimality Equation is a key concept in reinforcement learning and dynamic programming.
It provides a recursive way of calculating the optimal policy (the best way to act) in a Markov Decision
Process (MDP). It tells us how to calculate the value function of a state, which represents the total
expected reward that can be obtained from that state, following the best possible actions.

Bellman’s Optimality Equation (Simplified):

The Bellman Optimality Equation for the value function V ∗ (s) is given by:

V ∗ (s) = max (R(s, a) + γ ∑ P (s′ ∣s, a)V ∗ (s′ ))

a
s′

Where:

V ∗ (s) is the optimal value of state s.

a is an action taken in state s.
R(s, a) is the reward received for taking action a in state s.
γ is the discount factor (a value between 0 and 1 that represents the importance of future
rewards).
P (s′ ∣s, a) is the transition probability of moving to state s′ from state s after action a.
The sum ∑s′ accounts for all possible next states s′ , considering their probabilities and values.

This equation tells us that the value of a state is the maximum over all possible actions a, of the reward
from the action plus the expected value of the next state, weighted by the transition probabilities.

Simple Example:

Imagine a grid world where an agent is trying to get to a goal. The grid looks like this:

diff

`+---+---+---+
| S | | G |
+---+---+---+
| | | |
+---+---+---+
| | | |
+---+---+---+
`

S = Start state (agent begins here).

G = Goal state (agent wants to reach here).
The agent can move up, down, left, or right.
Reward: If the agent reaches the goal, it gets a reward of +10; otherwise, it gets a reward of -1 for
every step it takes.
Discount factor: γ = 0.9 (The agent values future rewards, but not as much as immediate
rewards).
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/29
We want to find the optimal path that leads to the goal. Using Bellman’s Optimality Equation, we will
calculate the value of the start state S .

Let’s break this down for state S :

Steps:

1. Start State S : The agent can take one of the four actions (up, down, left, or right). Each action
leads to a different state, and each state has a corresponding value.
2. Action Calculation: For each action, we calculate the expected value using the formula V ∗ (s) =
′ ∗ ′
maxa (R(s, a) + γ ∑s′ P (s ∣s, a)V (s )).

3. Value of the Goal State (G): Since the goal gives a reward of +10, and once the agent reaches the
goal, there are no more steps to take (it’s an endpoint), the value of the goal state is:

V ∗ (G) = 10

4. Value of State S : Now, let’s compute the value of state S . The agent can move right towards the
goal, and if it reaches the goal, it gets a reward of +10. Alternatively, it could move in other
directions, getting a reward of -1 for each step.

Using Bellman’s equation:

V ∗ (S) = max [Reward for moving right + γ × V ∗ (G), Reward for moving down + γ × V ∗ (n

For simplicity, let’s assume all other directions lead to non-optimal paths. The calculation will be:

V ∗ (S) = max [−1 + 0.9 × 10]

V ∗ (S) = max [−1 + 9]

V ∗ (S) = 8

So, the value of the start state S is 8, which means that the optimal policy is to move right and
eventually reach the goal, maximizing the cumulative reward.

Summary of Key Points:

Bellman’s Optimality Equation helps calculate the value of each state by considering the rewards
and the future values of the next states.
The agent aims to maximize the expected value by choosing the best possible action.
Recursive calculation: The value of a state is recursively defined by the rewards for taking actions
and the values of subsequent states.

policy store the previous game moves and tells the action to the present to get the maximum
reward is it right?

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/29
Yes, you're on the right track!
In reinforcement learning, a policy is a strategy or function that determines the action the agent should
take at each state to maximize its cumulative reward. However, it’s important to clarify that a policy
doesn't directly store all previous game moves. Instead, it typically either maps the current state to an
action or, in more advanced scenarios, it might be state-dependent, meaning it decides on actions
based on the current state and sometimes the history of actions.

Key points:

1. Policy: A policy π(s) defines the action to take for a given state s. In simpler terms, it’s like a set of
rules or a guide that tells the agent what to do next.
Deterministic policy: Always takes the same action for the same state (e.g., π(s) = a).
Stochastic policy: Chooses actions probabilistically for a given state (e.g., π(s) = a1 with

70% probability and π(s) = a2 with 30%).

2. Memory of Past Moves: In many reinforcement learning problems, the agent does not explicitly
store the entire history of moves. The agent only focuses on the current state (in most standard
problems like Markov Decision Processes). However, in some advanced settings like Partially
Observable Markov Decision Processes (POMDPs) or when dealing with recurrent neural
networks (RNNs), the agent can maintain memory of past actions or states. This allows the agent
to take past actions into account when making decisions.
3. Maximizing Reward: The policy’s goal is to maximize the expected cumulative reward. It does so
by choosing actions based on the current state in such a way that future rewards (often discounted
by a factor γ ) are maximized. The agent doesn't need to store all previous game moves explicitly
unless the problem requires it (e.g., for non-Markovian problems).

Example of Policy in Action:

Let’s consider a simple grid world where the agent’s goal is to reach a goal state (G) from a starting
position (S). The agent can move up, down, left, or right.

At each state, the policy will determine the best action for the agent to take.
In a deterministic policy, for example:
If the agent is at position (1, 1), the policy might decide: "Move right to (1, 2)."
If the agent is at (1, 2), it might decide: "Move right to (1, 3)."
The policy is designed to guide the agent step-by-step based on the current position (state) to
reach the goal while maximizing the rewards.

If the agent receives a penalty for each step (say, -1 for each move), the policy might choose the most
direct route to minimize the total steps and reach the goal.

Summary:

A policy decides which action to take based on the current state.

It doesn't necessarily store previous moves but can depend on the current state to decide the
best action.
The policy's objective is to maximize long-term rewards by selecting actions that bring the agent
closer to its goal (or lead to optimal behavior).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/29
Explain Value Iteration

Value Iteration

Value Iteration is an algorithm used to compute the optimal policy and value function in a Markov
Decision Process (MDP). It is a method to find the optimal value function V ∗ (s) for each state, which
can then be used to derive the optimal policy π ∗ (s) for selecting the best actions.

The algorithm iterates through all states, updating the value of each state based on the Bellman
Optimality Equation until the values converge (i.e., changes in the value function become very small).

Steps in Value Iteration

1. Initialize Values: Start by assigning an initial value to all states. This can be done arbitrarily (e.g.,
setting all states' values to zero).

2. Bellman Update: For each state, update its value by using the Bellman Optimality Equation:

V (s) ← max (R(s, a) + γ ∑ P (s′ ∣s, a)V (s′ ))

a
s′

Where:
V (s) is the value of state s,
a is an action taken in state s,
R(s, a) is the reward for taking action a in state s,
γ is the discount factor (between 0 and 1, which determines the importance of future
rewards),
P (s′ ∣s, a) is the transition probability (the probability of reaching state s′ from state s by
taking action a).
Essentially, the update finds the action a that maximizes the expected value, considering the
rewards and the expected future values of the next states.

3. Repeat: Repeat the update step for all states until the values converge. Convergence means that
the changes in state values are smaller than a pre-defined threshold ϵ (e.g., 0.01), and further
iterations won’t change the values much.

4. Extract the Optimal Policy: Once the value function converges, you can derive the optimal policy
by selecting the action that maximizes the expected value for each state:

π ∗ (s) = arg max (R(s, a) + γ ∑ P (s′ ∣s, a)V ∗ (s′ ))

a
s′

This step gives you the optimal action to take at each state based on the optimal value function.

Example: Value Iteration in a Simple Gridworld

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/29
Let’s consider a simple gridworld where the agent needs to reach a goal, and the agent has the
following characteristics:

States: The grid cells are states (e.g., positions like (1, 1), (1, 2), etc.).
Actions: The agent can move up, down, left, or right.
Rewards: A reward of +10 for reaching the goal state (G), and -1 for every step taken.
Discount factor: γ = 0.9.

Grid Layout:

diff

`+---+---+---+
| S | | G |
+---+---+---+
| | | |
+---+---+---+
| | | |
+---+---+---+
`

S = Start state.
G = Goal state.
The agent receives a +10 reward for reaching G, and -1 for each move it makes.

Value Iteration Process:

1. Initialization: Set the initial values of all states to 0 (or any arbitrary value).

`V(S) = 0, V(1,1) = 0, V(1,2) = 0, V(G) = 0

2. First Iteration:
For state S , the agent can move right, and the next state is (1, 2). The reward is -1, and the
next value is V (1, 2) = 0.
V (S) = max(−1 + 0.9 × 0) = −1.
For state (1, 2), the agent can move right toward the goal. The reward is -1, and V (G) = 0
(since the goal gives a reward of 10).
V (1, 2) = max(−1 + 0.9 × 10) = 8.
For the goal G, no further moves are made, so:
V (G) = 10 (since it’s the final reward).
After the first iteration:

`V(S) = -1, V(1,1) = 0, V(1,2) = 8, V(G) = 10

3. Subsequent Iterations: The values will continue to update, iterating through all states, and
gradually converging to their final optimal values.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/29
4. Convergence: The values will eventually stabilize (i.e., the updates will become very small), and at
that point, the value iteration process ends.

Key Takeaways:

Value Iteration computes the optimal value function for each state, and from this, the optimal
policy can be extracted.
It works by iterating over all states and updating their values based on the Bellman Optimality
Equation until the values converge.
After convergence, the agent can use the optimal value function to decide the best action (optimal
policy) for each state.

Explain Policy Iteration

Policy Iteration

Policy Iteration is another algorithm used to find the optimal policy in a Markov Decision Process
(MDP). Unlike Value Iteration, which directly updates the value function, Policy Iteration alternates
between policy evaluation (calculating the value function for a given policy) and policy improvement
(improving the policy based on the calculated value function) until convergence.

Steps in Policy Iteration

1. Initialize a Policy: Start with an initial policy. This can be arbitrary, such as randomly selecting an
action for each state or using a simple default policy (e.g., always moving in a certain direction).
2. Policy Evaluation: Given the current policy, evaluate its value function. This means calculating the
state values V π (s) for each state under the current policy. The goal is to determine the expected
reward when following the policy from each state.

The Policy Evaluation step uses the Bellman Expectation Equation:

V π (s) = R(s, π(s)) + γ ∑ P (s′ ∣s, π(s))V π (s′ )

s′

Where:
V π (s) is the value of state s under the policy π ,
π(s) is the action selected by the policy for state s,
R(s, π(s)) is the immediate reward for taking the action π(s) in state s,
γ is the discount factor,
P (s′ ∣s, π(s)) is the probability of transitioning to state s′ from state s by taking action π(s).
3. Policy Improvement: Once the value function V π (s) for the current policy is calculated, update
the policy. For each state, choose the action that maximizes the expected reward given the current
value function V π (s):

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 23/29
π ′ (s) = arg max (R(s, a) + γ ∑ P (s′ ∣s, a)V π (s′ ))

a
s′

This step improves the policy by choosing the best action for each state based on the current value
function.
4. Repeat: Repeat the Policy Evaluation and Policy Improvement steps until the policy stabilizes,
i.e., the policy no longer changes. When the policy stops changing, the process has converged, and
you have the optimal policy.

Example of Policy Iteration: Simple Gridworld

Let’s consider the same gridworld example with a goal (G) and a start state (S), where the agent needs
to reach the goal while minimizing penalties.

Grid Layout:

diff

`+---+---+---+
| S | | G |
+---+---+---+
| | | |
+---+---+---+
| | | |
+---+---+---+
`

S = Start state.
G = Goal state.
The agent receives a +10 reward for reaching G, and -1 for each move it makes.
Discount factor γ = 0.9.

Steps of Policy Iteration:

1. Initialize Policy: Start with an initial policy. Let’s say the policy tells the agent to move right at
every state.

sql

`Policy:
S → Right
(1,1) → Right
(1,2) → Right
G → Stay (because it's the goal)
`

2. Policy Evaluation: Using the current policy, evaluate the value function V π (s). We will calculate
the value of each state based on the policy. For example, if the agent is at (1, 1) and the policy tells
it to move right, we calculate the value for this state by considering the reward for the action and
the value of the next state (since the agent moves to (1, 2)).
V π (S) = −1 + 0.9 × V π (1, 2)
V π (1, 2) = −1 + 0.9 × V π (G)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/29
V π (G) = 10
After running this evaluation for all states, we get the values of each state.
3. Policy Improvement: Once the value function is calculated, improve the policy by selecting the
best action at each state. The goal is to maximize the expected reward given the value function.
For each state, the policy will be updated. For example:
At (1, 1), the policy might switch to moving right (because it maximizes the reward).
At (1, 2), the policy might stay the same or change, depending on the value of other states.
4. Repeat:
After improving the policy, re-evaluate it and update it again.
Continue until the policy converges (i.e., the policy does not change anymore).
5. Convergence: When the policy stops changing, the process is complete, and the agent now has
the optimal policy.

Key Points:

Policy Iteration alternates between policy evaluation (calculating the value function for the
current policy) and policy improvement (improving the policy based on the current value
function).
The algorithm repeats this process until the policy converges to the optimal policy, where further
policy updates will not change it.
Value Iteration updates the value function directly, while Policy Iteration improves the policy by
evaluating it and refining it in each step.

UNIT - III

The Reinforcement Learning problem, prediction and control problems, Model-based algorithm,
Monte Carlo methods for prediction, and Online implementation of Monte Carlo policy evaluation

UNIT - III: Reinforcement Learning Concepts

This unit focuses on Reinforcement Learning (RL), covering key topics such as prediction and control
problems, model-based algorithms, Monte Carlo methods for prediction, and online Monte Carlo
policy evaluation. Below is a breakdown of each concept with simple explanations and examples.

1. Reinforcement Learning Problem

What is Reinforcement Learning (RL)?

RL is a trial-and-error learning process where an agent interacts with an environment to learn

the best actions to maximize rewards.
The agent receives rewards based on its actions and adjusts its behavior over time to achieve the
highest cumulative reward.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/29
Example: Self-driving Car

The car (agent) observes the environment (roads, traffic, pedestrians).

It takes an action (accelerate, brake, turn left/right).
It receives a reward (positive for safe driving, negative for collisions).
Over time, it learns the best way to drive safely.

2. Prediction and Control Problems

Prediction Problem (Policy Evaluation)

The goal is to estimate the value function of a given policy.

We try to predict the expected future reward from each state when following a fixed policy π .

Example:

In a chess game, if we have a policy that always moves the queen aggressively, the prediction
problem is to estimate how good this strategy is in terms of winning chances.

Control Problem (Finding Optimal Policy)

The goal is to find the best policy that maximizes the expected reward.
The agent improves its policy over time using exploration and exploitation.

Example:

In game playing, the agent tries different strategies and improves its moves to win more games.

3. Model-Based Algorithm
What is a Model-Based Algorithm?

The agent knows or learns the environment dynamics (state transition probabilities).
It simulates future steps using the learned model to improve decision-making.

Example: Chess AI

A chess-playing AI like AlphaZero first learns how the game works (legal moves, opponent
reactions).
It simulates moves and outcomes before making a decision.

Model-Based vs. Model-Free

Aspect Model-Based RL Model-Free RL

Requires environment model? Yes No
Computes state transitions? Yes No
Learning speed Faster Slower
Example AlphaZero (chess AI) Q-learning, Deep Q Networks

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/29
4. Monte Carlo Methods for Prediction
What is Monte Carlo Prediction?

Monte Carlo (MC) methods estimate value functions using experience (sampled episodes).
The agent learns by playing many episodes and averaging the rewards received.

Monte Carlo Formula:

N
1
V (s) = ∑ Gi

N i=1

Where:

V (s) = estimated value of state s,

N = number of times state s was visited,
Gi = total return from each episode where s was visited.

Example: Blackjack Game

An agent plays 100 games of blackjack.

It calculates the average reward from games where it had a certain card total (e.g., 15).
Over time, it estimates which card totals are good (high win rate) or bad (low win rate).

5. Online Implementation of Monte Carlo Policy Evaluation

What is Online Monte Carlo Policy Evaluation?

Instead of waiting for the end of the episode, we update the value function after every action.
It allows the agent to learn faster in dynamic environments.

Example: Stock Market Prediction

Instead of waiting until the end of the year, an AI trading agent updates its strategy daily based
on new data.
This helps the agent quickly adapt to market changes.

Summary
Reinforcement Learning helps an agent learn the best actions by trial and error.
Prediction problems estimate how good a policy is, while control problems improve the policy.
Model-based RL learns the environment’s structure, while Monte Carlo methods learn from
sampled experiences.
Monte Carlo prediction helps estimate value functions, and online updates make learning faster.

Would you like any topic explained in more detail? 🚀

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 27/29
is value function estimate the future reward?

Yes! The value function estimates the expected future reward from a given state or state-action pair.

Types of Value Functions:

1. State-Value Function (V (s)):

Estimates the expected reward starting from state s and following a policy π .
Formula:

Vπ (s) = E[Gt ∣St = s]

Example: In chess, if you're in a certain board position, V (s) tells you the expected future
reward if you continue playing optimally.
2. Action-Value Function (Q(s, a)):

Estimates the expected reward if you take action a in state s and follow policy π .
Formula:

Qπ (s, a) = E[Gt ∣St = s, At = a]

Example: In a self-driving car, Q(s, turn left) tells you how good turning left is in a specific
situation.

Key Takeaway:

Yes, the value function predicts future rewards by estimating the total return the agent expects to
🚀
receive from a state or action.

Explain The Reinforcement Learning problem with simple example

The Reinforcement Learning (RL) Problem – Simple Explanation

Reinforcement Learning (RL) is about an agent learning to make the best decisions by interacting with
an environment and receiving rewards based on its actions. The goal is to find the best strategy
(policy) that maximizes the total reward over time.

RL Components:

1. Agent – The decision-maker (e.g., a robot, self-driving car).

2. Environment – The world in which the agent operates (e.g., a road for a self-driving car).
3. State (s) – The current situation (e.g., car at a traffic signal).
4. Action (a) – The choice the agent makes (e.g., stop, go, turn).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 28/29
5. Reward (r ) – The feedback (positive or negative) based on the action (e.g., stopping at a red light =
+10 points, running a red light = -50 points).
6. Policy (π ) – The strategy the agent follows to choose actions.

Simple Example: Teaching a Dog Tricks

Imagine you are training a dog to do a trick.

1. Agent = Dog
2. Environment = Your home
3. State = Dog is sitting
4. Action = Dog jumps
5. Reward = You give a treat (+1) if the dog jumps when told, or no treat (0) if it doesn’t.

Goal: The dog learns over time that jumping on command gets a treat and not jumping doesn’t. It will
adjust its behavior to maximize treats (rewards).

Key Takeaway:

RL is about learning from experience through trial and error. The agent tries different actions, receives
rewards or penalties, and improves its strategy to achieve the best long-term results. 🚀

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 29/29

EMDR Training-Manual 249pags
96% (26)
EMDR Training-Manual 249pags
249 pages
Certificate: Project Title Prepared by Class
75% (4)
Certificate: Project Title Prepared by Class
21 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
06 MDP
No ratings yet
06 MDP
89 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
2
No ratings yet
2
23 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
RL_Basics_1737166593
No ratings yet
RL_Basics_1737166593
30 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
Lec 08
No ratings yet
Lec 08
59 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
M 2
No ratings yet
M 2
12 pages
RL-DQN-PG
No ratings yet
RL-DQN-PG
65 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
UNIT 4 (2)
No ratings yet
UNIT 4 (2)
6 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
MAS - Class
No ratings yet
MAS - Class
71 pages
RL Frra
No ratings yet
RL Frra
9 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
AI (IT) UNIT-4
No ratings yet
AI (IT) UNIT-4
37 pages
ML_Unit-4
No ratings yet
ML_Unit-4
10 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
[24F-COSE361] 5. Markov Decision Process (1)
No ratings yet
[24F-COSE361] 5. Markov Decision Process (1)
40 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
35 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Hagar Kiyoshi 29 Yonatan Ratosh ST, Hadera, Central, 2610000, IL
No ratings yet
Hagar Kiyoshi 29 Yonatan Ratosh ST, Hadera, Central, 2610000, IL
4 pages
GE 9ha01 layout
No ratings yet
GE 9ha01 layout
4 pages
Aesv Model Baru
No ratings yet
Aesv Model Baru
32 pages
Traditional Lesson Plan
No ratings yet
Traditional Lesson Plan
3 pages
Font and Size: Times New Roman 12 Line Spacing: 1.5 Word Limit
No ratings yet
Font and Size: Times New Roman 12 Line Spacing: 1.5 Word Limit
5 pages
Spec Sheet CAT 3406
100% (1)
Spec Sheet CAT 3406
2 pages
Unit 2. Principles of Hydrographic Surveying
No ratings yet
Unit 2. Principles of Hydrographic Surveying
40 pages
Sample Specification For Crosshole Sonic Logging - 5 1 2024 - US Shafts Walls
No ratings yet
Sample Specification For Crosshole Sonic Logging - 5 1 2024 - US Shafts Walls
8 pages
Installing ICU 52
No ratings yet
Installing ICU 52
7 pages
Form 1 Biology Lesson Plans Sample by Mr Banda 0966855707
No ratings yet
Form 1 Biology Lesson Plans Sample by Mr Banda 0966855707
5 pages
3ds Max 2016 Help - Affect Region Modifier Reference
No ratings yet
3ds Max 2016 Help - Affect Region Modifier Reference
4 pages
Assignment 2 Management Perspective Son Leadership Motivation
No ratings yet
Assignment 2 Management Perspective Son Leadership Motivation
14 pages
Kirloskar AC Drives CATALOGUE
0% (1)
Kirloskar AC Drives CATALOGUE
2 pages
A Framework For The Study of Security Communities
No ratings yet
A Framework For The Study of Security Communities
10 pages
Legal Positivism Human Rights
No ratings yet
Legal Positivism Human Rights
5 pages
MEI A Level Further Mathematics Year 2 4th Edition Ben Sparks download
100% (2)
MEI A Level Further Mathematics Year 2 4th Edition Ben Sparks download
54 pages
Va 05 Verbal Logic 02
No ratings yet
Va 05 Verbal Logic 02
42 pages
SEM Notes
No ratings yet
SEM Notes
3 pages
Beginners Guide To The API
No ratings yet
Beginners Guide To The API
23 pages
Effebi Aster ACS With Drain - ENG
No ratings yet
Effebi Aster ACS With Drain - ENG
2 pages
Cell Histo
No ratings yet
Cell Histo
11 pages
HCSA Answers Operational Exam
No ratings yet
HCSA Answers Operational Exam
4 pages
Load Cell Manual
100% (1)
Load Cell Manual
32 pages
Application of Computer Based Simulation in Gas Network System
No ratings yet
Application of Computer Based Simulation in Gas Network System
135 pages
Days & themes - (February 2025)
No ratings yet
Days & themes - (February 2025)
15 pages
Metamorphic Petrology - Lecture XI
No ratings yet
Metamorphic Petrology - Lecture XI
21 pages
Robert T Jones
No ratings yet
Robert T Jones
23 pages
Chapter 2 - Section B - Questions 4
No ratings yet
Chapter 2 - Section B - Questions 4
4 pages

Unit 1, 2 RL

Uploaded by

Unit 1, 2 RL

Uploaded by

Unit 1, 2 RL

Explain regret with example

Explain regret with simple example

1. Markov Decision Problem (MDP)

term rewards but values immediate rewards more.

5. Episodic vs Continuing Tasks

6. Bellman’s Optimality Operator

V ∗ (s) = max (R(s, a) + γ ∑ P (s′ ∣s, a)V ∗ (s′ ))

Explain Markov Decision Process (MDP) with simple example

A Markov Decision Process (MDP) is a mathematical framework used to model decision-making

Key Components of MDP:

Components in this example:

Initial state: The robot starts at (0,0).

Policy and Goal:

Policy and Value Functions:

Formula for Policy:

π(s) = Action to take in state s

π((0, 0)) = right

2. Value Function (V)

Formula for Value Function:

V (s) is the value of state s.

The sum is taken over all future time steps.

Example of Value Function:

Goal (state (2, 2)) gives a reward of +10.

State (0, 0):

V (1, 1) = −1 + 0.9 ⋅ 8 = −1 + 7.2 = 6.2

Simple Example of Value Function Calculation:

Let's say we have a simplified grid with only two states:

π(0) = move to state 1

Value Function Calculation:

For State 1 (goal): V (1) = 10 (since it's the goal).

Substituting γ = 0.9 and V (1) = 10:

So, the value of State 0 is 8.

why we use mdp in simple words

Why MDP is Useful:

How it predict the next move and future

1. Transition Model (P):

P (s′ ∣s, a) = Probability of moving from state s to state s′ by taking action a

R(s, a, s′ ) = Reward after taking action a in state s and ending up in state s′

3. Value Function (V):

π(s) = Best action to take in state s

5. Prediction of Next Move:

the expected future reward using the value function V (s1 ). ​

It predicts the next state using the transition model.

1. Current State: s0 ​ = (0, 0) (starting point).

MDP predicts the next move and future outcomes by:

Using transition probabilities to predict the next state.

1. Infinite Discounted Reward Model (Discounted Reward)

γ is the discount factor.

Let γ = 0.9 (so future rewards are discounted by 10%).

The total discounted reward would be:

Rdiscounted = 10 + 0.9 × 5 + 0.92 × 3 + 0.93 × 2 = 10 + 4.5 + 2.43 + 1.458 = 18.39

2. Total Reward Model

Rt is the reward received at time step t.

The total reward would be:

3. Finite Horizon Reward Model

T is the finite time horizon (the total number of time steps).

The finite horizon reward would be:

4. Average Reward Model

T is the total number of time steps.

For the first 4 steps:

The average reward would be:

Summary of Reward Models:

Why There Are No Rewards After the Horizon:

Does the Agent Reach the Goal?

Let's say the finite horizon is 4 time steps.

whey rewards in negative -1

1. Discourage Unwanted Actions

2. Create a Time Penalty

3. Penalty for Failure or Mistakes

4. Balance Positive and Negative Rewards

Example of Reward Balance:

If the agent reaches the goal: +10

5. Exploration vs. Exploitation

Example with Explanation:

The agent starts at (0, 0).

the expected future reward using the value function V (s1 ).

1. Current State: s0 = (0, 0) (starting point).