RL
RL
Learning?
V(st)=V(st)+α[Rt+1+γV(st+1)−V(st)]V(st)=V(st)
+α[Rt+1+γV(st+1)−V(st)]
5. Updating Value Estimates:
o The agent updates its value estimates based on the TD error.
If the TD error is positive, it means things went better than
expected, so the agent will increase the value estimate for the
state. If the TD error is negative, it will decrease the value
estimate.
6. TD Learning vs. Monte Carlo Methods:
o In Monte Carlo methods, the agent only learns at the end of
an episode (e.g., after finishing a game), while TD learning
happens at every step. This makes TD learning more efficient
because it can update its estimates in real-time.
7. TD Learning Algorithms:
o TD(0): The simplest form of TD learning, where the agent
updates its value estimates after each step.
o SARSA: A TD method where the agent learns an action-value
function (what action to take in a given state) based on the
state, action, reward, next state, and next action.
o Q-Learning: Another TD method where the agent learns the
value of the best possible action in a given state, which allows
it to act optimally in future decisions.
1. Policy:
o A policy is a strategy or rule that the agent follows to decide
which action to take in each state. The goal of DP in RL is to
find the optimal policy that maximizes the long-term reward.
2. Value Functions:
o State-Value Function V(s)V(s)V(s): Represents the
expected return (sum of future rewards) starting from state
sss and following a particular policy.
o Action-Value Function Q(s,a)Q(s, a)Q(s,a): Represents
the expected return starting from state sss, taking action aaa,
and then following a particular policy.
3. Bellman Equations:
o The Bellman equations are central to DP in RL. They express
the relationship between the value of a state and the values of
its successor states. The equations are recursive and form the
foundation for calculating the value functions.
o Bellman Equation for State-Value Function:
V(s)=∑aπ(a∣s)∑s′P(s′∣s,a)[R(s,a,s′)+γV(s′)]V(s) = \sum_{a} \
pi(a|s) \sum_{s'} P(s'|s, a) \left[ R(s, a, s') + \gamma V(s') \
right]V(s)=a∑π(a∣s)s′∑P(s′∣s,a)[R(s,a,s′)+γV(s′)] Where:
π(a∣s)\pi(a|s)π(a∣s) is the probability of taking action aaa
in state sss under policy π\piπ.
P(s′∣s,a)P(s'|s, a)P(s′∣s,a) is the probability of moving to
state s′s's′ from state sss after taking action aaa.
R(s,a,s′)R(s, a, s')R(s,a,s′) is the reward received after
transitioning from state sss to state s′s's′ using action
aaa.
γ\gammaγ is the discount factor, which represents the
importance of future rewards.
4. Key Dynamic Programming Algorithms:
o Policy Evaluation:
This algorithm computes the state-value function
V(s)V(s)V(s) for a given policy π\piπ. It involves iterating
over all states and updating the value function using the
Bellman equation until it converges.
o Policy Improvement:
Given a value function V(s)V(s)V(s), this algorithm
improves the policy by choosing actions that maximize
the expected return. The improved policy is determined
by selecting actions that yield the highest action-value.
o Policy Iteration:
This is a combination of policy evaluation and policy
improvement. The algorithm alternates between
evaluating the current policy and improving it. The
process continues until the policy converges to the
optimal policy.
o Value Iteration:
Value iteration combines policy evaluation and policy
improvement into a single step. Instead of fully
evaluating a policy before improving it, value iteration
updates the value function using the Bellman optimality
equation and directly approximates the optimal value
function. Once the value function converges, the optimal
policy can be derived.
5. Applications of Dynamic Programming in RL:
o DP methods are used in environments where the dynamics
are well-understood, and an exact model is available. Some
classic examples include grid-world environments, inventory
management, and navigation problems where the transition
probabilities and rewards are known.
6. Limitations of Dynamic Programming:
o Scalability: DP algorithms require knowledge of the complete
environment, which may not be feasible in large or complex
environments where the state and action spaces are vast (this
is often referred to as the "curse of dimensionality").
o Model Dependency: DP relies on a perfect model of the
environment, which is not always available in real-world
applications.
V(st)=V(st)+1N(st)(Gt−V(st))V(st)=V(st)+1N(st)(Gt−V(st))
where GtGt is total return from time step tt and N(st)N(st) keeps
track of visits to state stst.
Monte Carlo methods are important because they provide a way to learn
optimal policies without requiring a model of the environment. They are
particularly useful in situations where the environment is too complex to
model or where the agent only has access to experience, not to the full
dynamics of the environment. Monte Carlo methods also lay the
groundwork for more advanced RL algorithms, combining them with
concepts from Temporal Difference learning to create more efficient and
robust learning techniques, such as Monte Carlo Tree Search (MCTS).
1. States (S):
Definition: The state represents the current situation or
configuration of the environment. At any given time, the
environment is in a specific state from a set of possible states SSS.
Example: In a chess game, a state could be the current
arrangement of all pieces on the board.
2. Actions (A):
4. Rewards (R):
5. Policy (π):
7. Value Functions: