qp ans
qp ans
Artificial intelligence & Data Science (Thakur College of Engineering and Technology)
Module: 04
1 Define Monte Carlo methods in the context of model-free reinforcement learning. 2
Monte Carlo control is a method in reinforcement learning that aims to find the
optimal policy by improving it iteratively.
• On-Policy Methods: Methods like Monte Carlo Exploring Starts (MCES)
update the policy based on the actions taken in the current policy.
• Off-Policy Methods: Methods like Importance Sampling allow learning about
one policy while following another. Useful for reusing data generated from
one policy to improve another.
Double Deep Q-Networks (DDQNs) are an improvement over the original Deep Q
Network (DQN) that addresses the overestimation bias inherent in Q-learning. This
bias occurs because the maximum Q-value used in the Q-learning update is often an
overestimate of the true maximum due to noise or inaccuracies in the value estimates.
DDQNs mitigate this issue by decoupling the action selection from the Q-value
evaluation.
9 Explain the process of Monte Carlo control for optimal policy determination. 5
2. Generate Episodes:
• Episode Execution: Follow the current policy to generate episodes of experience.
Each episode consists of a sequence of states, actions, and rewards, ending
when a terminal state is reached.
• Episode Recording: Record the state, action, and reward sequences from each
episode.
5. Iterate:
• Repeat: Continue generating episodes, updating value functions, and improving
the policy iteratively. Over time, this process converges to the optimal policy as
the value functions and policy stabilize.
Definition Learns about policy currently Learns about optimal target policy
being followed. while following a different
behaviour policy.
Action Uses the same policy for Uses a different behaviour policy
selection action selection and learning for action selection but learns a
vs separate target policy
learning
behaviour
Stability More stable since updates are Can be less stable and may
based on actions taken by the diverge if behaviour policy and
current policy target policy differ significantly
Importance Sampling:
Importance sampling is a method used in reinforcement learning (RL) when an agent
learns using off-policy data. It allows the agent to update the value of one policy (target
policy) while following another policy (behaviour policy) by adjusting the contribution
of each experience to account for the difference between these two policies. This
adjustment is done through importance sampling weights.
Mathematical Concept:
Where;
• Π (at ∣ st) is the probability of taking action at in state st under the target policy. • μ
(at ∣ st) is the probability of taking action at in state st under the behaviour policy. The
return G is then weighted by the importance sampling ratio to correct the bias
introduced by using the behaviour policy instead of the target policy.
Example:
Let’s say we have a self-driving car learning to navigate intersections. The behaviour
policy μ is a safer policy that the car is using to avoid risky decisions during learning.
The target policy π is a more optimal policy that the agent aims to learn, which might
include riskier but faster decisions.
Scenario:
• The car follows the behaviour policy μ and takes action a1 at state s1 with a
probability of 0.7.
• However, under the target policy π, the probability of taking action a1 in state s1 is
0.9.
This means that the car gives slightly more weight to this experience because the action
it took is more likely under the target policy than the behaviour policy.
Return Adjustment:
• Suppose the return Gt for this episode is 10 (i.e., the reward the car gets for
successfully navigating the intersection).
• The weighted return would be:
1. Overview of DQN:
DQN is based on Q-learning, which seeks to learn a policy that maximizes the
cumulative future rewards by estimating a Q-value function Q(s, a). The Q-value
function represents the expected future reward for taking action a in state s. In
traditional Q-learning, Q-values are stored in a table, but DQN approximates them
using a neural network, allowing it to scale to larger and more complex
environments.
2. Architecture:
The core component of a DQN is a deep neural network, which takes the state as
input and outputs the Q-values for all possible actions.
• Input Layer:
o The input to the network is the state of the environment. In complex
environments (like video games), this can be high-dimensional data
such as an image.
• Hidden Layers:
o Multiple hidden layers process the input data using non-linear activation
functions (e.g., ReLU). These hidden layers enable the network to
capture the complex relationships between states and actions.
• Output Layer:
o The output layer provides a Q-value for each possible action. If the
environment has n possible actions, the output layer will have n
neurons, each representing Q(s, a) for a particular action a.
3. Functioning of DQN:
The functioning of DQN consists of several key steps:
A. Q-Value Approximation:
The DQN uses a deep neural network Qθ(s, a) to approximate the Q-values. The
network’s weights θ are updated based on the difference between the predicted
Q-value and the target Q-value, using the Bellman equation:
where:
• rt is the reward at time t,
• γ is the discount factor (to balance immediate vs. future rewards),
• Q(st+1, a′) is the Q-value for the next state-action pair.
B. Experience Replay:
DQN uses experience replay to improve learning stability and efficiency. Instead of
updating the network based on consecutive experiences, DQN stores the agent's
experiences (st, at, rt, st+1) in a replay buffer and samples random minibatches from it
to train the network. This breaks the correlation between consecutive updates and
reduces overfitting.
C. Target Network:
One major issue in Q-learning is instability due to the fact that the Q-values are
constantly being updated. To address this, DQN introduces a target network, which
is a copy of the main network but updated less frequently. The target network Qθ′
provides stable target Q-values during training:
The target network is periodically synchronized with the main network to improve
stability.
D. Action Selection (Epsilon-Greedy):
To balance exploration and exploitation, DQN uses the epsilon-greedy strategy: •
With probability ϵ, the agent takes a random action (exploration). • With probability
1−ϵ, it takes the action with the highest Q-value (exploitation). This allows the agent
to explore the environment and avoid getting stuck in local optima.
Prioritization Mechanism:
In prioritized experience replay, each experience in the replay buffer is assigned a
priority value based on its TD-error. There are two main methods of prioritization: 1.
Proportional Prioritization: The probability of selecting an experience is directly
Limitations:
A. Over-Fitting on High-Priority Experiences:
B. Bias Introduction
selecting a random action with probability ϵ\epsilonϵ and the action with the highest
Q(s, a) value otherwise.
Steps in on-policy Monte Carlo control:
1. Policy Evaluation: Estimate Q(s, a) by averaging returns for each state-action
pair under the current policy.
2. Policy Improvement: Update the policy to be greedy with respect to the
current action-value estimates.
This iterative process converges to an optimal policy if all state-action pairs are
visited sufficiently often.
B. Off-Policy Monte Carlo Control:
In off-policy learning, the agent learns about one policy (the target policy) while
following a different policy (the behaviour policy) to generate experiences. This is
useful when we want to learn an optimal policy while exploring widely using a
different policy.
• Importance sampling is used to correct for the difference between the
behaviour policy and the target policy, ensuring unbiased value estimates.
model, TD methods learn incrementally like MC but without waiting for episode
completion, combining the strengths of both approaches.
16 Examine the mechanisms of Monte Carlo control and discuss its impact on 10
policy optimization in reinforcement learning.
Monte Carlo (MC) control is a key approach in reinforcement learning (RL) used to
optimize policies in environments modelled as Markov Decision Processes (MDPs).
as games. However, it may not be ideal for continuing tasks without clear
episode terminations.
17 Critically evaluate the differences between on-policy and off-policy learning, and 10
their respective advantages and challenges in real-world applications.
18 Discuss the importance sampling method, its mathematical foundation, and its 10
relevance in reinforcement learning scenarios.
2. Mathematical Foundation:
The essence of importance sampling lies in the adjustment of expectations. Suppose
we want to estimate an expectation under a target distribution π, but we only have
samples from a behaviour distribution μ. The expected value of a function f(x) under
the target policy π is:
Since we only have samples from μ(x), we modify the expectation using importance
sampling:
This ratio adjusts the value estimates based on how much the actions under the
behaviour policy differ from those of the target policy.
19 Analyze the architecture of Deep Q-networks (DQN) and its variants (DDQN, 10
Dueling DQN) with a focus on their improvements and applications.
Architecture:
1. DQN
2. DDQN:
Problem: DQN suffers from overestimation of Q-values because it uses the same
network to select and evaluate actions. This can lead to suboptimal policies,
especially in stochastic environments.
3. Dueling DQN:
Problem: In many states, the choice of action has little effect on the value, yet DQN
still computes a Q-value for every action.
Solution: Dueling DQN introduces two separate streams within the network
architecture to estimate:
• Value function V(s): Represents how good it is to be in a state s. • Advantage
function A(s, a): Measures the value of taking a specific action in state s.
Impact: This architecture allows the network to more efficiently learn state values
and action advantages, particularly when some actions are irrelevant in specific
states, leading to faster convergence.
Improvement:
DQN: Introduced stable deep learning techniques to RL and became a foundation for
subsequent methods.
DDQN: Addressed the overestimation problem, improving performance in
environments with stochastic rewards.
Dueling DQN: Optimized learning efficiency by separating value and advantage
functions, leading to faster and more efficient policy learning.
Applications:
DQN: Used extensively in Atari games and game-playing AI due to its ability to
handle high-dimensional input spaces (e.g., raw pixel data).
DQN: Applied in environments where Q-value overestimation is critical, such as
robotics and complex control tasks.
Dueling DQN: Used in scenarios with many irrelevant actions in certain states, such
as navigation tasks, resource management, and traffic control systems.
2. Tabular Methods
Contributions:
• Tabular methods (like Q-learning and SARSA) represent one of the earliest
and simplest approaches to RL. They store values for each state or state-action
pair in a table, allowing agents to directly update these values based on the
rewards they observe.
• Exploration-exploitation balance: Q-learning’s off-policy nature allows it to
balance exploration and exploitation well, leading to the development of
algorithms that are still widely used in RL, especially in discrete
environments.
• Optimal control: Tabular methods enabled practical applications like robotic
control, maze-solving tasks, and classic benchmark problems like the
GridWorld and CartPole.
• Theoretical foundation: These methods laid the groundwork for more advanced
RL algorithms, providing theoretical insights into value function
approximation and policy learning.
Future Prospects:
Module: 05
1 Define Temporal Difference (TD) methods. 2
The TD(0) is a special case of TD(λ) that only looks one step ahead and is a basic
method of Temporal Difference (TD) learning.
Key Points:
1. One-Step Update: TD(0) uses the value of the next state to update the current
state’s value immediately after each step (one-step lookahead).
2. TD Error: The update is based on the TD error, which is the difference
between the predicted value of the current state and the value of the next state
plus the reward.
Definition:
SARSA is an on-policy reinforcement learning algorithm that updates the value of state
action pairs based on the current policy. It updates the Q-value using the reward received
and the Q-value of the next state-action pair chosen by the same policy. Key Formula:
Advantage:
Q-Learning is an off-policy reinforcement learning algorithm that can learn the optimal
policy regardless of the actions taken by the current policy. This means it can learn the
best possible action-value function using the best available actions (greedy approach),
even if the agent is exploring suboptimal actions during learning.
Benefit: This allows Q-Learning to converge to the optimal policy more effectively, as
it improves learning from all experiences, not just those dictated by the current policy.
8 Explain the key differences between Monte Carlo methods and Temporal 5
Difference methods.
Data usage Uses the actual returns from Uses estimated future values
the episodes to update values. (based on next state) for updates.
Complexity Simpler to implement but can More complex but can handle
be less efficient in learning large state spaces and ongoing
from large state spaces. learning more efficiently.
1. Initialization:
• Objective: Set up the initial value estimates for all states or state-action pairs. •
Process: Initialize the value function (e.g., V(s) or Q(s, a)) arbitrarily, often to zero
or a small random value. This phase establishes the starting point for learning.
2. Policy Execution:
• Objective: Define and execute the policy used for selecting actions in the
environment.
• Process: Choose actions according to the current policy (e.g., ε-greedy policy).
This involves exploration (trying new actions) and exploitation (choosing the
best-known actions).
3. Transition and Reward Collection:
• Objective: Interact with the environment to collect data for learning. • Process:
Execute the chosen action, observe the resulting reward rrr and the next state s′.
This phase collects experiences that will be used for updating value estimates.
4. Value Update:
• Objective: Update the value function based on observed rewards and transitions.
• Process: Apply the TD update rule to adjust the value estimate for the current
state or state-action pair. For TD(0), this involves:
For TD(λ), this involves updating values using eligibility traces, combining
information from multiple steps.
5. Policy Improvement:
• Objective: Enhance the policy based on updated value estimates. • Process:
Adjust the policy to favor actions with higher value estimates. This phase is often
part of methods like SARSA(λ) or Q-learning, where the policy evolves as the
value function improves.
6. Termination or Convergence Check:
• Objective: Determine if the learning process should be stopped or if further
iterations are needed.
• Process: Check for convergence criteria, such as stability in value estimates or a
predefined number of episodes. This phase ensures that the learning process
has adequately captured the value function.
3. Computational Complexity
• Issue: High-fidelity simulations can be computationally expensive and time
consuming, especially for complex environments (e.g., robotics, traffic
systems). Running a large number of simulation episodes to train agents can
require significant computational resources and time.
• Impact: The cost of computation limits the scalability of RL, making it difficult
to train agents in environments that require real-time interactions or have
many variables, like simulating multiple agents or entities interacting with
each other.
1. Learning Efficiency:
• TD Methods: TD methods update the value function after every step based on
the current reward and the estimated value of the next state. This incremental
learning allows the agent to update its value estimates continuously as it
interacts with the environment.
• Monte Carlo Methods: MC methods wait until the end of an episode to
calculate the total return before updating value estimates. This can be
inefficient, especially for long episodes.
• Advantage: TD methods can learn faster than MC methods because they don’t
need to wait until an episode finishes to start learning. This makes TD
methods particularly useful in ongoing tasks or tasks with long episodes.
3. Sample Efficiency:
• TD Methods: TD methods use bootstrapping, which means they update value
estimates based not only on actual rewards but also on estimates of future
rewards (i.e., they update based on estimates from other estimates). This
makes TD methods more sample efficient as they make better use of the
available data.
• Monte Carlo Methods: MC methods use only the actual returns from complete
episodes to update value estimates, which can be less efficient in terms of data
utilization.
• Advantage: TD methods can learn more effectively with fewer interactions
with the environment, which is particularly important in environments where
gathering samples is expensive or slow.
incremental estimates rather than the total returns from an entire episode,
which can be noisy.
• Monte Carlo Methods: MC methods often have higher variance because they
rely on the total return from complete episodes, which can vary significantly,
especially in tasks with long or stochastic episodes.
• Advantage: The lower variance of TD methods often leads to more stable
learning and faster convergence in practice compared to MC methods.
Q-Learning Algorithm:
1. Initialization:
• Initialize the Q-value function Q(s, a) for all state-action pairs arbitrarily,
typically starting with zeros or small random values.
• Choose the learning rate αα (step size), the discount factor γ (which determines
how future rewards are valued), and an exploration strategy (e.g., ε-greedy) to
balance exploration and exploitation.
2. Action Selection (ε-greedy policy):
• In a given state st, select an action at using an ε-greedy policy, where ε is the
probability of exploring a random action, and 1-ε is the probability of
exploiting
5. Repeat:
• Set st←st+1 and repeat steps 2–4 until the episode ends or the learning process
converges.
TD methods are learning algorithms that update value estimates for states or
state-action pairs based on the difference between successive predictions of the
value function. They estimate the value of a state (or state-action pair) using
information from the next state, rather than waiting for the final outcome of the entire
episode.
The TD error is the difference between the current estimate and the more recent
estimate of the return:
• Learning Process: Making updates based on the action taken and the policy
being used for learning.
d) Q-Learning:
• Q-Learning is an off-policy TD algorithm that learns the optimal policy
• Learning
Process: Q-learning updates the Q-value of the state-action pair using the
maximum Q-value of the next state, making it optimal-policy focused.
• Monte Carlo methods can suffer from high variance in the return estimates,
especially in stochastic environments where the outcomes can vary widely. •
Limitation: This high variance can slow down learning and make it difficult to
converge to accurate value estimates. Bootstrapping methods like TD can address
this by using more frequent updates.
e) Incompatibility with Non-Episodic Tasks:
• Monte Carlo methods are ineffective in tasks that lack episode boundaries or
have ongoing interactions without a clear termination point.
• Limitation: This restricts their use to a narrower set of applications compared to
bootstrapping methods like TD(0) or Q-Learning, which can work in both
episodic and non-episodic tasks.
Significance:
3. Policy Improvement:
After estimating the value functions, the agent can improve its policy—the set of
actions it chooses in each state. In on-policy TD methods (e.g., SARSA), the agent
updates the policy based on the estimated value of the current policy. In off-policy
TD methods (e.g., Q-Learning), the agent improves its policy by learning from the
optimal actions, even if it is currently following a different policy.
Example in Chess:
In chess, once the agent learns that certain board positions lead to higher chances of
winning (value function estimation), it can adjust its policy by favoring moves that
lead to those positions. This phase is where the agent's decision-making process
improves over time, leading to better strategies.
Significance:
• Advantage: TD methods gradually improve the agent’s policy, leading to
increasingly optimal actions in the long run.
• Challenge: In complex environments like chess, where the state-action space is
enormous, finding the optimal policy can take a significant amount of time,
especially with limited exploration.
4. Bootstrapping and Convergence:
A critical feature of TD methods is the use of bootstrapping, which involves
updating the value of a state based on the estimated value of the subsequent state.
This allows the agent to improve its estimates without needing complete information.
Over time, these updates lead to the convergence of the value function, meaning the
estimates become more accurate as the agent continues learning.
Example in Chess:
In chess, bootstrapping allows the agent to improve its evaluation of a board position
after each move, based on the estimated value of the resulting board position. For
example, if the agent moves into a checkmate position, it can immediately update the
value of the previous state as a bad move.
Significance:
• Advantage: Bootstrapping makes TD methods highly efficient in tasks where
waiting for full episodes is impractical (such as long chess games).
• Challenge: The convergence of value estimates relies on sufficiently frequent
visits to all important states, which can be difficult to ensure in large state
spaces like chess.
of learning the value of each individual state, the agent uses a function approximator,
such as a neural network, to estimate values for groups of similar states. Example in
Chess:
Instead of assigning a unique value to every possible board configuration (which is
impractical due to the vast number of possible states), the agent can use a neural
network to estimate the value of a given board based on features like material
balance, king safety, and piece activity.
Significance:
• Advantage: Function approximation allows the agent to handle large and
continuous state spaces by learning from similar states.
• Challenge: Overfitting can occur if the agent focuses too much on specific
states without learning generalizable patterns. In chess, this could mean the
agent performs well in certain positions but poorly in others.
18 Discuss in detail the SARSA algorithm, its working, applications, and how it 10
differs from other TD methods.
2. Applications of SARSA
SARSA is widely used in tasks where on-policy learning is required or where it's
necessary to learn about the current behaviour of the agent. Its applications span
various domains:
a) Game Playing:
environment. It learns by interacting with the environment and updating the estimated
Q-values of state-action pairs. The algorithm seeks to approximate the optimal action
value function, denoted by Q∗(s,a)Q^*(s, a)Q∗(s,a), which represents the expected
cumulative reward when starting from state sss, taking action aaa, and then following
the optimal policy.
Algorithm Steps:
1. Initialize the Q-values Q(s, a) arbitrarily for all state-action pairs. If no prior
knowledge is available, they are typically set to zero.
2. For each episode:
o Start in an initial state s.
o Choose an action a using a policy derived from the current Q-values
(often using an ε-greedy policy, which balances exploration and
exploitation).
o Take the action a and observe the reward r and the next state s′.
o Update the Q-value of the state-action pair (s, a) using the Q-Learning
update rule:
3. Repeat steps 2
until convergence, meaning the Q-values no longer change significantly.
2. Benefits of Q-Learning
Q-Learning has several key advantages that have contributed to its widespread
adoption in reinforcement learning.
a) Model-Free Approach:
• Q-Learning does not require a model of the environment (i.e., it does not need to
know the transition probabilities or reward structure beforehand). This makes
it highly adaptable to a wide range of environments where the dynamics are
unknown or too complex to model.
b) Guaranteed Convergence:
• With an appropriate learning rate and sufficient exploration, Q-Learning is
guaranteed to converge to the optimal policy, provided the environment
satisfies certain conditions (i.e., it must be a Markov Decision Process with
a finite number of states and actions).
c) Flexibility and Simplicity:
• The algorithm is relatively simple to implement and can be applied to both small
and large state-action spaces.
• The off-policy nature of Q-Learning allows it to decouple learning from the
behavior policy. This means the agent can explore using one policy (like ε-
greedy) but update its Q-values based on the optimal policy (choosing actions
that maximize future rewards).
d) Online Learning:
• Q-Learning is an online learning algorithm, meaning it updates its knowledge
as it interacts with the environment. It can start with no prior knowledge and
learn from experience, making it useful for applications where training must
occur on the-fly.
e) Handling Exploration vs Exploitation:
• Through policies like ε-greedy, Q-Learning provides a mechanism to balance
exploration (trying new actions) and exploitation (taking actions that are
known to yield high rewards), which is critical in RL.
3. Applications of Q-Learning
Q-Learning has been applied in various fields, ranging from gaming to robotics, due
to its effectiveness in solving complex decision-making problems.
a) Game AI:
• Video Games: Q-Learning is used to train AI agents to play games by learning
from the environment. Notable examples include learning strategies for games
like tic-tac-toe, Atari games, and puzzle games.
• Chess: Q-Learning can be used to help an agent improve its chess-playing skills
by learning optimal moves through trial and error.
b) Robotics:
• Path Planning: In robotics, Q-Learning helps robots learn to navigate
environments, avoiding obstacles and reaching a goal efficiently.
• Autonomous Systems: Q-Learning is employed in autonomous systems where
robots must interact with the physical world, making decisions based on
feedback. For instance, a robot vacuum cleaner could use Q-Learning to
efficiently map out a cleaning strategy for a room.
c) Autonomous Vehicles:
• Self-driving Cars: Q-Learning has been used in the development of decision
making systems for autonomous vehicles, helping cars learn to navigate roads,
avoid collisions, and make real-time driving decisions.
d) Dynamic Pricing:
• In e-commerce, Q-Learning can be applied to dynamic pricing strategies
where the goal is to maximize profits by adjusting prices based on customer
demand and competitor actions.
e) Healthcare:
• Treatment Optimization: Q-Learning is applied in personalized medicine to
optimize treatment strategies for patients based on historical data and
real-time feedback. For example, it could be used to tailor drug dosages for
chronic disease management based on patient response.
20 A tourist car operator finds that during the past few months, the car’s use has 10
varied so much that the cost of maintaining the car varied considerably. During
the past 200 days the demand for the car fluctuated as below.
Simulate the demand for a 10-week period. Use the random numbers, 82, 96, 18,
96, 20 84, 56, 11, 52, 03.