ML-10
ML-10
C L LEGPT
Reinforcement
10
Learning
Prepared and Edited by:- Mayank Yadav Designed by:- Kussh Prajapati
www.collegpt.com [email protected]
Prepared By : Mayank Yadav Machine Learning
Reinforcement Learning
● In Reinforcement Learning, the agent learns automatically using feedback without any labeled
data, unlike supervised learning.
● Since there is no labeled data, the agent is bound to learn by its experience only.
● RL solves a specific type of problem where decision making is sequential, and the goal is
long-term, such as game-playing, robotics, etc.
● The agent interacts with the environment and explores it by itself. The primary goal of an agent
in reinforcement learning is to improve the performance by getting the maximum positive
rewards.
● The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a type of
machine learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that.
● It is a core part of Artificial intelligence, and all AI agents work on the concept of reinforcement
learning. Here we do not need to pre-program the agent, as it learns from its own experience
without any human intervention.
● The agent continues doing these three things (take action, change state/remain in the same
state, and get feedback), and by doing these actions, he learns and explores the environment.
Prepared By : Mayank Yadav Machine Learning
● Agent(): An entity that can perceive/explore the environment and act upon it.
● Environment(): A situation in which an agent is present or surrounded by. In RL, we assume the
stochastic environment, which means it is random in nature.
● Action(): Actions are the moves taken by an agent within the environment.
● State(): State is a situation returned by the environment after each action taken by the agent.
● Reward(): A feedback returned to the agent from the environment to evaluate the action of the
agent.
● Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
● Value(): It is expected long-term returns with the discount factor and opposite to the short-term
reward.
● Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).
Prepared By : Mayank Yadav Machine Learning
● Value-Based – The main goal of this method is to maximize a value function. Here, an agent
through a policy expects a long-term return of the current state.
● Policy-Based – In policy-based, you can come up with a strategy that helps to gain maximum
rewards in the future through possible actions performed in each state. Two types of
policy-based methods are deterministic and stochastic.
● Model-Based – In this method, we need to create a virtual model for the agent to help in
learning to perform in each specific environment.
Common RL Techniques:
● Q-Learning: A value-based method where the agent learns a Q-value for each state-action pair,
representing the expected future reward for taking an action in a given state.
● Policy Gradient Methods: Focus on directly improving the policy by estimating the gradient of
the expected reward with respect to the policy parameters.
● Deep Reinforcement Learning: Combines RL with deep neural networks for powerful agents
capable of handling complex environments and high-dimensional states.
Prepared By : Mayank Yadav Machine Learning
Rewards
-> Positive reinforcement: Positive reinforcement is defined as when an event, occurs due to specific
behavior, increases the strength and frequency of the behavior. It has a positive impact on behavior.
Advantages:
● Maximizes the performance of an action
● Sustain change for a longer period
Disadvantage:
● Excess reinforcement can lead to an overload of states which would minimize the results.
Example: In a game of Pong, the agent receives a reward for hitting the ball towards the opponent's goal.
-> Shaping rewards: Complex tasks might require breaking them down into smaller sub-goals with
associated rewards. These intermediate rewards can guide the agent's learning towards the final goal.
● Example: In a robot learning to walk, initial rewards might be given for taking a step, then for
maintaining balance, and finally for achieving forward movement.
Penalties
Advantages:
● Maximized behavior
● Provide a decent to minimum standard of performance
Disadvantage:
● It just limits itself enough to meet up a minimum behavior
Example: In a maze navigation task, the agent receives a penalty for hitting a wall.
● Magnitude: The magnitude of rewards and penalties can influence the learning speed and
effectiveness.
○ Larger rewards can encourage faster learning towards desired behavior, but smaller,
more frequent rewards can provide more continuous feedback.
● Sparsity: Rewards and penalties might not be received after every action, especially in complex
environments. The agent needs to learn to deal with delayed rewards and sparse feedback.
● Exploration vs. Exploitation: A balance needs to be struck between exploration (trying new
actions to learn about the environment) and exploitation (repeating actions known to yield
rewards).
● Reward Engineering: Defining appropriate reward signals is crucial for effective RL. Rewards
should be clear, consistent, and aligned with the desired goals.
Prepared By : Mayank Yadav Machine Learning
The Reinforcement Learning (RL) framework provides a structured approach for training agents to learn
and make decisions in an interactive environment. Unlike supervised learning with labeled data, RL
agents learn by trial and error, receiving rewards for desired actions and penalties for undesirable ones.
Core Elements:
● Agent: The learning entity that interacts with the environment and aims to maximize its
long-term reward.
● Environment: The system or world the agent operates in. It provides the agent with observations
(state) and rewards based on its actions.
● State: The representation of the environment relevant to the current situation (e.g., game board
configuration, robot's sensor readings). It captures the information necessary for the agent to
make decisions.
● Action: The choices the agent can make in a given state. These actions influence the
environment and the agent's future state.
● Reward: A signal indicating the desirability of an action. Positive rewards encourage repetition,
while negative rewards (penalties) discourage it. Rewards provide feedback to the agent about
the consequences of its actions.
Prepared By : Mayank Yadav Machine Learning
The RL Loop:
● Perception: The agent observes the current state of the environment through sensors or other
information sources.
● Decision-Making: Based on the perceived state, the agent selects an action using its policy
(strategy). This policy defines the mapping between states and actions.
● Action: The agent takes the chosen action in the environment, potentially changing the
environment's state.
● Reward: The environment provides a reward signal based on the outcome of the action. This
reward reflects the desirability of the chosen action.
● Update: The agent updates its policy (learning) based on the observed reward and the transition
from the previous state to the current state. The goal is to adjust the policy to favor actions that
lead to higher rewards in the long run.
Visit: www.collegpt.com