0% found this document useful (0 votes)

9 views

Introduction to Reinforcement Learning

Uploaded by

Harsh Vardhan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Introduction to Reinforcement Learning

Uploaded by

Harsh Vardhan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Introduction to Reinforcement Learning

Majid Ghasemi⋆,1 and Dariush Ebrahimi1

1
Department of Computer Science, Wilfrid Laurier University, Waterloo, Canada
{mghasemi⋆ , debrahimi}@wlu.ca
⋆
Corresponding author
arXiv:2408.07712v3 [cs.AI] 3 Dec 2024

Abstract
Reinforcement Learning (RL), a subfield of Artificial Intelligence (AI), focuses on training agents to
make decisions by interacting with their environment to maximize cumulative rewards. This paper pro-
vides an overview of RL, covering its core concepts, methodologies, and resources for further learning.
It offers a thorough explanation of fundamental components such as states, actions, policies, and reward
signals, ensuring readers develop a solid foundational understanding. Additionally, the paper presents
a variety of RL algorithms, categorized based on the key factors such as model-free, model-based, value-
based, policy-based, and other key factors. Resources for learning and implementing RL, such as books,
courses, and online communities are also provided. By offering a clear, structured introduction, this paper
aims to simplify the complexities of RL for beginners, providing a straightforward pathway to understand-
ing and applying real-time techniques.

1 Introduction
Reinforcement Learning (RL) is a subfield of Artificial Intelligence (AI) that focuses on training by interact-
ing with the environment, aiming to maximize cumulative reward over time [1]. In contrast to supervised
learning, where the objective is to learn from labeled examples, or unsupervised learning, which is based
on detecting patterns in the data, RL deals with an autonomous agent that must make intuitive decisions
and consequently learn from its actions, often without existing data. The key idea is to learn how the world
works (e.g., what action gets a (positive) reward and which does not) to maximize cumulative rewards over
time through trial-and-error exploration. RL revolves around several key concepts: States, Actions, Policy,
Rewards, Transition Dynamics(Probabilities), and Environment Model. Each of these components plays a
crucial role in defining the agent’s interaction with its environment and the overall learning process, and
are defined over the next paragraphs. This paper assumes readers to have basic knowledge of Machine
Learning (ML) algorithms like Supervised and Unsupervised Learning.
A State (s ∈ S) represents a specific condition or configuration of the environment at a given time
as perceived by the agent. A state sets the scene for the agent to make choices and select actions, which
describes all states from which an agent can choose for each action that occurs. For example, in chess, a
state might be one specific layout of pieces on the board. In brief, states are the situations that an agent
can be in and observe to make decisions. States can be discrete (the given example) and continuous (e.g.,
the position of a robot, in terms of x and y coordinates). Actions (a ∈ A) are the set of possible moves
or decisions an agent can make while interacting with the environment. A selected action is part of the
strategy followed by an agent to reach its desired goals according to its current states and policy. In the
chess example, moving a piece from one square to another is an action. Similar to states, actions can be
discrete and continuous. The given chess example has a set of discrete actions (moving the piece according
to the rules, which has a finite set of actions). For continuous actions, let us consider the robot example.
A robot can change its coordinates and its movement speed, which both are continuous. Policy (π) guides
the behavior of a learning agent by mapping perceived states of the environment into actions. This could
be a simple function, a lookup table, or a complex computation, e.g., approximated by Neural Networks.
Policies can be stochastic, defining the likelihood of taking certain actions. A stochastic policy (π(a|s)) defines
a probability distribution over possible actions for a given state. Instead of selecting a single action, it

1
samples actions based on the probabilities defined by the policy. On the other hand, A deterministic policy
(π(s) = a) directly maps a state to a specific action. For a given state, the action taken is fixed and does not
involve any randomness. In the game of chess, an agent’s policy determines the move it makes given the
current board configuration (state). Imagine an agent evaluating its options in a given state of the board.
It assigns probabilities to each legal move based on their estimated utility. For instance: π(a1 |s) = 0.5,
π(a2 |s) = 0.3, π(a3 |s) = 0.2. Here, a1 , a2 , and a3 are possible moves, and the policy specifies a 50% chance
of selecting a1 , 30% for a2 , and 20% for a3 . The agent chooses its move by sampling from this distribution.
This approach introduces diversity in decision-making, which can be particularly useful when exploring
novel strategies or dealing with uncertain opponents. Conversely, a deterministic policy maps the current
state s to a specific move a (e.g., π(s) = a1 ). In this case, the agent always chooses a1 (e.g., moving the
queen to a specific square) when the board configuration matches s. Deterministic policies are efficient
in well-defined and predictable scenarios, such as when the agent has already learned the optimal moves
for most board configurations. Rewards (r ∈ R) are a critical factor in RL since it provides the agent
with an objective at each time step, defining both local and global goals that the agent aims to achieve
over time. Rewards differentiate positive from negative events and help update policies according to the
outcomes of actions. Rewards depend on the state of the environment and the actions taken. For instance,
in a Chess scenario, winning the game will lead to a positive reward, while losing points will lead to a
negative reward. Also, the game might end up in a draw, which lead to no reward. Reward function
defines the immediate reward obtained when taking action a in state s. It can be deterministic (R(s, r) = r)
or stochastic (P (r|s, a)). Transition Dynamics function defines the probability of reaching a new state s′
given the current state s and action a. It can be written mathematically as P(s′ |s, a). Environment Model is
an approximation or prediction of the environment, mapping out what will be available (the next state and
reward) given a particular input from the state and action. These models help with planning by identifying
what actions should be taken based on possible future events. RL approaches that use models are called
model-based methods, whereas those relying solely on trial-and-error learning are model-free methods. This
characteristic is an important factor to choose the appropriate algorithms, suitable for the problem. The
environment’s dynamics can be represented as a transition function and a reward function. Together, these
form the Markov Decision Process (MDP):

M = (S, A, P, R, γ)
where S is state space, A is action space, P(s′ |s, a) is the transition dynamics, R(s, a) is the Reward function,
and γ ∈ [0, 1]. γ is the discount factor, a parameter that determines how much importance is given to future
rewards compared to immediate rewards.
Figure 1 illustrates the interaction between an RL agent and its environment. The agent observes the
current state (St ), selects an action (At ) based on its policy (π), and receives a reward based on the reward
distribution (Rt ) along with the next state (St+1 ). This feedback loop is critical for the agent to learn and
update its policy to maximize cumulative rewards.
The rest of the paper is organized as follows: Section 2 introduces backgrounds and ket concepts in RL,
starting with multi-armed bandit. Bandits are a great place to start learning RL’s foundational concepts
like value functions and the Bellman equations. A complete introduction to core RL methods is given in
section 3. Section 4 analyzes essential RL algorithms, categorized them in a comprehensive manner. In
section 5, useful resources for learning RL are provided for readers would like to delve deeper in the realm
of RL. Finally, section 6 Concludes the paper.

2 Backgrounds & Key Concepts

Understanding RL requires a solid grasp of its foundational principles and the mathematical frameworks
that guide agent-environment interactions. This section introduces the essential building blocks of it, start-
ing with the simplest form of decision-making under uncertainty: the Multi-Armed Bandit problem. This
provides an intuitive gateway to understanding how agents learn from evaluative feedback. Building on
this, we delve into MDPs, which formalize sequential decision-making by balancing immediate and future
rewards. Finally, we discuss key RL metrics, such as value functions and policies, which form the backbone

2
Figure 1: Overview of Reinforcement Learning [1]

of most RL algorithms. Together, these topics lay the groundwork for exploring the core RL methods in the
next section.

2.1 Multi-Armed Bandit(s)

The Multi-Armed Bandit problem serves as a foundational example for understanding Reinforcement
Learning (RL). It represents a simplified decision-making scenario where an agent repeatedly chooses from
K actions (or arms) to maximize cumulative rewards over time. Each action is associated with an unknown
reward distribution, requiring the agent to balance exploration (gathering information about all actions) and
exploitation (maximizing immediate rewards using the best-known action). Unlike supervised learning,
where instructive feedback explicitly points to correct actions, RL often involves evaluative feedback. This
type of feedback only signals the quality of chosen actions, making it harder to identify optimal strategies
without extensive interaction with the environment.
In the K-armed bandit problem, each action a ∈ {a1 , a2 , . . . , ak } yields a reward drawn from a stationary
probability distribution. The stationarity assumption implies that the reward probabilities remain constant
over time [2]. For each of the k ∈ K actions available, there is an expected average reward, referred to as
the value of the action [1]. Let At denote the action chosen at time step t, and Rt the corresponding reward
received. The true value of an action, q∗ (a), is defined as the expected reward by taking that action as:

q∗ (a) = E[Rt | At = a] (1)

If q∗ (a) were known for all a, the problem would be trivial: the agent would always choose the action with
the highest q∗ (a). However, in practice, these values are unknown and must be estimated through inter-
action with the environment. After estimating action values, there is at least one action with the highest
estimated value at each time step. These are called greedy actions. Selecting a greedy action exploits current
knowledge for immediate reward, while selecting non-greedy actions explores to improve estimates. Bal-
ancing exploration and exploitation is crucial in RL. Opting for exploitation maximizes immediate reward,
while exploration can yield higher overall reward depending on various factors [1, 3]. Estimating action
values through sampling averages is efficient for stationary bandit problems. In real-world problems with
non-stationary environments, it makes sense to weight recent rewards more. Using a constant step-size
parameter is popular. The following update rule can be used to update the action value, which is known as
sample-average method.
Pt−1
i=1 Ri · 1{Ai =a}
Qt (a) = Pt−1 , (2)
i=1 1{Ai =a}

3
where 1{Ai =a} is an indicator function that equals 1 if action a was chosen at step i, and 0 otherwise. Over
time, Qt (a) converges to q∗ (a) by the law of large numbers [4], provided all actions are sampled sufficiently
often.
(
1 if action a is taken at time step i
1{Ai =a} = (3)
0 otherwise
Equation 2 can be written in a different way as follows, which represents the action value Qt (a) as the
average of the rewards received from that action up to time step t.

sum of rewards when the action a is taken prior to time step t

Qt (a) = (4)
number of times the action a is taken prior to time step t

Qn+1 = Qn + α[Rn − Qn ], (5)

To iteratively refine Qt (a), we use Equation 5, known as incremental update rule, where Qn+1 is the up-
dated action value after observing reward Rn . Qn is the previous estimate, and α is the learning rate
(a step-size parameter determining how much new information overrides the old estimate). This up-
date rule incrementally adjusts the action value based on the difference between the received reward
and the current estimate, weighted by the learning rate α. This iterative process allows the agent to
refine its value estimates and improve decision-making over time. In other words, NewEstimate ←
OldEstimate + StepSize[Target − OldEstimate] [1].
One of the critical challenges in solving the multi-armed bandit problem, and in RL as well, is balancing
exploration and exploitation, the greedy action selection (At = arg maxa Qt (a)). Several strategies address
this trade-off. We touch some of the widely used ones here, starting with ϵ−greedy. In this method, with
probability ϵ, choose a random action (exploration); otherwise, select the greedy action (exploitation). This
ensures that all actions are sampled, albeit less frequently for suboptimal actions. Another method, known
as Upper Confidence Bound (UCB), which balances exploitation and exploration by incorporating uncer-
tainty into action selection [5]: s
" #
ln t
At = arg max Qt (a) + c , (6)
a Nt (a)

where Nt (a) is the number of times action a has been chosen, and c > 0 controls the exploration rate. UCB
prioritizes actions with high uncertainty or fewer samples. Optimistic Initialization, assigns high initial
values to Q1 (a), encouraging exploration of all actions early on. For example, setting Q1 (a) = +5 makes
untried actions appear attractive initially.
These methods ensure all actions are sampled enough to estimate their true value accurately, making the
selection of the best action almost certain over time. However, these are theoretical long-term benefits and
may not directly indicate practical effectiveness [6]. To begin with, estimating action values by sampling
averages is inefficient for large numbers of samples (K > n). A more efficient approach derives Qn as:

R1 + R2 + · · · + Rn−1
Qn ≡ (7)
n−1
We can express the update rule for action value recursively

1
Qn+1 = Qn + [Rn − Qn ] (8)
n
This recursive equation requires memory only for Qn and n, with minimal computation for each reward [7].
Sample averages apply to stationary bandit problems. In non-stationary environments, recent rewards are
more relevant. Using a decaying constant step-size parameter, issues related to non-stationary problems
can be mitigated [1].
The initial estimates of action values, Q1 (a), play a crucial role in the learning process. These initial
guesses influence the early decisions made by the agent. While sample-average methods can reduce this
initial bias after each action is chosen at least once, methods using a constant step-size parameter, α, tend

4
Algorithm 1 A simple bandit algorithm
1: Initialize:
2: for a = 1 to k do
3: Q(a) ← 0
4: N (a) ← 0
5: end for
6: Repeat( forever:
arg maxa Q(a) with probability 1 − ϵ
7: A ←
a random action with probability ϵ
8: R ← bandit(A)
9: N (A) ← N (A) + 1
1

10: Q(A) ← Q(A) + N (A) R − Q(A)

to mitigate this bias more gradually over time. Setting optimistic initial values can be particularly advan-
tageous. By assigning higher initial estimates (e.g., +5), the agent is encouraged to explore more actions
early on. This is because the initial optimism makes untried actions appear more attractive, thus promoting
exploration even when the agent uses a greedy strategy. This approach helps ensure that the agent thor-
oughly investigates the action space before converging to a final policy. However, this strategy requires
careful consideration in defining the initial values, which are often set to zero in standard practice. The
choice of initial values should reflect an informed guess about the potential rewards, and overly optimistic
values can prevent the agent from converging efficiently if not properly managed. Overall, optimistic ini-
tial values can be a useful technique to balance exploration and exploitation in RL, encouraging broader
exploration and potentially leading to more optimal long-term policies [8].
Over the next subsection, another fundamental concept in RL, MDPs, are introduced.

2.2 Markov Decision Process (MDPs)

An MDP provides a framework for sequential decision-making in which actions affect immediate rewards
as well as future outcomes. In MDPs, immediate rewards are balanced with delayed rewards. In contrast to
bandit problems in which the goal is to determine the value of each action a, MDPs aim to measure the value
of taking action a in state s, or the value of being in state s assuming optimal actions are taken. A correct
assessment of the long-term effects of interventions requires the estimation of these state-specific values [1].
MDPs consist of states, actions, and rewards (S, A, R). Discrete probability distributions are assigned to the
random variables Rt and St based on the preceding state and action. Using the probabilities of occurrence
of the random variables Rt and St , derive equations for these variables. A system is considered Markovian
when the outcome of an action is independent of past actions and states, relying solely on the current state
[9]. The Markov property requires the state to encapsulate significant details of the entire past interaction
influencing future outcomes [10]. This definition is the basis of MDPs being used in RL. To describe the
dynamics of an MDP, we use the state-transition probability function p(s′ , r | s, a), which is defined as
follows:

p(s′ , r | s, a) ≡ Pr{St = s′ , Rt = r | St−1 = s, At−1 = a} (9)

where the function p defines the MDP dynamics. The following state-transition probabilities, state-action
and state-action-next-state triple rewards can be derived from the four-argument dynamic function p. We
can derive the state-transition probabilities, the expected reward for state-action pairs, and the expected
rewards for state-action-next-state triples as follows:
X
p(s′ | s, a) ≡ Pr{St = s′ | St−1 = s, At−1 = a} = p(s′ , r | s, a) (10)
r∈R
X X
r(s, a) ≡ E{Rt | St−1 = s, At−1 = a} = r p(s′ , r | s, a) (11)
r∈R r∈R

5
X rp(s′ , r | s, a)
r(s, a, s′ ) ≡ E{Rt | St−1 = s, At−1 = a, St = s′ } = (12)
p(s′ | s, a)
r∈R

The concept of actions encompasses any decisions relating to learning, and the concept of states encom-
passes any information that is available in order to inform those decisions. As part of the MDP framework,
goal-directed behavior is abstracted through interaction. Any learning problem can be reduced to three
signals between an agent and its environment: actions, states, and rewards. A wide range of applications
have been demonstrated for this framework [1]. We are now able to formally define and solve RL problems.
We have defined rewards, objectives, probability distributions, the environment, and the agent. Some con-
cepts, however, were defined informally. According to our statement, the agent seeks to maximize future
rewards, but how can this be mathematically expressed? The return, denoted Gt , is the cumulative sum of
rewards received from time step t onwards. For episodic tasks, it is defined as follows:

Gt ≡ Rt+1 + Rt+2 + . . . + RT (13)

Here, Gt is a specific function of the reward sequence. Episodic problems are those in which the inter-
actions between agents and their environment occur naturally in sequence, known as episodes, and tasks
are termed episodic tasks. The game hangman is a good example of this. At the end of each episode, a
standard starting state is restored. The term new games refers to the next state after the terminal state, which
is the final state leading to the end of an episode. It is common for ongoing tasks to involve interactions
that persist continuously throughout the duration of the task, such as process control or applications that
utilize robots with prolonged lifespans. The term continuing tasks refers to these activities. As there are no
terminal states in continuing tasks (T = ∞), the return for continuing tasks should be defined differently. It
is possible that the return could be infinite if the agent consistently receives a reward. For continuing tasks,
where there is no terminal state, the return Gt is defined as the discounted sum of future rewards:
∞
X
2
Gt ≡ Rt+1 + γRt+2 + γ Rt+3 + . . . = γ k Rt+k+1 (14)
k=0

where γ is the discount rate (0 ≤ γ ≤ 1). The discount rate affects the current worth of future rewards.
When γ < 1, the infinite sum converges to a finite value. With γ = 0, the agent maximizes immediate re-
wards. As γ approaches 1, future rewards carry more weight. We can also express the return Gt recursively

Gt ≡ Rt+1 + γGt+1 (15)

The return is finite if the reward is non-zero and constant, and γ < 1. Equation 16 works for both episodic
and continuing tasks if T = ∞ or γ = 1, respectively.
T
X
Gt ≡ γ k−t−1 Rk (16)
k=t+1

The concepts introduced in this section—ranging from bandits and MDPs to value functions and poli-
cies—provide the mathematical and conceptual tools required to understand RL methods. In the next
subsection, we explore how these foundational ideas evolve into core RL algorithms, bridging theory and
application.

2.3 Policies and Value Functions

The value function estimates the expected return of the agent being in a certain state (or performing an
action in a particular state). Depending on the actions selected, these factors will vary. There is a link
between value functions and policies, which are linked to probabilities of action based on states. Value
functions can be divided into two broad categories; State Value Functions and Action Value Functions.
The value function of a state s under policy π, vπ (s), is the expected return starting in s and following π

6
thereafter (Equation 17). On the other hand, the value of taking action a in state s under policy π, qπ (s, a),
is the expected return starting from s, taking action a, and following π thereafter (Equation 18).
"∞ #
X
k
vπ (s) ≡ Eπ γ Rt+k+1 | St = s , for all s ∈ S (17)
k=0

∞
" #
X
k
qπ (s, a) ≡ Eπ [Gt | St = s, At = a] = Eπ γ Rt+k+1 | St = s, At = a (18)
k=0

It is important to note the difference between v and q, namely that q depends on the actions taken in each
state. With ten states and eight actions per state, q requires 80 functions, while v requires only 10 functions.
Following policy π, if an agent averages returns from each state, the average converges to vπ (s). Averaging
returns from each action converges to qπ (s, a) [1]. vπ (s) can be written recursively:
X XX
vπ (s) ≡ Eπ [Gt | St = s] = Eπ [Rt+1 + γGt+1 | St = s] = π(a|s) p(s′ , r|s, a)[r + γvπ (s′ )] (19)
a s′ r

Equation 19 is the Bellman equation for vπ . The Bellman equation relates the value of a state to its potential
successor states’ values. The diagram illustrates the anticipation from a state to its successors. The value of
the initial state equals the discounted value of the expected next state plus the anticipated reward [11, 1].
vπ (s) and qπ (s, a) serve different purposes in RL. In the evaluation of deterministic policies or when
understanding the value of being in a particular state is required, state-value functions are used. In policy
evaluation and policy iteration methods, where a policy is explicitly defined and it is necessary to evaluate
the performance of being in a particular state under the policy, these methods are highly useful. The use of
state-value functions is beneficial when there are many actions, since they reduce complexity by requiring
only an evaluation of state values. Action-value functions, on the other hand, are used to evaluate and
compare the potential for different actions when they are taking place in the same state. They are crucial
for the selection of actions, where the goal is to determine the most appropriate action for each situation.
As action-value functions take into account the expected return of different actions, they are particularly
useful in environments with stochastic policies. Moreover, when dealing with continuous action spaces,
action-value functions can provide a more detailed understanding of the impact of actions, aiding in the
fine-tuning of policy implementation.
Example: Consider a gambling scenario where a player starts with $10 and faces decisions regarding the
amount to bet. This game illustrates state and action value functions in RL. State Value function (v π (s))
quantifies expected cumulative future rewards for a state s, given policy π. Suppose the player has $5:
• With a consistent $1 bet, vπ (5) = 0.5 indicates an expected gain of $0.5.
• With a consistent $2 bet, vπ (5) = −1 indicates an expected loss of $1.

Action Value function (qπ (s, a)) assesses expected cumulative future rewards for action a in state s. For
instance:
• qπ (5, 1) = 1 suggests a $1 bet from $5 results in a cumulative reward of $1.
• qπ (5, 2) = −0.5 indicates a loss of $0.5 for a $2 bet from $5.

This gambling game scenario highlights the role of state and action value functions in RL, guiding optimal
decision-making in dynamic environments.

2.4 Optimal Policies and Optimal Value Functions

Solving an RL task involves identifying a policy that maximizes long-term rewards. Value functions create
a partial ordering over policies, allowing comparison and ranking based on expected cumulative rewards.
A policy π is better than or equal to π0 if vπ (s) ≥ vπ0 (s) for all states s. An optimal policy is better than or

7
equal to all other policies, denoted by π∗ , sharing the same optimal state-value function v∗ , which is defined
as the maximum value function over all possible policies.

v∗ (s) ≡ max vπ (s) for all s ∈ S (20)

Optimal policies also share the same optimal action-value function q∗ , which is defined as the maximum
action value function over all possible policies.

q∗ (s, a) ≡ max qπ (s, a) for all s ∈ S (21)

The relationship between the optimal action value function q∗ (s, a) and the optimal state value function
v∗ (s) is given by the following equation: By having optimal action value function q∗ (s, a) we can find
optimal state value function v∗ (s) as shown in Equation 22.

q∗ (s, a) = E[Rt+1 + γv∗ (St+1 ) | St = s, At = a] (22)

Optimal value functions and policies represent an ideal state in RL. It is however rare to find truly
optimal policies in computationally demanding tasks due to practical challenges [1]. RL agents strive to
approximate optimal policies. Dynamic Programming (DP) helps identify optimal values, assuming a per-
fect model of the environment, which is a challenge to have in real-world cases. DP methods are not also
sample efficient, even though they are theoretically sound. The fundamental idea of DP and RL is using
value functions to organize the search for good policies. For finite MDPs, the environment’s dynamics are
given by probabilities p(s′ , r | s, a). The Bellman optimality equations for the optimal state value function
v∗ (s) and the optimal action value function q∗ (s, a) are Equations 23 and 24, respectively:
X
v∗ (s) = max E[Rt+1 + γv∗ (St+1 ) | St = s, At = a] = max p(s′ , r | s, a)[r + γv∗ (s′ )] (23)
a a
s′ ,r

X
q∗ (s, a) = E[Rt+1 + max
′
q∗ (St+1 , a′ ) | St = s, At = a] = p(s′ , r | s, a)[r + γ max
′
q∗ (s′ , a′ )] (24)
a a
s′ ,r

DP algorithms are derived by transforming Bellman equations into update rules.

2.5 Policy Evaluation (Prediction)

Policy evaluation, also known as prediction, involves computing the state-value function vπ for a given
policy π. This process assesses the expected return when following policy π from each state. The state-
value function vπ (s) is defined as the expected return starting from state s and following policy π:

vπ (s) ≡ Eπ [Rt+1 + γGt+1 | St = s] (25)

This can be recursively expressed as:
X X
vπ (s) = Eπ [Rt+1 + γvπ (St+1 ) | St = s] = π(a | s) p(s′ , r | s, a)[r + γvπ (s′ )] (26)
a s′ ,r

In these equations, π(a | s) denotes the probability of taking action a in state s under policy π. The exis-
tence and uniqueness of vπ are guaranteed if γ < 1 or if all states eventually terminate under π. Dynamic
Programming (DP) algorithm updates are termed ”expected updates” because they rely on the expectation
over all potential next states, rather than just a sample [1].

8
2.6 Policy Improvement
The purpose of calculating the value function for a policy is to identify improved policies. Assuming vπ
for a deterministic policy π, for a state s, should we alter the policy to select action a ̸= π(s)? We know the
effectiveness of adhering to the existing policy from state s (vπ (s)), but would transition to a new policy
yield a superior outcome? We can answer this by selecting action a in s and then following π. To determine
if a policy can be improved, we compare the value of taking a different action a in state s with the current
policy. This is done using the action value function qπ (s, a):
X
qπ (s, a) ≡ E[Rt+1 + γvπ (St+1 ) | St = s, At = a] = p(s′ , r | s, a)[r + γvπ (s′ )] (27)
s′ ,r

The key criterion is whether this value exceeds vπ (s). If qπ (s, a) > vπ (s), consistently choosing action a in s
is more advantageous than following π, leading to an improved policy π ′ .

2.7 Policy Improvement Theorem

The policy improvement theorem states that if qπ (s, π ′ (s)) ≥ vπ (s) for all states s, then the new policy π ′ is
at least as good as the original policy π. Formally, it is expressed as

qπ (s, π ′ (s)) ≥ vπ (s) (28)

If π ′ achieves greater or equal expected return from all states s ∈ S:

vπ′ (s) ≥ vπ (s) (29)

If there is strict inequality at any state, π ′ is superior to π. Extending this to all states and actions, selecting
the action that maximizes qπ (s, a). The new policy π ′ is obtained by selecting the action that maximizes the
action value function qπ (s, a).
X
π ′ (s) ≡ arg max qπ (s, a) = arg max E[Rt+1 +γvπ (St+1 ) | St = s, At = a] = arg max p(s′ , r | s, a)[r+γvπ (s′ )]
a a a
s′ ,r
(30)
Policy improvement creates a new policy that enhances an initial policy by adopting a greedy approach
based on the value function. Assuming π ′ is equally effective as π but not superior, then vπ = vπ′ ensures
that for all states s ∈ S. The relationship between the optimal state value function v∗ (s) and the optimal
action value function q∗ (s, a) is given by the following equation:
X
vπ′ (s) = max E[Rt+1 + γvπ′ (St+1 ) | St = s, At = a] = max p(s′ , r | s, a)[r + γvπ′ (s′ )] (31)
a a
s′ ,r

Policy improvement yields a superior policy unless the initial policy is already optimal. This concept ex-
tends to stochastic policies. Stochastic policies introduce a set of probabilities for actions, with the action
most aligned with the greedy policy assigned the highest probability.

2.8 Policy Iteration

After enhancing a policy π using vπ to derive an improved policy π ′ , compute vπ′ and further refine it to
obtain a superior policy π ′′ . This process generates a sequence of improving policies and corresponding
value functions. The process of policy iteration involves alternating between policy evaluation and policy
improvement to obtain a sequence of improving policies and value functions:

Evaluation Improvement Evaluation Improvement Evaluation Improvement Evaluation

π0 −−−−−−→ vπ0 −−−−−−−→ π1 −−−−−−→ vπ1 −−−−−−−→ π2 −−−−−−→ . . . −−−−−−−→ π∗ −−−−−−→ v∗ (32)

Each policy in this sequence is a marked improvement over its predecessor unless the preceding one is al-
ready optimal. Given a finite MDP, this iterative process converges to an optimal policy and value function

9
in a finite number of iterations. This method is called Policy Iteration. Policy iteration involves two pro-
cesses: policy evaluation aligns the value function with the current policy, and policy improvement makes
the policy greedier based on the value function. These processes iteratively reinforce each other until an
optimal policy is obtained.

2.9 Value Iteration

One limitation of policy iteration is that each iteration requires policy evaluation, often necessitating mul-
tiple passes through the entire state set [12]. To address this, policy evaluation can be abbreviated without
losing convergence guarantees. This method, known as value iteration, terminates policy evaluation af-
ter a single sweep. It combines policy improvement with a truncated form of policy evaluation. Value
iteration merges one pass of policy evaluation with policy improvement in each iteration, ensuring conver-
gence to an optimal policy for discounted finite MDPs [13]. The update rule for value iteration is given in
Equation 33.
X
vk+1 (s) ≡ max E[Rt+1 + γvk (St+1 ) | St = s, At = a] = max p(s′ , r | s, a)[r + γvk (s′ )] (33)
a a
s′ ,r

In value iteration, the key advantage is its efficiency, as it reduces the computational burden by merging
policy evaluation and improvement into a single update step. This method is particularly useful for large
state spaces where full policy evaluation at each step of policy iteration is computationally prohibitive [1].
Additionally, value iteration can be implemented using a synchronous update approach, where all state val-
ues are updated simultaneously, or an asynchronous update approach, where state values are updated one
at a time, potentially allowing for faster convergence in practice. Another notable aspect of value iteration
is its robustness to initial conditions. Starting from an arbitrary value function, value iteration iteratively
refines the value estimates until convergence, making it a reliable method for finding optimal policies even
when the initial policy is far from optimal [14]. Furthermore, value iteration provides a foundation for more
advanced algorithms by illustrating the principle of bootstrapping, where the value of a state is updated
based on the estimated values of successor states. This principle is central to many RL algorithms that seek
to balance exploration and exploitation in dynamic and uncertain environments [15].
The concepts introduced in this section—ranging from bandits and MDPs to value functions and poli-
cies—provide the mathematical and conceptual tools required to understand RL methods. In the next
section, we explore how these foundational ideas evolve into core RL algorithms, bridging theory and ap-
plication.

3 Core RL Methods
Understanding the various methodologies and concepts within RL is essential for the effective design and
implementation of RL algorithms. Methods in RL can be classified as either off-policy or on-policy, and as
model-free and model-based. These categories offer different approaches and techniques for learning from
interactions with the environment.

3.1 Model-free & Model-based methods

Model-free methods determine the optimal policy or value function directly without constructing a model
of the environment. There is no requirement for them to know transition probabilities and rewards, as they
learn entirely from observed states, actions, and rewards. Compared with model-based methods, model-
free methods are simpler to implement, relying on experience-based learning. There are two primary types:
Value-based and Policy-based methods. The former focus on learning the action-value function to derive an
optimal policy. For instance, Q-learning (discussed in section 4) is an off-policy algorithm that learns the
value of the optimal policy independently of the agent’s actions by using a max operator in its update rule.
SARSA (also discussed in section 4), on the other hand, is an on-policy algorithm that updates its Q-values
based on the actions actually taken by the policy. Both methods update their action-value estimates based
on the Bellman equation until convergence. In contrast, policy-based methods, like REINFORCE (discussed

10
in section 4), work by directly learning the policy without explicitly learning a value function. These meth-
ods adjust the policy parameters directly by following the gradient of the expected reward. This approach is
particularly useful in environments with high-dimensional action spaces where value-based methods may
not be effective. Policy-based methods are also capable of handling stochastic policies, providing a natural
framework for dealing with uncertainty in action selection. In addition to these primary types, there are
also hybrid approaches that combine value-based and policy-based methods, such as Actor-Critic algo-
rithms (which will be discussed in section 4). These methods consist of two main components: an actor
that updates the policy parameters in a direction suggested by the critic, and a critic that evaluates the
action-value function. Combining both types of learning is intended to provide more stable and efficient
learning [16].
Another significant advancement in model-free methods is the development of Deep RL (DRL) By inte-
grating deep neural networks with traditional RL algorithms, methods such as Deep Q-Networks (DQN)
[17] and Proximal Policy Optimization (PPO) [18] have achieved remarkable success in complex, high-
dimensional environments, including games and robotic control tasks. The advancement of these tech-
nologies has opened up new possibilities for the application of RL to real-world problems, enabling the
demonstration of robust performance in domains which were previously intractable. It is beyond the scope
of this paper to discuss these algorithms, and wee refer you to [19, 20, 21, 22] to understand DRL deeply
and effectively.
It is possible to predict the outcomes of actions using model-based methods, which facilitate strategic
planning and decision-making. The use of these methods enhances learning efficiency by providing op-
portunities for virtual experimentation, despite the complexity of developing and refining accurate models
[7]. Autonomous driving systems are an example of how model-based methods can be applied in the real
world. As autonomous vehicles navigate in dynamic environments, obstacle avoidance, and optimal rout-
ing must be made in real time. Autonomous vehicles create detailed models of their environment. These
models include static elements, such as roads and buildings, as well as dynamic elements, such as other
vehicles and pedestrians. Sensor data, including cameras, LIDAR, and radar, are used to build this model.
Through the use of the environmental model, the vehicle is capable of predicting the outcome of various
actions. For instance, when a vehicle considers changing lanes, it uses its model to predict the behavior of
surrounding vehicles to determine the safest and most efficient way to make the change. The model assists
the vehicle in planning its route and making strategic decisions. To minimize travel time, avoid congestion,
and enhance safety, it evaluates different routes and actions. Simulation allows the vehicle to select the
best course of action by simulating various scenarios before implementation in the real world. The vehicle,
for example, may use the model to simulate different actions in the event of a busy intersection, such as
waiting for a gap in traffic or taking an alternate route. Considering the potential outcomes of each action,
the vehicle can make an informed decision that balances efficiency with safety. In addition to improving the
ability of autonomous vehicles to navigate safely and efficiently in real-world conditions, this model-based
approach enables them to make complex decisions with a high level of accuracy. As a result of continuously
refining the model based on new data, the vehicle is able to enhance its decision-making capabilities over
time, thereby improving performance and enhancing safety on the road.
There are several advantages to using model-based methods over methods that do not use models.
By simulating future states and rewards, they can plan and evaluate different action sequences without
interacting directly with the environment. It is believed that this capability may lead to a faster convergence
to an optimal policy, since learning can be accelerated by leveraging the model’s predictions. A model-
based approach can also adapt more quickly to changes in the environment, since it enables the model to
be updated and re-planned accordingly. Although model-based methods have many advantages, they also
face a number of challenges, primarily in regards to accuracy and computational cost. In order to create
an accurate model of the environment, a high-fidelity model needs to be created. Moreover, the planning
process may be computationally expensive, especially in environments with a large number of states and
actions. However, advances in computing power and algorithms continue to improve the feasibility and
performance of model-based methods, making them a valuable approach in RL [23].

11
3.2 Off-Policy and On-Policy Methods
On-policy and off-policy learning are methodologies within model-free learning approaches, not relying
on environment transition probabilities. They are classified based on the relationship between the behavior
policy and the updated policy [1]. On-policy methods evaluate and improve the policy used to make
decisions, intertwining exploration and learning. These methods update the policy based on the actions
taken and the rewards received while following the current policy (π). This ensures that the policy being
optimized is the one actually used to interact with the environment, allowing for a coherent learning process
where exploration and policy improvement are naturally integrated.
Off-policy methods, on the other hand, involve learning the value of the optimal policy independently
of the agent’s actions. In these methods, we distinguish between two types of policies: the behavior policy
(b) and the target policy (π). The behavior policy explores the environment, while the target policy aims
to improve performance based on the gathered experience. This allows for a more exploratory behavior
policy while learning an optimal target policy. A significant advantage of off-policy methods is that they
can learn from data generated by any policy, not just the one currently being followed, making them highly
flexible and sample-efficient. The decoupling of the behavior and target policies allows off-policy methods
to reuse experiences more effectively. For instance, experiences collected using a behavior policy that ex-
plores the environment broadly can be used to improve the target policy, which aims to maximize rewards.
This characteristic makes off-policy methods particularly powerful in dynamic and complex environments
where extensive exploration is required [24, 25].
The relationship between the target policy and the behavior policy determines if a method is on-policy
or off-policy. Identical policies indicate on-policy, while differing policies indicate off-policy. Implementa-
tion details and objectives also influence classification. To better distinguish these methods, we have to first
learn what are the different policies. Behavior Policy b is a strategy used by an agent to determine which
actions to take at each time step. The behavior policy might, for example, include recommending a variety
of movies in order to explore user preferences in the recommendation system example. Target policy π
governs how the agent updates its value estimates in response to observed outcomes. Depending on the
feedback received from the recommended movies, the target policy of the recommendation system may
update the estimated user preferences.A thorough understanding of the interactions between these policies
is essential for the implementation of effective learning systems. An agent’s behavior policy determines
how it explores an environment, balancing exploration with exploitation to gather useful information. Al-
ternatively, the target policy determines how the agent learns from these experiences in order to improve its
estimates of value. When using on-policy methods, the behavior policy and the target policy are the same,
meaning that the actions taken to interact with the environment are also used to update the value estimates.
The result is stable learning, but it can be less efficient because the policy may not sufficiently explore the
state space [26]. There is a difference between the behavior policy and the target policy in off-policy meth-
ods. As opposed to the behavior policy, the target policy focuses on optimizing the value estimates by
taking the most appropriate action. Despite the fact that this separation can make learning more efficient,
it can also introduce instability if the behavior policy diverges too far from the optimal policy [15, 25]. Fur-
thermore, advanced methods, such as Actor-Critic algorithms, separate the behavior policy (actor) and the
target policy (critic). Actors make decisions according to current policies, while critics evaluate these deci-
sions and provide feedback to improve policies, thus combining the stability of on-policy methods with the
efficiency of off-policy methods [27, 28].
Understanding the core methodologies of RL, such as model-free and model-based approaches, as well
as the distinction between off-policy and on-policy methods, provides a foundational framework for ex-
ploring the diverse landscape of RL algorithms. These methodologies not only shape how agents learn
from their environments but also influence their adaptability and efficiency in complex, real-world scenar-
ios. Building on this understanding, the next section delves deeper into the specific essential algorithms
underpinning these methods, focusing on the policy-based, value-based, and hybrid approaches, to pro-
vide a clearer picture of their mechanisms and applications in RL.

12
4 Essential Algorithms
This section aims to present a concise overview of key algorithms discussed thus far, accompanied by
references to the original research papers for further exploration. Each algorithm is briefly described, and
real-world examples are included to enhance understanding. Readers seeking detailed information are
encouraged to consult the cited references, which serve as a gateway to the primary sources and support
a deeper learning experience. We categorize algorithms into three types: Value-based, Policy-based, and
Hybrid Algorithms. For each type, we analyze one to two widely-used algorithms, acknowledging that
there are more algorithms to discover.

4.1 Value-based
We introduced value-based methods, and clearly analyzed how they work. Here, we introduce three al-
gorithms, that are in the tabular settings. Later, we dive deeper into the topics, and discuss value-based
methods that use Deep Learning, such as Deep Q-Networks.
A significant breakthrough was made by [29] with the introduction of Q-learning, a Model-free algo-
rithm considered as off-policy Temporal Difference (TD) control. TD learning is undoubtedly the most
fundamental and innovative concept. A combination of Monte Carlo (MC) methods and Dynamic Pro-
gramming (DP) is used in this method. On one hand, similar to MC approaches, TD learning can be used
to acquire knowledge from unprocessed experience without the need for a model that describes the dy-
namics of the environment. On the other hand, TD algorithms are also similar to DP in that they refine
predictions using previously learned estimates instead of requiring a definitive outcome in order to pro-
ceed (known as bootstrapping). Q-learning enables an agent to learn the value of an action in a particular
state through experience, without requiring a model of the environment. It operates on the principle of
learning an action-value function that gives the expected utility of taking a given action in each state and
following a fixed policy thereafter. The core of the Q-learning algorithm involves updating the Q-values
(action-value pairs), where the learned action-value function, denoted as Q, approximates q∗ , the optimal
action-value function, regardless of the policy being followed. This significantly simplifies the algorithm’s
analysis and has facilitated early proofs of convergence. However, the policy still influences the process by
determining which state-action pairs are visited and subsequently updated.

Q(St , At ) ← Q(St , At ) + α Rt+1 + γ max Q(St+1 , a) − Q(St , At ) (34)
a

There are other types of Q-learning introduced in the literature, with slight changes and improvements,
such as: Double Q-learning [30], that addresses the overestimation bias in Q-learning, Distributional Q-
learning [31], which models the distribution of returns instead of estimating the mean Q-value, providing
richer information for decision-making, and many more [32, 33, 34].
Another widely used value-based algorithm is Deep Q-Networks (DQN), which merges Q-learning
with Neural Networks to learn control policies directly from raw pixel inputs. It uses Convolutional Neu-
ral Networks (CNN) to process these inputs and an experience replay mechanism to stabilize learning by
breaking correlations between consecutive experiences. The target network, updated less frequently, aids
in stabilizing training. DQN achieved state-of-the-art performance on various Atari 2600 games, surpassing
previous methods and, in some cases, human experts, using a consistent network architecture and hyperpa-
rameters across different games [17]. DQN combines the introduced Bellman Equation with DL approaches
like Loss Function and Gradient Descent to find the optimal policy as below:
h i
2
Li (θi ) = E(s,a,r,s′ )∼D (yi − Q(s, a; θi )) (35)

where
yi = r + γ max
′
Q(s′ , a′ ; θ− ) (36)
a
h i
′ ′ −
∇θi Li (θi ) = E(s,a,r,s′ )∼D r + γ max
′
Q(s , a ; θ ) −Q(s, a; θ i )) ∇ θ i Q(s, a; θ i ) (37)
a

Similar to Q-learning, there have been updates made to DQN. Some of the variations are [35, 36, 37].

13
Table 1: Essential RL Algorithms
Algorithms
1 Q-Learning [29] - Model-free, Off-policy, Value-based
2 SARSA (State-Action-Reward-State-Action) [26] - Model-free, On-
policy, Value-based
3 Expected SARSA [38] - Model-free, On-policy, Value-based
4 REINFORCE [39] - Model-free, On-policy, Policy-based
5 Dyna-Q [40] - Model-based, Off-policy, Hybrid
6 DQN [17] - Model-free, Off-policy, Value-based
7 TRPO [41] - Model-free, On-policy, Policy-based
8 PPO [18] - Model-free, On-policy, Policy-based
9 SAC (Soft Actor-Critic) [33] - Model-free, Off-policy, Hybrid
10 A3C [42] - Model-free, On-policy, Hybrid
11 A2C [28] - Model-free, On-policy, Hybrid
12 DDPG (Deep Deterministic Policy Gradient) [28] - Model-free, Off-
policy, Policy-based
13 TD3 (Twin Delayed Deep Deterministic Policy Gradient) [43] - Model-
free, Off-policy, Policy-based

4.2 Policy-based
Moving on from value-based methods, we analyze some policy-based algorithms in this section. Policy-
based methods are another fundamental RL method that more strongly emphasizes direct policy optimiza-
tion in the process of choosing actions for an agent. In contrast to Value-based methods, which search for
the value function implicit in the task, and then derive an optimal policy, Policy-based methods directly
parameterize and optimize the policy. This approach offers several advantages, particularly better deal-
ing with very challenging environments that have high-dimensional action spaces or where policies are
inherently stochastic. Perhaps at the core, Policy-based methods conduct their operation based on the pa-
rameterization of policies, usually denoted as π(a|s; θ). Here, θ is used to denote the parameters of the
policy, while s denotes the state and a denotes the action. In other words, it finds the optimal parameters
θ∗ that maximize the expected cumulative reward. Needless to say, this is generally done by gradient as-
cent techniques and more specifically by Policy Gradient methods that explicitly compute the gradient of
expected reward with respect to the policy parameters, modifying parameters in the direction of reward
increase [1, 10, 44].
REINFORCE is one of the widely-used policy-based algorithms. The REINFORCE algorithm is a sem-
inal contribution to RL, particularly within the context of policy gradient methods. The algorithm is de-
signed to optimize the expected cumulative reward by adjusting the policy parameters in the direction of
the gradient of the expected reward. It is rooted in the stochastic policy framework, where the policy, pa-
rameterized by θ, defines a probability distribution over actions given the current state. The key insight of
the REINFORCE algorithm is to use the log-likelihood gradient estimator to update the policy parameters
[39]. The gradient of the expected reward with respect to the policy parameters θ is given by:

∇θ J (θ) = Eπ [∇θ log πθ (a|s)Gt ] , (38)

where πθ (a|s) is the probability of taking action a in the state s under policy π parameterized by θ, and Gt
is the return (cumulative future reward) following time step t. This gradient estimation forms the basis for
the parameter update rule.
θ ← θ + α∇θ log πθ (a|s)Gt , (39)
where α is the learning rate.
Another algorithm, which has been used in variety of applications [45, 46, 47, 48, 49, 50, 51, 52], is
Proximal Policy Optimization (PPO). PPO, proposed by [18], represents a significant advancement within
policy gradient methods. PPO aims to achieve reliable performance and sample efficiency, addressing the
limitations of previous policy optimization algorithms such as Vanilla Policy Gradient (VPG) [53] methods
and Trust Region Policy Optimization (TRPO) [41]. Using policy gradient methods, the policy parameters

14
are optimized through stochastic gradient ascent by estimating the gradient of the policy. One of the most
commonly used policy gradient estimators is:
h i
ĝ = Êt ∇θ log πθ (at |st )Ât , (40)

where πθ represents the policy parameterized by θ, and Ât is an estimator of the advantage function at
time step t. This estimator helps construct an objective function whose gradient corresponds to the policy
gradient estimator: h i
LPG (θ) = Êt log πθ (at |st )Ât . (41)

PPO simplifies TRPO by using a surrogate objective with a clipped probability ratio, allowing for multiple
epochs of mini-batch updates. In order to preserve learning, large policy updates should be avoided.

4.3 Hybrid (Actor-Critic) methods

For the last group of algorithm, hybrid methods, we introduce Asynchronous Advantage Actor-Critic
(A3C) and Advantage Actor-Critic (A2C). Actor-critic methods combine Value-based and Policy-based ap-
proaches. Essentially, these methods consist of two components: the Actor, who selects actions based on
a policy, and the Critic, who evaluates the actions based on their value function. By providing feedback
on the quality of the actions taken, the critic guides the actor in updating the policy directly. As a result
of this synergy, learning can be more stable and efficient, addressing some limitations of pure policy or
Value-based approaches [16, 54]. The A2C algorithm is a synchronous variant of the A3C algorithm, which
was introduced by [42]. A2C maintains the key principles of A3C but simplifies the training process by
synchronizing the updates of multiple agents, thereby leveraging the strengths of both Actor-Critic meth-
ods and advantage estimation. The Actor-Critic architecture combines two primary components, in both
algorithms: the actor, which is responsible for selecting actions, and the critic, which evaluates the actions
by estimating the value function. The actor updates the policy parameters in a direction that is expected to
increase the expected reward, while the critic provides feedback by computing the TD error. This integra-
tion allows for more stable and efficient learning compared to using Actor-only or critic-only methods [27].
Advantage estimation is a technique used to reduce the variance of the policy gradient updates. The ad-
vantage function A(s, a) represents the difference between the action-value function Q(s, a) and the value
function V(s).

A(s, a) = Q(s, a) − V(s). (42)

By using the advantage function, A2C focuses on actions that yield higher returns than the average,
which helps in making more informed updates to the policy [1]. Unlike A3C, where multiple agents up-
date the global model asynchronously, A2C synchronizes these updates. Multiple agents run in parallel
environments, collecting experiences and calculating gradients, which are then aggregated and used to up-
date the global model synchronously. This synchronization reduces the complexity of implementation and
avoids issues related to asynchronous updates, such as non-deterministic behavior and potential overwrit-
ing of gradients. Table 1 categorizes the examined algorithms, and other essential algorithms that we did
not discuss to give a comprehensive overview regarding the main features
By analyzing some of the widely-used algorithms, it is time to introduce some of the good resources to
further learn RL. In the next section, we introduce books, video lectures, and online communities in RL.

5 Resources and Further Reading

In this section, we provide a list of some of the most helpful books, courses, videos, and online commu-
nities to assist readers in getting started with their real-life research without being overwhelmed. Table 2
summarizes the necessary resources, all in one place. On top of the mentioned resources, we would like
to refer readers to survey papers [10, 19, 54, 55, 56, 23, 57, 7, 58, 59] that are really helpful to understand
different applications, algorithms, and background more.

15
Table 2: Reinforcement Learning Resources
Books
1 ”Reinforcement Learning: An Introduction” by Richard S. Sutton and
Andrew G. Barto
2 ”Deep Reinforcement Learning Hands-On” by Maxim Lapan
3 ”Grokking Deep Reinforcement Learning” by Miguel Morales
4 ”Algorithms for Reinforcement Learning” by Csaba Szepesvári
Online Courses
5 Coursera: Reinforcement Learning Specialization by University of Al-
berta
6 Udacity: Deep Reinforcement Learning Nanodegree
7 edX: Fundamentals of Reinforcement Learning by University of Al-
berta
8 Reinforcement Learning Winter 2019 (Stanford)
Video Lectures
9 DeepMind x UCL — Reinforcement Learning Lecture Series
10 David Silver’s Reinforcement Learning Course
11 Pascal Poupart’s Reinforcement Learning Course - CS885
12 Sarath Chandar’s Reinforcement Learning Course
Tutorials and Articles
13 OpenAI Spinning Up in Deep RL
14 Deep Reinforcement Learning Course by PyTorch
15 RL Adventure by Denny Britz
Online Communities and Forums
16 Reddit: r/reinforcementlearning
17 Stack Overflow
18 AI Alignment Forum

6 Conclusion
This paper presents an introductory exploration of the fundamental concepts and methodologies of Re-
inforcement Learning (RL), tailored to beginners. It establishes a foundational understanding of how RL
agents learn and make decisions by thoroughly examining key components such as states, actions, policies,
and reward signals. By analyzing Multi-armed bandit problem, this paper introduced background of RL
in an accessible and easy-to-understand way. The primary objective is to offer an overview of a wide range
of RL algorithms, encompassing both model-free and model-based approaches, thereby highlighting the
diversity within the field. Through this guide, we aim to equip new learners with the essential knowledge
and confidence to begin their journey into RL.

References
[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.

[2] M. Li, C. Shi, Z. Wu, and P. Fryzlewicz, “Testing stationarity and change point detection in reinforce-
ment learning,” arXiv preprint arXiv:2203.01707, 2022.
[3] S. B. Thrun, Efficient exploration in reinforcement learning. Carnegie Mellon University, 1992.
[4] P. Erd, “On a new law of large numbers,” J. Anal. Muth, vol. 22, pp. 103–l, 1970.

[5] A. Garivier and E. Moulines, “On upper-confidence bound policies for switching bandit problems,” in
International conference on algorithmic learning theory, pp. 174–188, Springer, 2011.
[6] P. Ladosz, L. Weng, M. Kim, and H. Oh, “Exploration in deep reinforcement learning: A survey,”
Information Fusion, vol. 85, pp. 1–22, 2022.

16
[7] F.-M. Luo, T. Xu, H. Lai, X.-H. Chen, W. Zhang, and Y. Yu, “A survey on model-based reinforcement
learning,” Science China Information Sciences, vol. 67, no. 2, p. 121101, 2024.
[8] M. A. Wiering and M. Van Otterlo, “Reinforcement learning,” Adaptation, learning, and optimization,
vol. 12, no. 3, p. 729, 2012.

[9] M. Van Otterlo and M. Wiering, “Reinforcement learning and markov decision processes,” in Rein-
forcement learning: State-of-the-art, pp. 3–42, Springer, 2012.
[10] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” Journal of artificial
intelligence research, vol. 4, pp. 237–285, 1996.
[11] B. O’Donoghue, I. Osband, R. Munos, and V. Mnih, “The uncertainty bellman equation and explo-
ration,” in International conference on machine learning, pp. 3836–3845, 2018.
[12] D. P. Bertsekas, “Approximate policy iteration: A survey and some new methods,” Journal of Control
Theory and Applications, vol. 9, no. 3, pp. 310–335, 2011.
[13] M. Lutter, S. Mannor, J. Peters, D. Fox, and A. Garg, “Value iteration in continuous actions, states and
time,” arXiv preprint arXiv:2105.04682, 2021.
[14] D. Bertsekas, Dynamic programming and optimal control: Volume I, vol. 4. Athena scientific, 2012.
[15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,
A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” na-
ture, vol. 518, no. 7540, pp. 529–533, 2015.

[16] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, “A survey of actor-critic reinforcement learn-
ing: Standard and natural policy gradients,” IEEE Transactions on Systems, Man, and Cybernetics, part C
(applications and reviews), vol. 42, no. 6, pp. 1291–1307, 2012.
[17] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing
atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[18] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algo-
rithms,” arXiv preprint arXiv:1707.06347, 2017.
[19] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint arXiv:1701.07274, 2017.

[20] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “A brief survey of deep reinforce-
ment learning,” arXiv preprint arXiv:1708.05866, 2017.
[21] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau, et al., “An introduction to deep
reinforcement learning,” Foundations and Trends® in Machine Learning, vol. 11, no. 3-4, pp. 219–354,
2018.

[22] S. E. Li, “Deep reinforcement learning,” in Reinforcement learning for sequential decision and optimal con-
trol, pp. 365–402, Springer, 2023.
[23] T. M. Moerland, J. Broekens, A. Plaat, C. M. Jonker, et al., “Model-based reinforcement learning: A
survey,” Foundations and Trends® in Machine Learning, vol. 16, no. 1, pp. 1–118, 2023.

[24] P. Dayan and C. Watkins, “Q-learning,” Machine learning, vol. 8, no. 3, pp. 279–292, 1992.
[25] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar,
and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” in Proceedings of
the AAAI conference on artificial intelligence, vol. 32, 2018.
[26] G. A. Rummery and M. Niranjan, On-line Q-learning using connectionist systems, vol. 37. University of
Cambridge, Department of Engineering Cambridge, UK, 1994.

17
[27] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in neural information processing systems,
vol. 12, 1999.
[28] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous
control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
[29] C. J. C. H. Watkins, “Learning from delayed rewards,” 1989.
[30] H. Hasselt, “Double q-learning,” Advances in neural information processing systems, vol. 23, 2010.
[31] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,”
in International conference on machine learning, pp. 449–458, PMLR, 2017.
[32] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Hysteretic q-learning: an algorithm for decentralized
reinforcement learning in cooperative multi-agent teams,” in 2007 IEEE/RSJ International Conference on
Intelligent Robots and Systems, pp. 64–69, IEEE, 2007.
[33] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor,” in International conference on machine learning, pp. 1861–
1870, PMLR, 2018.
[34] J. Wang, Z. Ren, T. Liu, Y. Yu, and C. Zhang, “Qplex: Duplex dueling multi-agent q-learning,” arXiv
preprint arXiv:2008.01062, 2020.
[35] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in
Proceedings of the AAAI conference on artificial intelligence, vol. 30, 2016.
[36] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures
for deep reinforcement learning,” in International conference on machine learning, pp. 1995–2003, PMLR,
2016.
[37] T. Schaul, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
[38] H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering, “A theoretical and empirical analysis
of expected sarsa,” in 2009 ieee symposium on adaptive dynamic programming and reinforcement learning,
pp. 177–184, IEEE, 2009.
[39] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learn-
ing,” Machine learning, vol. 8, pp. 229–256, 1992.
[40] R. S. Sutton, “Integrated architectures for learning, planning, and reacting based on approximating
dynamic programming,” in Machine learning proceedings 1990, pp. 216–224, Elsevier, 1990.
[41] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in
International conference on machine learning, pp. 1889–1897, PMLR, 2015.
[42] V. Mnih, “Asynchronous methods for deep reinforcement learning,” arXiv preprint arXiv:1602.01783,
2016.
[43] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic meth-
ods,” in International conference on machine learning, pp. 1587–1596, PMLR, 2018.
[44] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International
Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
[45] H. Wei, X. Liu, L. Mashayekhy, and K. Decker, “Mixed-autonomy traffic control with proximal policy
optimization,” in 2019 IEEE Vehicular Networking Conference (VNC), pp. 1–8, IEEE, 2019.
[46] B. Zhang, X. Lu, R. Diao, H. Li, T. Lan, D. Shi, and Z. Wang, “Real-time autonomous line flow control
using proximal policy optimization,” in 2020 IEEE Power & Energy Society General Meeting (PESGM),
pp. 1–5, IEEE, 2020.

18
[47] J. Jin and Y. Xu, “Optimal policy characterization enhanced proximal policy optimization for multitask
scheduling in cloud computing,” IEEE Internet of Things Journal, vol. 9, no. 9, pp. 6418–6433, 2021.
[48] L. Zhang, Y. Zhang, X. Zhao, and Z. Zou, “Image captioning via proximal policy optimization,” Image
and Vision Computing, vol. 108, p. 104126, 2021.

[49] Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo, and K. Li, “Centralized cooperation for connected and auto-
mated vehicles at intersections by proximal policy optimization,” IEEE Transactions on Vehicular Tech-
nology, vol. 69, no. 11, pp. 12597–12608, 2020.
[50] E. Bøhn, E. M. Coates, S. Moe, and T. A. Johansen, “Deep reinforcement learning attitude control
of fixed-wing uavs using proximal policy optimization,” in 2019 international conference on unmanned
aircraft systems (ICUAS), pp. 523–533, IEEE, 2019.
[51] G. C. Lopes, M. Ferreira, A. da Silva Simões, and E. L. Colombini, “Intelligent control of a quadrotor
with proximal policy optimization reinforcement learning,” in 2018 Latin American Robotic Symposium,
2018 Brazilian Symposium on Robotics (SBR) and 2018 Workshop on Robotics in Education (WRE), pp. 503–
508, IEEE, 2018.

[52] F. Ye, X. Cheng, P. Wang, C.-Y. Chan, and J. Zhang, “Automated lane change strategy using proximal
policy optimization-based deep reinforcement learning,” in 2020 IEEE Intelligent Vehicles Symposium
(IV), pp. 1746–1752, IEEE, 2020.
[53] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement
learning with function approximation,” Advances in neural information processing systems, vol. 12, 1999.

[54] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A

brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017.
[55] X. Wang, S. Wang, X. Liang, D. Zhao, J. Huang, X. Xu, B. Dai, and Q. Miao, “Deep reinforcement
learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 4, pp. 5064–
5078, 2022.
[56] H.-n. Wang, N. Liu, Y.-y. Zhang, D.-w. Feng, F. Huang, D.-s. Li, and Y.-m. Zhang, “Deep reinforcement
learning: a survey,” Frontiers of Information Technology & Electronic Engineering, vol. 21, no. 12, pp. 1726–
1744, 2020.
[57] A. S. Polydoros and L. Nalpantidis, “Survey of model-based reinforcement learning: Applications on
robotics,” Journal of Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017.
[58] Y. Sato, “Model-free reinforcement learning for financial portfolios: a brief survey,” arXiv preprint
arXiv:1904.04973, 2019.
[59] J. Ramı́rez, W. Yu, and A. Perrusquı́a, “Model-free reinforcement learning from expert demonstrations:
a survey,” Artificial Intelligence Review, vol. 55, no. 4, pp. 3213–3241, 2022.

Thermo Scientific Arl Quantodesk Benchtop Optical Emission Spectrometer
100% (1)
Thermo Scientific Arl Quantodesk Benchtop Optical Emission Spectrometer
4 pages
RL Module 1
No ratings yet
RL Module 1
6 pages
RL RS-Unit_3 (1)
No ratings yet
RL RS-Unit_3 (1)
6 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
kguh
No ratings yet
kguh
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
ML_Unit-4
No ratings yet
ML_Unit-4
10 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
RL_Unit III
No ratings yet
RL_Unit III
12 pages
Maai 6
No ratings yet
Maai 6
143 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Reinforcement Learning, Q-Learning
No ratings yet
Reinforcement Learning, Q-Learning
20 pages
UNIT IV-1
No ratings yet
UNIT IV-1
32 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Unit V
100% (1)
Unit V
24 pages
ML Unit-5
No ratings yet
ML Unit-5
9 pages
Unit 5
No ratings yet
Unit 5
10 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
11 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
MDP
No ratings yet
MDP
10 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Generating Intelligent Agent Behaviors in Multi-Agent Game AI Using Deep Reinforcement Learning Algorithm
No ratings yet
Generating Intelligent Agent Behaviors in Multi-Agent Game AI Using Deep Reinforcement Learning Algorithm
9 pages
Unit 1
No ratings yet
Unit 1
18 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
RL Frra
No ratings yet
RL Frra
10 pages
ML Unit 5 @ VS
No ratings yet
ML Unit 5 @ VS
29 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
Reinforcement Learning.pptx
No ratings yet
Reinforcement Learning.pptx
59 pages
Subgoal Discovery For Hierarchical Reinforcement Learning Using Learned Policies
No ratings yet
Subgoal Discovery For Hierarchical Reinforcement Learning Using Learned Policies
5 pages
2206.09328v1
No ratings yet
2206.09328v1
28 pages
A Primer Chapter on Reinforcement Learning-Final
No ratings yet
A Primer Chapter on Reinforcement Learning-Final
22 pages
Unit 04 Finite Markov Decision Processes
No ratings yet
Unit 04 Finite Markov Decision Processes
8 pages
Reinforcement_learning
No ratings yet
Reinforcement_learning
19 pages
Workshop Master Revealed
From Everand
Workshop Master Revealed
Anil Soni
No ratings yet
SOC Analyst Bootcamp Syllabus
No ratings yet
SOC Analyst Bootcamp Syllabus
27 pages
HP E776, E77650, E77660) Precautions 20241021V1.0
No ratings yet
HP E776, E77650, E77660) Precautions 20241021V1.0
11 pages
Emudhra Earnings Transcript May 2023
No ratings yet
Emudhra Earnings Transcript May 2023
26 pages
Machine and Deep Learning (Nezar a. El-Kady)
No ratings yet
Machine and Deep Learning (Nezar a. El-Kady)
353 pages
Master 17 April
No ratings yet
Master 17 April
441 pages
BLS WORKSHOP CERTIfICATE
No ratings yet
BLS WORKSHOP CERTIfICATE
2 pages
Cyber Security Module 1
No ratings yet
Cyber Security Module 1
19 pages
VSS01 - VSS02-Maximizing Use of Vanguard Administrator (Part 1 - Part 2)
No ratings yet
VSS01 - VSS02-Maximizing Use of Vanguard Administrator (Part 1 - Part 2)
126 pages
Statistics For Management Notes
No ratings yet
Statistics For Management Notes
87 pages
Day 11 Slides - Static Routing
No ratings yet
Day 11 Slides - Static Routing
43 pages
AI and Machine Learning in Microsoft Azure Advanced Specialization
No ratings yet
AI and Machine Learning in Microsoft Azure Advanced Specialization
3 pages
3android Kotlin - Android Threading and AsyncTask
No ratings yet
3android Kotlin - Android Threading and AsyncTask
33 pages
EXCEL-Switch Excel Column Heading To Number or Alphabets
No ratings yet
EXCEL-Switch Excel Column Heading To Number or Alphabets
8 pages
Community Technical Versions
No ratings yet
Community Technical Versions
109 pages
vdr100g3 PDF
100% (1)
vdr100g3 PDF
434 pages
RTI Connext DDS Professional GettingStarted
No ratings yet
RTI Connext DDS Professional GettingStarted
12 pages
Numbers 1-25: by Joy Hall
No ratings yet
Numbers 1-25: by Joy Hall
18 pages
Design of Experiment Project Report
No ratings yet
Design of Experiment Project Report
10 pages
THESIS
100% (1)
THESIS
37 pages
AMT Pangaea U2 User Manual ENG (v.1)
No ratings yet
AMT Pangaea U2 User Manual ENG (v.1)
4 pages
Untitled
100% (3)
Untitled
512 pages
Join
No ratings yet
Join
4 pages
COGTA Regulations 29 April
69% (36)
COGTA Regulations 29 April
40 pages
Comandos Huawei
0% (1)
Comandos Huawei
3 pages
A Study On Cybersecurity Risk and Their Impact On Financial Institutions
No ratings yet
A Study On Cybersecurity Risk and Their Impact On Financial Institutions
114 pages
Geotomografi (Geo - Tomography)
No ratings yet
Geotomografi (Geo - Tomography)
36 pages
Novianti - 2021 - Klasifikasi Landsat - Palembang Menggunakan Google Earth
No ratings yet
Novianti - 2021 - Klasifikasi Landsat - Palembang Menggunakan Google Earth
11 pages
Building Recommender Systems with Machine Learning and AI: Help people discover new products and content with deep learning, neural networks, and machine learning recommendations. Kane - Download the ebook today and experience the full content
100% (1)
Building Recommender Systems with Machine Learning and AI: Help people discover new products and content with deep learning, neural networks, and machine learning recommendations. Kane - Download the ebook today and experience the full content
65 pages
SAP BW Administration Guide
No ratings yet
SAP BW Administration Guide
12 pages

Introduction to Reinforcement Learning

Uploaded by

Introduction to Reinforcement Learning

Uploaded by

Introduction to Reinforcement Learning

Majid Ghasemi⋆,1 and Dariush Ebrahimi1

2 Backgrounds & Key Concepts

2.1 Multi-Armed Bandit(s)

q∗ (a) = E[Rt | At = a] (1)

sum of rewards when the action a is taken prior to time step t

Qn+1 = Qn + α[Rn − Qn ], (5)

2.2 Markov Decision Process (MDPs)

p(s′ , r | s, a) ≡ Pr{St = s′ , Rt = r | St−1 = s, At−1 = a} (9)

Gt ≡ Rt+1 + Rt+2 + . . . + RT (13)

Gt ≡ Rt+1 + γGt+1 (15)

2.3 Policies and Value Functions

2.4 Optimal Policies and Optimal Value Functions

v∗ (s) ≡ max vπ (s) for all s ∈ S (20)

q∗ (s, a) ≡ max qπ (s, a) for all s ∈ S (21)

q∗ (s, a) = E[Rt+1 + γv∗ (St+1 ) | St = s, At = a] (22)

DP algorithms are derived by transforming Bellman equations into update rules.

2.5 Policy Evaluation (Prediction)

vπ (s) ≡ Eπ [Rt+1 + γGt+1 | St = s] (25)

2.7 Policy Improvement Theorem

qπ (s, π ′ (s)) ≥ vπ (s) (28)

If π ′ achieves greater or equal expected return from all states s ∈ S:

vπ′ (s) ≥ vπ (s) (29)

2.8 Policy Iteration

Evaluation Improvement Evaluation Improvement Evaluation Improvement Evaluation

2.9 Value Iteration

3.1 Model-free & Model-based methods

∇θ J (θ) = Eπ [∇θ log πθ (a|s)Gt ] , (38)

4.3 Hybrid (Actor-Critic) methods

A(s, a) = Q(s, a) − V(s). (42)

5 Resources and Further Reading

[54] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A

You might also like