RL
RL
Created by :
Mohamed MAHDI
Raed BOUSSAA
Class :
1ATEL 1
Supervised by :
Mme. Wafa MEFTEH
2023-2024
i
Table of Contents
Acknowledgments 1
General Introduction 2
1 Tabular Techniques in RL 4
1.1 Multi-armed Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 A k-armed Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Action-value Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Finite Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Temporal Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 On-Policy Learning with SARSA . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Off-policy Learning with Q-Learning . . . . . . . . . . . . . . . . . . . . 13
2 Approximation Techniques in RL 15
2.1 On-policy Prediction with Approximation . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Value Function Approximation . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 The Prediction Objective (V E) . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 Stochastic Gradient-descent (SGD) for Linear Methods . . . . . . . . . . 17
2.2 On-policy Control with Approximation . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Episodic Semi-gradient Control . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Average Reward : A new problem Setting for Continuing Tasks . . . . . 19
2.3 Off-policy Methods with Approximation . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Semi-gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
ii
Table of Contents Table of Contents
General Conclusion 44
iii
List of Figures
1.1 A gambler playing a multi-armed bandit. Image sourced from the website To-
wards Data Science article ’Solving the Multi-Armed Bandit Problem’[15]. . . . 5
1.2 The agent–environment interaction in a Markov decision process. Image sourced
from the book ’Reinforcement Learning : An Introduction’ [14]. . . . . . . . . . 7
2.1 The Mountain Car task. Image sourced from the book ’Reinforcement Learning :
An Introduction’ [14]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Markov reward process diagram. Image sourced from the book ’Reinforcement
Learning : An Introduction’ [14]. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 The illustration of SeqGAN. Image retrieved from SeqGAN : Sequence Genera-
tive Adversarial Nets with Policy Gradient[1] . . . . . . . . . . . . . . . . . . . . 23
3.2 A sentence and its representation of the action. Image captured from Controllable
Neural Story Plot Generation via Reward Shaping [5] . . . . . . . . . . . . . . . 26
3.3 The reward predictor learns separately from comparing different parts of the
environment, and then the agent tries to get the highest reward predicted by
this predictor. Image captured from : Deep Reinforcement Learning from Human
Preferences[4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Cliff walking game. Image retrieved from OpenAI’s gymnasium documentation[12] 32
4.2 Frozen Lake game. Image retrieved from OpenAI’s gymnasium documentation[12] 34
4.3 Evolution of average reward per episode . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Evolution of Bandits Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
iv
List of Tables
4.1 Average Convergence Time for Policy Improvement and Value Iteration . . . . . 39
4.2 Average runtime of reinforcement learning algorithms . . . . . . . . . . . . . . . 39
4.3 Performance of training algorithms based on average winning time . . . . . . . . 40
4.4 Probabilities of multi-armed bandits . . . . . . . . . . . . . . . . . . . . . . . . 42
v
Acknowledgments
Firstly, we are very thankful to God for guiding us through this journey and granting us the
strength and wisdom to overcome challenges along the way.
We are deeply grateful to ENIT for providing us with the opportunity to pursue our stu-
dies and take this project. The support and resources provided by the institution have been
invaluable in shaping my academic journey.
We would like to express our gratitude to all the individuals who supported us throughout
this end of year project.
We would like to extend our gratitude and acknowledge the entire community of researchers
and developers in the field of Reinforcement Learning (RL) whose collective efforts have signi-
ficantly enriched and advanced the domain.
We would like to thank the professor Wafa Mefteh for accepting us into her team, we also
thank Mr. Ali Frihida, the chief for the TIC department of ENIT.
We are grateful to them for their patience and invaluable guidance throughout this project.
Their availability and commitment were essential in helping us overcome all the difficulties and
achieve our goals.
I would also like to express my appreciation to my colleagues for their collaboration, sup-
port, and enriching exchanges we had throughout this project.
Finally, and most importantly, we would like to thank ourselves for our commitment, hard
work, and the collaborative spirit we shared throughout this project . Despite the challenges
we encountered, we persevered with determination and resilience. It is through our collective
effort and dedication that we have successfully brought this project . Let us take a moment
to celebrate our accomplishments and the invaluable contribution each of us has made to its
success.
1
General Introduction
Past the agent and the environment, we are able to identify 7 first elements of a RL model :
An agent, an environment, a state, an action, a policy, a reward signal and a value function.
An agent is an autonomous entity that interacts with the environment to achieve a well defined
goal. It learns by trial and error, receiving feedback in the form of rewards based on its actions.
The agent’s objective is to maximize cumulative rewards over time by selecting the highest
rewarded actions. Through iterative exploration and exploitation, the agent refines its decision-
making policy, aiming to achieve optimal performance.
The environment is the outside framework with which the agent interacts. It may be a physical
environment, a recreated environment in a laptop program, or a digital environment.
2
General Introduction General Introduction
A state represents the present situation or configuration of the environment at a specific times-
tep. It encapsulates all of the applicable statistics essential for the agent to make decisions.
An action is a choice or preference made through the agent at a specific state. Actions have an
effect on the following states and the rewards obtained through the agent.
The policy is the approach or rule that the agent follows to pick movements at every state. It
maps states to actions or probabilities of choosing movements. Policies may be deterministic
which means choosing the identical movement for a given state, or stochastic choosing move-
ments probabilistically.
The reward signal in reinforcement learning serves as a crucial component, directing the agent
at his choice making process. This numerical feedback provided by the environment after each
action, indicates how desirable the agent’s choices are. By associating actions with rewards, the
agent learns to improve its behavior over time, aiming to maximize the total rewards.
The value function estimates the anticipated cumulative reward an agent can get from a given
state beneath a certain approach. It shows the long-term desirability of being in a specific state.
Value functions offer assistance to the agent to estimate the quality of different states and make
choices appropriately.
Reinforcement Learning (RL) pursues a few critical objectives and tackles numerous is-
sues in different areas. One of the most objectives of RL is to enable agents to learn ideal
decision-making procedures through interaction with the environment. This incorporates tasks
such as robot control, and resource optimization. RL looks to solve complex issues related to
vulnerability, dynamic situations, and the require for adjustment. In fields like healthcare and
logistics, RL is used to improve resources allocation and enhance decision-making processes.
Moreover, RL encourages the advancement of intelligent agents that can learn from experience,
thereby decreasing dependence on explicit programming. Therefore, the most objective of RL
is to form operators that can successfully explore and succeed in real-world scenarios by conti-
nually learning, adjusting, and optimizing their behavior based on input and rewards from the
environment.
3
CHAPTER 1
Tabular Techniques in RL
Introduction
In this chapter, we’re diving into different ways to solve problems in reinforcement learning.
We’ll start by exploring something called Multi-armed Bandit problems. These are like choosing
between different slot machines, each with its own rewards. We’ll learn methods to make smart
choices and get the most rewards.
After that, we’ll talk about Finite Markov Decision Processes (MDPs). These are about
making decisions in situations where the outcome is uncertain. We’ll learn how to figure out the
best strategies for these situations using techniques like Policy Evaluation, Policy Improvement,
and others.
Then, we’ll look at Monte Carlo Methods and Temporal Difference Learning. These are
ways to learn from experience and improve decision-making over time. We’ll explore different
approaches like SARSA and Q-Learning.
By the end of this chapter, we’ll have a good understanding of these methods, which will
help us solve a wide range of problems in reinforcement learning.
4
CHAPTER 1. Tabular Techniques in RL 1.1. Multi-armed Bandit
The multi-armed bandit is a slot machine featuring multiple levers that a gambler operates
independently. Each lever provides a distinct reward, and the probabilities of rewards associated
with each lever are independently distributed and typically unknown to the gambler. The
picture, 1.1, illustrates the multi-armed bandit in action.
Figure 1.1 – A gambler playing a multi-armed bandit. Image sourced from the website Towards
Data Science article ’Solving the Multi-Armed Bandit Problem’[15].
If we knew the value of each action, solving the k-armed bandit problem becomes trivial :
we simply select the action with the highest value as the optimal action. However, generally it
is not the case, we will then rely on statistics and estimations. The estimated value of action a
at time step t, denoted as Qt (a), has to closely approximate q ∗ (a), using an effective solution.
At any time step t, there is generally at least one action with the highest estimated value,
called the greedy action. Choosing greedy actions constitutes exploitation of current knowledge
learned. Alternatively, opting for non-greedy actions is called exploration. In the beginning
of our training exploring non-greedy actions may be advantageous to discover more awarding
actions. Although we might not get immediate rewards when exploring, we will learn better
actions that will pay off in the long run. The decision to explore or exploit depends on the
precision of estimates, uncertainties, and the remaining number of steps.
5
CHAPTER 1. Tabular Techniques in RL 1.2. Finite Markov Decision Process
t−1
P
Ri · I(Ai = a)
sum of the rewards when a was taken prior to t i=1
Qt (a) = = t−1
total number of times a was taken prior to t P
I(Ai = a)
i=1
In our case, I evaluates to 1 if the predicate (Ai = a) is true and 0 otherwise. As the denomina-
tor approaches infinity, Qt (a) tends toward q ∗ (a). This method is known as the sample-average
method for estimating action values since each estimate represents the average of a relevant
reward sample.
We define our action selection rule as to select one of the greedy actions (one of the ac-
tions with the highest estimated value). If there is multiple greedy actions, then we select one
randomly. We denote this greedy action selection algorithm as :
At = argmaxa Qt (a),
where argmaxa presents the action a for which the action value Qt (a) is maximum and where
ties are broken randomly. Greedy action selection always exploits current learned actions to
maximize immediate reward. It does consider seemingly inferior actions to determine if they
might actually be better. An alternative approach involves behaving greedily most of the time,
but occasionally, with a small probability ϵ, we select randomly from among all actions with
equal probability, regardless of the action-value estimates. We refer to methods employing this
near-greedy action selection rule as ϵ-greedy methods.
Following this method, each action will be sampled a quasi infinite number of times as the
number of steps increases, thus we ensure that all Qt (a) converge to q∗ (a).
6
CHAPTER 1. Tabular Techniques in RL 1.3. Dynamic Programming
The agent interacts with the environment in discrete steps. At each step, we have a state,
based on which the agent takes the most optimal action from the set of all possible actions. In
the next step, the agent receives a new state and a new reward (see Figure 1.2).
Figure 1.2 – The agent–environment interaction in a Markov decision process. Image sourced
from the book ’Reinforcement Learning : An Introduction’ [14].
In cases where the agent’s interaction with the environment naturally breaks into steps, the
goal is to maximize the expected return. Each step ends in a terminal state, and the game is reset
to a starting state. For continuing tasks, where the interaction continues indefinitely without
a final time step, the concept of the return becomes problematic. For instance, if the agent
receives a reward of +1 at each time step, the return would be infinite since there is no final
time step. To address this issue, we introduce discounting, denoted by the parameter γ, where
0 ≤ γ ≤ 1. The agent’s objective is to select actions that maximize the expected discounted
reward, where the reward received k time steps later, is valued as the original reward multiplied
by γ k−1 .
∞
X
2
Gt = Rt+1 + γRt+2 + γ Rt+3 + · · · = γ k Rt+k+1
k=0
Finding the most optimal policy is necessary to solve complex RL problems. For finite
Markov decision processes, an optimal policy is defined as follows : a policy π is better than a
policy π ′ when the return value of π is greater or equal to the return value of π ′ . Put differently,
where vπ (s) is the state value within the policy π. We define the most optimal policy as :
v∗ (s) = maxπ vπ (s).
7
CHAPTER 1. Tabular Techniques in RL 1.3. Dynamic Programming
value function for a given policy, while policy improvement involves refining the policy based
on the current value function.
In order to build and organize good policies, the use of the value function is essential. The
optimal value function v∗ and the optimal action-value function q∗ verify the Bellman optimality
equations :
where π(a|s) is the probability of taking the action a in the state s under the policy π.
Let’s examine a series of approximate value functions v0 , v1 , v2 , . . ., each one of them maps
S + to R. The initial approximation, v0 , is randomly selected (with the condition that the ter-
minal state, if it exists, must be assigned 0). Subsequent approximations are derived using the
Bellman equation for vπ as the update rule :
for all s ∈ S.
The sequence of value functions {vk } converges to vπ as k → ∞. This is known as the iterative
policy evaluation algorithm.
8
CHAPTER 1. Tabular Techniques in RL 1.3. Dynamic Programming
The crucial concept is to determine whether qπ (s, a) > vπ (s) or qπ (s, a) < vπ (s). If qπ (s, a) >
vπ (s), we can assume that selecting action a every time we encounter the state s would lead to
a more optimal policy on the whole.
This theorem, known as the policy improvement theorem, states that for any deterministic
policies pair (π, π ′ ), if for all states s ∈ S, the action-value function under policy π ′ is greater
than or equal to the value function under policy π :
E I E ′ I E I E
− π′ −
→ vπ →
π− − π ′′ −
→ vπ → − π∗ −
→ ... → → v∗,
E I
where −→ signifies policy evaluation and →
− represents policy improvement. Every policy repre-
sents a clear enhancement over the preceding one, except when the previous policy is already
optimal. Since a finite MDP has a finite number of policies, this iterative process must converge
to an optimal policy and optimal value function within a finite number of iterations. This me-
thod of identifying an optimal policy is referred to as policy iteration.
9
CHAPTER 1. Tabular Techniques in RL 1.4. Monte Carlo Methods
However, there are various ways to truncate the policy evaluation step in policy iteration
without compromising its convergence guarantees. A notable instance is when policy evaluation
is terminated after a single update of each state. This approach is known as value iteration. It
can be expressed as a straightforward update operation that integrates both policy improvement
and truncated policy evaluation steps :
X
vk+1 (s) = max E[Rt+1 + γvk (St+1 ) | St = s, At = a] = max p(s′ , r|s, a) (r + γvk (s′ ))
a a
s′ ,r
A model of the enviromment is crucial in order to compute the state values, so when the model
is unavailable, it becomes neccesary for us to estimate action values rather than state values.
For this reason, Monte Carlo methods for action values focus on visits to state-action pairs
rather than states alone.
Monte Carlo methods can also be used to determine an approximation of the optimal policy
through a process named generalized policy iteration (GPI), it involves updating the approxi-
mate policy and the the approximate value function. This process helps these both estimations
to reach optimality over time.
10
CHAPTER 1. Tabular Techniques in RL 1.4. Monte Carlo Methods
The policy improvement is executed by ensuring that the policy becomes greedy with regard
to the current value function. As a result, the policy improvement theorem is applicable to πk
in the following manner :
qπk (s, πk+1 (s)) = qπk (s, arg max qπk (s, a))
a
In order to avoid unrealstic assumptions in monte carlo methods, we use the on-policy method.
On-policy method generally assigns positive probability to all actions in all states. One of the
most known approaches with on-policy methods is ϵ-greedy policy, which chooses the action
with the highest estimated action, but in some cases it takes random actions based on a pro-
bability ϵ, which makes a balance between exploring and exploiting.
The policy improvement theorem guarantees that any ϵ-greedy policy with respect to qπ re-
presents an enhancement over any ϵ-soft policy π. Denote π ′ as the ϵ-greedy policy. The policy
improvement theorem conditions are satisfied due to the fact that for every state s ∈ S,
X
qπ (s, π ′ (s)) = π ′ (a|s)qπ (s, a)
a
ϵ X
= qπ (s, a) + (1 − ϵ) max qπ (s, a)
|A(s)| a a
ϵ X X −ϵ
≥ qπ (s, a) + (1 − ϵ) π(a|s) qπ (s, a)
|A(s)| a a
1 − ϵ
11
CHAPTER 1. Tabular Techniques in RL 1.5. Temporal Difference Learning
ϵ X ϵ X X
= qπ (s, a) − qπ (s, a) + π(a|s)qπ (s, a)
|A(s)| a |A(s)| a a
= vπ (s).
Thus, following the policy improvement theorem, π ′ ≥ π (vπ′ (s) ≥ vπ (s) (for all
s ∈ S). This proves that equality can hold only when π ′ and π are both optimal
among all the ϵ-soft policies. Meaning that they are better than or equal to all other
policies.
It has been shown that TD(0) method always converges to the correct value function for any
policy, thus convergence is guaranteed under certain parametres
12
CHAPTER 1. Tabular Techniques in RL 1.5. Temporal Difference Learning
Here, α represents the learning rate, which typically ranges between 0 and 1, and R denotes
the reward received. The discount factor γ determines the extent to which future rewards are
considered.
SARSA’s on-policy approach ensures that it learns from the current policy. This characte-
ristic makes SARSA particularly suitable for scenarios where stability and safety are crucial,
such as in robotics or autonomous systems.
where α is the learning rate and has a value between 0 and 1. R is a reward and is the reduction
rate of the reward as time passes.
This flexibility and adaptability make Q-learning suitable for a wide range of problems and
environments, including offline training on pre-collected datasets.
13
CHAPTER 1. Tabular Techniques in RL 1.5. Temporal Difference Learning
Conclusion
In this chapter, we delve into the methodologies employed in reinforcement learning problem-
solving, particularly focusing on tabular methods. Beginning with the rudimentary Multi-armed
Bandit problem, we elucidate its fundamental version and the strategies employed to tackle it.
Progressing further, we encounter more intricate scenarios known as Finite Markov Decision
Processes, pivotal in the realm of reinforcement learning.
Within this framework, our discourse gravitates towards Dynamic Programming techniques,
akin to systematic blueprints aimed at determining optimal solutions. We expound upon four
primary methodologies : Policy Evaluation, Policy Improvement, Policy Iteration, and Value
Iteration, instrumental in facilitating informed decision-making within these processes. Sub-
sequently, we embark on an exploration of Monte Carlo Methods, offering avenues to resolve
predicaments through simulating numerous plausible scenarios, especially beneficial in naviga-
ting uncertain terrains.
This chapter also delves into Temporal Difference Learning, a methodology predicated on
iterative learning through hypothesis and information assimilation. We elucidate two prominent
approaches within this paradigm : SARSA and Q-Learning, devised to glean insights from ex-
periences and refine decision-making prowess over time. In essence, Chapter 1 serves as a com-
prehensive primer on the diverse methodologies harnessed to tackle challenges in reinforcement
learning, traversing a spectrum from elementary scenarios to more intricate predicaments.
14
CHAPTER 2
Approximation Techniques in RL
Introduction
In this chapter, we delve into approximate solution methods in reinforcement learning, which
offer practical approaches for handling large-scale problems. We start by exploring On-policy
Prediction with Approximation, where we seek to estimate the value function using approxi-
mation techniques. This involves methods like Value Function Approximation, where we aim
to represent the value function in a compact form, and Linear Methods, which provide efficient
ways to approximate complex functions.
We then delve into On-policy Control with Approximation, focusing on techniques for fin-
ding optimal policies while approximating the value function. Through Episodic Semi-gradient
Control, we navigate the challenge of balancing exploration and exploitation to converge to-
wards optimal policies. Additionally, we explore the concept of Average Reward, introducing a
new problem setting for continuing tasks that offers fresh insights into reinforcement learning.
Transitioning to Off-policy Methods with Approximation, we explore semi-gradient methods
that enable learning from experiences generated by different policies. We also discuss the concept
of Learnability, shedding light on the theoretical aspects of approximation in reinforcement
learning and its implications for practical applications.
Through these discussions, we aim to equip readers with a deeper understanding of approxi-
mate solution methods and their applications in addressing complex reinforcement learning
problems.
15
CHAPTER 2. Approximation Techniques in RL 2.1. On-policy Prediction with Approximation
In this chapter, a novel method is introduced wherein the approximate value function is de-
picted through a parameterized functional structure, characterized by a weight vector w ∈ Rd .
We will denote the approximate value of state v̂(s, w) ≈ vπ (s), given weight vector w.
By extending reinforcement learning to function approximation, reinforcement learning be-
comes applicable to partially observable problems, where the full state is not observable to the
agent.
Let consider our individual update s 7→ u. Where s is the state and u is update target that’s
s estimated value is shifted toward. In the Dynamic Programming policy-evaluation update
s 7→ Eπ [Rt+1 + γv̂(St+1 , wt )|St = s], the state s is updated, however in the other cases the
actual experience’s state St is the one updated. In essence, updating s to u implies that the
estimated value for a state s should resemble the update target u to a greater extent.
Machine learning methods that aim to replicate input-output examples are known as Supervised
Learning methods. When the outputs are numerical, such as u, this process is also referred to as
function approximation. Function approximation methods rely on examples that demonstrate
the desired input-output policy that they are trying to approximate.
In Reinforcement Learning, it is vital for learning to take place online, allowing the agent
to interact dynamically with its environment.
Weighting this over the state space by µ, we obtain a natural objective function, the mean
squared value error, denoted (V E). Such that :
X
VE = e(s, w)
s∈S
16
CHAPTER 2. Approximation Techniques in RL 2.1. On-policy Prediction with Approximation
objective is occasionally viable for straightforward approximations like linear models, but sel-
dom feasible for intricate functions such as artificial neural networks or decision trees. Under
such circumstances, complex function approximators might converge towards a local optimum,
represented by a weight vector w∗ , where V E(w∗ ) ≤ V E(w) for all w within a certain interval
that includes w∗ .
This typically represents the optimal approach for nonlinear function approximators, and fre-
quently proves to be adequate. However, in numerous scenarios relevant to reinforcement lear-
ning, there is no guarantee of reaching an optimum solution, or even being close to one.
In linear case ∇v̂(s, w) = x(s), let’s also assume that the target output, that we denote Ut ∈ R
of the tt h training, St 7→ Ut , is not the exact value, vπ (st ), but a random approximation to it.
In our case, we cannot use the exact value of vπ (st ) but an approximation of it. As a result,
If Ut is an unbiased estimate, meaning E [Ut |st = s] = vπ (st ), for each training t, then wt is
guarenteed to converge to a local optimum.
Here the weight vector is a point near the local optimum. The update at each training time t is
17
CHAPTER 2. Approximation Techniques in RL 2.2. On-policy Control with Approximation
Here wT D is called temporal difference fixed point, and at the TD fixed point, it is proved that :
1
V E(wT D ) ≤ min V E
1−γ w
Let’s examine the challenge of navigating a low-powered vehicle along a steep mountain
road, as illustrated in the picture 2.1 below.
Figure 2.1 – The Mountain Car task. Image sourced from the book ’Reinforcement Learning :
An Introduction’ [14].
18
CHAPTER 2. Approximation Techniques in RL 2.3. Off-policy Methods with Approximation
The difficulty in our example is that the pulling gravity is much stronger than the car’s engine,
so even at full acceleration the car cannot accelerate up to its final Goal.
The only solution here is to go up to the left before moving towards the goal. Then, by applying
full acceleration the car can develop enough inertia to reach the goal, even though gravity is
slowing it down the whole time. This is an example of a continuous control task, meaning that
things must get worse (get far from the destination) before getting better (get closer the the
goal).
The reward is (−1) on all training steps except when the car moves toward its final goal, which
ends the episode. There are three possible actions :
Let’s model the car’s movement according to simplified classical mechanics. Its positon xt and
its velocity, ẋt follow these rules :
xt+1 = xt + xt+1
˙
ẋt+1 = ẋt + αAt + β cos(ωxt )
The feature vector x(s, a) is combined linearly with the parameter vector to approximate the
d
action-value function q̂(s, a, w) = wT x(s, a) =
P
wi xi (s, a) for each state s and action a.
i=1
Here expectations are conditioned on the initial state, S0 . Whereas, the subsequent actions
A0 , A1 , ...At−1 is taken according to our policy π. Partically, we order policies according to their
average reward per time step r(π).
Within the mean reward environment, returns are characterized by the disparities between
rewards and the average reward : Gt = Rt+1 − r(π) + Rt+2 − r(π) + Rt+3 − r(π).... This concept
is referred to as differential return, with the associated value functions termed differential value
functions.
19
CHAPTER 2. Approximation Techniques in RL 2.3. Off-policy Methods with Approximation
π(At |St )
ρt =
b(At |St )
Where, b(a|s) denotes the behavior policy, it is used to select actions while learning about the
target policy π.
For example, in the one-step state-value algorithm we add the ρt :
2.3.2 Learnability
In the realm of reinforcement learning, the concept of ”learnability” pertains to the ability to
be learned with a certain level of experience. However, it is discovered that numerous concepts
cannot be learned in reinforcement learning, even with an infinite amount of experimental
data. While these quantities are well-defined and computable with knowledge of the internal
environment structure, they remain uncomputable or unestimable from the observable sequence
of features, actions, or reward sets. As a result, we classify them as being non-learnable.
Figure 2.2 – Markov reward process diagram. Image sourced from the book ’Reinforcement
Learning : An Introduction’ [14].
Consider a scenario where a state has two outgoing transitions, each occurring with equal
probability. The numerical values denote the rewards associated with these transitions. In one
path, the Markov Reward Process (MRP) remains in the same state and yields either 0s or 2s
randomly at each step, each with a probability of 12 . Conversely, in the other path, the MRP
either remains in its current state or transitions to the other state with equal probability at
each step. Although the rewards are deterministically 0 or 2 in the MRP, the observable data
consists of a sequence of 0s and 2s generated randomly, just like that produced by the first
path. Thus, despite having access to an infinite amount of data, discerning between the two
processes is infeasible.
20
CHAPTER 2. Approximation Techniques in RL 2.3. Off-policy Methods with Approximation
Conclusion
In wrapping up this chapter, let’s reflect on what we’ve learned about approximate solution
methods in reinforcement learning. These methods offer us practical ways to deal with big and
complex problems, making it easier to find solutions. We started by learning how to estimate
the value function using approximation techniques. This helps us handle situations where the
number of possible states is too large to deal with directly. Techniques like Linear Methods
simplify the representation of the value function, making it more manageable. Moving on to
finding optimal policies, we explored methods like Episodic Semi-gradient Control. These tech-
niques help us balance between exploring new options and exploiting what we already know to
find the best policies efficiently. We also discussed the concept of Average Reward, which gives
us a new perspective on tasks that continue indefinitely.
We then looked into Off-policy Methods, which allow us to learn from different policies. This
flexibility is valuable in real-world scenarios where we might have data from various sources.
Understanding the theory behind Learnability helped us grasp the potential and limitations
of approximation in reinforcement learning. Overall, these approximate solution methods are
powerful tools for tackling real-world challenges in reinforcement learning. By mastering these
techniques, we can handle complex problems more effectively and drive progress in artificial
intelligence.
As we move forward, let’s keep exploring and experimenting with these methods, pushing
the boundaries of what’s possible in reinforcement learning. With dedication and creativity,
we can continue to advance the field and make meaningful contributions to AI research and
applications.
21
CHAPTER 3
Introduction
In this chapter, we’ll explore how reinforcement learning (RL) techniques can be used in
generative artificial intelligence (AI). We’ll discuss different methods and applications where
RL helps in training models to generate various outputs, optimize objectives, and improve
non-quantifiable features.
We’ll start by looking at RL for simple generation tasks like sequences and codes. Techniques
like SeqGAN and Proximal Policy Optimization are used here. Then, we’ll move on to RL for
maximizing specific objectives, such as improving translation quality or generating narrative
plots with specific characteristics.
Finally, we’ll discuss how RL can enhance subjective features in generated outputs. One
method we’ll explore is Reinforcement Learning from Human Preferences, where RL learns
from human feedback to improve outputs.
A wide variety of life applications rely on generation, including machine translation, text
summarization, image captioning, and a lot more. From a machine learning (ML) perspective,
data generation is the problem of predicting a syntactically and semantically correct data
given some context. For example, given an image, generate an appropriate caption or given
a sentence in English language, translate it to Arabic. Generative AI (GenAI) is the class of
machine learning (ML) algorithms that can learn from data like text, images, and audio in
order to generate new content. One major recent development of GenAI is the introduction of
OpenAI’s GPT-3 and GPT-4 models, which can generate human-like language output. Altough
all these achievements, there are many open doubts for how to make GenAI more user friendly.
For these reasons, it is crucial to use reinforcement learning to mitigate biases and align GenAI
models to human values.
22
CHAPTER 3. Reinforcement Learning for Generative AI 3.1. RL for Mere Generation
Interpreting this problem within the context of reinforcement learning, we consider ti-
mestep t, where the state s is represented by the current sequence of generated information
(y1 , . . . , yt−1 ), and the action a corresponds to selecting the next information yt . We also train
a φ-parameterized discriminative model Dφ to provide guidance to improve generator Gθ . Here,
Dφ (Y1:T ) is a probability indicating how a sequence Y1:T is coming from real sequence data or
not.
Figure 3.1 – The illustration of SeqGAN. Image retrieved from SeqGAN : Sequence Generative
Adversarial Nets with Policy Gradient[1]
In SeqGAN, there are two main parts : the Discriminator (D) and the Generator (G) (See
figure 3.1). The Discriminator learns from real and generated data, then figures out which
is (D) and which is (G). The Generator gets feedback from the Discriminator to improve its
output, and to make it more like the real data. In short, SeqGAN works by training both parts
23
CHAPTER 3. Reinforcement Learning for Generative AI 3.1. RL for Mere Generation
together, and the the Discriminator (D) guides the the Generator (G) to create more realistic
data.
24
CHAPTER 3. Reinforcement Learning for Generative AI 3.2. RL for Objective Maximization
with a dataset of parallel text in which contains N input-output sequence pairs, denoted
D ≡ {(X (i) , Y ∗(i) )}N
i=1 . Our goal is to estimate the parameters θ of a statistical model in a way
that it maximizes the probability of observing correct outputs given the corresponding inputs
in the dataset.
N
X
LML (θ) = log Pθ (Y ∗(i) |X (i) ).
i=1
We consider model refinement using the expected reward objective , which can be expressed as
N X
X
ORL (θ) = Pθ (Y |X (i) ) · r(Y, Y ∗(i) ).
i=1 Y ∈Y
Here, r(Y, Y ∗(i) ) denotes the score per sentence, and we are computing an expectation over all
of the output sentences Y (See picture 3.2).
In the preceding equation we deduce the mean reward from r(Y, Y ∗(i) ). The sample mean of
m sequences selected randomly from distribution Pθ (Y |X (i) ) is used to estimate the mean. We
optimise a linear mixture of ML and RL objectives as follows to further stabilise training :
We demonstrate that the model approaches the accuracy obtained by typical bilingual human
translators using a humanely rated comparison as a metric
25
CHAPTER 3. Reinforcement Learning for Generative AI 3.2. RL for Objective Maximization
next in order to add it to the story.LM techniques cant get help from the user to reach a specific
goal, which leads to stories that don’t have coherence. So, we must implement a technique called
Reward Shaping to help our model learn better. This technique looks at a bunch of stories and
gives the model rewards along the way. These rewards help the model get closer to our final
goal.
We make up stories by figuring out what to do and how to change things to get what we want.
The aim of this work is to make sure that a certain verb (like waddle, doodle, munch) is the
last thing that happens in the story. We use a method called reinforcement learning to figure
out what to do in a story and use a way of learning called policy gradients to make a model of
how to act. We begin by teaching a language model on a bunch of stories.
Figure 3.2 – A sentence and its representation of the action. Image captured from Controllable
Neural Story Plot Generation via Reward Shaping [5]
A Language Model P (xn |xn−1 .xn−k ; θ)gives a range of possible states. xn that are probable
to come next given a history of states xn−1 .xn−k and the parameters of a model θ.
With the help of repeatedly sampling from this language version, we use it to generate a
plot.However, there is no guarantee that the plot will arrive at a desired goal. To fix this issue,
we will apply Policy gradient learning. It’s true that this method needs a lot of reward signals
in order to give a feedback each step but in return it assures that are rewards are not very
sparse.
We choose our policy model θ in a way that P (et+1 |et ; θ) is the distribution over events ac-
cording to the story’s corpus and also that increase the probability of reaching a futur goal event
.an action selects the most likely subsequent event. The event one from the probability distri-
bution of the language model. The calculation of the reward is done. Determine the distance
between the event et+1 and our goal event. The last gradient utilized. Adjusting the networks
parameters and changing the language models distribution is necessary. The calculation is as
follows :
where et+1 and R(v(et+1 )) are, respectively, the event at the timestep t + 1 and the reward for
the verb in that event. The policy gradient technique provides an advantage to highly rewarding
events by enabling a more significant movement to the probability which predicts these events
, over events that have a lower reward.
Reward Shaping
To make sure the controllability in the plot generation, we chiose to design the reward system
in a way that we give a reward the network every time it generates an event which increases
its probability to achieve its goal.
26
CHAPTER 3. Reinforcement Learning for Generative AI 3.2. RL for Objective Maximization
the reward function is created by pre-processing the training stories and determining two
important factors :
— the distance of each verb to target or goal verb
— the verbs frequency in the existing stories
Distance
The distance measures how close is the verb v of an event to the wanted verb g, It is
employed to give the model reward for events it generates that have verbs that are more close
to the goal verb.for a verb v the formula that calculates this metric is :
X
r1 (v) = log ls − ds (v, g)
s∈Sv,g
where Sv,g is the portion of the corpus’s stories that have v before the target verb g, ls is the
story length s, and ds (v, g) is the events number between the event that v and the event that
contains g . A greater reward is obtained by shortening the story when the events involving v
and g are closer together.
Story-Verb Frequency
Rewarding story-verb frequency depends on which verb is most likely to appear in a story
before the goal verb. This segment calculates the frequency with which a certain verb v occurs
before the goal verb g across all tales in the corpus. By doing this, the model is deterred from
producing events that begin with verbs that don’t often appear in stories before the target
verb. The story-verb frequency measure is computed using the following formula :
kv,g
r2 (v) = log
Nv
where Nv represents the occurence of the verb v in the corpus, and kv,g is the count of v that
appears prior the target verb g in any story.
Final Reward
The sum of the distance and frequency metrics determines the ultimate reward for a
verb.Every verb in the corpus has its rewards normalised. The final reward is :
where the normalisation constant is denoted by α. Together, these indicators provide verbs that
— Advantage verbs in which are present close to the target.
— Advantage verbs which appear prior the target in a story in such a frequency that makes
it significant.
This reward shaping approach develops a strategy that creates stories that are probabilistically
similar with the training corpus, whereas language model–based narrative and plot creation
systems result in stories that are aimless.
27
CHAPTER 3. Reinforcement Learning for Generative AI 3.3. Enhancing Non-Quantifiable Features
Furthermore, there could be certain characteristics that do not have a corresponding metric
because they are subjective, hard to define, or not measurable. Typically, users only have an
implicit understanding of the task objective, making it nearly impossible to design a suitable
reward function. this issue is often called the agent alignment problem. A promising path is to
use rewards as a way to guide the model.
Our method is to get the reward function by getting human feedback and then we try to
optimize the function that we got. The reward functions should :
1. Allows us to take on tasks where we can merely identify the required behavior rather
than necessarily demonstrating it,
2. enables begginer users to teach agents
3. Scales to complex problems
4. Economical with users feedback
The introduced algorithm makes a reward function to the human’s preferences, while simulta-
neously developing a policy to maximise the reward function’s present prediction. Rather than
providing an exact score, we ask the human to compare brief video clips of the agent’s actions
(See figure 3.3). We discovered that while comparisons are equally helpful for understanding
human preferences, they are more straightforward for humans to offer in some fields.
Figure 3.3 – The reward predictor learns separately from comparing different parts of the
environment, and then the agent tries to get the highest reward predicted by this predictor.
Image captured from : Deep Reinforcement Learning from Human Preferences[4]
Think of an agent that interacts with an environment in a series of stages. at each time t,
The agent communicates an action to the environment at t ∈ A after receiving an observation
from the environment at t ∈ O. The agent’s objective in classical reinforcement learning is
to maximize the discounted sum of rewards, while the environment also provides a reward,
rt ∈ R. We assume the existence of a human user who may express preferences between a
series of observations and actions, σ ∈ (O × A)k , rather than the environment feeding back a
reward signal. To show that the human favored sequence σ1 over sequence σ2 , write σ1 ≻ σ2 .
28
CHAPTER 3. Reinforcement Learning for Generative AI 3.3. Enhancing Non-Quantifiable Features
Informally, the agent’s objective is to generate sequences that the human prefers.while making
as few queries as possible to the human.
We will evaluate our algorithms’ behavior in two ways :
Quantitative
whenever
Our AI should get a large total reward if the human’s preferences are determined by a
reward function r. Thus, we may assess if we are aware of the reward function r. the agent
quantitatively. Ideally, the agent will achieve a reward nearly as high as if it had been using
reinforcement learning (RL) to optimize its behavior.
Qualitative
this approach would be practically useful when we dont have a way to measure behavior
quantitatively. this sentence : In these cases, all we can do is qualitatively evaluate how well
the agent satisfies to the humans preferences. We begin by defining a goal expressed in natural
language, and then ask a human to evaluate the agents behavior based on how well it achieves
that goal
Once rewards get computed with r̂, we are left with a classical reinforcement learning
problem. Any RL algorithm that suits the domain can be used to solve this problem. We like
techniques like Policy Gradient Methods or Temporal Difference (TD) Learning since they are
flexible to changes in the reward because of a subtlety : the reward function r̂ might not be
non-stationary.
29
CHAPTER 3. Reinforcement Learning for Generative AI 3.3. Enhancing Non-Quantifiable Features
Conclusion
In conclusion, this chapter has shed light on the diverse applications of reinforcement lear-
ning (RL) in generative artificial intelligence (AI). We’ve explored various methods and use
cases where RL techniques play a crucial role in training models to generate outputs, optimize
objectives, and enhance non-quantifiable features. Starting with RL for Mere Generation, we
discussed techniques like SeqGAN and Proximal Policy Optimization, which enable the gene-
ration of high-quality sequences and codes. The CodeRL algorithm showcased the specialized
application of RL in generating code snippets, demonstrating its effectiveness in various do-
mains. Moving on to RL for Objective Maximization, we explored how RL is used to optimize
specific objectives, such as translation quality in Google’s Neural Machine Translation System.
We also discussed narrative plot generation with specific characteristics, highlighting the po-
tential of RL in creative content generation. Finally, we examined Enhancing Non-Quantifiable
Features, where RL techniques are employed to improve subjective aspects of generated out-
puts. Reinforcement Learning from Human Preferences emerged as a promising approach to
incorporate human feedback in the training process, leading to outputs that better align with
human preferences.
Overall, the applications discussed in this chapter illustrate the versatility and effective-
ness of RL in generative AI. By leveraging RL techniques, researchers and practitioners can
tackle a wide range of tasks, from generating diverse outputs to optimizing subjective features.
As the field continues to evolve, we anticipate further innovations and advancements in the
intersection of RL and generative AI, driving progress in artificial intelligence and enriching
human-computer interaction.
30
CHAPTER 4
Introduction
In this chapter, we will be applying reinforcement learning algorithms to various scenarios
using OpenAI’s Gym environments. Gymnasium, a version of OpenAI’s Gym library, provides
a collection of Python implementations featuring well-defined observation spaces, action spaces,
and predefined reward functions for different models. Using this library, we will tackle a range
of reinforcement learning problems, including both model-based and model-free scenarios. Ad-
ditionally, we will conduct benchmark tests and performance comparisons among different al-
gorithms to assess their effectiveness in solving these problems.
31
CHAPTER 4. Reinforcement Learning Implementation 4.1. Model-based Reinforcement Learning
Figure 4.1 – Cliff walking game. Image retrieved from OpenAI’s gymnasium
documentation[12]
reach the goal. The certainty in the agent’s mouvement makes our environment a model-based
environment.
Action Space
The action shape is (1, ) in the range {0, 3} indicating which direction to move the player.
— 0 : Move up
— 1 : Move right
— 2 : Move down
— 3 : Move left
Observation Space
There are 3 × 12 + 1 possible states. The player can’t be in a cliff, also he can’t be at the goal
. the only remainig positions are the first 3 upper rows as well as the bottom left cell.
The observation is a number, where the row and col start at 0, that represents the player’s
current location as current row × nrows + current col.
For example we can calculate the positions beginning as follows : 3 × 12 + 0 = 36.
starting state
The player is in state [36] at location [3, 0] at the beginning of the episode.
reward
Unless the player went into the cliff in which case the reward is −100, each time step results in
a −1 reward.
Episode End
The episode ends when the player enters state [47] (location [3, 11]).
32
CHAPTER 4. Reinforcement Learning Implementation 4.2. Model-free Reinforcement Learning
After repeating the policy improvement process for several iterations, we observed success-
ful convergence to the optimal policy. Each iteration refined the policy, gradually enhancing its
performance until reaching a constant policy, which is, in fact, the optimal policy. For detailed
numerical values and further insights into the algorithmic implementation, readers are encou-
raged to refer to the code implementation available in our GitHub repository (see Appendix A
A.1.1).
Value iteration represents an advancement over the policy improvement method by directly
focusing on optimizing the value function. Unlike policy improvement, this approach compresses
the process, bypassing the need for separate policy evaluation and improvement steps, resulting
in faster convergence and often more efficient outcomes.
Pseudo code for the training loop will be provided in Appendix A (A.1.2).
33
CHAPTER 4. Reinforcement Learning Implementation 4.2. Model-free Reinforcement Learning
Figure 4.2 – Frozen Lake game. Image retrieved from OpenAI’s gymnasium documentation[12]
Space
The player’s direction of movement is indicated by the action shape (1,) in the range {0, 3}.
— 0 : Motion left
— 1 : Motion downward
— 2 : Motion right
— 3 : Motion upward
In world created at random, there is always a way to the goal
Observation space
The observation is a number current row * nrows + current col, where nrows and col start
at 0. This value represents the player’s current location.
For instance, 3 ∗ 4 + 3 = 15 is the calculation for the goal location in the 4x4 map. The size of
the map determines how many observations are feasible.
Initial Condition
The player is in state [0] at location [0, 0] at the beginning of the episode.
Rewards
Reward schedule :
— Reach goal : +1
— Reach hole : 0
— Reach frozen : 0
Episode End
34
CHAPTER 4. Reinforcement Learning Implementation 4.2. Model-free Reinforcement Learning
where :
— s is the current state,
— a is the action taken,
— r is the received reward,
— s′ is the next state,
— a′ is the next action,
— α is the learning rate, and
— γ is the discount factor.
The pseudo code for this update rule can be found in Appendix A (A.3.1).
Expected SARSA
Expected SARSA is one of the algorithms that we didn’t cover in the previous chapter, as
it is considered an improvement of the SARSA algorithm.
35
CHAPTER 4. Reinforcement Learning Implementation 4.2. Model-free Reinforcement Learning
In this approach, we compute the expected values of Q-values for all possible next actions
weighted by their probabilities. In our particular case, all actions are equiprobable as we are
picking actions randomly. By considering this approach rather than choosing a single sampled
action in each timestep, Expected SARSA provides a more accurate estimation of the Q-value
for the current state-action pair. It achieves this by updating the Q-value using the Bellman
equation :
The pseudo code for the update method in Expected SARSA can be found in Appendix A
(A.3.2).
We have seen in 1.5.2 that the Q-learning algorithm is an off-policy temporal difference
reinforcement learning technique used for estimating the Q-value function. Unlike SARSA,
which is an on-policy method, Q-learning updates the Q-table by considering the maximum
Q-value of the next state regardless of the action taken. This approach allows Q-learning to
learn the optimal policy while following a different policy for exploration.
However, this off-policy approach can lead to sub-optimal solutions in environments where
exploration is necessary. Since Q-learning always assumes the greedy action selection strategy
for future actions, it may not explore the environment sufficiently, leading to a bias towards
exploiting known actions rather than exploring new ones.
The Q-learning update rule is based on the Bellman equation :
′ ′
Q(s, a) = (1 − α) · Q(s, a) + α · r + γ · max
′
Q(s , a )
a
The pseudo code for the Q-learning update method can be found in Appendix A (A.4.1).
To improve our Q-Learning algorithm used previously, we will apply the epsilon-greedy
exploration-exploitation balance strategy. On each episode, we will balance exploration and
exploitation by alternating between picking greedy actions that are learned so far and random
actions. By increasing epsilon, we make sure that our agent exponentially shifts from exploration
to exploitation as it gains more experience. This approach avoids converging to a sub-optimal
solution, preventing the agent from discovering more optimal policies.
We will set the following constants to test the performance of this improved approach :
— Initial epsilon (ϵ) : 0.9
— Epsilon decay rate : 0.999
— Minimum epsilon : 0.01
36
CHAPTER 4. Reinforcement Learning Implementation 4.2. Model-free Reinforcement Learning
The Q-learning with epsilon-greedy strategy updates the Q-values using the Bellman equation
just like the basic Q-learning algorithm :
′ ′
Q(s, a) = (1 − α) · Q(s, a) + α · r + γ · max
′
Q(s , a )
a
The pseudo code for Q-learning with the epsilon-greedy strategy can be found in Appendix
A (A.4.2).
To address the issue of overestimation of Q-values, we will use Double Q-learning. In fact,
instead of using a single Q-function to estimate the Q-values, we will use two separate Q-
functions. During the training, one Q-function is used to select the best action, while the other
one is used to evaluate the Q-value of that action. This way, we mitigate the overestimation
bias present in the basic method, resulting in convergence towards a more accurate policy.
The update method for Double Q-learning involves updating one Q-function using the
Bellman equation while using the other Q-function to determine the best action. The Bellman
equation for Double Q-learning is :
This equation represents how the Q-value of the i-th Q-function is updated using the Bellman
equation in Double Q-learning.
The pseudo code for Double Q-learning can be found in Appendix A (A.4.3).
To address the issue of estimating state values without using a model, we will use the First-
Visit Monte Carlo algorithm. In First-Visit Monte Carlo, we estimate the value of each state by
averaging the returns observed after the first visit to that state during an episode. This method
averages over multiple episodes to obtain a more accurate estimation of state values, making it
suitable for environments where we can simulate episodes without a model.
The pseudo code for First-Visit Monte Carlo can be found in Appendix A (A.5.1).
To further explore methods for estimating state values in reinforcement learning without
relying on a model, we introduce the Every-Visit Monte Carlo algorithm. Unlike the First-Visit
37
CHAPTER 4. Reinforcement Learning Implementation 4.2. Model-free Reinforcement Learning
Monte Carlo algorithm, which considers only the first visit to each state within an episode,
Every-Visit Monte Carlo computes the value of each state by averaging the returns obtained
from all visits to that state across all episodes. This approach provides a comprehensive esti-
mation of state values and can be particularly useful in environments with non-deterministic
dynamics. T
The pseudo code detailing the implementation of Every-Visit Monte Carlo can be found in
Appendix A (A.5.2).
To model the multi-armed bandits in our simulation, we represent them as arms in a com-
puter model, where each arm corresponds to a bandit. Our goal is to estimate the reward
probabilities associated with each bandit. The k-armed bandit is modeled by an array with
length k containing the probability of winning of each bandit.
The pseudo code detailing the implementation of the epsilon-greedy algorithm for multi-
armed bandits can be found in Appendix A (A.6).
38
CHAPTER 4. Reinforcement Learning Implementation
4.3. Bechmarking and Performance Evaluation
Table 4.1 – Average Convergence Time for Policy Improvement and Value Iteration
Our benchmarking results demonstrate that Value Iteration outperforms Policy Improve-
ment in terms of convergence time on an Intel Core i7 11th generation processor. With an
average time of 0.037 s, Policy Improvement exhibits a significantly longer convergence time
compared to Value Iteration, which achieves convergence in just 0.0035 s on average. These
averages were calculated over 10 repetitions, providing a reliable assessment of the methods’
performance. The superior efficiency of Value Iteration makes it a more favorable choice for
scenarios where rapid convergence is crucial.
The results reveal that SARSA and Q-Learning algorithms exhibit similar runtime perfor-
mance, with SARSA being slightly faster on average. Double Q-Learning and Expected SARSA
39
CHAPTER 4. Reinforcement Learning Implementation
4.3. Bechmarking and Performance Evaluation
algorithms take longer to execute compared to SARSA and Q-Learning. This is expected as
these algorithms involve additional computations to maintain and update two Q-functions or
compute expected values.
Monte Carlo methods, both first visit and every visit, show comparable runtime perfor-
mance, indicating that the difference in computational complexity between these methods is
minimal. However, the Q-Learning algorithm with Epsilon Greedy strategy demonstrates si-
gnificantly higher runtime. This is due to the exploration-exploitation trade-off inherent in the
epsilon-greedy strategy, which requires additional computations to balance exploration and ex-
ploitation during learning.
These runtime results provide insights into the computational efficiency of different reinforce-
ment learning algorithms, which can be crucial for real-world applications where computational
resources are limited.
We conducted an evaluation by playing 10,000 games and calculating the average winning
time for each training algorithm. Our agents were initially trained on 1,000,000 episodes to
learn optimal policies. The results are summarized in Table 4.3.
The evaluation indicates that SARSA, First Visit Monte Carlo, and Every Visit Monte Carlo
algorithms achieved high average winning times, demonstrating their effectiveness in learning
optimal policies. SARSA showed a respectable performance, maintaining a high success rate of
71.043%. First Visit Monte Carlo and Every Visit Monte Carlo exhibited even better perfor-
mance, with average winning times exceeding 73%. Interestingly, both Monte Carlo methods
showed a tendency to stay in the first positions, maximizing their immediate reward but failing
to explore further and potentially obtain a higher long-term reward.
40
CHAPTER 4. Reinforcement Learning Implementation
4.3. Bechmarking and Performance Evaluation
On the other hand, Q-Learning, Double Q-Learning, and Q-Learning with Epsilon Greedy
exhibited suboptimal performance, with average winning times close to zero. This indicates a
convergence to suboptimal solutions and that our model is over-trained. The visualisation of
the policies learned by the agent, shows that it tends to avoid exploring new actions and prefer
staying in their initial states maximizing their long-term reward by avoiding to lose. However,
when Q-Learning was trained on 2000 episodes and Double Q-Learning on 1000 episodes,
their performance improved significantly, demonstrating the impact of training duration on
Q-learning algorithm effectiveness.
Figure 4.3 illustrates the evolution of average reward per episode for the three algorithms.
Initially, all algorithms undergo exploration of the environment, resulting in a gradual in-
crease in average rewards. Notably, the Q-learning epsilon-greedy algorithm demonstrates the
most substantial improvement, with its average reward steadily increasing throughout the epi-
sodes. For instance, at episode 1000, the average reward stands at approximately 0.003. The
culmination of its performance is evident at episode 2738, where the average reward reaches its
peak, with a value of approximately 0.207. However, following this peak, the average reward
41
CHAPTER 4. Reinforcement Learning Implementation
4.3. Bechmarking and Performance Evaluation
for Q-learning epsilon-greedy starts to decline, indicating convergence towards sub-optimal po-
licies. This decline suggests that the agent may have started to exploit certain actions or states
that yield immediate but sub-optimal rewards, ultimately hindering further improvement in
performance.
In contrast, both the standard Q-learning and Double Q-learning algorithms exhibit a dif-
ferent pattern. The average rewards for these algorithms remain consistently low and almost
constant throughout the training period. This pattern suggests that, unlike Q-learning epsilon-
greedy, which balances exploration and exploitation, these algorithms predominantly focus on
exploration without effectively leveraging the knowledge gained. The consistently low average
rewards indicate that while exploration is occurring, there is no exploitation of the learned
knowledge, leading to stagnation in performance.
After 1 million iterations, the multi-armed bandit algorithm demonstrated remarkable effi-
ciency in selecting the three bandits with the highest probabilities. The algorithm’s exploration-
exploitation balance allowed it to focus primarily on the more rewarding options. This efficient
selection process highlights the effectiveness of the chosen hyperparameters in guiding the al-
gorithm towards optimal decision-making.
Bandit Probability
Bandit 1 0.91
Bandit 2 0.93
Bandit 3 0.98
Bandit 4 0.23
Bandit 5 0.08
42
CHAPTER 4. Reinforcement Learning Implementation
4.3. Bechmarking and Performance Evaluation
For a visual representation of the bandit selection process and the convergence towards the
optimal bandits, we refer to Figure 4.4.
This plot provides insight into how the algorithm’s selections evolve over time, demonstra-
ting its ability to identify and exploit the highest-probability bandits.
Conclusion
Model-based algorithms exhibit remarkable performance, swiftly converging to the optimal
policy with relatively fewer training episodes. Our experiments underscored their efficiency and
ability to learn rapidly. However, despite their effectiveness, the practicality of model-based
approaches may be limited in scenarios where computational efficiency is paramount. While
they offer fast convergence, their reliance on explicit models can restrict their applicability,
particularly when compared to model-free methods.
On the other hand, our experiments highlight the importance of selecting appropriate hy-
perparameters in reinforcement learning algorithms. We observed that different algorithms can
achieve remarkably close performances when tuned correctly. For instance, while Sarsa de-
mands a high number of training episodes to converge, Q-learning epsilon-greedy demonstrates
the opposite behavior, achieving optimal performance with fewer training iterations. Despite
its complexity, this approach proves advantageous as it accelerates training, saving time and
computational resources. Thus, the judicious selection of hyperparameters plays a crucial role
in maximizing the efficiency and effectiveness of reinforcement learning algorithms.
43
General Conclusion
In this project, we aimed to explore the world of Reinforcement Learning and address
the various methods and the ability of RL to solve complex decision-making problems. We
discussed the fundamental concepts of RL, including the agent, environments, policy, states,
actions, and rewards. Additionally, we introduced a various range of Reinforcement Learning
techniques, from tabular techniques in Reinforcement Learning such as multi-armed bandits,
finite Markov processes, dynamic programming, temporal difference learning and Monte Carlo
methods. These methods are the best for problems with discrete states. Policy iteration and
value iteration are different examples of those strategies. Policy iteration improves an initial
policy through alternately comparing and enhancing it till convergence, while value iteration
computes the optimal value function via iterative updates. These methods provide simplicity
and guarantees ; however, they face scalability challenges in large or continuous state and action
spaces because of their tabular representation. We also introduced Approximation Techniques
in Reinforcement Learning, which included On-policy prediction, on-policy control, and off-
policy methods. These types of methods provide scalability to deal with massive or continuous
states and action spaces. Instead of tabular representations, they use function approximation
strategies like neural networks or linear functions. By mastering compact representations of
value functions or policies, they allow RL algorithms to address high-dimensional input spaces
efficiently. However, approximation mistakes can pose challenges in convergence and stability.
The integration of generative AI impacts Reinforcement Learning by helping in mere data
generation and objective maximization tasks. In RL for mere generation, tools like Generative
Adversarial Networks (GANs) help us to create diverse and realistic data, improving the agent’s
learning process and efficiency. Also, combining RL with generative models allows agents to
make maximization tasks easier, for example, optimizing non-quantifiable features like creativity
or aesthetics.
However, training huge models like the Large Language Models (LLMs) or open-world vi-
deo games presents a huge difficulty because single computers, even those with great overall
performance capabilities, can hardly do the task. This implies that it can only be accompli-
shed by large IT companies with access to supercomputers and several GPU processors. They
have so benefited greatly from the development of cutting-edge AI technology. Smaller com-
panies and people without access to those kinds of resources find it difficult to compete in
this race. Encouraging collaboration across various businesses within the AI field and making
44
Summary and Personal Thoughts Summary and Personal Thoughts
those sources more available are imperative if we are to ensure that everyone has an honest
chance of contributing to AI developments. In this manner, we are able to make certain that
the development of AI isn’t always simply constrained to massive businesses and that everyone
can take advantage from it. In conclusion, Reinforcement Learning is an exponentially increa-
sing domain within the field of artificial intelligence. With the introduction of new techniques
and algorithms, RL continues to expand, making its way to new opportunities across multiple
domains. From a general perspective, RL is being adopted in a lot of sectors such as robotics,
finance, and healthcare. Companies are now exploring the services offered by RL to automate
complex tasks and find optimal solutions.
45
APPENDIX A
In this appendix, we will present our pseudo-code to all the algorithms we have covered.
Ideally, we don’t want any part of the main content of our work to become obsolete in the
future. Technology changes, standards change and eventually the code could become unrun-
nable. That’s why we will be presenting pseudo-code instead of the actual implementation.
Pseudo-code provides a high-level representation of the algorithm’s logic without being tied to
a specific programming language or implementation details. However, for those interested in
the full implementation, the Python code can be found on GitHub at the following link :
https://github.com/Mahdyy02/reinforcement-learning-implementation
By separating the pseudo-code from the main content, we ensure that this work remains
relevant and adaptable to changes in technology and standards over time.
2 FUNCTION c o m pu t e _s t a te _ v al u e ( state )
3 # Estimate the state ’s worth
4 IF state equals 47 THEN
5 RETURN 0
6 END IF
7 action = policy [ state ]
8 FOR each state_tuple in env . unwrapped . P [ state ][ action ]
9 probability , next_state , reward , info = state_tuple
10 # Recursively compute the value of the next state
11 RETURN reward + gamma * c om p u te _ s ta t e _v a l ue ( next_state )
12 END FOR
13 END FUNCTION
Listing A.1 – State Value Computation
46
Appendix A. Tabular-solution Algorithms Pseudo-code A.1. Model-based RL Algorithms
16 # Update the improved policy with the action that maximizes Q - value
17 improved_policy [ state ] = max_action
18 END FOR
19
47
Appendix A. Tabular-solution Algorithms Pseudo-code A.2. Training Loop
1 FUNCTION value_iteration () :
2 # Initialize value function v and policy
3 v = { state : 0 for state in range ( env . obser vation _space . n ) }
4 policy = { state : 0 for state in range ( env . ob servat ion_sp ace . n - 1) }
5 threshold = 0.001
6
48
Appendix A. Tabular-solution Algorithms Pseudo-code A.3. SARSA Algorithms
10
In our approach, we aim to maintain the training loop structure while adapting each
update q table function to its corresponding algorithm. This strategy allows us to preserve the
integrity of the training process while accommodating the unique requirements and updates of
each algorithm. By isolating the algorithm-specific logic within the update q table function,
we ensure modularity and flexibility in our implementation. This approach also streamlines
the development process, making it easier to manage and debug individual components. As a
result, we can efficiently experiment with different algorithms without overhauling the entire
training pipeline, enabling rapid iteration and exploration of our different algorithms.
49
Appendix A. Tabular-solution Algorithms Pseudo-code A.4. Q-learning Algorithms
50
Appendix A. Tabular-solution Algorithms Pseudo-code A.5. Monte Carlo Algorithms Implemented
25 # Calculate the average returns for each state - action pair and update
the Q - table
26 FOR EACH ( state , action ) IN returns_count DO :
27 IF returns_count [ state , action ] != 0 THEN :
28 Q [ state , action ] = returns_sum [ state , action ] / returns_count [
state , action ]
29
51
Appendix A. Tabular-solution Algorithms Pseudo-code A.5. Monte Carlo Algorithms Implemented
19 # Calculate the average returns for each state - action pair and update
the Q - table
20 FOR EACH ( state , action ) IN returns_count DO :
21 IF returns_count [ state , action ] != 0 THEN :
22 Q [ state , action ] = returns_sum [ state , action ] / returns_count [
state , action ]
23
52
Appendix A. Tabular-solution Algorithms Pseudo-code A.6. Multi-armed Bandits Implemented
1 FUNCTION epsilon_greedy () :
2 r = random_uniform ()
3 IF r < epsilon THEN
4 arm = random_integer (0 , n_bandits )
5 ELSE
6 arm = argmax ( values )
7 END IF
8 RETURN arm
9 END FUNCTION
10
11 n_bandits = 5
12 true _band it_pro bs = random_uniform ( n_bandits )
13 n_iterations = 500000
14 epsilon = 0.99
15 min_epsilon = 0.01
16 epsilon_decay = 0.999
17
23 FOR i = 1 TO n_iterations DO :
24 arm = epsilon_greedy ()
25 reward = random_uniform () < tr ue_ban dit_pr obs [ arm ]
26 rewards [ i ] = reward
27 selected_arms [ i ] = arm
28 counts [ arm ] += 1
29 values [ arm ] += ( reward - values [ arm ]) / counts [ arm ]
30 epsilon = max ( min_epsilon , epsilon * epsilon_decay )
31
32 FOR i = 1 TO n_bandits DO :
33 PRINT ( " Bandit # " + i + " -> " + true_ bandit _probs [ i ]:.2 f )
34 END FOR
Listing A.13 – Every-visit Monte Carlo
53
References
1. Yu, L., Zhang, W., Wang, J., Yu, Y. (2016). ”SeqGAN : Sequence Generative
Adversarial Nets with Policy Gradient.”
2. Franceschelli, G., Musolesi, M. (2023). ”Reinforcement Learning for Generative
AI : State of the Art, Opportunities and Open Research Challenges.”
3. Le, H., Wang, Y., Gotmare, A. D., Savarese, S., Hoi, S. C. H. (2022). ”CodeRL :
Mastering Code Generation through Pretrained Models and Deep Reinforcement
Learning.”
4. Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D.
(2023). Deep Reinforcement Learning from Human Preferences. OpenAI, Deep-
Mind.
5. Tambwekar, P., Dhuliawala, M., Martin, L. J., Mehta, A., Harrison, B., & Riedl,
M. O. (2023). Controllable Neural Story Plot Generation via Reward Shaping.
School of Interactive Computing, Georgia Institute of Technology, Department
of Computer Science, University of Kentucky.
6. Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., Amodei, D. (2017).
”Deep Reinforcement Learning from Human Preferences.”
7. Ramamurthy, R., Ammanabrolu, P., Brantley, K., Hessel, J., Sifa, R., Bauck-
hage, C., Hajishirzi, H., Choi, Y. (2022). ”Is Reinforcement Learning (Not) for
Natural Language Processing : Benchmarks, Baselines, and Building Blocks for
Natural Language Policy Optimization.”
8. Sutton, R. S., Mahmood, A. R., White, M. (2016). ”An Emphatic Approach to
the Problem of Off-policy Temporal-difference Learning.” Journal of Machine
Learning Research.
9. Howard, R. (1960). ”Dynamic Programming and Markov Processes.” MIT Press,
Cambridge, MA.
10. Barto, A. G., Bradtke, S. J., Singh, S. P. (1995). ”Learning to Act Using Real-
time Dynamic Programming.” Artificial Intelligence.
11. Hugging Face.
12. OpenAI Gymnasium. (2017).
13. OpenAI ChatGPT-3.5. (2023).
54
Appendix A. Tabular-solution Algorithms Pseudo-code A.6. Multi-armed Bandits Implemented
55