100% found this document useful (1 vote)
319 views

Reinforcement Learning and Dynamic Programming For Control

This document provides an introduction to reinforcement learning (RL). It discusses RL from the perspectives of control, artificial intelligence, and machine learning. RL aims to find optimal policies for sequential decision making problems by maximizing rewards received over time. The document outlines the basic RL problem formulation using Markov decision processes and describes how RL relates to dynamic programming. It notes the importance of RL for solving decision making problems in various fields.

Uploaded by

Vehid Tavakol
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
319 views

Reinforcement Learning and Dynamic Programming For Control

This document provides an introduction to reinforcement learning (RL). It discusses RL from the perspectives of control, artificial intelligence, and machine learning. RL aims to find optimal policies for sequential decision making problems by maximizing rewards received over time. The document outlines the basic RL problem formulation using Markov decision processes and describes how RL relates to dynamic programming. It notes the importance of RL for solving decision making problems in various fields.

Uploaded by

Vehid Tavakol
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Reinforcement learning

and dynamic programming


for control
Lecture notes
Control prin nv atare
Master ICAF, 2012 Sem. 2
Lucian Busoniu
Contents
List of algorithms iii
I Reinforcement learning framework 1
1 Introduction 3
1.1 The RL problem from the control perspective . . . . . . . . . . . . 3
1.2 The articial intelligence perspective . . . . . . . . . . . . . . . . . 6
1.3 The machine learning perspective . . . . . . . . . . . . . . . . . . 6
1.4 RL and dynamic programming . . . . . . . . . . . . . . . . . . . . 7
1.5 Importance of RL . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 The reinforcement learning problem 9
2.1 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Value functions and Bellman equations . . . . . . . . . . . . . . . . 12
2.4 Stochastic case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
II Classical reinforcement learning and dynamic programming 21
3 Dynamic programming 25
3.1 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Policy search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Monte Carlo methods 39
4.1 Monte Carlo policy iteration . . . . . . . . . . . . . . . . . . . . . 39
4.2 The need for exploration . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Optimistic Monte Carlo learning . . . . . . . . . . . . . . . . . . . 42
i
ii CONTENTS
5 Temporal difference methods 45
5.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Accelerating temporal-difference learning 51
6.1 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Experience replay . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Model learning and Dyna . . . . . . . . . . . . . . . . . . . . . . . 59
III Approximate reinforcement learning 65
7 Function approximation 69
7.1 Approximation architectures . . . . . . . . . . . . . . . . . . . . . 69
7.2 Approximation in the context of DP and RL . . . . . . . . . . . . . 75
7.3 Comparison of approximators . . . . . . . . . . . . . . . . . . . . 77
8 Ofine approximate reinforcement learning 79
8.1 Approximate value iteration . . . . . . . . . . . . . . . . . . . . . 79
8.2 Approximate policy iteration . . . . . . . . . . . . . . . . . . . . . 85
8.3 Theoretical properties. Comparison . . . . . . . . . . . . . . . . . 89
9 Online approximate reinforcement learning 91
9.1 Approximate Q-learning . . . . . . . . . . . . . . . . . . . . . . . 91
9.2 Approximate SARSA . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.3 Actor-critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10 Accelerating online approximate RL 97
10.1 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.2 Experience replay . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Glossary 103
List of Algorithms
3.1 Q-iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Q-iteration for the stochastic case. . . . . . . . . . . . . . . . . . . 28
3.3 Policy iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Iterative policy evaluation. . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Iterative policy evaluation for the stochastic case. . . . . . . . . . . 32
3.6 Optimistic planning for online control . . . . . . . . . . . . . . . . 37
4.1 Monte Carlo policy iteration. . . . . . . . . . . . . . . . . . . . . . 41
4.2 Optimistic Monte Carlo. . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 Q-learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 SARSA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1 Q(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 SARSA(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 ER Q-learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Dyna-Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.1 Fuzzy Q-iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.2 Fitted Q-iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.3 Approximate policy iteration. . . . . . . . . . . . . . . . . . . . . . 85
8.4 Least-squares policy iteration. . . . . . . . . . . . . . . . . . . . . 88
9.1 Approximate Q-learning. . . . . . . . . . . . . . . . . . . . . . . . 92
9.2 Approximate SARSA. . . . . . . . . . . . . . . . . . . . . . . . . 93
9.3 Actor-critic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.1 Approximate Q(). . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.2 Approximate ER Q-learning. . . . . . . . . . . . . . . . . . . . . . 99
iii
Part I
Reinforcement learning
framework
1
Chapter 1
Introduction
Reinforcement learning (RL) is a class of algorithms for solving problems in which
actions (decisions) are applied to a system over an extended period of time, in order
to achieve a desired goal. The time variable is usually discrete and actions are taken
at every discrete time step, leading to a sequential decision-making problem. The
actions are taken in closed loop, which means that the outcome of earlier actions is
monitored and taken into account when choosing new actions. Rewards are provided
that evaluate the one-step decision-making performance, and the goal is to optimize
the long-termperformance, measured by the total reward accumulated over the course
of interaction.
Such decision-making problems appear in a wide variety of elds, including au-
tomatic control, computer science, articial intelligence, operations research, eco-
nomics, and medicine. In these notes, we primarily adopt a control-theoretic point of
view, and therefore employ control-theoretical notation and terminology, and choose
control systems as examples. Nevertheless, to provide a higher-level image of the
eld, in this rst chapter we also introduce the articial-intelligence and machine-
learning points of view over RL.
1.1 The RL problem from the control perspective
The main elements of the RL problem, together with their ow of interaction, are
represented in Figure 1.1: a controller interacts with a process by measuring states
and applying actions, and receives rewards according to a reward function.
To clarify the meaning of these elements, consider the conceptual robotic navi-
gation example from Figure 1.2, in which the robot shown in the bottom region must
navigate to the goal on the top-right, while avoiding the obstacle represented by a
gray block. (For instance, in the eld of rescue robotics, the goal might represent
the location of a victim to be rescued.) The controller is the robots software, and the
process consists of the robots environment (the surface on which it moves, the obsta-
3
4 CHAPTER 1. INTRODUCTION
Controller Process
Reward function
state
action
reward
Figure 1.1: The elements of RL and their ow of interaction. The elements related to
the reward are depicted in gray.
cle, and the goal) together with the body of the robot itself. It should be emphasized
that in RL, the physical body of the decision-making entity (if it has one), its sensors
and actuators, as well as any xed lower-level controllers, are all considered to be a
part of the process, whereas the controller is taken to be only the decision-making
algorithm.
state (position) x
k
action (step) u
k
next state x
k+1
r ,
k+1
reward
Goal
Figure 1.2: A robotic navigation example. An example transition is also shown, in
which the current and next states are indicated by black dots, the action by a black
arrow, and the reward by a gray arrow. The dotted silhouette represents the robot in
the next state.
In the navigation example, the state x is the position of the robot on the surface,
given, e.g., in Cartesian coordinates, and the action u is a step taken by the robot,
similarly given in Cartesian coordinates. The discrete time step is denoted by k. As
a result of taking a step from the current position, the next position is obtained, ac-
cording to a transition function. Because both the positions and steps are represented
in Cartesian coordinates, the transitions are typically additive: the next position is
the sum of the current position and the step taken. More complicated transitions are
obtained if the robot collides with the obstacle. Note that for simplicity, most of the
1.1. THE RL PROBLEM FROM THE CONTROL PERSPECTIVE 5
dynamics of the robot, such as the motion of the wheels, have not been taken into ac-
count here. For instance, if the wheels can slip on the surface, the transitions become
stochastic, in which case the next state is a random variable.
The quality of every transition is measured by a reward r, generated according
to the reward function. For instance, the reward could have a positive value such as
10 if the robot reaches the goal, a negative value such as 1, representing a penalty,
if the robot collides with the obstacle, and a neutral value of 0 for any other transi-
tion. Alternatively, more informative rewards could be constructed, using, e.g., the
distances to the goal and to the obstacle. We follow the convention that the reward
obtained as a result of the transition from x
k
to x
k+1
has time index k +1.
The goal is to maximize the return, consisting of the cumulative reward over the
course of interaction. We mainly consider discounted innite-horizon returns, which
accumulate rewards obtained along (possibly innitely long) trajectories starting at
the initial time step k = 0, and weigh the rewards by a factor that decreases exponen-
tially as the time step increases:

0
r
1
+
1
r
2
+
2
r
3
+... (1.1)
The discount factor [0, 1) can be seen as a measure of how far-sighted the
controller is in considering its rewards. Figure 1.3 illustrates the computation of the
discounted return for the navigation problem of Figure 1.2.

0
r
1

1
r
2

2
r
3

3
r
4
Figure 1.3: The discounted return along a trajectory of the robot. The decreasing
heights of the gray vertical bars indicate the exponentially diminishing nature of the
discounting applied to the rewards.
The core challenge is therefore to arrive at a solution that optimizes the long-
term performance given by the return, using only reward information that describes
the immediate performance.
6 CHAPTER 1. INTRODUCTION
1.2 The articial intelligence perspective
In articial intelligence, RL is useful to learn optimal behavior for intelligent agents,
which monitor their environment through perceptions and inuence it by applying
actions. This view of the RL problem is shown in Figure 1.4.
Environment
Agent
Reward function
action
reward
perception (state)
Figure 1.4: RL from the AI perspective.
As an example, we can consider again the robotic navigation problem above;
autonomous mobile robotics is an application domain where automatic control and
AI meet in a natural way. We can simply view the robot as the articial agent that
must accomplish a task in its environment.
Note that in AI, the rewards are viewed as produced by the environment, and the
reward function is considered as a (possibly unknown) part of this environment. In
contrast, in control the reward function is simply a performance evaluation compo-
nent, under the complete control of the experimenter.
1.3 The machine learning perspective
Machine learning is the subeld of computer science and AI concerned with algo-
rithms that analyze data in order to achieve various types of results. From the per-
spective of machine learning, RL sits in-between two other paradigms: supervised
and unsupervised learning, see Figure 1.5.
Supervised
learning
Reinforcement
learning
Unsupervised
learning
more informative feedback less informative feedback
Figure 1.5: RL on the machine learning spectrum.
1.4. RL AND DYNAMIC PROGRAMMING 7
Supervised learning is about generalizing input-output relationships from exam-
ples. In each learning example, an input is associated with the correct corresponding
output, so in a sense there is a teacher guiding the learning process step-by-step.
Once learning is completed, the algorithm receives new inputs that it has not nec-
essarily seen before, and must (approximately) determine outputs corresponding to
these inputs. In RL, the teacher (reward function) never provides the correct, op-
timal actions, but only the less informative reward signal, which the algorithm must
then use to nd the correct actions by itself a signicantly harder problem.
As an example of supervised learning, consider a robot arm equipped with a grip-
per and a camera sensor, which disassembles used electrical motors. The robot has to
sort motor parts transported on a conveyor belt into several classes, after recognizing
them by their features, such as shape, color, size or texture. In this case, a supervised
learning algorithm would rst be provided with a training set of known objects with
their features (inputs) and classes (output), and would then have to nd the class of
each new object arriving on the belt.
Unsupervised learning is concerned with nding patterns and relationships in
data that does not have any well-dened outputs. As there are no outputs, there can
be no information at all about which output is correct and in this sense, we say
that in unsupervised learning there is less informative feedback than in RL. As an
example, a retailer could look at the purchasing habits of their customers (prices of
items purchased, frequency of purchases, etc.) and try to organize them into several
market segments without knowing in advance what these segments should be, and
perhaps not even knowing how many segments there should be.
This view is useful to understand reinforcement learning, but should be consid-
ered in light of the following non-obvious fact. Both supervised and unsupervised
learning are concerned only with making certain statements about (often static) data.
RL uses the data to learn how to control a dynamical system, which adds an extra
dimension and new challenges that make RL different from the rest of the machine
learning eld.
1.4 RL and dynamic programming
If a mathematical model of the process-to-be-controlled is available, a class of model-
based methods called dynamic programming (DP) can be applied. A key benet of
DP methods is that they make few assumptions on the system, which can have very
general, nonlinear and stochastic dynamics. This is in contrast to more classical tech-
niques from automatic control, many of which require restrictive assumptions on the
system, such as linearity or determinism. Moreover, some DP methods do not require
an analytical expression of the model, but are able to work with a simulation model
instead: a model that does not expose the mathematical expressions for the functions
f and , but can be used to generate next states and rewards for any state-action pair.
8 CHAPTER 1. INTRODUCTION
Constructing a simulation model is often easier than deriving an analytical model,
especially when the system behavior is stochastic.
DP is not usually seen as being a part of RL but most RL algorithms have their
theoretical foundation in DP, so we will devote signicant space to explaining DP in
these notes.
Sometimes, a model of the system cannot be obtained, e.g., because the system is
not fully known beforehand, is insufciently understood, or obtaining a model is too
costly. RL methods are helpful in this case, since they work using only data obtained
from the system, without requiring a model of its behavior. Ofine RL methods are
applicable if sufcient data can be obtained in advance. Online RL algorithms learn
a solution by interacting with the system, and can therefore be applied when data is
not available in advance. Note that RL methods can, of course, also be applied when
a model is available, simply by using the model in place of the real system. It is very
common to benchmark RL methods on simulation models, possibly as a preliminary
step to applying them in real-life problems.
1.5 Importance of RL
We close this chapter by summarizing the benets of RL in control and AI, and
briey mentioning some applications. In automatic control, RL algorithms learn how
to optimally control very general systems that may be unknown beforehand, which
makes them extremely useful. In AI, RL provides a way to build learning agents
that optimize their behavior in initially unknown environments. RL has obtained
impressive successes in applications such as backgammon playing (Tesauro, 1995),
elevator scheduling (Crites and Barto, 1998), simulated treatment of HIV infections
(Ernst et al., 2006), autonomous helicopter ight (Abbeel et al., 2007), robot control
(Peters and Schaal, 2008), interfacing an animal brain with a robot arm (DiGiovanna
et al., 2009), etc.
1
1
The bibliography can be found at the end of Part I. Separate bibliographies are provided per part.
Chapter 2
The reinforcement learning
problem
We will now consider in more detail the reinforcement learning problem and its so-
lution, delving more into the mathematical foundations of the eld.
2.1 Markov decision processes
RL problems can be formalized with the help of Markov decision processes (MDPs).
We focus on the case of MDPs with deterministic state transitions.
A deterministic MDP is dened by the state space X of the process, the action
space U of the controller, the transition function f of the process (which describes
how the state changes as a result of control actions), and the reward function (which
evaluates the immediate control performance). As a result of the action u
k
applied
in the state x
k
at the discrete time step k, the state changes to x
k+1
, according to the
transition function f : X U X:
x
k+1
= f (x
k
, u
k
)
At the same time, the controller receives the scalar reward signal r
k+1
, according to
the reward function : X U R:
r
k+1
= (x
k
, u
k
)
where we assume that

= max
x,u
|(x, u)| is nite.
1
The reward evaluates the
immediate effect of action u
k
, namely the transition from x
k
to x
k+1
, but in general
does not say anything about its long-term effects.
1
If a maximum does not exist, supremum (sup) should be used instead. Also note that since the
domains of the variables over which we are maximizing are obvious from the context (X and U), they
were omitted from the formula. To avoid clutter, these types of simplications will also be performed
in the sequel.
9
10 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM
The controller chooses actions according to its policy h : X U, using:
u
k
= h(x
k
)
In control theory such a control law is called a state feedback.
Given f and , the current state x
k
and the current action u
k
are sufcient to deter-
mine both the next state x
k+1
and the reward r
k+1
. This is called the Markov property,
which is necessary to provide theoretical guarantees about DP/RL algorithms.
Some MDPs have terminal states that, once reached, can no longer be left; all
the rewards received in terminal states are 0. In this case, a trial or episode is a
trajectory starting from some initial state and ending in a terminal state. MDPs with
terminal states are called episodic, while MDPs without terminal states are called
continuing.
Example 2.1 The cleaning-robot MDP. Consider the problemdepicted in Figure 2.1:
a cleaning robot has to collect a used can and also has to recharge its batteries.
x=0 1 2 3 4 5
u=1 u=-1
r=1
r=5
Figure 2.1: The cleaning-robot problem.
In this problem, the state x describes the position of the robot, and the action
u describes the direction of its motion. The state space is discrete and contains six
distinct states, denoted by integers 0 to 5: X = {0, 1, 2, 3, 4, 5}. The robot can move
to the left (u = 1) or to the right (u = 1); the discrete action space is therefore
U = {1, 1}. States 0 and 5 are terminal, meaning that once the robot reaches either
of them it can no longer leave, regardless of the action. The corresponding transition
function is:
f (x, u) =
_
x +u if 1 x 4
x if x = 0 or x = 5 (regardless of u)
In state 5, the robot nds a can and the transition into this state is rewarded with
5. In state 0, the robot can recharge its batteries and the transition into this state is
rewarded with 1. All other rewards are 0. In particular, taking any action while in a
terminal state results in a reward of 0, which means that the robot will not accumulate
(undeserved) rewards in the terminal states. The corresponding reward function is:
(x, u) =
_

_
5 if x = 4 and u = 1
1 if x = 1 and u = 1
0 otherwise

2.2. OPTIMALITY 11
2.2 Optimality
In RL and DP, the goal is to nd an optimal policy that maximizes the return from any
initial state x
0
. The return is a cumulative aggregation of rewards along a trajectory
starting at x
0
. It concisely represents the reward obtained by the controller in the
long run. Several types of return exist, depending on the way in which the rewards
are accumulated. We will mostly be concerned with the innite-horizon discounted
return, given by:
R
h
(x
0
) =

k=0

k
r
k+1
=

k=0

k
(x
k
, h(x
k
)) (2.1)
where [0, 1) is the discount factor and x
k+1
= f (x
k
, h(x
k
)) for k 0. The discount
factor can be interpreted intuitively as a measure of how far-sighted the controller
is in considering its rewards, or as a way of taking into account increasing uncertainty
about future rewards. From a mathematical point of view, discounting ensures that
the return will always be nite if the rewards are nite. The goal is therefore to
maximize the long-term performance (return), while only using feedback about the
immediate, one-step performance (reward). This leads to the so-called challenge of
delayed rewards: actions taken in the present affect the potential to achieve good
rewards far in the future, but the immediate reward provides no information about
these long-term effects.
Other types of return can also be dened. For example, the innite-horizon aver-
age return is:
lim
K
1
K
K

k=0
(x
k
, h(x
k
))
Finite-horizon returns can be obtained by accumulating rewards along trajectories of
a xed, nite length K (the horizon), instead of along innitely long trajectories. For
instance, the nite-horizon discounted return can be dened as:
K

k=0

k
(x
k
, h(x
k
))
We will mainly use the innite-horizon discounted return (2.1), because it has
useful theoretical properties. In particular, for this type of return, under certain techni-
cal assumptions, there always exists at least one stationary optimal policy h

: X U.
(In contrast, in the nite-horizon case, optimal policies depend in general on the time
step k, i.e., they are nonstationary.)
In practice, a good value of has to be chosen. Choosing often involves a
trade-off between the quality of the solution and the convergence rate of the DP/RL
algorithm. Some important DP/RL algorithms converge faster when is smaller, but
if is too small, the solution may be unsatisfactory because it does not sufciently take
into account rewards obtained after a large number of steps. There is no generally
12 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM
valid procedure for choosing , but a rough guideline would be that good discount
factors are often in the range [0.9, 1), and the value should increase with the length
of transients in typical trajectories of the system.
Example 2.2 Choosing the discount factor for the stabilization of a simple sys-
tem. Consider a linear rst-order system that must be stabilized, for which we know
from prior knowledge that the typical (stable) trajectory of the state looks like in
Figure 2.2. The sampling time is is T
s
= .25 s. Presumably, the rewards in the time
0 5 10 15 20
0
0.2
0.4
0.6
0.8
1
t = k Ts
x
Figure 2.2: Typical trajectory of a simple system.
interval t [15, 20] indicate that stabilization has been achieved. For the learning per-
formance, it is important that information from these rewards is sufciently visible
in the discounted return of the initial state at t = 0; that is, that the exponential dis-
counting corresponding to k =15/T
s
=60 is not too small. Choosing this discounting
to be at least 0.05, we get:

60
0.05 log
log0.05
60
exp
_
log0.05
60
_
0.9513
Indeed, choosing = 0.96, we get the shape in Figure 2.3 for the discounting curve

k
, in which
k
> 0.05 for k = 15.

2.3 Value functions and Bellman equations


A convenient way to characterize policies and optimality is by using value functions.
Two types of value functions exist: state-action value functions (Q-functions) and
state value functions (V-functions). We will rst dene and characterize Q-functions,
and then turn our attention to V-functions.
The Q-function Q
h
: X U R of a policy h gives the return obtained when
starting from a given state, applying a given action, and following h thereafter:
Q
h
(x, u) = (x, u) +R
h
( f (x, u)) (2.2)
2.3. VALUE FUNCTIONS AND BELLMAN EQUATIONS 13
0 5 10 15 20
0
0.2
0.4
0.6
0.8
1
t = k Ts

k
Figure 2.3: The evolution of the discounting for = 0.96.
Here, R
h
( f (x, u)) is the return from the next state f (x, u). This concise formula can be
obtained by rst writing Q
h
(x, u) explicitly as the discounted sum of rewards obtained
by taking u in x and then following h:
Q
h
(x, u) =

k=0

k
(x
k
, u
k
)
where (x
0
, u
0
) = (x, u), x
k+1
= f (x
k
, u
k
) for k 0, and u
k
= h(x
k
) for k 1. Then,
the rst term is separated from the sum:
Q
h
(x, u) = (x, u) +

k=1

k
(x
k
, u
k
)
= (x, u) +

k=1

k1
(x
k
, h(x
k
))
= (x, u) +R
h
( f (x, u))
(2.3)
where the denition (2.1) of the return was used in the last step. So, (2.2) has been
obtained.
The optimal Q-function is dened as the best Q-function that can be obtained by
any policy:
Q

(x, u) = max
h
Q
h
(x, u) (2.4)
Any policy h

that selects at each state an action with the largest optimal Q-value,
i.e., that satises:
h

(x) argmax
u
Q

(x, u) (2.5)
is optimal (it maximizes the return).
In general, for a given Q-function Q, a policy h that satises:
h(x) argmax
u
Q(x, u) (2.6)
14 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM
is said to be greedy in Q. So, nding an optimal policy can be done by rst nding
Q

, and then using (2.5) to compute a greedy policy in Q

.
When there are multiple greedy actions for a given state, we say there is a tie,
which must be resolved by selecting one of these actions (e.g., by randomly picking
one of them, or by picking the rst action in some natural order over the action space).
When Q = Q

, no matter which greedy action is chosen, the policy is still optimal.


The Q-functions Q
h
and Q

are recursively characterized by the Bellman equa-


tions, which are of central importance for RL algorithms. The Bellman equation for
Q
h
states that the value of taking action u in state x under the policy h equals the sum
of the immediate reward and the discounted value achieved by h in the next state:
Q
h
(x, u) = (x, u) +Q
h
( f (x, u), h( f (x, u))) (2.7)
This Bellman equation can be derived from the second step in (2.3), as follows:
Q
h
(x, u) = (x, u) +

k=1

k1
(x
k
, h(x
k
))
= (x, u) +
_
( f (x, u), h( f (x, u))) +

k=2

k2
(x
k
, h(x
k
))
_
= (x, u) +Q
h
( f (x, u), h( f (x, u)))
where (x
0
, u
0
) = (x, u), x
k+1
= f (x
k
, u
k
) for k 0, and u
k
= h(x
k
) for k 1.
The Bellman optimality equation characterizes Q

, and states that the optimal


value of action u taken in state x equals the sum of the immediate reward and the
discounted optimal value obtained by the best action in the next state:
Q

(x, u) = (x, u) + max


u

( f (x, u), u

) (2.8)
The V-function V
h
: X R of a policy h is the return obtained by starting from
a particular state and following h. This V-function can be computed from the Q-
function of policy h:
V
h
(x) = R
h
(x) = Q
h
(x, h(x)) (2.9)
The optimal V-function is the best V-function that can be obtained by any policy, and
can be computed from the optimal Q-function:
V

(x) = max
h
V
h
(x) = max
u
Q

(x, u) (2.10)
An optimal policy h

can be computed fromV

, by using the fact that it satises:


h

(x) argmax
u
[(x, u) +V

( f (x, u))] (2.11)


Using this formula is more difcult than using (2.5); in particular, a model of the
MDP is required in the form of the dynamics f and the reward function . Because
2.4. STOCHASTIC CASE 15
the Q-function also depends on the action, it already includes information about the
quality of transitions. In contrast, the V-function only describes the quality of the
states; in order to infer the quality of transitions, they must be explicitly taken into
account. This is what happens in (2.11), and this also explains why it is more difcult
to compute policies from V-functions. Because of this difference, Q-functions will
be preferred to V-functions throughout this book, even though they are more costly
to represent than V-functions, as they depend both on x and u.
The V-functions V
h
and V

satisfy the following Bellman equations, which can


be interpreted similarly to (2.7) and (2.8):
V
h
(x) = (x, h(x)) +V
h
( f (x, h(x))) (2.12)
V

(x) = max
u
[(x, u) +V

( f (x, u))] (2.13)


2.4 Stochastic case
In closing the rst part of these notes, we briey outline the stochastic MDP case.
In a stochastic MDP, the next state is not deterministically given by the current state
and action. Instead, the next state is a random variable, with a probability distribution
depending on the current state and action.
More formally, the deterministic transition function f is replaced by a transition
probability function

f : X U X [0, 1]. After action u
k
is taken in state x
k
, the
probability that the next state x
k+1
is x

is:
P
_
x
k+1
= x

| x
k
, u
k
_
=

f (x
k
, u
k
, x

)
For any x and u,

f (x, u, ) must dene a valid probability distribution, where the dot
stands for the random variable x

; that is,
x


f (x, u, x

) = 1. We assume here that the


next state x

can only take a nite number of possible values (see e.g. the cleaning
robot example below).
Because rewards are associated with transitions, and the transitions are no longer
fully determined by the current state and action, the reward function also has to de-
pend on the next state, : X U X R. After each transition to a state x
k+1
, a
reward r
k+1
is received according to:
r
k+1
= (x
k
, u
k
, x
k+1
)
where we assume that

= max
x,u,x
| (x, u, x

)| is nite. Note that we consider


to be a deterministic function of the transition (x
k
, u
k
, x
k+1
). This means that, once
x
k+1
has been generated, the reward r
k+1
is fully determined.
Example 2.3 The stochastic cleaning-robot MDP. Consider again the cleaning-
robot problem of Example 2.1. Assume that, due to uncertainties in the environment,
16 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM
such as a slippery oor, state transitions are no longer deterministic. When trying to
move in a certain direction, the robot succeeds with a probability of only 0.8. With a
probability of 0.15 it remains in the same state, and it may even move in the opposite
direction with a probability of 0.05 (see also Figure 2.4).
x=0 1 2 3 4 5
P=0.8 P=0.05
P=0.15
u
Figure 2.4: The stochastic cleaning-robot problem. The robot intends to move right,
but it may instead end up standing still or moving left, with different probabilities.
The transition function

f that models the probabilistic transitions described above
is shown in Table 2.1. In this table, the rows correspond to combinations of current
states and actions taken, while the columns correspond to future states. Note that the
transitions from any terminal state still lead deterministically to the same terminal
state, regardless of the action.
Table 2.1: Dynamics of the stochastic, cleaning-robot MDP.
(x, u)

f (x, u, 0)

f (x, u, 1)

f (x, u, 2)

f (x, u, 3)

f (x, u, 4)

f (x, u, 5)
(0, 1) 1 0 0 0 0 0
(1, 1) 0.8 0.15 0.05 0 0 0
(2, 1) 0 0.8 0.15 0.05 0 0
(3, 1) 0 0 0.8 0.15 0.05 0
(4, 1) 0 0 0 0.8 0.15 0.05
(5, 1) 0 0 0 0 0 1
(0, 1) 1 0 0 0 0 0
(1, 1) 0.05 0.15 0.8 0 0 0
(2, 1) 0 0.05 0.15 0.8 0 0
(3, 1) 0 0 0.05 0.15 0.8 0
(4, 1) 0 0 0 0.05 0.15 0.8
(5, 1) 0 0 0 0 0 1
The robot receives rewards as in the deterministic case: upon reaching state 5, it is
rewarded with 5, and upon reaching state 0, it is rewarded with 1. The corresponding
reward function, in the form : X U X R, is:
(x, u, x

) =
_

_
5 if x = 5 and x

= 5
1 if x = 0 and x

= 0
0 otherwise

2.4. STOCHASTIC CASE 17


The deterministic return from (2.1) is no longer well dened in the stochastic case.
Instead, the expected innite-horizon discounted return of an initial state x
0
under a
(deterministic) policy h is dened as:
R
h
(x
0
) = E
x
k+1


f (x
k
,h(x
k
),)
_

k=0

k
r
k+1
_
= E
x
k+1


f (x
k
,h(x
k
),)
_

k=0

k
(x
k
, h(x
k
), x
k+1
)
_ (2.14)
where E denotes expectation and the notation x
k+1


f (x
k
, h(x
k
), ) means that the
next state x
k+1
is drawn from the probability distribution

f (x
k
, h(x
k
), ) at each step
k. Thus, the expectation is taken over all the stochastic transitions or equivalently
the entire stochastic trajectory. Intuitively the expected return may be seen as the
sum of the returns of all possible trajectories, where each return is weighted by the
probability of its respective trajectory; the sum of these weights is 1. An important
property of stochastic MDPs is that there always exists a deterministic optimal policy
that maximizes (2.14); so even though the dynamics are stochastic, we only need to
consider deterministic policies when characterizing the solution.
2
In the stochastic case, the Q-function is the expected return under the one-step
stochastic transitions, when starting in a particular state, applying a particular action,
and following the policy h thereafter:
Q
h
(x, u) = E
x


f (x,u,)
_
(x, u, x

) +R
h
(x

)
_
(2.15)
The denition of the optimal Q-function Q

remains unchanged from the determin-


istic case: Q

(x, u) = max
h
Q
h
(x, u), and optimal policies can still be computed as
greedy policies in Q

: h

(x) argmax
u
Q

(x, u).
Like the return, the Bellman equations for Q
h
and Q

must now be given using


expectations over the stochastic transitions. Because the next state x

can only take a


nite number of values, these expectations can be written as sums:
Q
h
(x, u) = E
x


f (x,u,)
_
(x, u, x

) +Q
h
(x

, h(x

))
_
=

f (x, u, x

)
_
(x, u, x

) +Q
h
(x

, h(x

))
_
(2.16)
Q

(x, u) = E
x


f (x,u,)
_
(x, u, x

) + max
u

(x

, u

)
_
=

f (x, u, x

)
_
(x, u, x

) + max
u

(x

, u

)
_
(2.17)
2
We use probability theory rather informally in this section; a mathematically complete formalism
would require signicant additional development. We point the interested reader to (Bertsekas and
Shreve, 1978) for a complete development, and do not consider these difculties further here.
18 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM
Thus the weighted-sum interpretation of the expectation becomes explicit in these
equations.
In the remainder of the lecture notes, we focus mainly on deterministic systems.
Nevertheless, when the extensions of certain algorithms and results to the stochastic
case are important, we separately present these extensions. Whenever the stochastic
case is considered, this is explicitly mentioned in the text.
BIBLIOGRAPHY 19
Bibliographical notes for Part I
The following textbooks provide detailed descriptions of RL and DP: (Bertsekas and
Tsitsiklis, 1996; Bertsekas, 2007) are optimal-control oriented, (Sutton and Barto,
1998; Szepesv ari, 2010; Sigaud and Buffet, 2010) have an articial-intelligence per-
spective, (Powell, 2007) is operations-research oriented, while our recent book
(Busoniu et al., 2010) focuses explicitly on approximate RL and DP for control prob-
lems. Markov decision processes are described by (Puterman, 1994), and the mathe-
matical conditions under which MDPs have solutions for the various types of return
can be found in (Bertsekas and Shreve, 1978). The books (Busoniu et al., 2010; Sut-
ton and Barto, 1998; Bertsekas, 2007) are most useful to supplement the information
in these lecture notes.
Bibliography
Abbeel, P., Coates, A., Quigley, M., and Ng, A. Y. (2007). An application of re-
inforcement learning to aerobatic helicopter ight. In Sch olkopf, B., Platt, J. C.,
and Hoffman, T., editors, Advances in Neural Information Processing Systems 19,
pages 18. MIT Press.
Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control, volume 2.
Athena Scientic, 3rd edition.
Bertsekas, D. P. and Shreve, S. E. (1978). Stochastic Optimal Control: The Discrete
Time Case. Academic Press.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena
Scientic.
Busoniu, L., Babu ska, R., De Schutter, B., and Ernst, D. (2010). Reinforcement
Learning and Dynamic Programming Using Function Approximators. Automation
and Control Engineering. Taylor & Francis CRC Press.
Crites, R. H. and Barto, A. G. (1998). Elevator group control using multiple rein-
forcement learning agents. Machine Learning, 33(23):235262.
DiGiovanna, J., Mahmoudi, B., Fortes, J., Principe, J. C., and Sanchez, J. C. (2009).
Coadaptive brain-machine interface via reinforcement learning. IEEE Transac-
tions on Biomedical Engineering, 56(1):5464.
Ernst, D., Stan, G.-B., Goncalves, J., and Wehenkel, L. (2006). Clinical data based
optimal STI strategies for HIV: A reinforcement learning approach. In Proceed-
ings 45th IEEE Conference on Decision & Control, pages 667672, San Diego,
US.
20 BIBLIOGRAPHY
Peters, J. and Schaal, S. (2008). Reinforcement learning of motor skills with policy
gradients. Neural Networks, 21:682697.
Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of
Dimensionality. Wiley.
Puterman, M. L. (1994). Markov Decision ProcessesDiscrete Stochastic Dynamic
Programming. Wiley.
Sigaud, O. and Buffet, O., editors (2010). Markov Decision Processes in Articial
Intelligence. Wiley.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction.
MIT Press.
Szepesv ari, Cs. (2010). Algorithms for Reinforcement Learning. Morgan & Claypool
Publishers.
Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communica-
tions of the ACM, 38(3):5868.
Part II
Classical reinforcement learning
and dynamic programming
21
Introduction and outline
In this part, we present classical reinforcement learning (RL) algorithms, starting
from their dynamic programming (DP) roots.
Classical algorithms require exact representations of value functions (Q-functions
Q(x, u) or V-functions V(x)) and/or policies h(x). In general, this can only be achieved
by storing separate values for each state or state-action pair, a representation called
tabular because it can be seen as a table (say, for the Q-function, with x on the
rows, u on the columns, and Q-values in the cells). This means that, in practice, the
classical algorithms only work when the state space X and the action space U are dis-
crete and contain a relatively small number of elements. This class of problems does
include interesting examples, such as some high-level decision-making problems and
certain board games; however, most automatic control problems are ruled out, as they
have continuous state and action spaces. We will discuss methods to deal with such
problems in Part III of these notes; those methods are all extensions of the classical
DP and RL algorithms discussed here.
Dynamic
programming
Reinforcement
learning
value iteration
policy iteration
policy search & planning
Q-learning
SARSA
Monte Carlo
temporal
difference
Monte Carlo policy iteration
optimistic Monte Carlo
Accelerating
temporal
difference
eligilibility traces
experience replay
model learning
Q()
ER-Q-learning
SARSA()
Dyna-Q
Figure 2.5: Classication of classical DP and RL algorithms and enhancements dis-
cussed.
23
24
Figure 2.5 classies the methods discussed in the present part. We start by in-
troducing two classical DP methods for nding the optimal solution of a Markov
decision process (MDP): value iteration and policy iteration. Afterwards, we discuss
techniques that search directly in the space of policies or control actions. All these
techniques are model-based, that is, they use a model of the MDP in the form of the
transition dynamics f and reward function . (Note other textbooks may exclude
direct policy search from the DP class, restricting DP to value- and policy-iteration
algorithms. We use the term DP to generically mean any model-based method.)
We then move on to RL, model-free methods. First, Monte Carlo methods are dis-
cussed; these methods learn on a trajectory-by-trajectory basis, only making changes
to their value function and policy at the end of trajectories. After Monte Carlo meth-
ods, temporal-difference learning is introduced: a fully incremental class of methods
that learn on a sample-by-sample basis. Two major temporal difference methods are
presented, Q-learning and SARSA, which can respectively be viewed as online, in-
cremental variants of value and policy iteration. Both MC and temporal difference
methods learn online, by interacting with the system.
Additionally, we describe several ways to increase the learning speed of temporal
difference methods: using so-called eligibility traces, reusing raw data, and learning
a model of the MDP which is then used to generate new data. Modied variants of
the standard algorithms are introduced exploiting all of these enhancements.
Throughout this part, we introduce algorithms from the deterministic perspective,
going into the stochastic setting only when necessary. Despite this, all RL algorithms
that we introduce work in general, deterministic as well as stochastic, problems. The
DP algorithms typically do not; so for value and policy iteration, we present exten-
sions to the stochastic case.
We describe algorithms that employ Q-functions rather than V-functions. The
cleaning robot task of Example 2.1 is employed to illustrate the behavior of several
representative algorithms.
Chapter 3
Dynamic programming
DP algorithms can be broken down into three subclasses, according to the path taken
to nd an optimal policy:
Value iteration algorithms search for the optimal value function (recall this
consists of the maximal returns from every state or from every state-action
pair). The optimal value function is used to compute an optimal policy.
Policy iteration algorithms evaluate policies by constructing their value func-
tions, and use these value functions to nd new, improved policies.
Policy search algorithms use optimization and related techniques to directly
search for an optimal policy. From this class, we also discuss a planning algo-
rithm that ingeniously combines policy search with features of value iteration.
3.1 Value iteration
Value iteration techniques use the Bellman optimality equation to iteratively compute
an optimal value function, from which an optimal policy is derived. To solve the
Bellman optimality equation, knowledge of the transition and reward functions is
employed.
In particular, we introduce the Q-iteration algorithm, which computes Q-functions.
Let the set of all the Q-functions be denoted by Q. Then, the Q-iteration mapping
T : Q Q, computes the right-hand side of the Bellman optimality equation (2.8)
for any Q-function:
1
[T(Q)](x, u) = (x, u) + max
u

Q( f (x, u), u

) (3.1)
1
The term mapping is used to refer to functions that work with other functions as inputs and/or
outputs. The term is used to differentiate mappings from ordinary functions, which only have numerical
scalars, vectors, or matrices as inputs and/or outputs.
25
26 CHAPTER 3. DYNAMIC PROGRAMMING
The Q-iteration algorithm starts from an arbitrary Q-function Q
0
and at each iter-
ation updates the Q-function using:
Q
+1
= T(Q

) (3.2)
When rewritten using the Q-iteration mapping, the Bellman optimality equation (2.8)
states that Q

is a xed point of T, i.e.:


Q

= T(Q

) (3.3)
It can be shown that T is a contraction with factor < 1 in the innity norm, i.e., for
any pair of functions Q and Q

, it is true that:
T(Q) T(Q

QQ

Because T is a contraction, it has a unique xed point, and by (3.3) this point is
actually Q

. Due to its contraction nature, Q-iteration asymptotically converges to


Q

as , at a rate of , in the sense that Q


+1
Q

. An
optimal policy can be computed from Q

with (2.5).
Algorithm 3.1 presents Q-iteration in an explicit, procedural form, wherein T is
computed using (3.1).
Algorithm 3.1 Q-iteration.
Input: dynamics f , reward function , discount factor
1: initialize Q-function, e.g., Q
0
(x, u) 0 x, u
2: repeat at every iteration = 0, 1, 2, . . .
3: for every (x, u) do
4: Q
+1
(x, u) (x, u) + max
u
Q

( f (x, u), u

)
5: end for
6: until stopping criterion is satised
Output: Q
+1
As a stopping criterion, we could require that Q
+1
=Q

, i.e. that we have reached


the convergence point Q

. However, this is only guaranteed to happen as , so


a more practical solution is to stop when the difference between two consecutive Q-
functions decreases below a given threshold
QI
> 0, i.e., when Q
+1
Q


QI
.
(The subscript QI stands for Q-iteration.) This can be guaranteed to happen after
a nite number of iterations, due to the convergence rate of . It is also possible to
derive a stopping condition that guarantees the solution returned is within a given
distance of the optimum.
Computational cost of Q-iteration
Next, we investigate the computational cost of Q-iteration when applied to an MDP
with a nite number of states and actions. Denote by || the cardinality of the argu-
3.1. VALUE ITERATION 27
ment set , so that |X| denotes the nite number of states and |U| denotes the nite
number of actions.
Assume that, when updating the Q-value for a given state-action pair (x, u), the
maximization over the action space U is solved by enumeration over its |U| elements,
and f (x, u) is computed once and then stored and reused. Updating the Q-value then
requires 2 +|U| function evaluations, where the functions being evaluated are f , ,
and the current Q-function Q

. Since at every iteration, the Q-values of |X| |U| state-


action pairs have to be updated, the cost per iteration is |X| |U| (2+|U|). So, the total
cost of L Q-iterations is:
L|X| |U| (2+|U|) (3.4)
Example 3.1 Q-iteration for the cleaning robot. In this example, we apply Q-
iteration to the cleaning-robot problem of Example 2.1. The discount factor is set
to 0.5. The discount factor is small compared to the recommended values in Chapter 2
because the state space is very small and thus interesting trajectories are short.
Starting from an identically zero initial Q-function, Q
0
= 0, Q-iteration produces
the sequence of Q-functions given in the rst part of Table 3.1 (above the dashed line),
where each cell shows the Q-values of the two actions in a certain state, separated by
a semicolon. For instance:
Q
3
(2, 1) =(2, 1)+ max
u
Q
2
( f (2, 1), u) =0+0.5max
u
Q
2
(3, u) =0+0.5 2.5 =1.25
Table 3.1: Q-iteration results for the cleaning robot.
x = 0 x = 1 x = 2 x = 3 x = 4 x = 5
Q
0
0; 0 0; 0 0; 0 0; 0 0; 0 0; 0
Q
1
0; 0 1; 0 0; 0 0; 0 0; 5 0; 0
Q
2
0; 0 1; 0 0.5; 0 0; 2.5 0; 5 0; 0
Q
3
0; 0 1; 0.25 0.5; 1.25 0.25; 2.5 1.25; 5 0; 0
Q
4
0; 0 1; 0.625 0.5; 1.25 0.625; 2.5 1.25; 5 0; 0
Q
5
0; 0 1; 0.625 0.5; 1.25 0.625; 2.5 1.25; 5 0; 0
h

1 1 1 1
V

0 1 1.25 2.5 5 0
The algorithm fully converges after 5 iterations; Q
5
= Q
4
= Q

. The last two


rows of the table (below the dashed line) also give the optimal policies, computed
from Q

with (2.5), and the optimal V-function V

, computed from Q

with (2.10).
In the policy representation, the symbol means that any action can be taken in
that state without changing the quality of the policy. The total number of function
evaluations required by the algorithm is:
5|X| |U| (2+|U|) = 5 6 2 4 = 240
28 CHAPTER 3. DYNAMIC PROGRAMMING

Q-iteration in the stochastic case


In the stochastic case, one must simply use a stochastic variant of the Q-iteration
mapping, derived from the Bellman equation (2.17):
[T(Q)](x, u) =

f (x, u, x

)
_
(x, u, x

) + max
u

Q(x

, u

)
_
(3.5)
Algorithm 3.2 is obtained. The remarks above regarding the contraction property,
convergence rate, and stopping criterion, directly apply to this algorithm.
Algorithm 3.2 Q-iteration for the stochastic case.
Input: dynamics

f , reward function , discount factor
1: initialize Q-function, e.g., Q
0
0
2: repeat at every iteration = 0, 1, 2, . . .
3: for every (x, u) do
4: Q
+1
(x, u)
x


f (x, u, x

)[ (x, u, x

) + max
u
Q

(x

, u

)]
5: end for
6: until stopping criterion is satised
Output: Q
+1
3.2 Policy iteration
We now consider policy iteration algorithms, which evaluate policies by constructing
their value functions, and use these value functions to nd new, improved policies.
Consider that policies are evaluated using their Q-functions. Policy iteration starts
with an arbitrary policy h
0
. At every iteration , the Q-function Q
h

of the current
policy h

is determined; this step is called policy evaluation. Policy evaluation is per-


formed by solving the Bellman equation (2.7). When policy evaluation is complete,
a new policy h
+1
that is greedy in Q
h
is found:
h
+1
(x) argmax
u
Q
h

(x, u) (3.6)
(breaking ties among greedy actions where necessary). This step is called policy
improvement. The entire procedure is summarized in Algorithm 3.3. The sequence
of Q-functions produced by policy iteration converges to Q

, and at the same time,


an optimal policy h

is obtained. When the number of states-action pairs is nite,


convergence will happen in a nite number of iterations. This is because in that
case each policy improvement is guaranteed to nd a strictly better policy unless
3.2. POLICY ITERATION 29
it is already optimal, and there is a nite number of possible policies. These two
facts imply that the algorithm will reach the optimal policy after a nite number of
improvements.
Algorithm 3.3 Policy iteration.
1: initialize policy h
0
2: repeat at every iteration = 0, 1, 2, . . .
3: nd Q
h

, the Q-function of h

policy evaluation
4: h
+1
(x) argmax
u
Q
h

(x, u) x policy improvement


5: until h
+1
= h

Output: h

= h

, Q

= Q
h

The crucial component of policy iteration is policy evaluation, while policy im-
provement is comparatively easy to perform. Thus, we pay special attention to policy
evaluation.
An iterative policy evaluation algorithm can be given that is similar to Q-iteration.
Analogously to the Q-iteration mapping T (3.1), a policy evaluation mapping T
h
:
Q Q is dened, which computes the right-hand side of the Bellman equation for
an arbitrary Q-function:
[T
h
(Q)](x, u) = (x, u) +Q( f (x, u), h( f (x, u))) (3.7)
The algorithm starts from some initial Q-function Q
h
0
and at each iteration updates
the Q-function using:
2
Q
h
+1
= T
h
(Q
h

) (3.8)
Like the Q-iteration mapping T, the policy evaluation mapping T
h
is a contraction
with a factor < 1 in the innity norm, i.e., for any pair of functions Q and Q

:
T
h
(Q) T
h
(Q

QQ

So, T
h
has a unique xed point. Written in terms of the mapping T
h
, the Bellman
equation (2.7) states that this unique xed point is actually Q
h
:
Q
h
= T
h
(Q
h
) (3.9)
Therefore, iterative policy evaluation (3.8) converges to Q
h
as . Like Q-
iteration, it converges at a rate of : Q
h
+1
Q
h

Q
h

Q
h

.
Algorithm 3.4 presents this iterative policy evaluation procedure. In practice,
the algorithm can be stopped when the difference between consecutive Q-functions
decreases below a given threshold: Q
h
+1
Q
h


PE
, where
PE
> 0. Here, the
subscript PE stands for policy evaluation.
2
A different iteration index is used for policy evaluation, because it runs in the inner loop of every
policy iteration .
30 CHAPTER 3. DYNAMIC PROGRAMMING
Algorithm 3.4 Iterative policy evaluation.
Input: policy h to be evaluated, dynamics f , reward function , discount factor
1: initialize Q-function, e.g., Q
0
(x, u) 0 x, u
2: repeat at every iteration = 0, 1, 2, . . .
3: for every (x, u) do
4: Q
h
+1
(x, u) (x, u) +Q
h

( f (x, u), h( f (x, u)))


5: end for
6: until stopping criterion satised
Output: Q
h
+1
There are also other ways to compute Q
h
, for example, by using the fact that the
Bellman equation (3.9) can be written as a linear system of equations in the Q-values
Q
h
(x, u), and solving this system directly.
More generally, the linearity of the Bellman equation for Q
h
makes policy eval-
uation easier than Q-iteration (the Bellman optimality equation (2.8) is highly non-
linear, due to the maximization at the right-hand side). Moreover, in practice, policy
iteration often converges in a smaller number of iterations than value iteration. How-
ever, this does not mean that policy iteration is computationally less costly than value
iteration, because every single policy iteration requires a complete policy evaluation.
Computational cost of iterative policy evaluation
We next investigate the computational cost of iterative policy evaluation for an MDP
with a nite number of states and actions. The computational cost of one iteration of
Algorithm 3.4, measured by the number of function evaluations, is:
4|X| |U|
where the functions being evaluated are , f , h, and the current Q-function Q
h

. The
total cost of an entire policy evaluation consisting of T iterations is T 4 |X| |U|.
Recall that a single Q-iteration requires |X| |U| (2+|U|) function evaluations, and
is therefore more computationally expensive than an iteration of policy evaluation
whenever |U| > 2.
Example 3.2 Policy iteration for the cleaning robot. In this example, we apply
policy iteration to the cleaning-robot problem. Recall that every single policy iter-
ation requires a complete execution of policy evaluation for the current policy, to-
gether with a policy improvement. The iterative policy evaluation Algorithm 3.4
is employed, starting from identically zero Q-functions. Each policy evaluation is
run until the Q-function fully converges. The same discount factor is used as for
Q-iteration in Example 3.1, namely = 0.5.
3.2. POLICY ITERATION 31
Starting from a policy that always moves left (h
0
(x) = 1 for all x), policy iter-
ation produces the sequence of Q-functions and policies given in Table 3.2. In this
table, the sequence of Q-functions produced by a given execution of policy evalua-
tion is separated by dashed lines from the policy being evaluated (shown above the
sequence of Q-functions) and from the improved policy (shown below the sequence).
The policy iteration algorithm converges after 4 iterations (h
3
is already optimal).
Table 3.2: Policy iteration results for the cleaning robot. Q-values are rounded to 3
decimal places.
x = 0 x = 1 x = 2 x = 3 x = 4 x = 5
Q
0
0; 0 0; 0 0; 0 0; 0 0; 0 0; 0
Q
1
0; 0 1; 0 0; 0 0; 0 0; 5 0; 0
Q
2
0; 0 1; 0 0.5; 0 0; 0 0; 5 0; 0
Q
3
0; 0 1; 0.25 0.5; 0 0.25; 0 0; 5 0; 0
Q
4
0; 0 1; 0.25 0.5; 0.125 0.25; 0 0.125; 5 0; 0
Q
5
0; 0 1; 0.25 0.5; 0.125 0.25; 0.0625 0.125; 5 0; 0
Q
6
0; 0 1; 0.25 0.5; 0.125 0.25; 0.0625 0.125; 5 0; 0
h
1
* 1 1 1 1 *
Q
0
0; 0 0; 0 0; 0 0; 0 0; 0 0; 0
Q
1
0; 0 1; 0 0; 0 0; 0 0; 5 0; 0
Q
2
0; 0 1; 0 0.5; 0 0; 2.5 0; 5 0; 0
Q
3
0; 0 1; 0.25 0.5; 0 0.25; 2.5 0; 5 0; 0
Q
4
0; 0 1; 0.25 0.5; 0.125 0.25; 2.5 0.125; 5 0; 0
Q
5
0; 0 1; 0.25 0.5; 0.125 0.25; 2.5 0.125; 5 0; 0
h
2
* 1 1 1 1 *
Q
0
0; 0 0; 0 0; 0 0; 0 0; 0 0; 0
Q
1
0; 0 1; 0 0; 0 0; 0 0; 5 0; 0
Q
2
0; 0 1; 0 0.5; 0 0; 2.5 0; 5 0; 0
Q
3
0; 0 1; 0.25 0.5; 1.25 0.25; 2.5 1.25; 5 0; 0
Q
4
0; 0 1; 0.25 0.5; 1.25 0.25; 2.5 1.25; 5 0; 0
h
3
* 1 1 1 1 *
Q
0
0; 0 0; 0 0; 0 0; 0 0; 0 0; 0
Q
1
0; 0 1; 0 0; 0 0; 0 0; 5 0; 0
Q
2
0; 0 1; 0 0.5; 0 0; 2.5 0; 5 0; 0
Q
3
0; 0 1; 0 0.5; 1.25 0; 2.5 1.25; 5 0; 0
Q
4
0; 0 1; 0.625 0.5; 1.25 0.625; 2.5 1.25; 5 0; 0
Q
5
0; 0 1; 0.625 0.5; 1.25 0.625; 2.5 1.25; 5 0; 0
h
4
* 1 1 1 1 *
Policy evaluation takes, respectively, a number T of 6, 5, 4, and 5 iterations for
the four evaluated policies. Recall that the computational cost of an entire policy
32 CHAPTER 3. DYNAMIC PROGRAMMING
evaluation is T 4 |X| |U|. Assuming that the maximization over U in the policy
improvement is solved by enumeration, the computational cost of every policy im-
provement is |X| |U|. Thus, the entire policy iteration algorithm has a cost of:
(6+5+4+5) 4 |X| |U| +4 |X| |U| = 84|X| |U| = 84 6 2 = 1008
In the rst expression, the rst term corresponds to policy evaluations, and the second
to policy improvements. Compared to the cost 240 of Q-iteration in Example 3.1,
policy iteration is in this case more computationally expensive.
Policy iteration in the stochastic case
Only the policy evaluation component must be changed in the stochastic case. Policy
improvement remains the same, because the expression of the greedy policy is also
unchanged, see Section 2.4.
Similarly to the case of Q-iteration, a stochastic variant of the policy evaluation
must be used, derived from the Bellman equation (2.16):
[T
h
(Q)](x, u) =

f (x, u, x

)
_
(x, u, x

) +Q(x

, h(x

))

(3.10)
Algorithm 3.5 shows the resulting policy evaluation method.
Algorithm 3.5 Iterative policy evaluation for the stochastic case.
Input: policy h to be evaluated, dynamics

f , reward function , discount factor
1: initialize Q-function, e.g., Q
h
0
0
2: repeat at every iteration = 0, 1, 2, . . .
3: for every (x, u) do
4: Q
h
+1
(x, u)
x


f (x, u, x

)
_
(x, u, x

) +Q
h

(x

, h(x

))

5: end for
6: until stopping criterion satised
Output: Q
h
+1
3.3 Policy search
Policy search algorithms use optimization techniques to directly search for a (near)-
optimal policy. Since the optimal policy maximizes the return from every initial
state, the optimization criterion should be a combination (e.g., average) of the returns
from every initial state. Note that gradient-based optimization methods will typically
not be applicable because the optimization criterion is not differentiable, and there
may exist local optima. Instead, more general (global, gradient-free) methods are
required, such as genetic algorithms.
3.3. POLICY SEARCH 33
Consider the return estimation procedure. While the returns are innite sums of
discounted rewards (2.1), in practice they have to be estimated in a nite number of
K steps. This means that the innite sum in the return is approximated with a nite
sum over the rst K steps. To guarantee that the approximation obtained in this way
makes an error of at most
MC
> 0, K can be chosen with:
K =
_
log

MC
(1)

_
(3.11)
The return estimation has to be done for every initial state, for every policy that
must be evaluated, which means that policy search algorithms will be computation-
ally expensive, usually more so than value iteration and policy iteration.
Computational cost of exhaustive policy search
We next investigate the computational cost of a policy search algorithm for an MDP
with a nite number of states and actions. For simplicity, we consider an algorithm
that exhaustively searches the entire policy space.
The number of possible policies is |U|
|X|
and the return has to be evaluated for
all the |X| initial states, by using a K-step trajectory. It follows that the total number
of simulation steps that have to be performed to nd an optimal policy is at most
K|U|
|X|
|X|. Since f , , and h are each evaluated once at every simulation step, the
computational cost, measured by the number of function evaluations, is:
3K|U|
|X|
|X|
Compared e.g. to the cost L|X| |U| (2+|U|) of Q-iteration (3.4), this implementation
of policy search is, in most cases, clearly more costly.
Of course, more efcient optimization techniques than exhaustive search are
available, and the estimation of the expected returns can also be accelerated. For
instance, after the return of a state has been estimated, this estimate can be reused
at every occurrence of that state along subsequent trajectories, thereby reducing the
computational cost. Nevertheless, the costs derived above can be seen as worst-case
values that illustrate the inherently large complexity of policy search.
Example 3.3 Exhaustive policy search for the cleaning robot. Consider again
the cleaning-robot problem and assume that the exhaustive policy search described
above is applied. Take the approximation tolerance in the evaluation of the return
to be
MC
= 0.01. Using
MC
, maximum absolute reward

= 5, and discount
factor = 0.5 in (3.11), a time horizon of K = 10 steps is obtained. Therefore, the
computational cost of the algorithm, measured by the number of function evaluations,
is:
3K|U|
|X|
|X| = 3 10 2
6
6 = 11520
34 CHAPTER 3. DYNAMIC PROGRAMMING
(Note that, in fact, the cost will be smaller as many trajectories will reach a terminal
state in under 10 steps, and additionally it is useless to search for optimal actions or
estimate returns in the terminal states.)
For the cleaning-robot problem, the exhaustive implementation of direct policy
search is very likely to be more expensive than both Q-iteration (which had a cost of
240) and policy iteration (which had a cost of 1008).
Optimistic planning
In what follows, we introduce an algorithm for the local search of control actions,
at a specic state of the system. By applying this search algorithm at every state
encountered while interacting with the system, an online control method is obtained.
Although the algorithm is classied under policy search, it exploits the ideas of dy-
namic programming to efciently search the space of possible actions. An ingenious
combination of dynamic programming and policy search is obtained. We only con-
sider the deterministic case here, although stochastic variants of the algorithm exist.
The algorithm is called optimistic planning (OP). It builds a planning (lookahead)
tree starting from a root node that contains the state where an action must be chosen.
At each iteration, the algorithm selects a leaf node (a state) and expands it, by gen-
erating the next states for all the possible actions. Denote my M = |U| the number
of discrete actions; thus M new nodes are created at each expansion. The algorithm
stops growing the tree after n expansions and returns an action chosen on the basis of
the nal tree.
x
0
x
x
a
x
b
V x ( )
a
a,
x,a
(
)
V( ) x
b
b,
x,b
(
)
Figure 3.1: An example of OP tree. In this example there are only two actions,
denoted a and b, and 3 expansions have been performed so far. Each inner node
has two children, one for a and one for b, and the arcs to these children also store
the actions and rewards associated to the transitions; such a relationship has been
exemplied for node x. Values V are indicated for two nodes.
The following notation and conventions are necessary, see also Figure 3.1:
The entire tree is denoted by T , and the set of leaf (unexpanded) nodes by L.
The inner nodes are T \L.
3.3. POLICY SEARCH 35
A node of the tree is labeled by its associated state x. Note that several nodes
may have the same state label. These nodes will in fact be distinct; while
keeping that in mind, we denote for simplicity a node directly by its associated
state x. A child node, generically denoted x

, has the meaning of next state;


the child of x corresponding to action u is f (x, u). The actions and rewards
associated with the transitions from parent to child nodes are stored on the
tree.
Because everything happens at the current time step, the time index k is dropped
and the subscript of x is reused to indicate the depth d of a node in the tree,
whenever this depth is relevant. So, x
0
is the root node, where an action must
eventually be chosen, and x
d
is a node at depth d.
As a rst step to develop an understanding the algorithm, some xed tree will be
considered, such as the one in Figure 3.1, and the procedure to choose a nal action
at x
0
will be explained. For each x T , dene the values V(x) recursively, starting
from the leaf nodes, as follows:
V(x) = 0 x L
V(x) = max
u
[(x, u) +V( f (x, u))] x T \L
(3.12)
Note that children nodes x

= f (x, u) exist for all inner nodes x and actions u, by the


way in which the tree is built, so the values V can indeed be computed using the
information on the tree. Then, an action is chosen at the root with:
u
0
argmax
u
[(x
0
, u) +V( f (x
0
, u))] (3.13)
It is no mistake that the notation V has been used for the values; they have the
same meaning as the V-values of the states in the tree. To understand this, say that
the optimal V-function function V

is available. If leaf nodes values are initialized


using V

instead of 0, then a single node expansion sufces to nd optimal actions,


because then (3.13) exactly implements the greedy policy in V

, see (2.11):
u
0
= h

(x
0
) argmax
u
[(x
0
, u) +V

( f (x
0
, u))]
This situation is illustrated in Figure 3.2, top.
The optimal V-function V

is of course not available, so let us look at 0 initializa-


tion, but in a uniform tree up to some very large depth D, as illustrated in Figure 3.2,
bottom. Then (3.12) actually implements a local form of value iteration, in which the
Bellman equation (2.13):
V

(x) = max
u
[(x, u) +V

( f (x, u))]
36 CHAPTER 3. DYNAMIC PROGRAMMING
x
0
a

x
a
(
,
)
0
x
1
a
x
1
b
b

x
b
(
,
)
0
V*(x
1
a
)
V*(x
1
b
)
x
0
x
1
a
x
1
b
V( ) x
1
a
V( ) x
1
b
x
D-1
a
V( ) x
D-1
a
x
D
a
x
D
b
V( ) x
D
a
V( ) x
D
b
. . .
. . .
. . .
.

.

.
.

.

.
Figure 3.2: Two extreme cases of the OP tree. Top: tree of depth 1, but where leaf
values are initialized using V

. Bottom: uniform tree of depth D, for very large D.


Like in the previous gure, there are only two actions. Unlike there however, node
depth has been indicated using subscripts.
is turned into an update starting from the 0 leaf values, and continuing backwards
along the tree down to the root. By similar convergence arguments as for Q-iteration,
as D , the values of states at depth 1 (and also, in fact, at any nite depth),
converge to the optimal V-values, and (3.13) yet again recovers the optimal action.
When the depth D is nite, the updates (3.12) actually compute an approximation
of V

; in fact, the tree does not need to be uniform, and a valid approximation will
be computed with a tree of any shape, such as the one in Figure 3.1. The idea of OP
is to judiciously choose which nodes to expand so that V(x) is as close as possible to
V

(x) after a limited number of expansions.


To this end, an optimistic procedure is applied to select which leaf node to expand
at each iteration of the algorithm. First, translate and rescale the reward function so
that all possible rewards are now in [0, 1] (this can always be done without changing
the solution when there are no terminal states; if there are terminal states care must
be taken). For each x T , dene b-values b(x) in a similar way to the V-values, but
starting with 1/(1) at the leaves:
b(x) = 1/(1) x L
b(x) = max
u
[(x, u) +b( f (x, u))] x T \L
(3.14)
Each b-value b(x) is an upper bound on V

(x), i.e. b(x) V

(x) x T . This is
3.3. POLICY SEARCH 37
immediately clear at the leaves, where the value 1/(1) is an upper bound for any
V-value, because the rewards are in [0, 1] and a discount factor < 1 is used. By
backward induction, it is true at any inner node: since the b-values of the nodes
children are upper bounds on their V-values, the resulting b-values of this node are
upper bounds on its own V-values.
To select a node to expand, the tree is traversed starting from the root and always
following optimistic actions, i.e., those greedy in the b-values:
x
d+1
= f (x
d
, u

(x
d
)), where u

(x) argmax
u
[(x, u) +b( f (x, u))]
(3.15)
The leaf reached using this procedure is the one chosen for expansion. Ties in the
maximization can be broken in any way, but to make the algorithm predictable, they
should preferably be broken deterministically, e.g., always in favor of the rst child.
This procedure is optimistic because it uses b-values (upper bounds) as if they were
optimal Q-values; said another way, it assumes the best possible optimal values com-
patible with the planning tree available so far.
By analyzing the algorithm it turns out that choosing nodes in this way is good:
the quality of action chosen at the root increases fast with the number of node ex-
pansions, where the complexity of the problem is taken into account in a specic
way.
Algorithm 3.6 summarizes the OP algorithm, placing it in an online control loop.
Algorithm 3.6 Optimistic planning for online control
Input: dynamics f , reward function , discount factor , number of allowed expan-
sions n
1: for every time step k = 0, 1, 2, . . . do
2: measure x
k
, relabel time so that the current step is 0
3: initialize tree: T {x
0
}
4: for = 1, . . . , n do
5: nd optimistic leaf x

, navigating the tree with (3.15)


6: for j = 1, . . . , M do
7: simulate transition: compute f (x

, u
j
), (x

, u
j
)
8: add corresponding node to tree as a child of x

9: end for
10: update b-values upwards in the tree with (3.14)
11: end for
12: compute V-values with (3.14)
13: apply u
0
, chosen with (3.13)
14: end for
38 CHAPTER 3. DYNAMIC PROGRAMMING
Chapter 4
Monte Carlo methods
When a mathematical model of the system is not available, DP methods are not ap-
plicable, and we must resort to RL algorithms instead. We start by presenting Monte
Carlo algorithms.
4.1 Monte Carlo policy iteration
MC methods are easy to understand initially as a trajectory-based variant of policy it-
eration. For simplicity, we assume that the MDP is episodic, and that every trajectory
(trial) eventually ends in a terminal state, so the i-th trial has the form:
x
i,0
, u
i,0
, r
i,1
, x
i,1
, u
i,1
, r
i,2
, x
i,2
, . . . , r
i,K
i
, x
i,K
i
with x
i,K
i
terminal. If all the actions u
i,k
for k > 0 are chosen by the policy h, i.e. if
u
i,k
= h(x
i,k
) for k > 0, then by denition the returns obtained along this trial, for all
the state-action pairs that occur along it, are in fact the Q-values of these pairs under
h:
Q
h
(x
i,k
, u
i,k
) =
K
i
1

l=k

lk
r
i,k+1
for k 0. Note that the rst action does not have to be taken with the policy h, but
can be chosen freely. Once we have enough trajectories to nd the Q-values of all
the state-action pairs, we can perform a policy improvement with (3.6), and continue
to evaluate the new policy using the MC technique. Overall, we obtain a model-free,
RL policy iteration algorithm.
To give the general form of MC, which also works for stochastic MDPs, we must
briey consider the stochastic case. In fact, for stochastic MDPs the assumption that
every trajectory ends up in a terminal state with some positive probability is easier
to justify than in the deterministic case, where it is often easy to construct policies
that loop over a never-ending cycle of states (just consider the cleaning robot and a
39
40 CHAPTER 4. MONTE CARLO METHODS
policy that assigns h(2) = 1, h(3) = 1). In the stochastic case, a single trial does
not sufce to nd the Q-value of the state-action pairs occurring along it, but does
provide a sample of this Q-value. The Q-value is estimated as an average of many
such samples:
Q
h
(x, u)
A(x, u)
C(x, u)
where:
A(x, u) =

(i,k)s.t. x
i,k
=x,u
i,k
=u
K
i
1

l=k

lk
r
i,k+1
and C(x, u) counts how many times the pair (x, u) occurs:
C(x, u) =

(i,k)s.t. x
i,k
=x,u
i,k
=u
1
The name Monte Carlo is more generally applied to this type of sample-based
estimation for any random quantity, not necessarily returns. If a state-action pair
occurs twice along the same trial, the formulas above take it into account both times;
such an MC method is called every-visit (since we take the pair into account every
time we visit it). A rst-visit alternative, which takes a pair into account only the
rst time it occurs in each trial, can also be given. First-visit and every-visit MC have
slightly different theoretical properties, but they both obtain the correct Q-values in
the limit, as the number of visits approaches innity.
In practice, we can of course only use a nite number of trials to estimate the
Q-function of a policy; denote this number by N
MC
. Algorithm 4.1 summarizes this
practical variant of MC learning. The trajectories could be obtained by interacting
with the real system, or with a simulation model of it.
Note that an idealized, innite-time setting is considered for this algorithm, in
which no termination condition is specied and no explicit output is produced. In-
stead, the result of the algorithm is the improvement of the control performance
achieved while interacting with the process. A similar setting will be considered for
other online learning algorithms described in this book, with the implicit understand-
ing that, in practice, the algorithms will of course be stopped after a nite number of
steps. When MC policy iteration is stopped, the resulting policy and Q-function can
be interpreted as outputs and reused.
4.2 The need for exploration
Until now, we have tacitly assumed that all the state-action pairs are visited a suf-
cient number of times to estimate their Q-values. This requires two things: (i) that
each state is sufciently visited and (ii) that for any given state, each action is suf-
ciently visited. Assuming that the selection procedure for the initial states of the
4.2. THE NEED FOR EXPLORATION 41
Algorithm 4.1 Monte Carlo policy iteration.
Input: discount factor , algorithm type (rst-visit or every-visit)
1: initialize policy h
0
2: for every iteration = 0, 1, 2, . . . do
3: A(x, u) 0, C(x, u) 0, x, u start policy evaluation
4: for i = 1, . . . , N
MC
do
5: initialize x
i,0
, u
i,0
6: execute trial x
i,0
, u
i,0
, r
i,1
, x
i,1
, u
i,1
, r
i,2
, x
i,2
, . . . , r
i,K
i
, x
i,K
i
using policy h

for every k > 0


7: for k = 0, . . . , K
i
1 do
8: if rst-visit and x
i,k
, u
i,k
already encountered in trial i then
9: ignore this pair and continue to next k
10: else
11: A(x
i,k
, u
i,k
) A(x
i,k
, u
i,k
) +
K
i
1
l=k

lk
r
i,k+1
12: C(x
i,k
, u
i,k
) C(x
i,k
, u
i,k
) +1
13: end if
14: end for
15: Q

(x, u) A(x, u)/C(x, u), x, u complete policy evaluation


16: end for
17: h
+1
(x) argmax
u
Q

(x, u), x policy improvement


18: end for
trajectories ensures (i) (e.g. random states are selected from the entire state space),
we focus on (ii) in this section.
If actions would always be chosen according to the policy h that is being evalu-
ated, only pairs of the form (x, h(x)) would be visited, and no information about pairs
(x, u) with u = h(x) would be available. As a result, the Q-values of such pairs could
not be estimated and relied upon for policy improvement. To alleviate this problem,
exploration is necessary: sometimes, actions different from h(x) have to be selected.
In Algorithm 4.1, the only opportunity to do this is for the rst action of each trial.
This action could e.g. be chosen uniformly randomly from U. Together with (i), this
ensures that the initial state-action pairs sufciently cover the state-action space; this
is called exploring starts in the literature. With exploring starts and an ideal version
of MC policy evaluation that employs an innite number N
MC
of trajectories, the Q-
function accurately evaluates the policy and thus MC policy iteration converges to
h

.
An alternative to exploring starts is to use an exploratory policy throughout the
trajectories, that is, a policy that sometimes takes exploratory actions instead of
greedy ones, at the initial as well as subsequent steps. This is also done using random
action selections, so that the algorithm has a nonzero probability of selecting any ac-
tion in every encountered state. A classical exploratory policy is the -greedy policy,
42 CHAPTER 4. MONTE CARLO METHODS
which selects actions according to:
u =
_
u argmax
u
Q(x, u) with probability 1
a uniformly random action in U with probability
(4.1)
where (0, 1) is the exploration probability. Another option is to use Boltzmann
exploration, which selects an action u with probabilities dependent on the Q-values,
as follows:
P(u| x) =
e
Q(x,u)/

u
e
Q(x, u)/
(4.2)
where the temperature 0 controls the randomness of the exploration. When
0, (4.2) is equivalent to greedy action selection, while for , action selection
is uniformly random. For nonzero, nite values of
k
, higher-valued actions have a
greater chance of being selected than lower-valued ones.
In the extreme case, fully random actions could be applied at every step. How-
ever, this is not desirable when interacting with a real system, as it can be damaging
and will lead to poor control performance. Instead, the algorithm also has to ex-
ploit its current knowledge in order to obtain good performance, by selecting greedy
actions in the current Q-function. This is a typical illustration of the exploration-
exploitation trade-off in online RL.
Classically, this tradeoff is resolved by diminishing the exploration over time, so
that the policy gets closer and closer to the greedy policy, and thus (as the algorithm
hopefully also converges to the optimal solution) to the optimal policy. This can be
achieved by reducing or close to 0 over time, e.g., as the iteration number grows.
4.3 Optimistic Monte Carlo learning
Algorithm 4.1 only improves the policy once every N
MC
trajectories. In-between,
the policy remains unchanged, and possibly performs badly, for long periods of time
(we say that the algorithm learns slowly). To avoid this, policy improvements could
be performed more often, before an accurate evaluation of the current policy can be
completed. In particular, we consider the case when they are performed after each
trial. In the context of policy iteration, such methods are called optimistic.
Algorithm 4.2 shows this variant of MC learning, where -greedy exploration is
used. A core difference from Algorithm 4.1 is that we no longer afford to reset A and
C, as there is not enough information in a single trial to rebuild them. Instead, we just
keep updating them, disregarding the fact that the policy is changing. Furthermore,
several notation simplications have been made: a policy is no longer explicitly con-
sidered, but the policy improvements are implicit in the greedy action choices; the
trial index is now identical to the iteration index , since the two have the same mean-
ing; and the iteration index has been dropped from the Q-function, to reect its nature
of a continually evolving learned object.
4.3. OPTIMISTIC MONTE CARLO LEARNING 43
Algorithm 4.2 Optimistic Monte Carlo.
Input: discount factor , algorithm type (rst-visit or every-visit)
1: A(x, u) 0, C(x, u) 0, x, u
2: initialize Q-function Q, e.g., Q(x, u) 0 x, u
3: for every iteration = 0, 1, 2, . . . do start policy evaluation
4: execute trial x
,0
, u
,0
, r
,1
, x
,1
, u
,1
, r
,2
, x
,2
, . . . , r
,K

, x
,K

where u
,k
=
_
u argmax
u
Q(x
k
, u) with probability 1
a uniformly random action in U with probability
5: for k = 0, . . . , K

1 do
6: if rst-visit and x
,k
, u
,k
already encountered in trial then
7: ignore this pair and continue to next k
8: else
9: A(x
,k
, u
,k
) A(x
,k
, u
,k
) +
K

1
l=k

lk
r
,k+1
10: C(x
,k
, u
,k
) C(x
,k
, u
,k
) +1
11: end if
12: end for
13: Q(x, u) A(x, u)/C(x, u), x, u
14: end for
In closing, we note that the assumption of terminal states can be removed by
estimating innite-horizon returns from nitely long trajectories, as in policy search,
see Section 3.3 and Equation (3.11). Conversely, MCmethods can be used to estimate
the quality of policies in policy search methods.
Example 4.1 Monte Carlo policy iteration for the cleaning robot. As an example
of Monte Carlo learning, we apply Algorithm 4.1 to the cleaning-robot problem. The
discount factor is the same as before, = 0.5. We use exploring starts, where both
the initial state and the initial action are chosen fully randomly among the possible
states and actions. A number N
MC
= 10 of trials is used to evaluate each policy, and
the algorithm is allowed to run for a total number of 100 trials, corresponding to 10
policy improvements.
Starting from a policy that always moves left (h
0
(x) = 1 for all x), a represen-
tative run of MC policy iteration produces the sequence of Q-functions and policies
given in Table 4.1. (A representative run must be chosen because the results of the al-
gorithm depend on the random state and action choices, so they will change on every
run.) Note that each policy h

in the table, for > 0, is greedy in the Q-function Q

.
The algorithm nds the optimal policy for the rst time at iteration 7. Then, however,
the action for x = 1 changes to the suboptimal value of 1 (going right), before revert-
ing back to the optimal. This behavior is less predictable than that of classical policy
iteration (Algorithm 3.3), because of the approximate nature of Monte Carlo policy
evaluation.
44 CHAPTER 4. MONTE CARLO METHODS
Table 4.1: Monte Carlo policy iteration results for the cleaning robot.
x = 0 x = 1 x = 2 x = 3 x = 4 x = 5
h
0
* 1 1 1 1 *
Q
0
0; 0 1; 0.25 0.5; 0 0.25; 0 0.125; 0 0; 0
h
1
* 1 1 1 1 *
Q
1
0; 0 1; 0 0.5; 0.125 0.25; 0.0625 0.125; 0 0; 0
h
2
* 1 1 1 1 *
Q
2
0; 0 1; 0 0.5; 0.125 0.25; 0.0625 0.125; 0 0; 0
h
3
* 1 1 1 1 *
Q
3
0; 0 1; 0.25 0.5; 0 0.25; 0.0625 0.125; 5 0; 0
h
4
* 1 1 1 1 *
Q
4
0; 0 1; 0.25 0.5; 0.125 0.25; 2.01 1.02; 5 0; 0
h
5
* 1 1 1 1 *
Q
5
0; 0 1; 0 0.5; 0.875 0.5; 2.5 0; 5 0; 0
h
6
* 1 1 1 1 *
Q
6
0; 0 1; 0.4 0.425; 1.25 0.625; 2.5 1.25; 5 0; 0
h
7
* 1 1 1 1 *
Q
7
0; 0 0; 0.625 0.313; 1.25 0.625; 2.5 1.25; 5 0; 0
h
8
* 1 1 1 1 *
Q
8
0; 0 1; 0.625 0.313; 1.25 0.625; 2.5 0; 5 0; 0
h
9
* 1 1 1 1 *
Q
9
0; 0 1; 0.438 0.406; 1.25 0.625; 2.5 1.25; 5 0; 0
h
10
* 1 1 1 1 *
Chapter 5
Temporal difference methods
Temporal difference methods can be intuitively understood as a combination of DP,
in particular value and policy iteration, and MC methods. Like MC, they learn from
trajectories instead of using a model. However, unlike MC, they update the solution
in a sample-by-sample, incremental and fully online way, without waiting until the
end of the trajectory (or indeed, without requiring that trajectories end at all). This
corresponds to the iterative way in which Q-iteration (Algorithm 3.1) and iterative
policy evaluation (Algorithm 3.4) work, estimating new Q-functions on the basis of
old Q-functions, which is called bootstrapping in the literature.
By far the most popular methods fromthe temporal difference class are Q-learning
and SARSA, which we present next.
5.1 Q-learning
Q-learning starts froman arbitrary initial Q-function Q
0
and updates it using observed
state transitions and rewards, i.e., data tuples of the form (x
k
, u
k
, x
k+1
, r
k+1
). After
each transition, the Q-function is updated using such a data tuple, as follows:
Q
k+1
(x
k
, u
k
) = Q
k
(x
k
, u
k
) +
k
[r
k+1
+ max
u

Q
k
(x
k+1
, u

) Q
k
(x
k
, u
k
)] (5.1)
where
k
(0, 1] is the learning rate. The term between square brackets is the tempo-
ral difference, i.e., the difference between the updated estimate r
k+1
+ max
u
Q
k
(x
k+1
, u

)
of the optimal Q-value of (x
k
, u
k
), and the current estimate Q
k
(x
k
, u
k
). This new es-
timate is actually the Q-iteration mapping (3.1) applied to Q
k
in the state-action pair
(x
k
, u
k
), where (x
k
, u
k
) has been replaced by the observed reward r
k+1
, and f (x
k
, u
k
)
by the observed next-state x
k+1
. Thus Q-learning can be seen as a sample-based,
incremental variant of Q-iteration.
As the number of transitions k approaches innity, Q-learning asymptotically
converges to Q

if the state and action spaces are discrete and nite, and under the
following conditions:
45
46 CHAPTER 5. TEMPORAL DIFFERENCE METHODS
The sum

k=0

2
k
produces a nite value, whereas the sum

k=0

k
produces
an innite value.
All the state-action pairs are (asymptotically) visited innitely often.
The rst condition is not difcult to satisfy. For instance, a satisfactory standard
choice is:

k
=
1
k
(5.2)
In practice, the learning rate schedule may require tuning, because it inuences the
number of transitions required by Q-learning to obtain a good solution. A good
choice for the learning rate schedule depends on the problem at hand.
The second condition can be satised using an exploratory policy, e.g. -greedy
(4.1) or Boltzmann (4.2). As with MC, there is a need to balance exploration and
exploitation. Usually, the exploration parameters are time-dependent (i.e., using a k-
dependent
k
for -greedy,
k
for Boltzmann), and decrease over time. For instance,
an -greedy exploration schedule of the form
k
= 1/k diminishes to 0 as k ,
while still satisfying the second convergence condition of Q-learning, i.e., allowing
innitely many visits to all the state-action pairs. Notice the similarity of this explo-
ration schedule with the learning rate schedule (5.2). Like the learning rate schedule,
the exploration schedule has a signicant effect on the performance of Q-learning.
Algorithm 5.1 presents Q-learning with -greedy exploration.
Algorithm 5.1 Q-learning.
Input: discount factor , exploration schedule {
k
}

k=0
, learning rate schedule
{
k
}

k=0
1: initialize Q-function, e.g., Q
0
(x, u) 0 x, u
2: measure initial state x
0
3: for every time step k = 0, 1, 2, . . . do
4: u
k

_
u argmax
u
Q
k
(x
k
, u) with probability 1
k
(exploit)
a uniformly random action in U with probability
k
(explore)
5: apply u
k
, measure next state x
k+1
and reward r
k+1
6: Q
k+1
(x
k
, u
k
) Q
k
(x
k
, u
k
) +
k
[r
k+1
+ max
u
Q
k
(x
k+1
, u

) Q
k
(x
k
, u
k
)]
7: end for
Note that no difference has been made between episodic and continuing tasks. In
episodic tasks, whenever a terminal state is reached, the process must be somehow
reset to a new, nonterminal initial state; the change from the terminal state to this
initial state is not a valid transition and should therefore not be used for learning. The
algorithm remains unchanged. This also holds for the algorithms presented in the
sequel.
5.2. SARSA 47
5.2 SARSA
SARSA was proposed as an alternative to Q-learning, and is very similar to it. The
name SARSA is obtained by joining together the initials of every element in the data
tuples employed by the algorithm, namely: State, Action, Reward, next State, next
Action. Formally, such a tuple is denoted by (x
k
, u
k
, r
k+1
, x
k+1
, u
k+1
). SARSA starts
with an arbitrary initial Q-function Q
0
and updates it at each step using tuples of this
form, as follows:
Q
k+1
(x
k
, u
k
) = Q
k
(x
k
, u
k
) +
k
[r
k+1
+Q
k
(x
k+1
, u
k+1
) Q
k
(x
k
, u
k
)] (5.3)
where
k
(0, 1] is the learning rate. The term between square brackets is the tem-
poral difference, obtained as the difference between the updated estimate r
k+1
+
Q
k
(x
k+1
, u
k+1
) of the Q-value for (x
k
, u
k
), and the current estimate Q
k
(x
k
, u
k
).
This is not the same as the temporal difference used in Q-learning (5.1). While
the Q-learning temporal difference includes the maximal Q-value in the next state, the
SARSA temporal difference includes the Q-value of the action actually taken in this
next state. This means that SARSA performs online, model-free policy evaluation
steps for the policy that is currently being followed. In particular, the new estimate
r
k+1
+Q
k
(x
k+1
, u
k+1
) of the Q-value for (x
k
, u
k
) is actually the policy evaluation
mapping (3.7) applied to Q
k
in the state-action pair (x
k
, u
k
). Here, (x
k
, u
k
) has been
replaced by the observed reward r
k+1
, and f (x
k
, u
k
) by the observed next state x
k+1
.
Because it works online, SARSA cannot afford to wait until the Q-function has
converged before it improves the policy. Instead, to select actions, SARSA combines
a greedy policy in the current Q-function with exploration. Because of the greedy
component, SARSA implicitly performs an policy improvement at every time step,
and is therefore a type of optimistic, online policy iteration. Policy improvements
are optimistic, like in the MC variant of Algorithm 4.2, because they are done on the
basis of Q-functions that may not be accurate evaluations of the current policy.
In fact, if the policy is kept xed (while still exploratory), SARSA becomes a true
online policy evaluation algorithm.
Algorithm 5.2 presents SARSA with -greedy exploration. In this algorithm,
because the update at step k involves the action u
k+1
, this action has to be chosen
prior to updating the Q-function.
In order to converge to the optimal Q-function Q

, SARSA requires conditions


similar to those of Q-learning, which demand exploration, and additionally that the
exploratory policy being followed asymptotically becomes greedy. Such a policy can
be obtained e.g. by decreasing to 0 in an -greedy policy.
Algorithms like SARSA, which evaluate the policy they are currently using to
control the process, are also called on-policy in the RL literature (Sutton and Barto,
1998). In contrast, algorithms like Q-learning, which act on the process using one
policy and evaluate another policy, are called off-policy. In Q-learning, the pol-
icy used to control the system typically includes exploration, whereas the algorithm
48 CHAPTER 5. TEMPORAL DIFFERENCE METHODS
Algorithm 5.2 SARSA.
Input: discount factor , exploration schedule {
k
}

k=0
, learning rate schedule
{
k
}

k=0
1: initialize Q-function, e.g., Q
0
(x, u) 0 x, u
2: measure initial state x
0
3: u
0

_
u argmax
u
Q
0
(x
0
, u) with probability 1
0
(exploit)
a uniformly random action in U with probability
0
(explore)
4: for every time step k = 0, 1, 2, . . . do
5: apply u
k
, measure next state x
k+1
and reward r
k+1
6: u
k+1

_
u argmax
u
Q
k
(x
k+1
, u) with probability 1
k+1
a uniformly random action in U with probability
k+1
7: Q
k+1
(x
k
, u
k
) Q
k
(x
k
, u
k
) +
k
[r
k+1
+Q
k
(x
k+1
, u
k+1
) Q
k
(x
k
, u
k
)]
8: end for
implicitly evaluates a policy that is greedy in the current Q-function, since maximal
Q-values are used in the Q-function updates (5.1). Note by this denition, both the
MC algorithms presented are on-policy.
Computational cost of temporal difference methods
Consider the computational cost of a single Q-learning step for an MDP with a nite
number of states and actions, measured by the number of Q-function evaluations.
We assume like before that that the maximization of the Q-function is performed by
enumerating over the |U| actions. With the -greedy policy, the cost is:
1+2|U|
at steps where a greedy action is chosen (where the algorithm does not explore). This
is because the Q-function must be maximized twice (once when choosing actions,
and once in the updates), and Q
k
(x
k
, u
k
) must additionally be evaluated. When action
selection is random (the algorithm explores), the cost is only 1+|U|. Note that unlike
in DP algorithms, f and do not have to be evaluated to run the algorithm; instead,
the state and reward data are obtained (measured) from the process. (Things change
if a simulation model is used and the computational cost of the simulations must be
considered.)
For similar reasons, the cost of a SARSA step is:
2+|U|
when a greedy action is chosen and just 2 (constant!) when a random action is se-
lected. Overall, it can be said that temporal-difference methods are computationally
cheap.
5.2. SARSA 49
Example 5.1 Q-learning for the cleaning robot. As an example, we apply Q-
learning with -greedy exploration (Algorithm 5.1) to the cleaning robot problem.
The problem is episodic, so learning is separated into trials. At the beginning of each
trial, the state is reset to a random, non-terminal value. The learning rate is initially
1, remains constant along trials, and decays with a factor 0.99 at the end of each
trial: 0.99. The exact same schedule is used for the exploration probability .
These schedules were tuned for a good performance of the algorithm. Note that this
type of exponential decay does not satisfy the theoretical conditions for convergence
(k = 0

k
is nite, and innite exploration is not achieved) but in practice, for an
experiment having a nite duration, this is not important, as the decay rates can be
chosen large enough to get sufciently large values for
k
and
k
.
We examine the results of Q-learning by looking at the evolution of two quantities
with the number of trials. The rst quantity is the distance between the Q-function
found at the end of the trial and the optimal Q-function, computed with the formula:
max
x,u
|Q(x, u) Q

(x, u)|
This indicates how close learning has got to the optimal Q-function. The second
quantity indicates how close we are to an optimal policy, and is computed as the
number of states where the greedy action in Q is not optimal. (If the Q-values of
the two actions are the same, then we have a tie, which we break in favor of the rst
action.) In what follows, we call this quantity simply distance to the optimal policy.
The distance to the optimal policy is more important in practice, as it directly
inuences the control performance. Errors in the Q-function are harmful only if they
are large enough to inuence the ranking of the actions, which is what dictates the
policy.
Figure 5.1 shows the evolution of these two quantities for the rst 50 learning
trials. Because the results depend on the random exploration and the random initial-
ization of the state, the experiment is run 30 times, and the mean across these runs
is reported together with a statistical condence interval.
1
Q-learning nds relatively
stable Q-values and policies after around 42 trials, and although the distance to Q

is
never exactly zero, the policy does become fully optimal, illustrating the difference
discussed above between policy and Q-function convergence.
1
Of course, because the distances are always non-negative, the lower condence bound should in
fact be understood as being equal to 0 whenever shown as negative in the graphs.
50 CHAPTER 5. TEMPORAL DIFFERENCE METHODS
0 10 20 30 40 50
1
0
1
2
3
4
5
trial number

d
i
s
t
a
n
c
e

t
o

Q
*
standard Qlearning, mean distance to Q*
(a) Evolution of the distance to the optimal Q-
function.
0 10 20 30 40 50
0.5
0
0.5
1
1.5
2
2.5
3
trial number

d
i
s
t
a
n
c
e

t
o

h
*
standard Qlearning, mean distance to h*
(b) Corresponding distance to the optimal pol-
icy.
Figure 5.1: Results of Q-learning for the cleaning robot problem. The shaded regions
show 95% condence intervals on the mean.
Chapter 6
Accelerating temporal-difference
learning
Temporal difference methods have very desirable properties: they are model-free,
fully online, easy to understand and implement (even without grasping all the the-
ory behind them), and computationally cheap. Unfortunately, in their original form
presented above they also tend to learn slowly, requiring a large amount of transition
samples to reach a good control performance. That is because they use every sample
just once to update the Q-function, and then they discard it. Slow learning is damag-
ing in practice, as it effectively means the system is controlled poorly for a very long
time.
Therefore, ways to accelerate learning in temporal difference methods are needed;
more specically, to increase the efciency with which they use data. We describe
next three possibilities of achieving this, namely: using so-called eligibility traces,
reusing raw data, and learning a model of the MDP which is used to generate new
data.
6.1 Eligibility traces
Learning can be sped up by using the fact that the latest transition is the causal
result of the entire preceding trajectory. To this end, recently visited state-action
pairs are made eligible for updating by using an eligibility trace e : X U [0, ).
For example, in SARSA, instead of using the temporal difference
SARSA
k
= r
k+1
+
Q
k
(x
k+1
, u
k+1
) Q
k
(x
k
, u
k
) to update only the Q-value of the latest state-action pair
(x
k
, u
k
), all pairs are updated proportionally to their eligibility:
Q
k+1
(x, u) = Q
k
(x, u) +
k

SARSA
k
e
k+1
(x, u), x, u (6.1)
There are two variants of eligibility trace, both of which start with zero initial
traces: e
0
(x, u) = 0 x, u. In the replacing traces variant, the trace is set to 1 for the
51
52 CHAPTER 6. ACCELERATING TEMPORAL-DIFFERENCE LEARNING
latest state-action pair and decayed by a factor for all other pairs:
e
k+1
(x, u) =
_
1 if (x, u) = (x
k
, u
k
)
e
k
(x, u) otherwise
We call [0, 1] the decay rate of the trace. State-action pairs become exponentially
less eligible as they are farther in the past: looking from step k, the pair at step k l
has eligibility ()
l
.
In the accumulating traces variant, a value of 1 is added to the trace of the latest
pair, rather than just setting the trace to 1:
e
k+1
(x, u) =
_
e
k
(x, u) +1 if (x, u) = (x
k
, u
k
)
e
k
(x, u) otherwise
In Q-learning, the temporal difference is
QL
k
= r
k+1
+ max
u
Q
k
(x
k+1
, u

)
Q
k
(x
k
, u
k
), and the Q-function update is otherwise similar to SARSA:
Q
k+1
(x, u) = Q
k
(x, u) +
k

QL
k
e
k+1
(x, u), x, u (6.2)
There is another difference from SARSA: the theory behind eligibility traces (which
we did not go into here) indicates that the trace should be reset whenever a non-greedy
action is taken due to exploration. The reason for this can be intuitively understood as
an interruption of causality in the trajectory. However, it is not clear whether resetting
the trace in this way actually leads to a better algorithm in practice.
For both Q-learning and SARSA, the eligibility trace must be reset whenever a
new episode begins.
When = 0, the original algorithms are recovered. When = 1, SARSA be-
comes very similar to an MC variant (see Algorithm 4.2) that improves the policy
after every step rather than after every trajectory. In fact, can be viewed as a tun-
ing parameter than smoothly changes the algorithm from fully incremental ( = 0)
to fully MC ( = 1), and this interpretation also holds for Q-learning. However, it
should be noted that in practice using equal or close to 1 gives bad results, and
intermediary values between 0 and 1 usually perform better.
Example 6.1 The inuence of eligibility traces. To illustrate the benets of us-
ing eligibility traces, consider the simple maze (gridworld) problem shown in Fig-
ure 6.1(a), in which a robot must nd the shortest path from the start state S to the
goal state G, while only receiving a positive reward upon reaching the goal state; all
the other rewards are zero.
Consider that the robot has traveled the trajectory in Figure 6.1(a), and that it
learns with a temporal difference algorithm, such as SARSA or Q-learning (ignoring
any trace resets due to exploration). To propagate the maximum amount of informa-
tion, set the learning rate to =1. If no eligibility traces are used, the reward received
6.1. ELIGIBILITY TRACES 53
upon reaching G is propagated only to the state just before G, see Figure 6.1(b). If el-
igibility traces with <1 are used, the reward is propagated to all the states along the
trajectory, with an eligibility that decreases exponentially while moving back along
the trajectory, see Figure 6.1(c). When Monte Carlo updates are used, or equivalently
= 1, the reward fully propagates to all the states along the trajectory (properly
discounted, of course), see Figure 6.1(d).
G
S
(a) Trajectory followed.
G
S
(b) Incremental updates with-
out eligibility traces.
G
S
(c) Incremental updates with
eligibility traces, < 1.
G
S
(d) Monte Carlo updates; and
incremental updates with eli-
gibility traces, = 1.
Figure 6.1: The inuence of eligibility traces. The length of the arrows symbolizes
the amount of information propagated from the goal to each state along the trajectory.
Algorithm 6.1 presents Q-learning with replacing eligibility traces, called Q()
due to the dependence on the parameter. In this algorithm (the second line of the
trace update formula), the trace is reset on exploratory actions, as discussed above.
Algorithm6.2 presents SARSAwith replacing eligibility traces, SARSA() for short.
In a naive implementation, eligibility traces increase the computational cost of
Q-learning and SARSA roughly by a factor of |X| |U|, since all the Q-values must
now be updated, instead of just one as in the standard algorithms. However, efcient
implementations can be given by skipping over state-action pairs with zero eligibility,
and by setting traces to 0 whenever they fall below a small threshold.
Example 6.2 Q() for the cleaning robot. In this example, we apply Q-learning
with eligibility traces Algorithm 5.1 to the cleaning robot problem, setting = 0.5.
Except , the rest of the parameters are exactly the same as in Example 5.1.
54 CHAPTER 6. ACCELERATING TEMPORAL-DIFFERENCE LEARNING
Algorithm 6.1 Q().
Input: discount factor , exploration schedule {
k
}

k=0
, learning rate schedule
{
k
}

k=0
, eligibility trace decay rate
1: initialize Q-function, e.g., Q
0
(x, u) 0 x, u
2: e
0
(x, u) 0 x, u
3: measure initial state x
0
4: for every time step k = 0, 1, 2, . . . do
5: u
k

_
u argmax
u
Q
k
(x
k
, u) with probability 1
k
(exploit)
a uniformly random action in U with probability
k
(explore)
6: apply u
k
, measure next state x
k+1
and reward r
k+1
7: e
k+1
(x, u)
_

_
1 if (x, u) = (x
k
, u
k
)
0 if (x, u) = (x
k
, u
k
) and u
k
argmax
u
Q
k
(x
k
, u)
e
k
(x, u) in all other cases
8:
QL
k
r
k+1
+ max
u
Q
k
(x
k+1
, u

) Q
k
(x
k
, u
k
)
9: Q
k+1
(x, u) = Q
k
(x, u) +
k

QL
k
e
k+1
(x, u), x, u
10: end for
Algorithm 6.2 SARSA().
Input: discount factor , exploration schedule {
k
}

k=0
, learning rate schedule
{
k
}

k=0
, eligibility trace decay rate
1: initialize Q-function, e.g., Q
0
(x, u) 0 x, u
2: e
0
(x, u) 0 x, u
3: measure initial state x
0
4: u
0

_
u argmax
u
Q
0
(x
0
, u) with probability 1
0
(exploit)
a uniformly random action in U with probability
0
(explore)
5: for every time step k = 0, 1, 2, . . . do
6: apply u
k
, measure next state x
k+1
and reward r
k+1
7: u
k+1

_
u argmax
u
Q
k
(x
k+1
, u) with probability 1
k+1
a uniformly random action in U with probability
k+1
8: e
k+1
(x, u)
_
1 if (x, u) = (x
k
, u
k
)
e
k
(x, u) otherwise
9:
SARSA
k
r
k+1
+Q
k
(x
k+1
, u
k+1
) Q
k
(x
k
, u
k
)
10: Q
k+1
(x, u) = Q
k
(x, u) +
k

SARSA
k
e
k+1
(x, u), x, u
11: end for
6.2. EXPERIENCE REPLAY 55
Figure 6.2 shows the results obtained, and also repeats the graphs for standard
Q-learning (Figure 5.1) for an easy comparison. While the effect on the Q-function
is not obvious, the difference in policies is clearer: the algorithm nds a fully optimal
policy after at most 22 trials in all 30 runs, whereas it does that in 42 trials for standard
Q-learning. (We know the policies are optimal in all 30 runs because the condence
interval reduces to a line.) While this difference is illustrative, the two curves are not
statistically signicantly different.
0 10 20 30 40 50
1
0
1
2
3
4
5
trial number

d
i
s
t
a
n
c
e

t
o

Q
*
Q(), mean distance to Q*
(a) Q(): distance to the optimal Q-function.
0 10 20 30 40 50
0.5
0
0.5
1
1.5
2
2.5
3
trial number

d
i
s
t
a
n
c
e

t
o

h
*
Q(), mean distance to h*
(b) Q(): distance to the optimal policy.
0 10 20 30 40 50
1
0
1
2
3
4
5
trial number

d
i
s
t
a
n
c
e

t
o

Q
*
standard Qlearning, mean distance to Q*
(c) Standard Q-learning: distance to the optimal
Q-function.
0 10 20 30 40 50
0.5
0
0.5
1
1.5
2
2.5
3
trial number

d
i
s
t
a
n
c
e

t
o

h
*
standard Qlearning, mean distance to h*
(d) Standard Q-learning: distance to the optimal
policy.
Figure 6.2: Results of Q() for the cleaning robot problem, compared to standard
Q-learning.
6.2 Experience replay
A straightforward approach to increase the data efciency of RL is experience replay
(ER), in which the data acquired during the online learning process are stored and
presented repeatedly to the RL algorithm.
We illustrate the ER approach for Q-learning. At every time step k, after learning
56 CHAPTER 6. ACCELERATING TEMPORAL-DIFFERENCE LEARNING
from the observed transition sample (x
k
, u
k
, r
k+1
, x
k+1
), this sample is stored in a
database D. Then, a number N of samples from D are replayed: each sample is
retrieved from the database and used to update the Q-function with the regular Q-
learning rule. Algorithm 6.3 presents this procedure. Note that the index k has been
dropped from the Q-function, since this function is now updated several times per
step. It may be useful to rene the learning rate so that it also changes with the replay
steps; this possibility is not made explicit in Algorithm 6.3 to avoid notation clutter.
Algorithm 6.3 ER Q-learning.
Input: discount factor , exploration schedule {
k
}

k=0
, learning rate schedule
{
k
}

k=0
, number of replays per time step N
1: initialize Q-function, e.g., Q(x, u) 0 x, u
2: initialize sample database D / 0
3: measure initial state x
0
4: for every time step k = 0, 1, 2, . . . do
5: u
k

_
u argmax
u
Q(x
k
, u) with probability 1
k
(exploit)
a uniformly random action in U with probability
k
(explore)
6: apply u
k
, measure next state x
k+1
and reward r
k+1
7: Q(x
k
, u
k
) Q(x
k
, u
k
) +
k
[r
k+1
+ max
u
Q(x
k+1
, u

) Q(x
k
, u
k
)]
8: add sample to database: D D {(x
k
, u
k
, r
k+1
, x
k+1
)}
9: loop N times experience replay
10: retrieve a sample (x, u, r, x

) from D
11: Q(x, u) Q(x, u) +
k
[r + max
u
Q(x

, u

) Q(x, u)]
12: end loop
13: end for
The rule by which the N samples are chosen is left open in this algorithm. Many
choices are possible, for example:
Forward: samples are replayed in the order in which they were observed.
Backward: samples are replayed in reverse order.
Random: samples are replayed in random order.
We discuss the effect of these three choices in the example below. When there are
terminal states and forward or backward replays are used, an additional problem is
selecting which trials to replay.
Another degree of freedom is in the way of interleaving normal updates with ER
updates. ER updates can be performed once every several steps rather than at each
step, e.g. at the end of trials. The algorithm may even restrict itself to performing only
ER updates at given times, without any regular updates. This can make its behavior
more predictable, by keeping the Q-function (and therefore greedy policy) constant
for known intervals of time.
6.2. EXPERIENCE REPLAY 57
An ER variant of SARSA can be given in a similar way. Since SARSA is on-
policy, it seems reasonable to only replay samples that correspond to the current
policy, that is, that contain actions equal to those given by the current policy. This is
not a strict requirement however: even regular SARSA updates the Q-function using
samples from many policies (as the policy is updated at every time step). It may
be sufcient to just give preference to samples from the current policy, by replaying
them more often, or to only replay samples that are close enough to the current
policy in some sense.
Example 6.3 Effects of experience replay. In this example, we illustrate the effects
of ER, and the differences between forward, backward, and random ER. We use again
the gridworld problem of Example 6.1, and like in that example, we use a temporal
difference algorithm such as Q-learning with = 1, to propagate the maximum
information.
Assume the robot has traveled the trajectory in Figure 6.3(a). This trajectory has
six transitions, which we denote by (x
0
, u
0
, r
1
, x
1
), . . . , (x
5
, u
5
, r
6
, x
6
); x
6
is terminal.
Recall that all rewards are 0 except when reaching the goal, where a single positive
reward is obtained: r
6
. Thus, the rst informative update is the one using transition
(x
5
, u
5
, r
6
, x
6
) before that, all the Q-values remained 0 because all the rewards were
0. After this update, the positive reward propagates to the Q-value of state x
5
, that is,
just before the goal: the situation in Figure 6.3(b) is obtained.
G
S
(a) Trajectory followed.
G
S
(b) Normal update.
G
S
(c) Forward replay.
G
S
(d) Backward replay.
Figure 6.3: The difference between forward and backward ER.
Now the algorithm switches to ER mode. Consider for illustration purposes that
N = 6, equal to the length of the trajectory. If forward replay is used, then replay-
ing samples (x
0
, u
0
, r
1
, x
1
), . . . , (x
3
, u
3
, r
4
, x
4
) is not useful, as all the Q-values up to
58 CHAPTER 6. ACCELERATING TEMPORAL-DIFFERENCE LEARNING
x
4
are still 0. There are two useful samples: (x
4
, u
4
, r
5
, x
5
) which propagates the
Q-value of x
5
, updated above, to x
4
and (x
5
, u
5
, r
6
, x
6
), which again propagates
the nonzero reward r
6
. So information has now been propagated to an additional
state compared to the regular algorithm, obtaining the situation in Figure 6.3(c). In
contrast, with backward replay all the samples are useful, since each update using
sample (x
k
, u
k
, r
k+1
, x
k+1
) propagates the Q-value of x
k+1
, which was itself updated
in the previous ER step Figure 6.3(d). In fact we obtain a result similar to a Monte
Carlo update, or to using eligibility traces with a large . This means that back-
ward replay is generally better than forward replay, and can emulate the effects of
eligibility traces.
Random replay can help in another way, by allowing the algorithm to more easily
make connections between states that have been encountered far apart in time, but
are actually close to each other in the state-action space. Assume the robot has trav-
eled the two trajectories in Figure 6.4(a). Using eligibility traces or replaying a single
trajectory, only the rewards encountered over the dashed trajectory can be propagated
back to the state in the gray square, see Figure 6.4(b). In contrast, with random ER,
there is a good chance that samples from both trajectories are interleaved, allowing
the state in the gray square to benet from information along both trajectories, see
Figure 6.4(c). So, the robot may nd out that from the gray square to the goal there
exists a trajectory shorter than the dashed line.
G
S
(a) Trajectories.
G
S
(b) Single-trajectory
updates.
G
S
(c) Random ER up-
dates.
Figure 6.4: Random ER aggregates information from multiple trajectories, allowing
the algorithm to make connections between these trajectories. The gray square
receives information from the squares crossed by thick lines.
Regardless of the particular type of ER used, as the number of replays increases
many samples will be replayed many times, following each other in different ways,
so eventually all the benecial effects illustrated above will be obtained. In the short
term and for small N, however, the differences above can prove to be very important,
by signicantly affecting the learning speed.
Regarding computational cost, it is clear that replaying N samples at each step,
such as in Algorithm 6.3, will increase the cost per step roughly N +1 times, as now
N +1 updates must be performed instead of just 1. Fortunately, ER is useful for any
6.3. MODEL LEARNING AND DYNA 59
value of N, and it generally improves the Q-function more with larger N; together,
these two facts make ER a so-called anytime algorithm. This means N can be
chosen adaptively to satisfy computational constraints (such as running in an interval
shorter than the sample time in real-time control), and it will nevertheless provide
any benets it can within these constraints.
ER can also be combined with eligibility traces, in particular in its forward vari-
ant. However, as illustrated above, this may be unnecessary as backward ER can
emulate the effects of eligibility traces.
Example 6.4 ER-Q-learning for the cleaning robot. We apply Q-learning with ER
(Algorithm 6.3) to the cleaning robot problem, setting N = 10 and replaying trials in
backward order. The trials themselves are selected randomly.
Figure 6.5 shows the results obtained, also comparing to Figure 6.2. ER gives
qualitatively slightly better results than eligibility traces, but not by much (and, again,
the differences are largely not statistically signicant).
6.3 Model learning and Dyna
Both eligibility traces and ER provide direct ways of reusing data. In this section, we
consider a method for indirect data reuse: model learning. In this method, the data is
used to learn about the dynamics f and the reward function of the MDP: together,
these functions form the model of the MDP, hence the name model learning. Once
a (partial) model is available, it can be exploited to simulate new transitions, which
are used for learning just like in ER. Note here we consider the more general variant
of model learning that also constructs ; if is known, as is often the case in control,
then it is simply not necessary to learn about it and the rest of the algorithm remains
valid.
The interaction between the learning algorithm, model, and process can be rep-
resented as in Figure 6.6. Experience (transition samples) from the process is used to
update the value function and implicitly or explicitly depending on the algorithm
the policy, as well as the model. Simulated experience generated by the model is
used to update the value function and policy.
Algorithm6.4 exemplies a model-learning variant of Q-learning, which is called
Dyna-Q in the literature. Note the strong similarity between the updates that use sim-
ulated experience (line 11) and the Q-iteration updates of Algorithm 3.1. Indeed, line
11 is simply an asynchronous, incremental variant of Q-iteration using the learned
model: asynchronous because it may update state-action pairs in any order, and in-
cremental because it is parameterized by the learning rate .
Regarding the order in which state-action pairs are chosen for simulation, the
same discussion as for ER applies.
60 CHAPTER 6. ACCELERATING TEMPORAL-DIFFERENCE LEARNING
0 10 20 30 40 50
1
0
1
2
3
4
5
trial number

d
i
s
t
a
n
c
e

t
o

Q
*
ERQlearning, mean distance to Q*
(a) ER-Q-learning: distance to the optimal Q-
function.
0 10 20 30 40 50
0.5
0
0.5
1
1.5
2
2.5
3
trial number

d
i
s
t
a
n
c
e

t
o

h
*
ERQlearning, mean distance to h*
(b) ER-Q-learning: distance to the optimal pol-
icy.
0 10 20 30 40 50
1
0
1
2
3
4
5
trial number

d
i
s
t
a
n
c
e

t
o

Q
*
Q(), mean distance to Q*
(c) Q(): distance to the optimal Q-function.
0 10 20 30 40 50
0.5
0
0.5
1
1.5
2
2.5
3
trial number

d
i
s
t
a
n
c
e

t
o

h
*
Q(), mean distance to h*
(d) Q(): distance to the optimal policy.
Figure 6.5: Results of ER-Q-learning for the cleaning robot problem, compared to
Q().
Model Process
e
x
p
e
r
i
e
n
c
e
a
c
t
io
n
s
s
i
m
u
l
a
t
e
d
e
x
p
e
r
i
e
n
c
e
Value or policy
learning
Model
learning
update
model
Figure 6.6: RL with model learning.
In fact, the careful reader has probably noticed that in the form given above,
Dyna-Q is little different from ER-Q-learning: storing and retrieving the next states
and rewards in the model is little different than storing and retrieving the whole tran-
sitions in a database. In general, Dyna-Q becomes more interesting in two different
ways:
6.3. MODEL LEARNING AND DYNA 61
Algorithm 6.4 Dyna-Q.
Input: discount factor , exploration schedule {
k
}

k=0
, learning rate schedule
{
k
}

k=0
, number of simulated samples N
1: initialize Q-function, e.g., Q(x, u) 0 x, u
2: initialize model f
m
(x, u),
m
(x, u) x, u
3: measure initial state x
0
4: for every time step k = 0, 1, 2, . . . do
5: u
k

_
u argmax
u
Q
k
(x
k
, u) with probability 1
k
(exploit)
a uniformly random action in U with probability
k
(explore)
6: apply u
k
, measure next state x
k+1
and reward r
k+1
7: Q(x
k
, u
k
) Q(x
k
, u
k
) +
k
[r
k+1
+ max
u
Q(x
k+1
, u

) Q(x
k
, u
k
)]
8: update model: f
m
(x
k
, u
k
) = x
k+1
,
m
(x
k
, u
k
) = r
k+1
9: loop N times simulated experience
10: choose a previously observed (x, u)
11: Q(x, u) Q(x, u) +
k
[
m
(x, u) + max
u
Q( f
m
(x, u), u

) Q(x, u)]
12: end loop
13: end for
Stochastic systems: in the deterministic case above, model-learning means
simply memorizing the next state and rewards. In stochastic settings, a dis-
tribution over next states must be learned from the samples.
Generalization: rather than just memorizing and recalling, the model could
be used to generalize. To understand this, consider e.g. the cleaning robot
problem of Example 2.1. If it was observed that taking action 1 in states 2
and 4 leads to moving right, it may seem reasonable to assume that the same
outcome will be obtained in state 3. Model generalization is very powerful
because it allows simulating and learning about state-action pairs that are new,
unseen so far. We return to this point in more detail in the continuous-variable,
approximate RL setting, in Part III of these notes.
62 BIBLIOGRAPHY
Bibliographical notes for Part II
The origins of DP methods date back to the beginnings of the dynamic program-
ming eld (Bellman, 1957). However, here we follow the recent texts (Busoniu et al.,
2010; Bertsekas, 2007). Optimistic planning was introduced in the form given here
by Hren and Munos (2008), but it is in fact just a form of the classical A* plan-
ning algorithm applied to MDPs, see (La Valle, 2006, Ch. 2). Our presentation of
Monte Carlo is based on (Sutton and Barto, 1998, Chapter 5). The concept of op-
timistic policy iteration was introduced by (Bertsekas and Tsitsiklis, 1996, Section
6.4). The presentation of basic Q-learning and SARSA follows our book (Busoniu
et al., 2010), but it is important to note that Q-learning and Q() were introduced by
(Watkins, 1989; Watkins and Dayan, 1992), while SARSA and SARSA() are due
to (Rummery and Niranjan, 1994). Denitive convergence results for Q-learning can
be found in (Watkins and Dayan, 1992; Tsitsiklis, 1994; Jaakkola et al., 1994) and
for SARSA in (Singh et al., 2000). ER was introduced by Lin (1992) and Dyna by
Sutton (1990).
Bibliography
Bellman, R. (1957). Applied Dynamic Programming. Princeton University Press.
Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control, volume 2.
Athena Scientic, 3rd edition.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena
Scientic.
Busoniu, L., Babu ska, R., De Schutter, B., and Ernst, D. (2010). Reinforcement
Learning and Dynamic Programming Using Function Approximators. Automation
and Control Engineering. Taylor & Francis CRC Press.
Hren, J.-F. and Munos, R. (2008). Optimistic planning of deterministic systems.
In Proceedings 8th European Workshop on Reinforcement Learning (EWRL-08),
pages 151164, Villeneuve dAscq, France.
Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). On the convergence of stochastic
iterative dynamic programming algorithms. Neural Computation, 6(6):11851201.
La Valle, S. M. (2006). Planning Algorithms. Cambridge University Press.
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning,
planning and teaching. Machine Learning, 8(34):293321. Special issue on re-
inforcement learning.
BIBLIOGRAPHY 63
Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist
systems. Technical Report CUED/F-INFENG/TR166, Engineering Department,
Cambridge University, UK. Available at http://mi.eng.cam.ac.uk/reports/svr-
ftp/rummery tr166.ps.Z.
Singh, S., Jaakkola, T., Littman, M. L., and Szepesv ari, Cs. (2000). Convergence re-
sults for single-step on-policy reinforcement-learning algorithms. Machine Learn-
ing, 38(3):287308.
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting
based on approximating dynamic programming. In Proceedings 7th International
Conference on Machine Learning (ICML-90), pages 216224, Austin, US.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction.
MIT Press.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Ma-
chine Learning, 16(1):185202.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, Kings
College, Oxford, UK.
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279
292.
64 BIBLIOGRAPHY
Part III
Approximate reinforcement
learning
65
Introduction and outline
In control problems, as well as most realistic problems from other elds, state and
action spaces are too large to allow the exact representations of the solution that
classical reinforcement learning (RL) algorithms require. Instead, approximate rep-
resentation of the solution (value function and/or policy) are needed. In this third and
nal part of the lecture notes, we present a representative selection of methods for
approximate RL.
Offline approximate
reinforcement learning
Online approximate
reinforcement learning
approximate
value iteration
approximate
policy iteration
Accelerating online
reinforcement learning
eligilibility traces
experience replay
approximate Q-learning
approximate SARSA
actor-critic
Approximation
techniques
parametric
nonparametric
linear
nonlinear
fitted Q-iteration
least-squares
policy iteration
Figure 6.7: Classication of approximation techniques, algorithms for approximate
RL, and enhancements discussed.
Figure 6.7 shows the methods that will be discussed. We start by presenting the
function approximation techniques underpinning the whole eld; these are shown
within a dashed outline in the gure. Parametric approximators are mappings having
a xed form and that (linearly or nonlinearly) depend on a given number of param-
67
68
eters, which can be tuned to approximate the function of interest. The form and
number of parameters of nonparametric approximators are derived from the data.
Once the function approximation framework is in place, we describe approximate
RL algorithms. First, approximate variants of the ofine value and policy iteration
methods are presented. The exemplied algorithms are the so-called tted Q-iteration
(which belongs to the value iteration class), and the least-squares policy iteration.
To introduce tted Q-iteration, we present as an intermediate step a model-based,
approximate DP method called fuzzy Q-iteration.
Then, three online algorithms for approximate RL are introduced. Two of them
are derived from the classical Q-learning and SARSA algorithms, whereas a third
represents a new class of methods called actor-critic.
Finally, we extend to the approximate case two ways to increase the learning
speed of online RL: eligibility traces and experience replay.
As we did in Part II, we describe the algorithms in the deterministic case, but they
also work in stochastic problems. Q-functions are used in all the algorithms except
the actor-critic method, which approximates a V-function instead.
Chapter 7
Function approximation
Most of the algorithms for exact dynamic programming (DP) and reinforcement
learning (RL) that we presented in Part II use Q-functions, and thus require the stor-
age of distinct Q-values for state-action pair. If we consider V-functions, distinct
values for each state must be stored. Policy search algorithms, even when not using
Q-functions or V-functions, need to store distinct actions for each state in order to
represent the policy.
All this is only possible if the states vary in a relatively small set of discrete
values. When using Q-functions, the same must be true for the actions. However,
when some of the state or action variables have a very large or innite number of
possible values, exact storage is no longer possible, and the functions of interest must
be represented approximately. Innite spaces arise easily when the state-action space
is continuous, which is typical in problems from automatic control.
In this chapter, we introduce a number of basic function approximation tech-
niques that can be used in RL. We start by presenting methods for function approxi-
mation in the general case, and then we will discuss specics in the case of approxi-
mation for DP and RL.
7.1 Approximation architectures
Throughout this section, we will be concerned with the representation of a generic
function g(x), where x X is a generic D-dimensional vector of variables (it may be
the state vector of some system as in RL, but this is not necessary). The function is
scalar (i.e. it has only one output), as this sufces to introduce the necessary concepts.
Two major classes of approximators can be identied, namely parametric and
nonparametric approximators.
69
70 CHAPTER 7. FUNCTION APPROXIMATION
Parametric approximation
Parametric approximators depend on a vector of parameters. The functional form
of the dependence, as well as the number of parameters, are typically established
in advance and do not depend on the data. The parameters of the approximator are
then tuned using data about the target function. For example, we denote a parametric
approximator of g(x) by:
g(x; )
where R
n
is a vector of n parameters. So, instead of storing distinct values for
every x, which would be impossible e.g. when x is continuous, it is only necessary to
store n parameters. Approximation can even be useful when X contains a nite, but
large number of values |X|, as in that case n is usually much smaller than |X|, thereby
providing a compact representation (recall that, when applied to sets, the notation ||
stands for cardinality). Of course, not every function g can be exactly represented
by some parameter vector in general, there will be some approximation error
between g(x) and g(x; ); that is why g is called an approximator.
In general, the approximator can depend nonlinearly on the parameters. A typ-
ical example of a such a nonlinear approximator is a feed-forward neural network.
However, linearly parameterized approximators are often preferred in DP and RL,
because they make it easier to analyze the theoretical properties of the resulting al-
gorithms. A linearly parameterized approximator employs n basis functions (BFs)

1
, . . . ,
n
: X R, and computes approximate values with:
g(x; ) =
n

i=1

i
(x)
i
=
T
(x) (7.1)
where (x) = [
1
(x), . . . ,
n
(x)]
T
is the vector of BFs.
Note that while this approximator is linear in the parameters, it can (and usually
does) depend nonlinearly on x: this is achieved by using BFs that are nonlinear in x.
Consider now that we are given a set of input-output samples for the function g,
and we have to use these samples to derive that lead to a good approximation of g.
Denote the set of samples by {(x
i
s
, g
i
s
) | i
s
= 1, . . . , n
s
}, where x
i
s
is the i
s
th input, and
g
i
s
= g(x
i
s
) the corresponding output (note this is a scalar number, not a function).
Then, a natural method to nd the parameters is by minimizing the squared errors on
the provided samples:
argmin

n
s

i
s
=1
(g
i
s
g(x
i
s
; ))
2
In the linear case, this least-squares criterion turns into:
argmin

n
s

i
s
=1
_
g
i
s

T
(x
i
s
)
_
2
(7.2)
7.1. APPROXIMATION ARCHITECTURES 71
Let us put the basis function values for each input vector in a matrix A, and the output
values in a vector b:
A =
_

T
(x
1
)
.
.
.

T
(x
n
s
)
_

_ =
_

1
(x
1
) . . .
n
(x
1
)
.
.
.
.
.
.
.
.
.

1
(x
i
s
) . . .
n
(x
i
s
)
_

_, b =
_

_
g
T
1
.
.
.
g
T
i
s
_

_
Then, solving (7.2) corresponds to solving the system of equations:
A = b
Usually, n
s
>n, so the system is overdetermined, and can be solved in a least-squares
sense. By linear algebra arguments, if the columns of A are linearly independent, one
such solution can be found by using the pseudoinverse A
+
of A:
= A
+
b = (A
T
A)
1
A
T
b
A more efcient solution is provided, in Matlab, by the matrix left division operator
(backslash, \), which uses the so-called Gaussian elimination algorithm.
Example 7.1 Linear approximation for the Rosenbrock function. In this exam-
ple, we will compute least-squares ts for the following, slightly modied version of
the Rosenbrock function, also known as the banana function:
g(x) = (1x
1
)
2
+100[(x
2
+1.5) x
2
1
]
2
, x = [x
1
, x
2
]
T
over the domain [2, 2] [1.5, 1.5]. A graph of this function is shown in Figure 7.1.
2
1
0
1
2
2
1
0
1
2
0
500
1000
1500
2000
x
1
True values
x
2
g
(
x
)
Figure 7.1: Rosenbrock function over the domain [2, 2] [1.5, 1.5].
72 CHAPTER 7. FUNCTION APPROXIMATION
To approximate this function, we will use 200 uniformly, randomly distributed
samples over the domain [2, 2] [1.5, 1.5], and a least-squares criterion (7.2).
The parameters will be found with matrix left division in Matlab, as described above.
We will use two types of BFs: pyramidal and radial Gaussian RBFs.
Pyramidal BFs can be obtained as follows, in the general case of a D-dimensional
variable (for the Rosenbrock function D = 2). For each state variable x
d
, where
d {1, . . . , D}, a number n
d
of triangular BFs are dened as follows:

d,1
(x
d
) = max
_
0,
c
d,2
x
d
c
d,2
c
d,1
_

d,i
(x
d
) = max
_
0, min
_
x
d
c
d,i1
c
d,i
c
d,i1
,
c
d,i+1
x
d
c
d,i+1
c
d,i
__
, for i = 2, . . . , n
d
1

d,n
d
(x
d
) = max
_
0,
x
d
c
d,n
d
1
c
d,n
d
c
d,n
d
1
_
where c
d,1
, . . . , c
d,n
d
are the centers along dimension d and must satisfy c
d,1
< <
c
d,n
d
. These centers fully determine the shape of the BFs. Adjacent BFs always
intersect at a level of 0.5. The product of each combination of single-dimensional
BFs gives a pyramidal D-dimensional BF. Examples of single-dimensional and two-
dimensional such BFs are given in Figure 7.2. Note such an approximator corre-
2 0 2
0
0.5
1
x
1

(
x
1
)
(a) A set of single-dimensional
triangular BFs, each shown in a
different line style.
2
0
2
2
0
2
0
0.5
1
x
1
x
2

(
x
)
(b) Two-dimensional BFs, obtained by
combining two sets of single-dimensional
BFs, each identical to the set in Fig-
ure 7.2(a).
Figure 7.2: Examples of triangular and pyramidal BFs.
sponds to multilinear interpolation on the grid of BF centers. The parameter corre-
sponding to each BF can therefore be seen as the approximate value of the function
at the center of the BF.
The results of approximating the Rosenbrock function with a 66 grid of pyra-
midal BFs are shown in Figure 7.3. The number MSE is the mean squared error
for a grid of 3131 samples (computed as the sum of least-squares in (7.2) divided
by the number of samples). It represents the quality of the approximation: the lower
the MSE, the better the approximation.
7.1. APPROXIMATION ARCHITECTURES 73
2
1
0
1
2
2
1
0
1
2
0
500
1000
1500
2000
x
1
linearinterp output; MSE=4035
x
2
g
h
a
t
(
x
)
Figure 7.3: Pyramidal-BF approximation of the Rosenbrock function.
The general form of (normalized) Gaussian radial BFs (RBFs) can be given as
follows:

i
(x) =

i
(x)

N
i

=1

(x)
,

i
(x) = exp
_

1
2
[x c
i
]
T
B
i
1
[x c
i
]
_
(7.3)
Here,

i
are the nonnormalized RBFs, the vector c
i
= [c
i,1
, . . . , c
i,D
]
T
R
D
is the
center of the ith RBF, and the symmetric positive-denite matrix B
i
R
DD
is its
width. Depending on the structure of the width matrix, RBFs of various shapes can
be obtained. For a general width matrix, the RBFs are elliptical and can be oriented
in any way. Axis-aligned RBFs are obtained if the width matrix is diagonal, i.e., if
B
i
= diag(b
i,1
, . . . , b
i,D
). In this case, the width of an RBF can also be expressed
using a vector b
i
= [b
i,1
, . . . , b
i,D
]
T
. Furthermore, spherical RBFs are obtained if,
in addition, b
i,1
= = b
i,D
. Figure 7.4 shows some spherical normalized RBFs
distributed on an equidistant 4 4 grid. The RBFs along the margins and in the
corners reach larger values due to the normalization (they have to compensate for
the smaller values of the other RBFs in these regions).
The results of approximating the Rosenbrock function with a 66 grid of equidis-
tant RBFs, shaped similarly to Figure 7.4, are shown in Figure 7.5. Because of the
wide RBFs, the approximation is smoother than that obtained with pyramidal BFs.

Nonparametric approximation
Nonparametric approximators, despite their name, still have parameters. However,
unlike in the parametric case, the number of parameters, as well as the form of the
nonparametric approximator, are derived from the available data.
74 CHAPTER 7. FUNCTION APPROXIMATION
Figure 7.4: Two-dimensional RBFs.
2
1
0
1
2
2
1
0
1
2
0
500
1000
1500
2000
x
1
RBFs output; MSE=4263
x
2
g
h
a
t
(
x
)
Figure 7.5: Gaussian RBF approximation of the Rosenbrock function.
As an example of nonparametric approximator, we consider local linear regres-
sion (LLR), which is representative for the larger class of so-called memory-based
methods. With LLR, the user does not need to specify a global structure for the
approximator. Instead, observations of the function of interest are simply stored a
memory. A stored observation is an input-output sample: (x
i
s
, g
i
s
) consisting of a
D-dimensional input vector x
i
s
and the scalar output g
i
s
, where the sample index
i
s
= 1, . . . , n
s
. The samples are stored in a matrix called the memory M with size
(D+1) n
s
, whose columns each contain one sample (D inputs and 1 output).
Consider now that a query is made, that is, that the approximator must compute
g(x
q
) for some given x
q
. This approximate value is computed by rst nding a local
neighborhood of x
q
in the samples stored in memory. This neighborhood is found by
applying a weighted distance metric to the query point x and the input data x
i
s
of all
samples in M. An example is the weighted Euclidean distance
_
[x x
i
s
]
T
W[x x
i
s
].
The weighting W is used to scale and rotate the input space; it has a large inuence
7.2. APPROXIMATION IN THE CONTEXT OF DP AND RL 75
on the resulting neighborhood shape, and thus on the accuracy of the approximator.
The method selects a limited number K of samples with the smallest distances.
Only these K nearest neighbors are then used to approximate g. In particular, the
approximation is computed by tting an afne (linear + constant) parametrization to
these nearest neighbors; that is, a local parametric approximator of the form:
[x
T
, 1]
where the additional 1 provides the constant term. Note that R
D+1
.
Denote the indices of the nearest neighbors by i
1
, . . . , i
K
. First, a matrix A and a
vector b need to be constructed using the nearest neighbors:
A =
_

_
x
T
i
1
1
.
.
.
.
.
.
x
T
i
K
1
_

_, b =
_

_
g
i
1
.
.
.
g
i
K
_

_
Like for linear parametrizations above, A and b give a (typically) over-determined
set of equations for the parameters :
A = b
which can be solved by the method of least squares. Finally, the parameters found
are used to compute the approximation for the query point:
g(x
q
) = [x
T
q
, 1]
As a result, the globally nonlinear function is approximated locally by a linear func-
tion. The approximator is nevertheless globally nonlinear.
Example 7.2 LLR approximation for the Rosenbrock function. We apply LLR
approximation to the function of Example 7.1, setting the number of nearest neigh-
bors K = 5 and using as memory the 200 random samples that we used before to
t the linear approximators. The distance metric is weighted Euclidean, with the
weights selected to bring the input domain [2, 2] [1.5, 1.5] into a unit range
[1, 1] [1, 1]: W = diag(
1
2
,
1
1.5
).
The resulting approximation is shown in Figure 7.6. Note that, due to a more
direct dependence on the random samples, the shape is less predictable than for the
parametric approximators.
7.2 Approximation in the context of DP and RL
Until now, we have looked at the representation dimension of function approxima-
tion: how to compactly represent Q-functions, V-functions, or policies. However,
76 CHAPTER 7. FUNCTION APPROXIMATION
2
1
0
1
2
2
1
0
1
2
0
500
1000
1500
2000
x
1
LLR output; MSE=4018
x
2
g
h
a
t
(
x
)
Figure 7.6: LLR approximation of the Rosenbrock function.
approximation in DP/RL is not only a problem of representation, and the algorithms
must also consider other challenges. We present next the core challenge of nding
greedy actions by solving maximization problems over the action variables.
Recall for example the Q-iteration update from Algorithm 3.1:
Q
+1
(x, u) (x, u) + max
u

( f (x, u), u

)
and the policy improvement formula from policy iteration:
h
+1
(x) argmax
u
Q
h

(x, u)
Both these formulas require searching for maximal Q-values over the action space. In
large or continuous action spaces, these maximization problems can only be solved
approximately. Since greedy actions have to be found for many values of the state,
the search must also be computationally efcient. To simplify this problem, many
algorithms discretize the action space into a small number of values, compute the
value function for all the discrete actions, and nd the maximum among these values
using enumeration.
Q-iteration and policy iteration are just examples, and these maximizations ap-
pear in most of the RL algorithms that have been presented in Part II. One exception
are policy search algorithms, which search the space of policies by using classical
optimization methods, rather than greedy policy improvements.
Example 7.3 Approximating Q-functions with state-dependent BFs and action
discretization. In this example we consider a discrete-action approximator, which
employs state-dependent BFs to approximate over the state space.
7.3. COMPARISON OF APPROXIMATORS 77
A discrete, nite set of actions u
1
, . . . , u
M
is chosen from the original action space
U. The resulting discretized action space is denoted by U
d
={u
1
, . . . , u
M
}. A number
of N state-dependent BFs

1
, . . . ,

N
: X R are dened and replicated for each
discrete action in U
d
. Approximate Q-values are computed for any state-discrete
action pair with:

Q(x, u
j
; ) =
T
(x, u
j
), (7.4)
where, in the state-action BF vector
T
(x, u
j
), all the BFs that do not correspond to
the current discrete action are taken to be equal to 0:
(x, u
j
) = [0, . . . , 0
. .
u
1
, . . . , 0,

1
(x), . . . ,

N
(x)
. .
u
j
, 0, . . . , 0, . . . , 0
. .
u
M
]
T
R
NM
(7.5)
The parameter vector therefore also has NM elements. This type of approximator
can be seen as representing M distinct state-dependent slices through the Q-function,
one slice for each of the M discrete actions. Note that it is only meaningful to use such
an approximator for the discrete actions in U
d
; for any other actions, the approximator
outputs 0.
A different view on the benets of approximation can be taken in the setting
of RL. Consider an algorithm that estimates Q-functions, such as Q-learning (Al-
gorithm 5.1). Without approximation, the Q-value of every state-action would be
estimated separately (assuming here that it is possible to do so). If little or no data
would be available for some states, their Q-values would be poorly estimated, and the
algorithm would make poor control decisions in those states. However, when approx-
imation is used, the approximator can be designed so that the Q-values of each state
inuence the Q-values of other, usually nearby, states. Then, if good estimates of
the Q-values of a certain state are available, the algorithm can also make reasonable
control decisions in nearby states. This is called generalization, and can help algo-
rithms work well despite using only a limited number of samples. All this assumes,
of course, that the Q-function has a certain degree of smoothness.
7.3 Comparison of approximators
Because they are designed in advance, parametric approximators have to be exi-
ble enough to accurately model the target functions by only tuning the parameters.
Highly exible, nonlinearly parameterized approximators are available, such as neu-
ral networks. However, when approximating value functions, linear approximators
are often preferred, because they make it easier to analyze the resulting RL algorithms
(e.g. in order to provide convergence guarantees).
Linear approximators are specied by their BFs. When prior knowledge is not
available to guide the BF selection, which is usually the case in RL, a large number of
78 CHAPTER 7. FUNCTION APPROXIMATION
BFs must be dened to evenly cover the state-action space. This becomes problematic
in high-dimensional problems.
Nonparametric approximators are more exible. For example, they adapt their
complexity to the amount of available data. However, because their shape depends
on the data, it may change while the DP/RL algorithm is running, which increases
the difculty of providing convergence guarantees.
Chapter 8
Ofine approximate reinforcement
learning
In this chapter, ofine algorithms for approximate RL are considered, in particular,
model-free, sample-based versions of the value iteration and policy iteration methods
introduced in Chapter 3.
8.1 Approximate value iteration
To make the model-free algorithms for approximate value iteration easier to under-
stand, we will introduce them gradually, starting from a model-based variant of Q-
iteration.
Model-based approximate Q-iteration
The model-based approximate Q-iteration that we introduce represents the Q-function
using a discrete-action approximator (Example 7.3). The action discretization is com-
bined with pyramidal BFs over the state space (Example 7.1). Because the pyramidal
BFs can be seen as membership functions in a fuzzy rule base, this algorithm has
sometimes been called fuzzy Q-iteration, and we will use this name here.
Algorithm 8.1 shows fuzzy Q-iteration.
1
Beyond the use of the approximate
Q-function, notice two other important differences from the original Q-iteration in
Algorithm 3.1. The rst difference is that the maximization is only performed over
the discrete actions. The second difference is that parameter values are updated,
1
In this algorithm, the notation [i, j] represents the scalar index corresponding to i and j, which can
be computed as [i, j] = i + ( j 1)N. If the n elements of the BF vector were arranged into an N M
matrix, by rst lling in the rst column with the rst N elements, then the second column with the
subsequent N elements, etc., then the element at index [i, j] of the vector would be placed at row i and
column j of the matrix.
79
80 CHAPTER 8. OFFLINE APPROXIMATE REINFORCEMENT LEARNING
instead of directly updating the Q-function. All the NM parameters are updated in
one iteration of the algorithm.
Algorithm 8.1 Fuzzy Q-iteration.
Input: dynamics f , reward function , discount factor ,
interpolation centers x
i
, i = 1, . . . , N, set of discrete actions U
d
, threshold
QI
1: initialize parameter vector, e.g.,
0
0
2: repeat at every iteration = 0, 1, 2, . . .
3: for i = 1, . . . , N, j = 1, . . . , M do
4:
+1,[i, j]
(x
i
, u
j
) + max
j


Q( f (x
i
, u
j
), u
j
;

)
5: end for
6: until
+1


QI
Output:
+1
By carefully examining the parameter update, we notice that
+1,[i, j]
is inter-
preted as the updated Q-value of the state-action pair (x
i
, u
j
). Indeed, this interpreta-
tion is valid due to the special nature of the Q-function approximator. First, we have
for the pyramidal BFs that

i
(x
i
) = 1 and

i
(x
i
) = 0 for any i

= i: that is, each BF


takes the value 1 at its center, and all the other BFs are zero there. After combining
with action discretization as in Example 7.3, a similar property holds for the resulting
state-action BFs:
[i, j]
(x
i
, u
j
) = 1, and
[i

, j

]
(x
i
, u
j
) = 0 whenever [i

, j

] = [i, j].
This property also means that the parameter-by-parameter update from Algo-
rithm 8.1 is in fact the explicit solution of a least-squares problem:

+1
argmin

i=1,...,N
j=1,...,M
_
Q

+1
(x
i
, u
j
)

Q(x
i
, u
j
; )
_
2
(8.1)
which requires to minimize the error between the approximate Q-function and the
updated Q-values, for the pairs (x
i
, u
j
):
Q

+1
(x
i
, u
j
) = (x
i
, u
j
) + max
j

Q( f (x
i
, u
j
), u
j
;

)
In fact, the error in (8.1) is brought to exactly 0 for this particular combination of
approximator and set of samples. Among others, this is because the number n = NM
of free parameters is equal to the number of samples.
Finally, note that the stopping condition of fuzzy Q-iteration is specied in terms
of the difference between consecutive parameter vectors.
8.1. APPROXIMATE VALUE ITERATION 81
Model-free approximate Q-iteration
In the ofine model-free case, the transition dynamics f and the reward function
are unknown.
2
Instead, only a batch of transition samples is available:
{(x
i
s
, u
i
s
, x

i
s
, r
i
s
)| i
s
= 1, . . . , n
s
}
where for every i
s
, the next state x

i
s
and the reward r
i
s
have been obtained as a result
of taking action u
i
s
in the state x
i
s
.
We present tted Q-iteration, a model-free version of approximate Q-iteration
that employs such a batch of samples. To obtain tted Q-iteration, two changes are
made in fuzzy Q-iteration. For the rst change, we make use of the insight that
the fuzzy Q-iteration update is just a special case of a least-squares problem (8.1).
We generalize this problem to consider the samples (x
i
s
, u
i
s
) rather than the center-
discrete action pairs:

+1
argmin

n
s

i
s
=1
_
Q

+1
(x
i
s
, u
i
s
)

Q(x
i
s
, u
i
s
; )
_
2
(8.2)
Note we are no longer restricted to any particular approximator.
Second, because f and are not available, the updated Q-function samples
Q

+1
(x
i
s
, u
i
s
) are computed using the available data, making use of the fact that
(x
i
s
, u
i
s
) = r
i
s
and that f (x
i
s
, u
i
s
) = x

i
s
. We get:
Q

+1
(x
i
s
, u
i
s
) = r
i
s
+ max
u

Q(x

i
s
, u

) (8.3)
Hence the updated Q-value can be computed exactly from the transition sample
(x
i
s
, u
i
s
, x

i
s
, r
i
s
), without using f or . Note the formula (8.3) can also directly be
used in the case of stochastic transitions, although there the next state and reward re-
placements are no longer exact (they are merely samples from a random distribution).
Algorithm 8.2 presents tted Q-iteration. Note that, if it were to use the approx-
imator and samples employed by fuzzy Q-iteration, tted Q-iteration would in fact
reduce to this simpler algorithm.
We have introduced tted Q-iteration in the parametric case, to clearly establish
its link with fuzzy Q-iteration. Neural networks are one class of parametric approxi-
mators that have been combined with tted Q-iteration, leading to the so-called neural
tted Q-iteration. However, tted Q-iteration is just as popular in combination with
nonparametric approximators.
In the discussion above, the samples were assumed given, and the sample collec-
tion process was not considered explicitly. An important observation is that collecting
samples by using any deterministic policy h is insufcient, for the following reason.
2
In control problems, of course, the reward function is usually under the control of the experimenter.
We take here the more general (AI) view that is unknown.
82 CHAPTER 8. OFFLINE APPROXIMATE REINFORCEMENT LEARNING
Algorithm 8.2 Fitted Q-iteration.
Input: discount factor ,
Q-function approximator, samples {(x
i
s
, u
i
s
, x

i
s
, r
i
s
)| i
s
= 1, . . . , n
s
}
1: initialize parameter vector, e.g.,
0
0
2: repeat at every iteration = 0, 1, 2, . . .
3: for i
s
= 1, . . . , n
s
do
4: Q

+1
(x
i
s
, u
i
s
) r
i
s
+ max
u


Q(x

i
s
, u

)
5: end for
6:
+1
argmin


n
s
i
s
=1
_
Q

+1
(x
i
s
, u
i
s
)

Q(x
i
s
, u
i
s
;

)
_
2
7: until
+1
is satisfactory
Output:
+1
If only state-action pairs of the form (x, h(x)) were collected, no information about
pairs (x, u) with u = h(x) would be available. As a result, the approximate Q-values
of such pairs would be poorly estimated. To alleviate this problem, exploration is
necessary, so that even if sample collection relies on a policy h, actions different
from h(x) are sometimes selected, e.g., in a random fashion.
Example 8.1 Fitted Q-iteration for a DC motor. To show how approximate value
iteration can be used in practice, we provide an example involving the control of a
DC (direct current) motor. The second-order, discrete-time model of the DC motor
is:
x
k+1
= f (x
k
, u
k
) = Ax
k
+Bu
k
A =
_
1 0.0049
0 0.9540
_
, B =
_
0.0021
0.8505
_
(8.4)
This was obtained by discretizing with the zero-order-hold method a continuous-time
model, for a sampling time of T
s
= 0.005 s. Using saturation, the shaft angle x
1,k
=
is bounded to [, ] rad, the angular velocity x
2,k
= to [16, 16] rad/s, and the
control input u
k
to [10, 10] V. The control goal is to stabilize the DC motor in the
zero equilibrium (x = 0), and is expressed by the quadratic reward function:
r
k+1
= (x
k
, u
k
) = x
T
k
Q
rew
x
k
R
rew
u
2
k
Q
rew
=
_
5 0
0 0.01
_
, R
rew
= 0.01
(8.5)
The discount factor is chosen to be = 0.95. A (near-)optimal policy will drive
the state (close) to 0, while also minimizing the magnitude of the states along the
trajectory and the control effort.
Figure 8.1 presents a near-optimal solution, including a representative state-de-
pendent slice through the Q-function (obtained by setting the action argument u to 0),
8.1. APPROXIMATE VALUE ITERATION 83
2
0
2
50
0
50
800
600
400
200
0
[rad]
[rad/s]
Q
(

,
0
)
(a) Slice through a near-optimal Q-function,
for u = 0.
2 0 2
50
0
50
[rad]


[
r
a
d
/
s
]
h(,) [V]


10
5
0
5
10
(b) A near-optimal policy.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
3
2
1
0


[
r
a
d
]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
20
40


[
r
a
d
/
s
]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
10
0
10
u

[
V
]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
50
0
r

[

]
t [s]
(c) Controlled trajectory from x
0
= [, 0]
T
.
Figure 8.1: A near-optimal solution for the DC motor.
a greedy policy in the Q-function, and a representative trajectory that is controlled by
this policy.
3
We apply tted Q-iteration to the DC motor problem. The approximator consists
of an equidistant discretization of the action space into 5 elements, and LLR non-
parametric approximation over the state space (similarly to the parametric case, this
can be seen as maintaining a separate approximator for each discrete action). The
memory consists of 15000 random samples, uniformly distributed over the continu-
ous state-discretized action space. The distance metric is weighted Euclidean, with
the weights selected to bring the state values into the same range. Rather than solving
a least-squares problem to nd parameters, like in Algorithm 8.2, here the Q-value
sample q
i
s
for each state-action sample (x
i
s
, u
i
s
) in the LLR memory is directly up-
3
To nd this solution, fuzzy Q-iteration was applied with a very accurate approximator. Due to the
theoretical properties of fuzzy Q-iteration, this solution is guaranteed to be near-optimal.
84 CHAPTER 8. OFFLINE APPROXIMATE REINFORCEMENT LEARNING
dated using the formula:
q
i
s
r
i
s
+ max
u

Q(x

i
s
, u

)
Figure 8.2 shows the solution obtained. The structure of the policy is captured
well, although there is noise due to the random nature of the samples. There is sig-
nicant chattering of the control action.
2
0
2
50
0
50
800
600
400
200
0
[rad]
[rad/s]
Q
(

,
0
)
(a) Slice through Q-function for u = 0.
2 0 2
50
0
50
[rad]


[
r
a
d
/
s
]
h(,) [V]


3
2
1
0
1
2
3
(b) Policy.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
3
2
1
0


[
r
a
d
]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
20
40


[
r
a
d
/
s
]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
10
0
10
u

[
V
]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
50
0
r

[

]
t [s]
(c) Controlled trajectory from x
0
= [, 0]
T
.
Figure 8.2: Fitted Q-iteration solution for the DC motor.
The execution time of tted Q-iteration was approximately 90 s.
4
It is important
to note that the algorithm did not converge to a xed solution: instead, after 63 itera-
tions, it reached an oscillatory regime between two nearby solutions to a precision of
0.001, which means that:
max
i
s
|q
i
s
,
q
i
s
,2
| 0.001
for = 62 and 63.
4
The execution times were recorded while running the algorithms in Matlab 7 on a PC with an Intel
Core 2 Duo T9550 2.66 GHz CPU and with 3 GB RAM.
8.2. APPROXIMATE POLICY ITERATION 85
8.2 Approximate policy iteration
Policy iteration algorithms evaluate policies by constructing their value functions,
and use these value functions to nd new, improved policies. In large or continuous
spaces, at least the value function has to be approximated. Rather than also explic-
itly representing policies (which by necessity would mean we have to approximate
them), we will assume that improved policies are computed on demand from the
value function, for every state where an action is required:
h
+1
(x) argmax
u

Q
h

(x, u) (8.6)
The approximate Q-function of the previous policy h

,

Q
h

, implicitly denes via


(8.6) the new, improved policy. Such policy improvements can be solved efciently
e.g. in the case of discrete-action Q-function approximators.
Algorithm 8.3 outlines a general template for approximate policy iteration in this
setting. The policy improvement step is made explicit for clarity, although in practice
the policy will be computed on-demand as explained above.
Algorithm 8.3 Approximate policy iteration.
1: initialize policy h
0
2: repeat at every iteration = 0, 1, 2, . . .
3: nd

Q
h

, an approximate Q-function of h

policy evaluation
4: nd h
+1
(x) argmax
u

Q
h

(x, u), x X policy improvement


5: until h
+1
is satisfactory
Output: h
+1
The most difcult step of this algorithm, which we will focus on in this section, is
approximate policy evaluation: nding an approximate value function for each policy
considered. This problem is similar in complexity to approximate value iteration.
In fact, a class of algorithms for approximate policy evaluation can be derived
along entirely similar lines to approximate value iteration, by simply using policy
evaluation updates (see e.g. Algorithm 3.4) instead of value iteration updates. So
we could derive, for example, fuzzy policy evaluation and tted policy evaluation
algorithms.
However, a more efcient, specialized framework for approximate policy eval-
uation can be developed when linearly parameterized approximators are used. By
exploiting the linearity of the approximator in combination with the linearity of the
Bellman equation for Q
h
, it is possible to derive a specic approximate form of this
Bellman equation, called the projected Bellman equation, which is linear in the
parameter vector. Efcient algorithms can be developed to solve this equation. In
contrast, in approximate value iteration, the maximum operator in the Bellman opti-
mality equation (for Q

) leads to nonlinearity even when the approximator is linearly


parameterized.
86 CHAPTER 8. OFFLINE APPROXIMATE REINFORCEMENT LEARNING
We next introduce the projected Bellman equation, along with a model-free pro-
cedure that approximately solves it. Finally, we present the policy iteration algorithm
that is obtained when using this procedure to evaluate policies.
Projected Bellman equation
Recall the policy evaluation mapping T
h
(Section 3.2):
[T
h
(Q)](x, u) = (x, u) +Q( f (x, u), h( f (x, u))) (8.7)
using which the Bellman equation can be written:
Q
h
= T
h
(Q
h
)
A linear parametrization of the Q-function is employed:

Q
h
(x, u) =

Q(x, u;
h
) =
T
(x, u)
h
We will require that the approximate Q-function satises the following approximate
version of the Bellman equation, called the projected Bellman equation:

Q
h
= (P
w
T
h
)(

Q
h
) (8.8)
where P
w
performs a weighted least-squares projection onto the space of approximate
Q-functions:
[P
w
(Q)](x, u) =
T
(x, u)

, where

argmin

(x,u)XU
w(x, u)
_

T
(x, u) Q(x, u)
_
2
(8.9)
The weight function w : X U [0, 1] controls the distribution of the approximation
error, and must satisfy
x,u
w(x, u) = 1. Under appropriate conditions, the projected
Bellman equation has a unique solution.
Of course, the sum in the projection above only works if X and U are discrete.
However, the nal resulting algorithms will be applicable without any change in con-
tinuous state-action spaces, as well.
Figure 8.3 illustrates the projected Bellman equation.
Model-free solution and the least-squares policy iteration algorithm
By exploiting the linearity of the approximator in combination with the linearity of
T
h
, the projected Bellman equation can eventually be written as a linear equation in
the parameter vector:
A
h
= b (8.10)
8.2. APPROXIMATE POLICY ITERATION 87
P
w
Q
^
Q
h
^
T Q
h
( )
^
P
w
T
h
P T Q
w h
( ) ( )
T Q
h h
( )
^
^
space of approximate Q-functions
space of all Q-functions
T
h
Figure 8.3: A conceptual illustration of the projected Bellman equation. Applying
T
h
and then P
w
to an ordinary approximate Q-function

Q leads to a different point
in the space of approximate Q-functions (left). In contrast, applying T
h
and then P
w
to the xed point

Q
h
of the projected Bellman equation leads back to the same point
(right).
We will not go into the derivations as they are quite involved; we direct the interested
reader to the survey (Busoniu et al., 2011) for a complete derivation. So, instead of
the original, high-dimensional Bellman equation, approximate policy evaluation only
needs to solve the low-dimensional system (8.10). A solution
h
of this system leads
to an approximate Q-function

Q(x, u;
h
).
The matrix A R
nn
and the vector b R
n
have a special structure which makes
it possible to estimate them incrementally, from transition samples. Consider a set of
transition samples: {(x
i
s
, u
i
s
, x

i
s
, r
i
s
)| i
s
= 1, . . . , n
s
}. Using these samples, estimates
of A, and b can be constructed as follows:
A
0
= 0, b
0
= 0
A
i
s
= A
i
s
1
+(x
i
s
, u
i
s
)
_

T
(x
i
s
, u
i
s
)
T
(x

i
s
, h(x

i
s
))

b
i
s
= b
i
s
1
+(x
i
s
, u
i
s
)r
i
s
(8.11)
The so-called least-squares temporal difference algorithm processes the samples us-
ing (8.11) and then solves the equation:
1
n
s
A
n
s

h
=
1
n
s
b
n
s
(8.12)
to nd an approximate parameter vector

h
. The division by n
s
does not change the
solution, but in practice it helps to increase the numerical stability of the algorithm
(the elements in A
n
s
and b
n
s
can be very large when n
s
is large).
Placing least-squares temporal difference in the policy iteration framework (Al-
gorithm 8.3) results in the very popular least-squares policy iteration (LSPI) algo-
rithm. Algorithm 8.4 shows LSPI, in a simple variant that uses the same set of tran-
sition samples at every policy evaluation (note the sample index was dropped from
88 CHAPTER 8. OFFLINE APPROXIMATE REINFORCEMENT LEARNING
the estimates of A and b, for simplicity). In general, different sets of samples can be
used at different iterations.
Algorithm 8.4 Least-squares policy iteration.
Input: discount factor ,
BFs
1
, . . . ,
n
: X U R, samples {(x
i
s
, u
i
s
, x

i
s
, r
i
s
)| i
s
= 1, . . . , n
s
}
1: initialize policy h
0
2: repeat at every iteration = 0, 1, 2, . . .
3: A 0, b 0 begin policy evaluation
4: for i
s
= 1, . . . , n
s
do
5: A A+(x
i
s
, u
i
s
)
_

T
(x
i
s
, u
i
s
)
T
(x

i
s
, h(x

i
s
))

6: b b+(x
i
s
, u
i
s
)r
i
s
7: end for
8: solve
1
n
s
A

=
1
n
s
b nding

policy evaluation
9: h
+1
(x) argmax
u

T
(x, u)

, x X policy improvement
10: until h
+1
is satisfactory
Output: h
+1
Similar remarks with respect to exploration while generating samples apply as in
the approximate value iteration case.
Example 8.2 Least-squares policy iteration for a DC motor. In this example, we
apply LSPI to the DC motor problem introduced in Example 8.1. The Q-function
is approximated by combining the same 5-action discretization as in Example 8.1
with Gaussian RBFs over the state space. The centers of the RBFs are arranged on a
15 9 grid equidistantly spaced in each dimension, they are identical in shape, and
their width along each dimension is proportional to the distance between adjacent
RBFs along that dimension (the grid step). These RBFs yield a smooth interpolation
of the Q-function over the state space. A similar set of n
s
= 15000 samples is used as
for tted Q-iteration. The initial policy h
0
is identically equal to 10 throughout the
state space.
Figure 8.4 shows the solution found by LSPI. In comparison to the nonparametric,
tted Q-iteration solution of Figure 8.2, the RBF approximator enforces a smoother
policy but one that also suffers from the limitations imposed by the shape of the
chosen RBFs. The controlled trajectory from [, 0]
T
does not exhibit chattering as
in Figure 8.2, but it does have a small steady-state error.
In this problem, LSPI fully converged in 12 iterations (in general, it may os-
cillate, like tted Q-iteration). This number is signicantly smaller than the 63 it-
erations required by tted Q-iteration to converge to an oscillatory regime. Such a
convergence rate advantage of policy iteration over value iteration is often observed
in practice. The execution time was approximately 118 s, larger than the 90 s of tted
8.3. THEORETICAL PROPERTIES. COMPARISON 89
2
0
2
50
0
50
800
600
400
200
0
[rad]
[rad/s]
Q
(

,
0
)
(a) Slice through the Q-function for u = 0.
2 0 2
50
0
50
[rad]


[
r
a
d
/
s
]
h(,) [V]


10
5
0
5
10
(b) Policy

.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
3
2
1
0


[
r
a
d
]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
20
40


[
r
a
d
/
s
]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
10
0
10
u

[
V
]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
50
0
r

[

]
t [s]
(c) Controlled trajectory from x
0
= [, 0]
T
.
Figure 8.4: Results of LSPI for the DC motor.
Q-iteration, despite the smaller number of iterations. This is because each iteration of
LSPI is more computationally intensive, consisting of an entire approximate policy
evaluation.
8.3 Theoretical properties. Comparison of approximate
value and policy iteration
Under appropriate conditions, both approximate value iteration and approximate pol-
icy iteration converge to a sequence of solutions that are within a bounded distance
from the optimum. This does not guarantee their convergence to a xed point, as was
illustrated in Example 8.1, where tted QI with LLR approximation converged to a
limit cycle. However, under additional restrictions on the approximator, approximate
value iteration can be guaranteed to converge to a xed solution.
The (model-based) fuzzy Q-iteration algorithm guarantees this type of conver-
90 CHAPTER 8. OFFLINE APPROXIMATE REINFORCEMENT LEARNING
gence, and we will provide some details as an example. We introduce a number
that describes the quality of the chosen approximator for the problem at hand, by
measuring the distance between the optimal Q-function and its best representation
that the approximator could potentially provide:
= min

_
_
_Q

(x, u)

Q(x, u; )
_
_
_

Then, fuzzy Q-iteration converges to a parameter

providing an approximate Q-
function with a sub-optimality that depends on (but can be larger than !):
_
_
_Q

(x, u)

Q(x, u;

)
_
_
_

2
1
Fuzzy Q-iteration is also consistent: as the accuracy of the approximator increases
(more specically, as the centers of the BFs and the discrete actions become innitely
close together), the solution obtained converges to the true optimum Q

.
The results outlined above are asymptotic: they hold in the ideal case when the
number of iterations (and, for consistency, the power of the approximator) approach
innity. Adifferent class of results looks at what happens the number of iterations and
samples provided to the algorithm are nite. Such nite-sample results are available
for some types of approximate value and policy iteration.
In practice, approximate policy iteration often converges in a smaller number
of iterations than approximate value iteration. However, this does not mean that
approximate policy iteration is computationally less demanding than approximate
value iteration, since approximate policy evaluation is a difcult problem by itself,
which must be solved at every single policy iteration. This relationship was illustrated
for the DC motor examples above.
Chapter 9
Online approximate reinforcement
learning
The previous chapter presented ofine algorithms for approximate RL. Here, we will
describe some representative online algorithms. We will mainly be concerned with
temporal-difference-based methods, including Q-learning and SARSA, as well as a
representative algorithm from a new class of techniques: actor-critic learning.
9.1 Approximate Q-learning
Recall from Section 5.1 that classical Q-learning updates the Q-function with:
Q
k+1
(x
k
, u
k
) = Q
k
(x
k
, u
k
) +
k
[r
k+1
+ max
u

Q
k
(x
k+1
, u

) Q
k
(x
k
, u
k
)]
after observing the next state x
k+1
and reward r
k+1
, as a result of taking action u
k
in
state x
k
. A straightforward way to integrate approximation in Q-learning is by using
gradient descent. To use gradient updates, we need an approximator

Q(x, u; ) that is
differentiable in the parameters .
Assume for now that after taking action u
k
in state x
k
, the algorithm is provided
with the true optimal Q-value of the current state action pair, Q

(x
k
, u
k
), in addition to
the next state x
k+1
and reward r
k+1
. Under these circumstances, the algorithm could
aim to minimize the squared error between this optimal value and the current Q-value
(divided by 2 to simplify the formula after differentiation):

k+1
=
k

1
2

_
Q

(x
k
, u
k
)

Q(x
k
, u
k
;
k
)
_
2
=
k
+
k
_
Q

(x
k
, u
k
)

Q(x
k
, u
k
;
k
)
_

Q(x
k
, u
k
;
k
)
In this equation, the expression

Q(x
k
, u
k
;
k
) denotes the gradient of the approxi-
mate Q-function with respect to , evaluated for the value
k
of the parameter vector.
91
92 CHAPTER 9. ONLINE APPROXIMATE REINFORCEMENT LEARNING
Now, Q

(x
k
, u
k
) is not available, but it can be replaced by a Bellman estimate derived
from the data:
r
k+1
+ max
u

Q(x
k+1
, u

;
k
)
Note the similarity with the Q-function samples (8.3) used in tted Q-iteration. The
substitution leads to the approximate Q-learning update:

k+1
=
k
+
k
_
r
k+1
+ max
u

Q(x
k+1
, u

;
k
)

Q(x
k
, u
k
;
k
)
_

Q(x
k
, u
k
;
k
) (9.1)
We have actually obtained, in the square brackets, an approximation of the temporal
difference. With a linearly parameterized approximator:

Q(x, u; ) =
T
(x, u), the
gradient is very easy to compute:

[
T
(x, u)] = (x, u), and the approximate Q-
learning update simplies to:

k+1
=
k
+
k
_
r
k+1
+ max
u

T
(x
k+1
, u

)
k
_

T
(x
k
, u
k
)
k
_
(x
k
, u
k
) (9.2)
Of course, like any RL algorithm, approximate Q-learning requires exploration.
Algorithm 9.1 presents gradient-based Q-learning with a linear parametrization and
-greedy exploration. For an explanation of the exploration and learning rate vari-
ables, see Sections 4.2 and 5.1 of Part II.
Algorithm 9.1 Approximate Q-learning.
Input: discount factor ,
BFs
1
, . . . ,
n
: X U R,
exploration schedule {
k
}

k=0
, learning rate schedule {
k
}

k=0
1: initialize parameter vector, e.g.,
0
0
2: measure initial state x
0
3: for every time step k = 0, 1, 2, . . . do
4: u
k

_
u argmax
u
_

T
(x
k
, u)
k
_
with probability 1
k
(exploit)
a uniform random action in U with probability
k
(explore)
5: apply u
k
, measure next state x
k+1
and reward r
k+1
6:
k+1

k
+
k
_
r
k+1
+ max
u

T
(x
k+1
, u

)
k
_

T
(x
k
, u
k
)
k

(x
k
, u
k
)
7: end for
9.2 Approximate SARSA
An approximate version of SARSA can be derived just like approximate Q-learning.
We start from classical SARSA, which uses data tuples (x
k
, u
k
, r
k+1
, x
k+1
, u
k+1
) to
update the Q-function as follows:
Q
k+1
(x
k
, u
k
) = Q
k
(x
k
, u
k
) +
k
[r
k+1
+Q
k
(x
k+1
, u
k+1
) Q
k
(x
k
, u
k
)] (9.3)
9.3. ACTOR-CRITIC 93
where
k
is the learning rate. Linearly parameterized approximation is considered,
and by a derivation similar to that given for gradient-based Q-learning, the following
update rule is obtained:

k+1
=
k
+
k
_
r
k+1
+
T
(x
k+1
, u
k+1
)
k

T
(x
k
, u
k
)
k

(x
k
, u
k
) (9.4)
where the quantity in square brackets is an approximation of the temporal difference.
Algorithm 9.2 shows approximate SARSA with a linear parametrization and -
greedy exploration. As in the classical case, because the update at step k involves the
action u
k+1
, this action has to be chosen prior to updating the Q-function.
Algorithm 9.2 Approximate SARSA.
Input: discount factor ,
BFs
1
, . . . ,
n
: X U R,
exploration schedule {
k
}

k=0
, learning rate schedule {
k
}

k=0
1: initialize parameter vector, e.g.,
0
0
2: measure initial state x
0
3: u
0

_
u argmax
u
_

T
(x
0
, u)
0
_
with probability 1
0
(exploit)
a uniform random action in U with probability
0
(explore)
4: for every time step k = 0, 1, 2, . . . do
5: apply u
k
, measure next state x
k+1
and reward r
k+1
6: u
k+1

_
u argmax
u
_

T
(x
k+1
, u)
k
_
with probability 1
k+1
a uniform random action in U with probability
k+1
7:
k+1

k
+
k
_
r
k+1
+
T
(x
k+1
, u
k+1
)
k

T
(x
k
, u
k
)
k

(x
k
, u
k
)
8: end for
Recall that when u
k
is chosen according to a xed policy h, SARSA actually per-
forms policy evaluation. This means that a similar restriction of approximate SARSA
could be used in a policy iteration scheme, to perform approximate policy evaluation.
In fact, approximate SARSA can even be interpreted as an optimistic variant of ap-
proximate policy iteration, which implicitly improves the policy after every sample
(the improvements are implicit in the application of the greedy policy).
9.3 Actor-critic
Until now, we have considered algorithms that represent policies implicitly, by nd-
ing greedy actions on-demand for every state where an action choice is necessary. To
do this, these algorithms relied on efcient maximizations of the Q-function over the
action variable, typically achieved by using action discretization.
In this section, we focus on an algorithm that represents policies explicitly. This
algorithmbelongs to the policy search class: it uses a parametric policy approximator,
94 CHAPTER 9. ONLINE APPROXIMATE REINFORCEMENT LEARNING
and optimizes the parameters using classical optimization techniques. The policy
search approach has two important advantages in control:
Since the action is computed directly by the policy, Q-function maximizations
(and thus discretization) are no longer required. This means continuous actions
are much easier to handle, which often helps in control problems.
Prior knowledge about the policy can be incorporated in the chosen parametriza-
tion. This is important because policy knowledge is often available (e.g. from
experienced operators) whereas value function knowledge is comparatively
difcult and unintuitive to obtain. As a simple example, if it is known that
the system can be reasonably controlled by a policy that is linear in the state
variables, a linear parametrization of the form x
T
can be used.
Actor-critic algorithms are a specic type of policy search that use value functions
in the search process. The name actor-critic comes from calling the approximate
policy an actor and the approximate value function a critic. Note that because
greedy actions choices are not needed, a V-function can be used instead of the Q-
function. An advantage of V-functions is that they are easier to approximate, because
they only depend on x and not on u.
Denote by

h(x; ) the (deterministic) approximate policy, parameterized by


R
N
, and by

V(x; ) the approximate V-function, parameterized by R
N
. We will
perform gradient-based updates of and , so we require differentiable parametriza-
tions. The algorithm does not distinguish between the value functions of different
policies, so the value function notation is not superscripted by the policy. In the se-
quel, for simplicity, the action signal is assumed to be scalar, but the method can be
extended to multiple action variables.
At each time step, an action u
k
is chosen by adding a random, exploratory term to
the action recommended by the policy

h(x; ). This term could be drawn, e.g., from a


zero-mean Gaussian distribution. After the transition from x
k
to x
k+1
, an approximate
temporal difference (expressed in terms of the V-function) is computed with:

AC
k
= r
k+1
+

V(x
k+1
;
k
)

V(x
k
;
k
)
This temporal difference can be obtained from the Bellman equation for the policy
V-function (2.12). It is analogous to the temporal difference for Q-functions, used in
approximate SARSA above. Once the temporal difference is computed, the policy
and V-function parameters are updated with the following gradient formulas:

k+1
=
k
+
A
k

h(x
k
;
k
)

[u
k

h(x
k
;
k
)]
AC
k
(9.5)

k+1
=
k
+
C
k

V(x
k
;
k
)


AC
k
(9.6)
9.4. DISCUSSION 95
where
A
k
and
C
k
are the (possibly time-varying) step sizes for the actor and the
critic, respectively.
In the actor update (9.5), due to exploration, the actual action u
k
applied at step k
can be different from the action recommended by the policy. When the exploratory
action u
k
leads to a positive temporal difference, the policy is adjusted towards this
action. Conversely, when
AC
k
is negative, the policy is adjusted away from u
k
. This is
because the temporal difference is interpreted as a correction of the predicted perfor-
mance, so that, e.g., if the temporal difference is positive, the obtained performance is
considered to be better than the predicted one. In the critic update (9.6), the temporal
difference takes the place of the prediction error V
h
(x
k
)

V(x
k
;
k
), where V
h
(x
k
) is
the exact value of x
k
, given the current policy h. Since this exact value is not available,
it is replaced by the estimate r
k+1
+

V(x
k+1
;
k
) suggested by the Bellman equation
(2.12), thus leading to the temporal difference.
This actor-critic method is summarized in Algorithm 9.3, which generates ex-
ploratory actions using a Gaussian density N (0,
k
), with a standard deviation
k
that can vary over time.
Algorithm 9.3 Actor-critic.
Input: discount factor ,
policy parametrization

h, V-function parametrization

V,
exploration schedule {
k
}

k=0
, step size schedules
_

A
k
_

k=0
,
_

C
k
_

k=0
1: initialize parameter vectors, e.g.,
0
0,
0
0
2: measure initial state x
0
3: for every time step k = 0, 1, 2, . . . do
4: u
k

h(x
k
;
k
) + u, where u N (0,
k
)
5: apply u
k
, measure next state x
k+1
and reward r
k+1
6:
AC
k
= r
k+1
+

V(x
k+1
;
k
)

V(x
k
;
k
)
7:
k+1
=
k
+
A
k

h(x
k
;
k
)

[u
k

h(x
k
;
k
)]
AC
k
8:
k+1
=
k
+
C
k


V(x
k
;
k
)


AC
k
9: end for
9.4 Discussion
Convergence guarantees are available for modied forms of approximate Q-learning,
SARSA, and actor-critic. For example, due to their gradient-based nature, actor-critic
algorithms will converge to a locally optimal policy parameter (which may neverthe-
less be far from the global optimum). Approximate Q-learning is proven to converge
under the severe restriction that the policy used to choose actions remains unchanged
during the learning process (which means the usual greedy action selection in the
current Q-function cannot be used).
96 CHAPTER 9. ONLINE APPROXIMATE REINFORCEMENT LEARNING
All three online algorithms that we presented update the solution incrementally,
using learning rates. The learning rate schedule signicantly inuences performance
and can be difcult to tune. In particular, the actor-critic algorithm requires two
learning rates, which must be carefully set considering their interplay. As in any RL
algorithm, the exploration schedule is also very important.
It is important to note that ofine algorithms can also be adapted to work online.
For example, batch updates could be interspersed with episodes of online sample
collection, which are used to refresh the sample database. In the context of policy it-
eration, optimistic policy improvements can be used, without waiting until the policy
evaluation can nd an accurate approximation for the value function of the current
policy. Some of these adapted algorithms do not require learning rates and can there-
fore be easier to tune.
Recent research efforts focus on novel variants of actor-critic methods that exploit
better gradient updates to learn faster, as well as new temporal difference methods
that converge under less restrictive conditions.
Chapter 10
Accelerating online approximate
reinforcement learning
Like in the classical case, the temporal-difference methods for approximate RL pre-
sented above learn slowly due to their inefcient use of data. In this chapter, we
present two ways to accelerate learning in approximate RL, paralleling similar meth-
ods we presented in the classical case: eligibility traces and experience replay.
10.1 Eligibility traces
Consider an algorithm that learns Q-functions, such as Q-learning. In the classical
case, eligibility traces associated a trace e(x, u) to every state-action pair, and at every
step the enhanced algorithms updated the Q-values of all pairs proportionally to their
trace (Section 6.1).
In the approximate case, say for a parametric Q-function approximator

Q(x, u; )
with R
n
, it is no longer possible to associate eligibility traces to all state-action
pairs, for the same reason for which we cannot store their Q-values: the number of
pairs is too large or innite. Instead, we associate a trace to each parameter, so that e
is a vector of the same size as : e R
n
.
Let us combine the trace with gradient-based, approximate Q-learning. Recall
that this algorithm adds to the parameter vector, at each step, the approximate tem-
poral difference multiplied by the gradient at the last state-action pair:

k+1
=
k
+
k
_
r
k+1
+ max
u

Q(x
k+1
, u

;
k
)

Q(x
k
, u
k
;
k
)
_

Q(x
k
, u
k
;
k
)
Instead of using only the gradient for the last state-action pair, now we multiply the
temporal difference with the eligibility trace:

k+1
=
k
+
k
_
r
k+1
+ max
u

Q(x
k+1
, u

;
k
)

Q(x
k
, u
k
;
k
)
_
e
k+1
(10.1)
97
98 CHAPTER 10. ACCELERATING ONLINE APPROXIMATE RL
where the trace incrementally accumulates the gradients of all the pairs previously
encountered:
e
k+1
= e
k
+

Q
k
(x
k
, u
k
;
k
) (10.2)
Here, [0, 1] is the decay rate of the eligibility trace.
The resulting algorithm, approximate Q(), is summarized in Algorithm 10.1,
for the case of linearly parameterized approximation and -greedy exploration.
Algorithm 10.1 Approximate Q().
Input: discount factor ,
BFs
1
, . . . ,
n
: X U R,
exploration schedule {
k
}

k=0
, learning rate schedule {
k
}

k=0
, decay rate
1: initialize parameter vector, e.g.,
0
0, and eligibility trace, e
0
0
2: measure initial state x
0
3: for every time step k = 0, 1, 2, . . . do
4: u
k

_
u argmax
u
_

T
(x
k
, u)
k
_
with probability 1
k
(exploit)
a uniform random action in U with probability
k
(explore)
5: apply u
k
, measure next state x
k+1
and reward r
k+1
6: e
k+1
= e
k
+(x
k
, u
k
)
7:
k+1

k
+
k
_
r
k+1
+ max
u

T
(x
k+1
, u

)
k
_

T
(x
k
, u
k
)
k

e
k+1
8: end for
Eligibility traces can be used in a similar way in other temporal-difference based
algorithms, such as approximate SARSAor the policy evaluation component of actor-
critic.
10.2 Experience replay
Approximate RL algorithms can be enhanced with experience replay (ER) exactly
like in the classical case (Section 6.2): the data acquired during learning are stored
and presented repeatedly to underlying RL algorithm. Algorithm 10.2 exemplies
ER for approximate Q-learning. Note that the index k has been dropped from the
parameter vector, since this vector is now updated several times per step.
The discussion from Section 6.2 regarding the order of replaying samples and its
effect on the propagation of information remains valid. Here, information propaga-
tion is additionally helped by the generalization of the approximator. For example,
to aggregate information across trajectories, they do not need to exactly intersect at
some state. Instead, if they get close enough so that, through generalization, the Q-
values of states in one trajectory inuence the Q-values of states in other trajectories,
information propagation takes place.
10.2. EXPERIENCE REPLAY 99
Algorithm 10.2 Approximate ER Q-learning.
Input: discount factor ,
BFs
1
, . . . ,
n
: X U R,
exploration schedule {
k
}

k=0
, learning rate schedule {
k
}

k=0
,
number of replays per time step N
1: initialize parameter vector, e.g., 0
2: initialize sample database D / 0
3: measure initial state x
0
4: for every time step k = 0, 1, 2, . . . do
5: u
k

_
u argmax
u
_

T
(x
k
, u)
_
with probability 1
k
(exploit)
a uniform random action in U with probability
k
(explore)
6: apply u
k
, measure next state x
k+1
and reward r
k+1
7: +
k
_
r
k+1
+ max
u

T
(x
k+1
, u

)
_

T
(x
k
, u
k
)

(x
k
, u
k
)
8: add sample to database: D D {(x
k
, u
k
, r
k+1
, x
k+1
)}
9: loop N times experience replay
10: retrieve a sample (x, u, r, x

) from D
11: +
k
_
r + max
u

T
(x

, u

)
_

T
(x, u)

(x, u)
12: end loop
13: end for
100 BIBLIOGRAPHY
Bibliographical notes for Part III
The presentation in this part follows in part Chapter 3 of our book (Busoniu et al.,
2010a). For more material on neural networks, see e.g. (Bertsekas and Tsitsiklis,
1996, Chapter 3), while for nonparametric approximation, (Shawe-Taylor and Cris-
tianini, 2004) is recommended. Local linear regression is a special case of locally
weighted regression, surveyed by Atkeson et al. (1997).
Fuzzy Q-iteration is described in (Busoniu et al., 2010b) and tted Q-iteration
experienced a resurgence after being combined with nonparametric approximation
by Ernst et al. (2005); however, the approximate value iteration principles underlying
both methods have been known for a long time. Least-squares policy iteration was
described by (Lagoudakis and Parr, 2003), and is based on the least-squares tempo-
ral difference method given by (Bradtke and Barto, 1996). To derive gradient-based
approximate Q-learning and SARSA, as well as their eligibility-trace variants, we
followed (Sutton and Barto, 1998, Chapter 8). While we introduced experience re-
play for the classical case, it was in fact proposed in the context of approximate RL,
by Lin (1992).
Bibliography
Atkeson, C. G., Moore, A. W., and Schaal, S. (1997). Locally weighted learning.
Articial Intelligence Review, 11(15):1173.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena
Scientic.
Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal
difference learning. Machine Learning, 22(13):3357.
Busoniu, L., Babu ska, R., De Schutter, B., and Ernst, D. (2010a). Reinforcement
Learning and Dynamic Programming Using Function Approximators. Automation
and Control Engineering. Taylor & Francis CRC Press.
Busoniu, L., Ernst, D., De Schutter, B., and Babu ska, R. (2010b). Approximate
dynamic programming with a fuzzy parameterization. Automatica, 46(5):804814.
Busoniu, L., Lazaric, A., Ghavamzadeh, M., Munos, R., Babu ska, R., and De Schut-
ter, B. (2011). Least-squares methods for policy iteration. In Wiering, M. and van
Otterlo, M., editors, Reinforcement Learning: State of the Art. Springer. Submit-
ted.
Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement
learning. Journal of Machine Learning Research, 6:503556.
BIBLIOGRAPHY 101
Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of
Machine Learning Research, 4:11071149.
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning,
planning and teaching. Machine Learning, 8(34):293321. Special issue on re-
inforcement learning.
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis.
Cambridge University Press.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction.
MIT Press.
102 BIBLIOGRAPHY
Glossary
List of abbreviations
The following list collects, in alphabetical order, the abbreviations used in these lec-
ture notes.
AI articial intelligence
BF basis function
DP dynamic programming
ER experience replay
LLR local linear regression
LSPI least-squares policy iteration
MDP Markov decision process
MC Monte Carlo
OP optimistic planning
RBF radial basis function
RL reinforcement learning
List of symbols and notations
The most important mathematical symbols and notations used in these lecture notes
are listed below, organized by topic.
General notations
|| absolute value (for numeric arguments); cardinality (for sets)

p
p-norm of the argument
the largest integer smaller than or equal to the argument (oor)
the smallest integer larger than or equal to the argument (ceiling)
103
104 GLOSSARY
Classical reinforcement learning and dynamic programming
x; X state; state space
u; U control action; action space
r reward
f ;

f deterministic transition function; stochastic transition function
; reward function for deterministic transitions; reward function for
stochastic transitions
h;

h deterministic control policy; stochastic control policy
R return
discount factor
k; K discrete time index; discrete time horizon or index of nal time step
T
s
sampling time
Q; V Q-function; V-function
Q
h
; V
h
Q-function of policy h; V-function of policy h
Q

; V

optimal Q-function; optimal V-function


h

optimal policy
Q set of all Q-functions
T Q-iteration mapping
T
h
policy evaluation mapping for policy h
; L primary iteration index; number of primary iterations
secondary iteration index
learning rate (step size)
exploration probability
eligibility trace decay rate
Approximate dynamic programming and reinforcement learning

Q;

V;

h approximate Q-function; V-function; and policy


d; D index of dimension (variable); number of dimensions
P projection mapping
; parameter vector; basis functions
n number of parameters and basis functions
N number of state-dependent basis functions
i index of parameter and basis function (also, in the context of Q-function
approximation, index of state-dependent basis function)
U
d
set of discrete actions
M number of discrete actions
j index of discrete action
GLOSSARY 105
policy parameter vector
N number of policy parameters
n
s
number of samples
i
s
sample index
A matrix on the left-hand side of a linear system (such as the projected
Bellman equation)
b vector on the right-hand side of a linear system (such as the projected
Bellman equation)

You might also like