0% found this document useful (0 votes)

17 views9 pages

NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper

This document provides a unifying perspective of two popular parametric policy search algorithms - Expectation Maximization and natural gradient ascent - for optimizing Markov Decision Processes. It shows that the search directions of these two algorithms in the parameter space are closely related to the search direction of an approximate Newton method. This leads to considering the approximate Newton method as an alternative optimization method for MDPs. Empirical results suggest the approximate Newton method has excellent convergence and robustness properties, performing strongly compared to Expectation Maximization and natural gradient ascent.

Uploaded by

adeka1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views9 pages

NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper

Uploaded by

adeka1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

A Unifying Perspective of Parametric Policy Search

Methods for Markov Decision Processes

Thomas Furmston David Barber

Department of Computer Science Department of Computer Science
University College London University College London
[email protected] [email protected]

Abstract

Parametric policy search algorithms are one of the methods of choice for the opti-
misation of Markov Decision Processes, with Expectation Maximisation and nat-
ural gradient ascent being popular methods in this field. In this article we pro-
vide a unifying perspective of these two algorithms by showing that their search-
directions in the parameter space are closely related to the search-direction of
an approximate Newton method. This analysis leads naturally to the considera-
tion of this approximate Newton method as an alternative optimisation method for
Markov Decision Processes. We are able to show that the algorithm has numer-
ous desirable properties, absent in the naive application of Newton’s method, that
make it a viable alternative to either Expectation Maximisation or natural gradi-
ent ascent. Empirical results suggest that the algorithm has excellent convergence
and robustness properties, performing strongly in comparison to both Expectation
Maximisation and natural gradient ascent.

1 Markov Decision Processes

Markov Decision Processes (MDPs) are the most commonly used model for the description of se-
quential decision making processes in a fully observable environment, see e.g. [5]. A MDP is
described by the tuple {S, A, H, p1 , p, π, R}, where S and A are sets known respectively as the
state and action space, H ∈ N is the planning horizon, which can be either finite or infinite, and
{p1 , p, π, R} are functions that are referred as the initial state distribution, transition dynamics, pol-
icy (or controller) and the reward function. In general the state and action spaces can be arbitrary
sets, but we restrict our attention to either discrete sets or subsets of Rn , where n ∈ N. We use
boldface notation to represent a vector and also use the notation z = (s, a) to denote a state-action
pair. Given a MDP the trajectory of the agent is determined by the following recursive procedure:
Given the agent’s state, st , at a given time-point, t ∈ NH , an action is selected according to the pol-
icy, at ∼ π(·|st ); The agent will then transition to a new state according to the transition dynamics,
st+1 ∼ p(·|at , st ); this process is iterated sequentially through all of the time-points in the plan-
ning horizon, where the state of the initial time-point is determined by the initial state distribution
s1 ∼ p1 (·). At each time-point the agent receives a (scalar) reward that is determined by the reward
function, where this function depends on the current action and state of the environment. Typically
the reward function is assumed to be bounded, but as the objective is linear in the reward function
we assume w.l.o.g that it is non-negative.
The most widely used objective in the MDP framework is to maximise the total expected reward of
the agent over the course of the planning horizon. This objective can take various forms, including
an infinite planning horizon, with either discounted or average rewards, or a finite planning horizon.
The theoretical contributions of this paper are applicable to all three frameworks, but for notational
ease and for reasons of space we concern ourselves with the infinite horizon framework with dis-
counted rewards. In this framework the boundedness of the objective function is ensured by the

1
introduction of a discount factor, γ ∈ [0, 1), which scales the rewards of the various time-points in a
geometric manner. Writing the objective function and trajectory distribution directly in terms of the
parameter vector then, for any w ∈ W, the objective function takes the form
∞
X
t−1
U (w) = Ept (a,s;w) γ R(a, s) , (1)
t=1

where we have denoted the parameter space by W and have used the notation pt (a, s; w) to repre-
sent the marginal p(st = s, at = a; w) of the joint state-action trajectory distribution
H−1
Y
p(a1:H , s1:H ; w) = π(aH |sH ; w) p(st+1 |at , st )π(at |st ; w) p1 (s1 ), H ∈ N. (2)
t=1

Note that the policy is now written in terms of its parametric representation, π(a|s; w).
It is well known that the global optimum of (1) can be obtained through dynamic programming, see
e.g. [5]. However, due to various issues, such as prohibitively large state-action spaces or highly
non-linear transition dynamics, it is not possible to find the global optimum of (1) in most real-world
problems of interest. Instead most research in this area focuses on obtaining approximate solutions,
for which there exist numerous techniques, such as approximate dynamic programming methods [6],
Monte-Carlo tree search methods [19] and policy search methods, both parametric [27, 21, 16, 18]
and non-parametric [2, 25].
This work is focused solely on parametric policy search methods, by which we mean gradient-based
methods, such as steepest and natural gradient ascent [23, 1], along with Expectation Maximisation
[11], which is a bound optimisation technique from the statistics literature. Since their introduction
[14, 31, 10, 16] these methods have been the centre of a large amount of research, with much of it
focusing on gradient estimation [21, 4], variance reduction techniques [30, 15], function approxima-
tion techniques [27, 8, 20] and real-world applications [18, 26]. While steepest gradient ascent has
enjoyed some success it is known to suffer from some substantial issues that often make it unattrac-
tive in practice, such as slow convergence and susceptibility to poor scaling of the objective function
[23]. Various optimisation methods have been introduced as an alternative, most notably natural
gradient ascent [16, 24, 3] and Expectation Maximisation [18, 28], which are currently the methods
of choice among parametric policy search algorithms. In this paper our primary focus is on the
search-direction (in the parameter space) of these two methods.

2 Search Direction Analysis

In this section we will perform a novel analysis of the search-direction of both natural gradient
ascent and Expectation Maximisation. In gradient-based algorithms of Markov Decision Processes
the update of the policy parameters take the form
wnew = w + αM(w)∇w U (w), (3)
+
where α ∈ R is the step-size parameter and M(w) is some positive-definite matrix that possibly
depends on w. It is well-known that such an update will increase the total expected reward, provided
that α is sufficiently small, and this process will converge to a local optimum of (1) provided the
step-size sequence is appropriately selected. While EM doesn’t have an update of the form given
in (3) we shall see that the algorithm is closely related to such an update. It is convenient for later
reference to note that the gradient ∇w U (w) can be written in the following form

∇w U (w) = Epγ (z;w)Q(z;w) ∇w log π(a|s; w) , (4)

where we use the expectation notation E[·] to denote the integral/summation w.r.t. a non-negative
function. The term pγ (z; w) is a geometric weighted average of state-action occupancy marginals
given by
∞
X
pγ (z; w) = γ t−1 pt (z; w),
t=1
while the term Q(z; w) is referred to as the state-action value function and is equal to the total
expected future reward from the current time-point onwards, given the current state-action pair, z,

2
and parameter vector, w, i.e.
∞
X
t−1 0
Q(z; w) = Ept (z0 ;w) γ R(z ) z1 = z .
t=1

This is a standard result and due to reasons of space we have omitted the details, but see e.g. [27] or
section(6.1) of the supplementary material for more details.
An immediate issue concerning updates of the form (3) is in the selection of the ‘optimal’ choice
of the matrix M(w), which clearly depends on the sense in which optimality is defined. There
are numerous reasonable properties that are desirable of such an update, including the numerical
stability and computational complexity of the parameter update, as well as the rate of convergence
of the overall algorithm resulting from these updates. While all reasonable criteria the rate of con-
vergence is of such importance in an optimisation algorithm that it is a logical starting point in our
analysis. For this reason we concern ourselves with relating these two parametric policy search al-
gorithms to the Newton method, which has the highly desirable property of having a quadratic rate
of convergence in the vicinity of a local optimum. The Newton method is well-known to suffer from
problems that make it either infeasible or unattractive in practice, but in terms of forming a basis for
theoretical comparisons it is a logical starting point. We shall discuss some of the issues with the
Newton method in more detail in section(3). In the Newton method the matrix M(w) is set to the
negative inverse Hessian, i.e.
M(w) = −H−1 (w), where H(w) = ∇w ∇Tw U (w).
where we have denoted the Hessian by H(w). Using methods similar to those used to calculate the
gradient, it can be shown that the Hessian takes the form
H(w) = H1 (w) + H2 (w), (5)
where
∞
X
H1 (w) = Ep(z1:t ;w) γ t−1 R(zt )∇w log p(z1:t ; w)∇Tw log p(z1:t ; w) , (6)
t=1
X∞
t−1 T
H2 (w) = Ep(z1:t ;w) γ R(zt )∇w ∇w log p(z1:t ; w) . (7)
t=1

We have omitted the details of the derivation, but these can be found in section(6.2) of the sup-
plementary material, with a similar derivation of a sample-based estimate of the Hessian given in
[4].

2.1 Natural Gradient Ascent

To overcome some of the issues that can hinder steepest gradient ascent an alternative, natural
gradient, was introduced in [16]. Natural gradient ascent techniques originated in the neural network
and blind source separation literature, see e.g. [1], and take the perspective that the parameter space
has a Riemannian manifold structure, as opposed to a Euclidean structure. Deriving the steepest
ascent direction of U (w) w.r.t. a local norm defined on this parameter manifold (as opposed to w.r.t.
the Euclidean norm, which is the case in steepest gradient ascent) results in natural gradient ascent.
We denote the quadratic form that induces this local norm on the parameter manifold by G(w), i.e.
d(w)2 = wT G(w)w. The derivation for natural gradient ascent is well-known, see e.g. [1], and its
application to the objective (1) results in a parameter update of the form
wk+1 = wk + αk G−1 (wk )∇w U (wk ).
In terms of (3) this corresponds to M(w) = G−1 (w). In the case of MDPs the most commonly
used local norm is given by the Fisher information matrix of the trajectory distribution, see e.g.
[3, 24], and due to the Markovian structure of the dynamics it is given by

T
G(w) = −Epγ (z;w) ∇w ∇w log π(a|s; w) . (8)

We note that there is an alternate, but equivalent, form of writing the Fisher information matrix, see
e.g. [24], but we do not use it in this work.

3
In order to relate natural gradient ascent to the Newton method we first rewrite the matrix (7) into
the following form

T
H2 (w) = Epγ (z;w)Q(z;w) ∇w ∇w log π(a|s; w) . (9)

For reasons of space the details of this reformulation of (7) are left to section(6.2) of the supplemen-
tary material. Comparing the Fisher information matrix (8) with the form of H2 (w) given in (9) it is
clear that natural gradient ascent has a relationship with the approximate Newton method that uses
H2 (w) in place of H(w). In terms of (3) this approximate Newton method corresponds to setting
M(w) = −H2−1 (w). In particular it can be seen that the difference between the two methods lies
in the non-negative function w.r.t. which the expectation is taken in (8) and (9). (It also appears
that there is a difference in sign, but observing the form of M(w) for each algorithm shows that
this is not the case.) In the Fisher information matrix the expectation is taken w.r.t. to the geometri-
cally weighted summation of state-action occupancy marginals of the trajectory distribution, while
in H2 (w) there is an additional weighting from the state-action value function. Hence, H2 (w)
incorporates information about the reward structure of the objective function, whereas the Fisher
information matrix does not, and so it will generally contain more information about the curvature
of the objective function.

2.2 Expectation Maximisation

The Expectation Maximisation algorithm, or EM-algorithm, is a powerful optimisation technique

from the statistics literature, see e.g. [11], that has recently been the centre of much research in
the planning and reinforcement learning communities, see e.g. [10, 28, 18]. A quantity of central
importance in the EM-algorithm for MDPs is the following lower-bound on the log-objective

log U (w) ≥ Hentropy (q(z1:t , t)) + Eq(z1:t ,t) log γ t−1 R(zt )p(z1:t ; w) , (10)

where Hentropy is the entropy function and q(z1:t , t) is known as the ‘variational distribution’. Further
details of the EM-algorithm for MDPs and a derivation of (10) are given in section(6.3) of the
supplementary material or can be found in e.g. [18, 28]. The parameter update of the EM-algorithm
is given by the maximum (w.r.t. w) of the ‘energy’ term,

Q(w, wk ) = Epγ (z;wk )Q(z;wk ) log π(a|s; w) . (11)

Note that Q is a two-parameter function, where the first parameter occurs inside the expectation
and the second parameter defines the non-negative function w.r.t. the expectation is taken. This
decoupling allows the maximisation over w to be performed explicitly in many cases of interest.
For example, when the log-policy is quadratic in w the maximisation problems is equivalent to a
least-squares problem and the optimum can be found through solving a linear system of equations.
It has previously been noted, again see e.g. [18], that the parameter update of steepest gradient
ascent and the EM-algorithm can be related through this ‘energy’ term. In particular the gradient
(4) evaluated at wk can also be written as follows ∇w|w=wk U (w) = ∇10 w|w=wk Q(w, wk ), where
10
we use the notation ∇w to denote the first derivative w.r.t. the first parameter, while the update of
the EM-algorithm is given by wk+1 = argmaxw∈W Q(w, wk ). In other words, steepest gradient
ascent moves in the direction that most rapidly increases Q(w, wk ) w.r.t. the first variable, while the
EM-algorithm maximises Q(w, wk ) w.r.t. the first variable. While this relationship is true, it is also
quite a negative result. It states that in situations where it is not possible to explicitly perform the
maximisation over w in (11) then the alternative, in terms of the EM-algorithm, is this generalised
EM-algorithm, which is equivalent to steepest gradient ascent. Considering that algorithms such as
EM are typically considered because of the negative aspects related to steepest gradient ascent this
is an undesirable alternative. It is possible to find the optimum of (11) numerically, but this is also
undesirable as it results in a double-loop algorithm that could be computationally expensive. Finally,
this result provides no insight into the behaviour of the EM-algorithm, in terms of the direction of
its parameter update, when the maximisation over w in (11) can be performed explicitly.
Instead we provide the following result, which shows that the step-direction of the EM-algorithm
has an underlying relationship with the Newton method. In particular we show that, under suitable

4
regularity conditions, the direction of the EM-update, i.e. wk+1 − wk , is the same, up to first order,
as the direction of an approximate Newton method that uses H2 (w) in place of H(w).
Theorem 1. Suppose we are given a Markov Decision Process with objective (1) and Markovian
trajectory distribution (2). Consider the update of the parameter through Expectation Maximisation
at the k th iteration of the algorithm, i.e.
wk+1 = argmaxw∈W Q(w, wk ).
Provided that Q(w, wk ) is twice continuously differentiable in the first parameter we have that
wk+1 − wk = −H2−1 (wk )∇w|w=wk U (w) + O(kwk+1 − wk k2 ). (12)
Additionally, in the case where the log-policy is quadratic the relation to the approximate Newton
method is exact, i.e. the second term on the r.h.s. (12) is zero.

Proof. The idea of the proof is simple and only involves performing a Taylor expansion of
∇10
w Q(w, wk ). As Q is assumed to be twice continuously differentiable in the first component
this Taylor expansion is possible and gives
∇10 10 20 2
w Q(wk+1 , wk ) = ∇w Q(wk , wk ) + ∇w Q(wk , wk )(wk+1 − wk ) + O(kwk+1 − wk k ). (13)

As wk+1 = argmaxw∈W Q(w, wk ) it follows that ∇10 w Q(wk+1 , wk ) = 0. This means that, upon
ignoring higher order terms in wk+1 − wk , the Taylor expansion (13) can be rewritten into the form
−1 10
wk+1 − wk = −∇20
w Q(wk , wk ) ∇w Q(wk , wk ). (14)
The proof is completed by observing that ∇10 w Q(wk , wk ) = ∇w|w=wk U (w) and
∇20w Q(w k , w k ) = H2 (w k ). The second statement follows because in the case where the log-policy
is quadratic the higher order terms in the Taylor expansion vanish.

2.3 Summary

In this section we have provided a novel analysis of both natural gradient ascent and Expectation
Maximisation when applied to the MDP framework. Previously, while both of these algorithms have
proved popular methods for MDP optimisation, there has been little understanding of them in terms
of their search-direction in the parameter space or their relation to the Newton method. Firstly, our
analysis shows that the Fisher information matrix, which is used in natural gradient ascent, is similar
to H2 (w) in (5) with the exception that the information about the reward structure of the problem
is not contained in the Fisher information matrix, while such information is contained in H2 (w).
Additionally we have shown that the step-direction of the EM-algorithm is, up to first order, an
approximate Newton method that uses H2 (w) in place of H(w) and employs a constant step-size
of one.

3 An Approximate Newton Method

A natural follow on from the analysis in section(2) is the consideration of using M(w) = −H2−1 (w)
in (3), a method we call the full approximate Newton method from this point onwards. In this section
we show that this method has many desirable properties that make it an attractive alternative to other
parametric policy search methods. Additionally, denoting the diagonal matrix formed from the
diagonal elements of H2 (w) by D2 (w), we shall also consider the method that uses M(w) =
−D2−1 (w) in (3). We call this second method the diagonal approximate Newton method.
Recall that in (3) it is necessary that M(w) is positive-definite (in the Newton method this corre-
sponds to requiring the Hessian to be negative-definite) to ensure an increase of the objective. In
general the objective (1) is not concave, which means that the Hessian will not be negative-definite
over the entire parameter space. In such cases the Newton method can actually lower the objective
and this is an undesirable aspect of the Newton method. An attractive property of the approximate
Hessian, H2 (w), is that it is always negative-definite when the policy is log–concave in the policy
parameters. This fact follows from the observation that in such cases H2 (w) is a non-negative mix-
ture of negative-definite matrices, which again is negative-definite [9]. Additionally, the diagonal

5
terms of a negative-definite matrix are negative and so D2 (w) is also negative-definite when the
controller is log-concave.
To motivate this result we now briefly consider some widely used policies that are either log-concave
or blockwise log-concave. Firstly, consider the Gibb’s policy, π(a|s; w) ∝ exp wT φ(a, s), where
φ(a, s) ∈ Rnw is a feature vector. This policy is widely used in discrete systems and is log-
concave in w, which can be seen from the fact that log π(a|s; w) is the sum of a linear term and
a negative log-sum-exp term, both of which are concave [9]. In systems with a continuous state-
action space a common choice of controller is π(a|s; wmean , Σ) = N (a|Kφ(s) + m, Σ(s)), where
wmean = {K, m} and φ(s) ∈ Rnw is a feature vector. The notation Σ(s) is used because there
are cases where is it beneficial to have state dependent noise in the controller. This controller is not
jointly log-concave in wmean and Σ, but it is blockwise log-concave in wmean and Σ−1 . In terms of
wmean the log-policy is quadratic and the coefficient matrix of the quadratic term is negative-definite.
In terms of Σ−1 the log-policy consists of a linear term and a log-determinant term, both of which
are concave.
In terms of evaluating the search direction it is clear from the forms of D2 (w) and H2 (w) that
many of the pre-existing gradient evaluation techniques can be extended to the approximate Newton
framework with little difficulty. In particular, gradient evaluation requires calculating the expectation
of the derivative of the log-policy w.r.t. pγ (z; w)Q(z; w). In terms of inference the only additional
calculation necessary to implement either the full or diagonal approximate Newton methods is the
calculation of the expectation (w.r.t. to the same function) of the Hessian of the log-policy, or its
diagonal terms. As an example in section(6.5) of the supplementary material we detail the extension
of the recurrent state formulation of gradient evaluation in the average reward framework, see e.g.
[31], to the approximate Newton method. We use this extension in the Tetris experiment that we
consider in section(4). Given ns samples and nw parameters the complexity of these extensions
scale as O(ns nw ) for the diagonal approximate Newton method, while it scales as O(ns n2w ) for the
full approximate Newton method.
An issue with the Newton method is the inversion of the Hessian matrix, which scales with O(n3w ) in
the worst case. In the standard application of the Newton method this inversion has to be performed
at each iteration and in large parameter systems this becomes prohibitively costly. In general H(w)
will be dense and no computational savings will be possible when performing this matrix inversion.
The same is not true, however, of the matrices D2 (w) and H2 (w). Firstly, as D2 (w) is a diagonal
matrix it is trivial to invert. Secondly, there is an immediate source of sparsity that comes from
taking the second derivative of the log-trajectory distribution in (7). This property ensures that any
(product) sparsity over the control parameters in the log-trajectory distribution will correspond to
sparsity in H2 (w). For example, in a partially observable Markov Decision Processes where the
policy is modeled through a finite state controller, see e.g. [22], there are three functions to be
optimised, namely the initial belief distribution, the belief transition dynamics and the policy. When
the parameters of these three functions are independent H2 (w) will be block-diagonal (across the
parameters of the three functions) and the matrix inversion can be performed more efficiently by
inverting each of the block matrices individually. The reason that H(w) does not exhibit any such
sparsity properties is due to the term H1 (w) in (5), which consists of the non-negative mixture of
outer-product matrices. The vector in these outer-products is the derivative of the log-trajectory
distribution and this typically produces a dense matrix.
A undesirable aspect of steepest gradient ascent is that its performance is affected by the choice of
basis used to represent the parameter space. An important and desirable property of the Newton
method is that it is invariant to non-singular linear (affine) transformations of the parameter space,
see e.g. [9]. This means that given a non-singular linear (affine) mapping T ∈ Rnw ×nw , the Newton
update of the objective Ũ (w) = U (T w) is related to the Newton update of the original objective
through the same linear (affine) mapping, i.e. v + ∆vnt = T w + ∆wnt , where v = T w and ∆vnt
and ∆wnt denote the respective Newton steps. In other words running the Newton method on U (w)
and Ũ (T −1 w) will give identical results. An important point to note is that this desirable property
is maintained when using H2 (w) in an approximate Newton method, while using D2 (w) results
in a method that is invariant to rescaling of the parameters, i.e. where T is a diagonal matrix with
non-zero elements along the diagonal. This can be seen by using the linearity of the expectation
operator and a proof of this statement is provided in section(6.4) of the supplementary material.

6
20
400

Completed Lines
15
300

θ2
10 200

5 100

0 0
−10 −8 −6 −4 −2 0 2 0 20 40 60 80 100
θ1 Training Iterations
(a) Policy Trace (b) Tetris Problem

Figure 1: (a) An empirical illustration of the affine invariance of the approximate Newton method,
performed on the two state MDP of [16]. The plot shows the trace of the policy during training
for the two different parameter spaces, where the results of the latter have been mapped back into
the original parameter space for comparison. The plot shows the two steepest gradient ascent traces
(blue cross and blue circle) and the two traces of the full approximate Newton method (red cross
and red circle). (b) Results of the tetris problem for steepest gradient ascent (black), natural gradient
ascent (green), the diagonal approximate Newton method (blue) and the full approximate Newton
method (red).

4 Experiments

The first experiment we performed was an empirical illustration that the full approximate Newton
method is invariant to linear transformations of the parameter space. We considered the simple two
state example of [16] as it allows us to plot the trace of the policy during training, since the policy
has only two parameters. The policy was trained using both steepest gradient ascent and the full
approximate Newton method and in both the original and linearly transformed parameter space. The
policy traces of the two algorithms are plotted in figure(1.a). As expected steepest gradient ascent is
affected by such mappings, whilst the full approximate Newton method is invariant to them.
The second experiment was aimed at demonstrating the scalability of the full and diagonal approxi-
mate Newton methods to problems with a large state space. We considered the tetris domain, which
is a popular computer game designed by Alexey Pajitnov in 1985. See [12] for more details. Firstly,
we compared the performance of the full and diagonal approximate Newton methods to other para-
metric policy search methods. Tetris is typically played on a 20 × 10 grid, but due to computational
costs we considered a 10 × 10 grid in the experiment. This results in a state space with roughly
7 × 2100 states. We modelled the policy through a Gibb’s distribution, where we considered a fea-
ture vector with the following features: the heights of each column, the difference in heights between
adjacent columns, the maximum height and the number of ‘holes’. Under this policy it is not possi-
ble to obtain the explicit maximum over w in (11) and so a straightforward application of EM is not
possible in this problem. We therefore compared the diagonal and full approximate Newton methods
with steepest and natural gradient ascent. Due to reasons of space the exact implementation of the
experiment is detailed in section(6.6) of the supplementary material. We ran 100 repetitions of the
experiment, each consisting of 100 training iterations, and the mean and standard error of the results
are given in figure(1.b). It can be seen that the full approximate Newton method outperforms all of
the other methods, while the performance of the diagonal approximate Newton method is compa-
rable to natural gradient ascent. We also ran several training runs of the full approximate Newton
method on the full-sized 20 × 10 board and were able to obtain a score in the region of 14, 000
completed lines, which was obtained after roughly 40 training iterations. An approximate dynamic
programming based method has previously been applied to the Tetris domain in [7]. The same set
of features were used and a score of roughly 4, 500 completed lines was obtained after around 6
training iterations, after which the solution then deteriorated.
In the third experiment we considered a finite horizon (controlled) linear dynamical system. This
allowed the search-directions of the various algorithms to be computed exactly using [13] and re-
moved any issues of approximate inference from the comparison. In particular we considered a
3-link rigid manipulator, linearized through feedback linearisation, see e.g. [17]. This system has a

7
1

Normalised Total Expected Reward

0.7

0.6
0.9
0.5

0.4
0.8
0.3

0.2 0.7
0.1

0 0.6
0 200 400 600 0 200 400 600 800
Training Time Training Iterations

(a) Model-Based Linear System (b) Model-Free Non-Linear System

Figure 2: (a) The normalised total expected reward plotted against training time, in seconds, for the
3-link rigid manipulator. The plot shows the results for steepest gradient ascent (black), EM (blue),
natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the
mean and standard error of the results. (b) The normalised total expected reward plotted against
training iterations for the synthetic non-linear system of [29]. The plot shows the results for EM
(blue), steepest gradient ascent (black), natural gradient ascent (green) and the approximate Newton
method (red), where the plot shows the mean and standard error of the results.

6-dimensional state space, 3-dimensional action space and a 22-dimensional parameter space. Fur-
ther details of the system can be found in section(6.7) of the supplementary material. We ran the
experiment 100 times and the mean and standard error of the results plotted in figure(2.a). In this
experiment the approximate Newton method found substantially better solutions than either steep-
est gradient ascent, natural gradient ascent or Expectation Maximisation. The superiority of the
results in comparison to either steepest or natural gradient ascent can be explained by the fact that
H2 (w) gives a better estimate of the curvature of the objective function. Expectation Maximisation
performed poorly in this experiment, exhibiting sub-linear convergence. Steepest gradient ascent
performed 3684 ± 314 training iterations in this experiment which, in comparison to the 203 ± 34
and 310 ± 40 iterations of natural gradient ascent and the approximate Newton method respectively,
illustrates the susceptibility of this method to poor scaling. In the final experiment we considered the
synthetic non-linear system considered in [29]. Full details of the system and the experiment can be
found in section(6.8) of the supplementary material. We ran the experiment 100 times and the mean
and standard error of the results are plotted in figure(2.b). Again the approximate Newton method
outperforms both steepest and natural gradient ascent. In this example only the mean parameters of
the Gaussian controller are optimised, while the parameters of the noise are held fixed, which means
that the log-policy is quadratic in the policy parameters. Hence, in this example the EM-algorithm
is a particular (less general) version of the approximate Newton method, where a fixed step-size
of one is used throughout. The marked difference in performance between the EM-algorithm and
the approximate Newton method shows the benefit of being able to tune the step-size sequence.
In this experiment we considered five different step-size sequences for the approximate Newton
method and all of them obtained superior results than the EM-algorithm. In contrast only one of
the seven step-size sequences considered for steepest and natural gradient ascent outperformed the
EM-algorithm.

5 Conclusion
The contributions of this paper are twofold: Firstly we have given a novel analysis of Expectation
Maximisation and natural gradient ascent when applied to the MDP framework, showing that both
have close connections to an approximate Newton method; Secondly, prompted by this analysis
we have considered the direct application of this approximate Newton method to the optimisation of
MDPs, showing that it has numerous desirable properties that are not present in the naive application
of the Newton method. In terms of empirical performance we have found the approximate Newton
method to perform consistently well in comparison to EM and natural gradient ascent, highlighting
its viability as an alternative to either of these methods. At present we have only considered actor
type implementations of the approximate Newton method and the extension to actor-critic methods
is a point of future research.

8
References
[1] S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation, 10:251–276, 1998.
[2] M. Azar, V. Gómez, and H. Kappen. Dynamic policy programming with function approximation. Journal
of Machine Learning Research - Proceedings Track, 15:119–127, 2011.
[3] J. Bagnell and J. Schneider. Covariant Policy Search. IJCAI, 18:1019–1024, 2003.
[4] J. Baxter and P. Bartlett. Infinite Horizon Policy Gradient Estimation. Journal of Artificial Intelligence
Research, 15:319–350, 2001.
[5] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, second edition, 2000.
[6] D. P. Bertsekas. Approximate Policy Iteration: A Survey and Some New Methods. Research report,
Massachusetts Institute of Technology, 2010.
[7] D. P. Bertsekas and S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in Neuro-
Dynamic Programming. Research report, Massachusetts Institute of Technology, 1997.
[8] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark. Natural Actor-Critic Algorithms. Automatica,
45:2471–2482, 2009.
[9] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[10] P. Dayan and G. E. Hinton. Using Expectation-Maximization for Reinforcement Learning. Neural Com-
putation, 9:271–278, 1997.
[11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM
Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
[12] C. Fahey. Tetris AI, Computers Play Tetris http://colinfahey.com/tetris/tetris_en.
html, 2003.
[13] T. Furmston and D. Barber. Efficient Inference for Markov Control Problems. UAI, 29:221–229, 2011.
[14] P. W. Glynn. Likelihood Ratio Gradient Estimation for Stochastic Systems. Communications of the ACM,
33:97–84, 1990.
[15] E. Greensmith, P. Bartlett, and J. Baxter. Variance Reduction Techniques For Gradient Based Estimates
in Reinforcement Learning. Journal of Machine Learning Research, 5:1471–1530, 2004.
[16] S. Kakade. A Natural Policy Gradient. NIPS, 14:1531–1538, 2002.
[17] H. Khalil. Nonlinear Systems. Prentice Hall, 2001.
[18] J. Kober and J. Peters. Policy Search for Motor Primitives in Robotics. Machine Learning, 84(1-2):171–
203, 2011.
[19] L. Kocsis and C. Szepesvári. Bandit Based Monte-Carlo Planning. European Conference on Machine
Learning (ECML), 17:282–293, 2006.
[20] V. R. Konda and J. N. Tsitsiklis. On Actor-Critic Algorithms. SIAM J. Control Optim., 42(4):1143–1166,
2003.
[21] P. Marbach and J. Tsitsiklis. Simulation-Based Optimisation of Markov Reward Processes. IEEE Trans-
actions on Automatic Control, 46(2):191–209, 2001.
[22] N. Meuleau, L. Peshkin, K. Kim, and L. Kaelbling. Learning Finite-State Controllers for Partially Ob-
servable Environments. UAI, 15:427–436, 1999.
[23] J. Nocedal and S. Wright. Numerical Optimisation. Springer, 2006.
[24] J. Peters and S. Schaal. Natural Actor-Critic. Neurocomputing, 71(7-9):1180–1190, 2008.
[25] K. Rawlik, Toussaint. M, and S. Vijayakumar. On Stochastic Optimal Control and Reinforcement Learn-
ing by Approximate Inference. International Conference on Robotics Science and Systems, 2012.
[26] S. Richter, D. Aberdeen, and J. Yu. Natural Actor-Critic for Road Traffic Optimisation. NIPS, 19:1169–
1176, 2007.
[27] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning
with Function Approximation. NIPS, 13:1057–1063, 2000.
[28] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic Inference for Solving (PO)MDPs. Research
Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006.
[29] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning Model-Free Robot Control by a Monte
Carlo EM Algorithm. Autonomous Robots, 27(2):123–130, 2009.
[30] L. Weaver and N. Tao. The Optimal Reward Baseline for Gradient Based Reinforcement Learning. UAI,
17(29):538–545, 2001.
[31] R. Williams. Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learn-
ing. Machine Learning, 8:229–256, 1992.

The Illusion of Thin-Tails Under Aggregation - SSRN-Id1987562
No ratings yet
The Illusion of Thin-Tails Under Aggregation - SSRN-Id1987562
4 pages
Tank Calibration Report
100% (3)
Tank Calibration Report
5 pages
Lesson 7.4 Newton's Laws of Motion
No ratings yet
Lesson 7.4 Newton's Laws of Motion
24 pages
MACRO1 - Reference - Manual Macaferri
No ratings yet
MACRO1 - Reference - Manual Macaferri
65 pages
Namma Kalvi 12th Physics Practical Guide English Medium 221204
No ratings yet
Namma Kalvi 12th Physics Practical Guide English Medium 221204
33 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
Common Errors in The Interpretation of The Ideas of The Black Swan and Associated Papers - SSRN-Id1490769
No ratings yet
Common Errors in The Interpretation of The Ideas of The Black Swan and Associated Papers - SSRN-Id1490769
7 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
Policy_Gradient_Methods_for_Reinforcement_Learning
No ratings yet
Policy_Gradient_Methods_for_Reinforcement_Learning
5 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
No ratings yet
Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
19 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Lecture_12_slides_-_after
No ratings yet
Lecture_12_slides_-_after
50 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Infinite Horizon Problems
No ratings yet
Infinite Horizon Problems
69 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
CS229
No ratings yet
CS229
17 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Policy Gradient 2020
No ratings yet
Policy Gradient 2020
76 pages
Provably Efficient Maximum Entropy Exploration
No ratings yet
Provably Efficient Maximum Entropy Exploration
11 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
06 MDP
No ratings yet
06 MDP
89 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
M 2
No ratings yet
M 2
12 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
notes
No ratings yet
notes
6 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
Solving Stochastic Planning Problems With Large State and Action Spaces
No ratings yet
Solving Stochastic Planning Problems With Large State and Action Spaces
9 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
AS02
No ratings yet
AS02
16 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Book-Decision Making Under Uncertainty and Reinforcement Learning
No ratings yet
Book-Decision Making Under Uncertainty and Reinforcement Learning
273 pages
Robust_Dynamic_Programming
No ratings yet
Robust_Dynamic_Programming
30 pages
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - libgen.li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - libgen.li
283 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Understanding Vector Calculus: Practical Development and Solved Problems
From Everand
Understanding Vector Calculus: Practical Development and Solved Problems
Jerrold Franklin
No ratings yet
TFM MII PretelParejoMerino, CarlosJesus
No ratings yet
TFM MII PretelParejoMerino, CarlosJesus
105 pages
Enhancing Time Series Momentum Strategies Using Deep Neural Networks
No ratings yet
Enhancing Time Series Momentum Strategies Using Deep Neural Networks
21 pages
Random Forest
No ratings yet
Random Forest
225 pages
Liquidity 2008
No ratings yet
Liquidity 2008
47 pages
The Value of Queue Position in A Limit Order Book
100% (1)
The Value of Queue Position in A Limit Order Book
48 pages
We Consider Several Frequently Asked Questions (FAQ's) in Option Pricing Theory
No ratings yet
We Consider Several Frequently Asked Questions (FAQ's) in Option Pricing Theory
31 pages
ProgrammingGuide v1.0
0% (1)
ProgrammingGuide v1.0
89 pages
Statistical Undecidability - SSRN-Id1691165
No ratings yet
Statistical Undecidability - SSRN-Id1691165
4 pages
NLGI Consistency Number
No ratings yet
NLGI Consistency Number
2 pages
Learning Lens Design de Rudolf Kingslake (IM3A.4)
No ratings yet
Learning Lens Design de Rudolf Kingslake (IM3A.4)
2 pages
revision-tool-april-attempt-2025
No ratings yet
revision-tool-april-attempt-2025
4 pages
Grade 7 Mathematics Competition
No ratings yet
Grade 7 Mathematics Competition
7 pages
IB Math SL AI Unit 3 Functions and Modeling Notes Updated (24)
No ratings yet
IB Math SL AI Unit 3 Functions and Modeling Notes Updated (24)
25 pages
Mathematical Formulation For The Numerical Solution of The Internal Ballistics of Classical Weapons in Lagrangian Coordinates
No ratings yet
Mathematical Formulation For The Numerical Solution of The Internal Ballistics of Classical Weapons in Lagrangian Coordinates
22 pages
AS 22Schm
No ratings yet
AS 22Schm
3 pages
KBS-aided Design of Tube Bending Processes PDF
No ratings yet
KBS-aided Design of Tube Bending Processes PDF
8 pages
Continious and Charectiristics X-Ray and Moseley S Law
No ratings yet
Continious and Charectiristics X-Ray and Moseley S Law
3 pages
Phys111 FINAL-PRACTICE PROBLEMS FALL 2024
No ratings yet
Phys111 FINAL-PRACTICE PROBLEMS FALL 2024
6 pages
Code - Aster: FDLV112 - Calculation of Stopping With Reserve Under Seismic Request
No ratings yet
Code - Aster: FDLV112 - Calculation of Stopping With Reserve Under Seismic Request
22 pages
Instant Download Other Cities Other Worlds Urban Imaginaries in a Globalizing Age 1st Edition Andreas Huyssen PDF All Chapters
No ratings yet
Instant Download Other Cities Other Worlds Urban Imaginaries in a Globalizing Age 1st Edition Andreas Huyssen PDF All Chapters
41 pages
Detailed Science Lesson Plan: Grade Level
No ratings yet
Detailed Science Lesson Plan: Grade Level
4 pages
MTH3202 Numerical Methods
No ratings yet
MTH3202 Numerical Methods
3 pages
BHU Assistant Professor Recruitment 2024 1
No ratings yet
BHU Assistant Professor Recruitment 2024 1
7 pages
COT 2 November 30,2023 With Indicators (Repaired)
No ratings yet
COT 2 November 30,2023 With Indicators (Repaired)
5 pages
Bellow Seal For Liquid System
No ratings yet
Bellow Seal For Liquid System
42 pages
Test Advanced Engineering Mathematics
No ratings yet
Test Advanced Engineering Mathematics
1 page
ch9t8 PDF
No ratings yet
ch9t8 PDF
1 page
Lesson 8 Series and Parallel Circuits
No ratings yet
Lesson 8 Series and Parallel Circuits
18 pages
PHD Advertisement 2024
No ratings yet
PHD Advertisement 2024
10 pages
SCORE Advanced 2023 MEQ Discussion Schedule
No ratings yet
SCORE Advanced 2023 MEQ Discussion Schedule
4 pages
Konversi Satuan
No ratings yet
Konversi Satuan
6 pages
Download full (Ebook) The AI Does Not Hate You: Superintelligence, Rationality and the Race to Save the World by Tom Chivers ISBN 9781474608770, 9781474608787, 9781474608800, 1474608779, 1474608787, 1474608809 ebook all chapters
100% (11)
Download full (Ebook) The AI Does Not Hate You: Superintelligence, Rationality and the Race to Save the World by Tom Chivers ISBN 9781474608770, 9781474608787, 9781474608800, 1474608779, 1474608787, 1474608809 ebook all chapters
55 pages
2020 Tesis Towards Reliable Operation of Converter Dominated Power Systems Dynamics Optimization and Control
No ratings yet
2020 Tesis Towards Reliable Operation of Converter Dominated Power Systems Dynamics Optimization and Control
465 pages
Concrete Technology Solved MCQs (Set-2)
No ratings yet
Concrete Technology Solved MCQs (Set-2)
5 pages

NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper

Uploaded by

NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper

Uploaded by

A Unifying Perspective of Parametric Policy Search

Methods for Markov Decision Processes

Thomas Furmston David Barber

1 Markov Decision Processes

2 Search Direction Analysis

2.1 Natural Gradient Ascent

2.2 Expectation Maximisation

The Expectation Maximisation algorithm, or EM-algorithm, is a powerful optimisation technique

3 An Approximate Newton Method

Normalised Total Expected Reward

Normalised Total Expected Reward

(a) Model-Based Linear System (b) Model-Free Non-Linear System

You might also like