0% found this document useful (0 votes)
2 views

Policy Gradient 2020

Uploaded by

陳登文
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Policy Gradient 2020

Uploaded by

陳登文
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Journal of Machine Learning Research 22 (2021) 1-76 Submitted 9/19; Published 2/21

On the Theory of Policy Gradient Methods:


Optimality, Approximation, and Distribution Shift

Alekh Agarwal ALEKHA @ MICROSOFT. COM


Microsoft Research
Redmond, WA 98052, USA
Sham M. Kakade SHAM @ CS . WASHINGTON . EDU
University of Washington & Microsoft Research
Seattle, WA 98195, USA
Jason D. Lee JASONLEE @ PRINCETON . EDU
Princeton University
Princeton, NJ 08540, USA
Gaurav Mahajan GMAHAJAN @ ENG . UCSD . EDU
University of California
San Diego, La Jolla, CA 92093, USA

Editor: Csaba Szepesvari

Abstract
Policy gradient methods are among the most effective methods in challenging reinforcement learning
problems with large state and/or action spaces. However, little is known about even their most basic
theoretical convergence properties, including: if and how fast they converge to a globally optimal
solution or how they cope with approximation error due to using a restricted class of parametric
policies. This work provides provable characterizations of the computational, approximation, and
sample size properties of policy gradient methods in the context of discounted Markov Decision
Processes (MDPs). We focus on both: “tabular” policy parameterizations, where the optimal policy
is contained in the class and where we show global convergence to the optimal policy; and parametric
policy classes (considering both log-linear and neural policy classes), which may not contain the
optimal policy and where we provide agnostic learning results. One central contribution of this work
is in providing approximation guarantees that are average case — which avoid explicit worst-case
dependencies on the size of state space — by making a formal connection to supervised learning
under distribution shift. This characterization shows an important interplay between estimation
error, approximation error, and exploration (as characterized through a precisely defined condition
number).
Keywords: Policy Gradient, Reinforcement Learning

1. Introduction

Policy gradient methods have a long history in the reinforcement learning (RL) literature (Williams,
1992; Sutton et al., 1999; Konda and Tsitsiklis, 2000; Kakade, 2001) and are an attractive class of
algorithms as they are applicable to any differentiable policy parameterization; admit easy extensions
to function approximation; easily incorporate structured state and action spaces; are easy to implement
in a simulation based, model-free manner. Owing to their flexibility and generality, there has also

c 2021 Alekh Agarwal, Sham Kakade, Jason Lee and Gaurav Mahajan.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v22/19-736.html.
AGARWAL , K AKADE , L EE AND M AHAJAN

been a flurry of improvements and refinements to make these ideas work robustly with deep neural
network based approaches (see e.g. Schulman et al. (2015, 2017)).
Despite the large body of empirical work around these methods, their convergence properties
are only established at a relatively coarse level; in particular, the folklore guarantee is that these
methods converge to a stationary point of the objective, assuming adequate smoothness properties
hold and assuming either exact or unbiased estimates of a gradient can be obtained (with appropriate
regularity conditions on the variance). However, this local convergence viewpoint does not address
some of the most basic theoretical convergence questions, including: 1) if and how fast they converge
to a globally optimal solution (say with a sufficiently rich policy class); 2) how they cope with
approximation error due to using a restricted class of parametric policies; or 3) their finite sample
behavior. These questions are the focus of this work.
Overall, the results of this work place policy gradient methods under a solid theoretical footing,
analogous to the global convergence guarantees of iterative value function based algorithms.

1.1 Our Contributions


This work focuses on first-order and quasi second-order policy gradient methods which directly work
in the space of some parameterized policy class (rather than value-based approaches). We characterize
the computational, approximation, and sample size properties of these methods in the context of
a discounted Markov Decision Process (MDP). We focus on: 1) tabular policy parameterizations,
where there is one parameter per state-action pair so the policy class is complete in that it contains
the optimal policy, and 2) function approximation, where we have a restricted class or parametric
policies which may not contain the globally optimal policy. Note that policy gradient methods for
discrete action MDPs work in the space of stochastic policies, which permits the policy class to be
differentiable. We now discuss our contributions in the both of these contexts.

Tabular case: We consider three algorithms: two of which are first order methods, projected
gradient ascent (on the simplex) and gradient ascent (with a softmax policy parameterization); and
the third algorithm, natural policy gradient ascent, can be viewed as a quasi second-order method (or
preconditioned first-order method). Table 1 summarizes our main results in this case: upper bounds
on the number of iterations taken by these algorithms to find an -optimal policy, when we have
access to exact policy gradients.
Arguably, the most natural starting point for an analysis of policy gradient methods is to consider
directly doing gradient ascent on the policy simplex itself and then to project back onto the simplex
if the constraint is violated after a gradient update; we refer to this algorithm as projected gradient
ascent on the simplex. Using a notion of gradient domination (Polyak, 1963), our results provably
show that any first-order stationary point of the value function results in an approximately optimal
policy, under certain regularity assumptions; this allows for a global convergence analysis by directly
appealing to standard results in the non-convex optimization literature.
A more practical and commonly used parameterization is the softmax parameterization, where
the simplex constraint is explicitly enforced by the exponential parameterization, thus avoiding
projections. This work provides the first global convergence guarantees using only first-order gradient
information for the widely-used softmax parameterization. Our first result for this parameterization
establishes the asymptotic convergence of the policy gradient algorithm; the analysis challenge here
is that the optimal policy (which is deterministic) is attained by sending the softmax parameters to
infinity.

2
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Algorithm Iteration complexity


 2 
∞ |S||A|
O D 6
(1−γ)  2
Projected Gradient Ascent on Simplex (Thm 5)

Policy Gradient, softmax parameterization (Thm 10) asymptotic

 2 |S|2 |A|2

Policy Gradient + log barrier regularization, softmax O D∞
(1−γ)6 2
parameterization (Cor 13)

Natural Policy Gradient (NPG), 2


(1−γ)2 
softmax parameterization (Thm 16)

Table 1: Iteration Complexities with Exact Gradients for the Tabular Case: A summary of
the number of iterations required by different algorithms to find a policy π such that
V ? (s0 ) − V π (s0 ) ≤  for some fixed s0 , assuming access to exact policy gradients. The
first three algorithms optimize the objective Es∼µ [V π (s)], where µ is the starting state
distribution for the algorithms. The MDP has |S| states, |A| actions, and discount factor
 dπ? (s) 
s0
0 ≤ γ < 1. The quantity D∞ := maxs µ(s) is termed the distribution mismatch
?
coefficient, where, roughly speaking, dπs0 (s) is the fraction of time spent in state s when
executing an optimal policy π ? , starting from the state s0 (see (4)). The NPG algorithm
directly optimizes V π (s0 ) for any state s0 . In contrast to the complexities of the previous
three algorithms, NPG has no dependence on the coefficient D∞ , nor does it depend on
the choice of s0 . Both the MDP Experts Algorithm (Even-Dar et al., 2009) and MD-MPI
algorithm (Geist et al., 2019) (see Corollary 3 of their paper) also yield guarantees for the
same update rule as NPG for the softmax parameterization, though at a worse rate. See
Section 2 for further discussion.

In order to establish a finite time convergence rate to optimality for the softmax parameteriza-
tion, we then consider a log barrier regularizer and provide an iteration complexity bound that is
polynomial in all relevant quantities. Our use of the log barrier regularizer is critical to avoiding the
issue of gradients becomingly vanishingly small at suboptimal near-deterministic policies, an issue
of significant practical relevance. The log barrier regularizer can also be viewed as using a relative
entropy regularizer; here, we note the general approach of entropy based regularization is common in
practice (e.g. see (Williams and Peng, 1991; Mnih et al., 2016; Peters et al., 2010; Abdolmaleki
et al., 2018; Ahmed et al., 2019)). One notable distinction, which we discuss later, is that our analysis
is for the log barrier regularization rather than the entropy regularization.
For these aforementioned algorithms, our convergence rates depend on the optimization measure
having coverage over the state space, as measured by the distribution mismatch coefficient D∞ (see
Table 1 caption). In particular, for the convergence rates shown in Table 1 (for the aforementioned
algorithms), we assume that the optimization objective is the expected (discounted) cumulative value
where the initial state is sampled under some distribution, and D∞ is a measure of the coverage of

3
AGARWAL , K AKADE , L EE AND M AHAJAN

this initial distribution. Furthermore, we provide a lower bound that shows such a dependence is
unavoidable for first-order methods, even when exact gradients are available.
We then consider the Natural Policy Gradient (NPG) algorithm (Kakade, 2001) (also see Bagnell
and Schneider (2003); Peters and Schaal (2008)), which can be considered a quasi second-order
method due to the use of its particular preconditioner, and provide an iteration complexity to achieve
2
an -optimal policy that is at most (1−γ) 2  iterations, improving upon the previous related results
of (Even-Dar et al., 2009; Geist et al., 2019) (see Section 2). Note the convergence rate has no
dependence on the number of states or the number of actions, nor does it depend on the distribution
mismatch coefficient D∞ . We provide a simple and concise proof for the convergence rate analysis
by extending the approach developed in (Even-Dar et al., 2009), which uses a mirror descent style
of analysis (Nemirovsky and Yudin, 1983; Cesa-Bianchi and Lugosi, 2006) and also handles the
non-concavity of the policy optimization problem.
This fast and dimension free convergence rate shows how the variable preconditioner in the
natural gradient method improves over the standard gradient ascent algorithm. The dimension free
aspect of this convergence rate is worth reflecting on, especially given the widespread use of the
natural policy gradient algorithm along with variants such as the Trust Region Policy Optimization
(TRPO) algorithm (Schulman et al., 2015); our results may help to provide analysis of a more general
family of entropy based algorithms (see for example Neu et al. (2017)).
Function Approximation: We now summarize our results with regards to policy gradient methods
in the setting where we work with a restricted policy class, which may not contain the optimal policy.
In this sense, these methods can be viewed as approximate methods. Table 2 provides a summary
along with the comparisons to some relevant approximate dynamic programming methods.
A long line of work in the function approximation setting focuses on mitigating the worst-case
“`∞ ” guarantees that are inherent to approximate dynamic programming methods (Bertsekas and
Tsitsiklis, 1996) (see the first row in Table 2). The reason to focus on average case guarantees is that
it supports the applicability of supervised machine learning methods to solve the underlying approx-
imation problem. This is because supervised learning methods, like classification and regression,
typically have bounds on the expected error under a distribution, as opposed to worst-case guarantees
over all possible inputs.
The existing literature largely consists of two lines of provable guarantees that attempt to mitigate
the explicit `∞ error conditions of approximate dynamic programming: those methods which utilize
a problem dependent parameter (the concentrability coefficient (Munos, 2005)) to provide more
refined dynamic programming guarantees (e.g. see Munos (2005); Szepesvári and Munos (2005);
Antos et al. (2008); Farahmand et al. (2010)) and those which work with a restricted policy class,
making incremental updates, such as Conservative Policy Iteration (CPI) (Kakade and Langford,
2002; Scherrer and Geist, 2014), Policy Search by Dynamic Programming (PSDP) (Bagnell et al.,
2004), and MD-MPI Geist et al. (2019). Both styles of approaches give guarantees based on
worst-case density ratios, i.e. they depend on a maximum ratio between two different densities over
the state space. As discussed in(Scherrer, 2014), the assumptions in the latter class of algorithms
are substantially weaker, in that the worst-case density ratio only depends on the state visitation
distribution of an optimal policy (also see Table 2 caption and Section 2).
With regards to function approximation, our main contribution is in providing performance
bounds that, in some cases, have milder dependence on these density ratios. We precisely quantify
an approximation/estimation error decomposition relevant for the analysis of the natural gradient
method; this decomposition is stated in terms of the compatible function approximation error as

4
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

introduced in Sutton et al. (1999). More generally, we quantify our function approximation results in
terms of a precisely quantified transfer error notion, based on approximation error under distribution
shift. Table 2 shows a special case of our convergence rates of NPG, which is governed by four
quantities: stat , approx , κ, and D∞ .
Let us discuss the important special case of log-linear policies (i.e. policies that take the softmax
of linear functions in a given feature space) where the relevant quantities are as follows: stat is
a bound on the excess risk (the estimation error) in fitting linearly parameterized√value functions,
which can be driven to 0 with more samples (at the usual statistical rate of O(1/ N ) where N is
the number of samples); approx is the usual notion of average squared approximation error where
the target function may not be perfectly representable by a linear function; κ can be upper bounded
with an inverse dependence on the minimal eigenvalue of the feature covariance matrix of the fitting
measure (as such it can be viewed as a dimension dependent quantity but not necessarily state
dependent); and D∞ is as before.
For the realizable case, where all policies have values which are linear in the given features (such
as in linear MDP models of (Jin et al., 2019; Yang and Wang, 2019; Jiang et al., 2017)), we have
that the approximation error approx is 0. Here, our guarantees yield a fully polynomial and sample
efficient convergence guarantee, provided the condition number κ is bounded. Importantly, there
always exists a good (universal) initial measure that ensures κ is bounded by a quantity that is only
polynomial in the dimension of the features, d, as opposed to an explicit dependence on the size of
the (infinite) state space (see Remark 22). Such a guarantee would not be implied by algorithms
which depend on the coefficients C∞ or D∞ .1
Our results are also suggestive that a broader class of incremental algorithms — such as
CPI (Kakade and Langford, 2002), PSDP (Bagnell et al., 2004), and MD-MPI Geist et al. (2019)
which make small changes to the policy from one iteration to the next — may also permit a sharper
analysis, where the dependence of worst-case density ratios can be avoided through an appropriate
approximation/estimation decomposition; this is an interesting direction for future work (a point
which we return to in Section 7). One significant advantage of NPG is that the explicit parametric
policy representation in NPG (and other policy gradient methods) leads to a succinct policy represen-
tation in comparison to CPI, PSDP, or related boosting-style methods (Scherrer and Geist, 2014),
where the representation complexity of the policy of the latter class of methods grows linearly in
the number of iterations (since these methods add one policy to the ensemble per iteration). This
representation complexity is likely why the latter class of algorithms are less widely used in practice.

1. Bounding C∞ would require a restriction on the dynamics of the MDP (see Chen and Jiang (2019) and Section 2).
?
Bounding D∞ would require an initial state distribution that is constructed using knowledge of π ? , through dπ . In
contrast, κ can be made O(d), with an initial state distribution that only depends on the geometry of the features (and
does not depend on any other properties of the MDP). See Remark 22.

5
AGARWAL , K AKADE , L EE AND M AHAJAN

Suboptimality
Algorithm Relevant Quantities
after T Iterations

Approx. Value/Policy Iteration ∞ γT


(1−γ)2
+ (1−γ)2 ∞ : `∞ error of values
(Bertsekas and Tsitsiklis, 1996)

Approx. Value/Policy Iteration, 1 : an `1 average error


C∞ 1 γT
with concentrability (1−γ)2
+ (1−γ)2 C∞ : concentrability
(Munos, 2005; Antos et al., 2008) (max density ratio)

Conservative Policy Iteration 1 : an `1 average error


(Kakade and Langford, 2002) D∞ 1
+ 1√
Related: PSDP (Bagnell et al., 2004),
(1−γ)2 (1−γ) T D∞ : max density ratio to opt.,
MD-MPI Geist et al. (2019)
D ∞ ≤ C∞

stat : excess risk


q approx : approx. error
Natural Policy Gradient (Cor. 21 κstat +D∞ approx
+ 1√
(1−γ)3 (1−γ) T κ: a condition number
and Thm. 29)
D∞ : max density ratio to opt.,
D ∞ ≤ C∞

Table 2: Overview of Approximate Methods: The suboptimality, V ? (s0 ) − V π (s0 ), after T iter-
ations for various approximate algorithms, which use different notions of approximation
error (sample complexities are not directly considered but instead may be thought of as
part of 1 and stat . See Section 2 for further discussion). Order notation is used to drop
constants, and we assume |A| = 2 for ease of exposition. For approximate dynamic pro-
gramming methods, the relevant error is the worst case, `∞ -error in approximating a value
function, e.g. ∞ = maxs,a |Qπ (s, a) − Q b π (s, a)|, where Q
b π is what an estimation oracle
returns during the course of the algorithm. The second row (see Lemma 12 in Antos et al.
(2008)) is a refinement of this approach, where 1 is an `1 -average error in fitting the value
functions under the fitting (state) distribution µ, and, roughly, C∞ is a worst case density
ratio between the state visitation distribution of any non-stationary policy and the fitting
distribution µ. For Conservative Policy Iteration, 1 is a related `1 -average case fitting
error with respect to a fitting distribution µ, and D∞ is as defined as before, in the caption
of Table 1 (see also (Kakade and Langford, 2002)); here, D∞ ≤ C∞ (e.g. see Scherrer
(2014)). For NPG, stat and approx measure the excess risk (the regret) and approximation
errors in fitting the values. Roughly speaking, stat is the excess squared loss relative to
the best fit (among an appropriately defined parametric class) under our fitting distribution
(defined with respect to the state distribution µ). Here, approx is the approximation error:
the minimal possible error (in our parametric class) under our fitting distribution. The
condition number κ is a relative eigenvalue condition between appropriately defined feature
?
covariances with respect to the state visitation distribution of an optimal policy, dπs0 , and
the state fitting distribution µ. See text for further discussion, and Section 6 for precise
statements as well as a more general result not explicitly dependent on D∞ .

6
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

2. Related Work
We now discuss related work, roughly in the order which reflects our presentation of results in the
previous section.
For the direct policy parameterization in the tabular case, we make use of a gradient domination-
like property, namely any first-order stationary point of the policy value is approximately optimal up
to a distribution mismatch coefficient. A variant of this result also appears in Theorem 2 of Scherrer
and Geist (2014), which itself can be viewed as a generalization of the approach in Kakade and
Langford (2002). In contrast to CPI (Kakade and Langford, 2002) and the more general boosting-
based approach in Scherrer and Geist (2014), we phrase this approach as a Polyak-like gradient
domination property (Polyak, 1963) in order to directly allow for the transfer of any advances in
non-convex optimization to policy optimization in RL. More broadly, it is worth noting the global
convergence of policy gradients for Linear Quadratic Regulators (Fazel et al., 2018) also goes through
a similar proof approach of gradient domination.
Empirically, the recent work of Ahmed et al. (2019) studies entropy based regularization and
shows the value of regularization in policy optimization, even with exact gradients. This is related to
our use of the log barrier regularization.
For our convergence results of the natural policy gradient algorithm in the tabular setting, there
are close connections between our results and the works of Even-Dar et al. (2009); Geist et al. (2019).
Even-Dar et al. (2009) provides provable online regret guarantees in changing MDPs utilizing experts
algorithms (also see Neu et al. (2010); Abbasi-Yadkori et al. (2019a)); as a special case, their MDP
Experts Algorithm is equivalent to the natural policy gradient algorithm with the softmax policy
parameterization. While the convergence result due to Even-Dar et al. (2009) was not specifically
designed for this setting, it is instructive to see what it implies due to the close connections between
optimization and regret (Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz et al., 2012). The Mirror
Descent-Modified Policy Iteration (MD-MPI) algorithm (Geist et al., 2019) with negative entropy as
the Bregman divergence results is an identical algorithm as NPG for softmax parameterization in
the tabular case; Corollary 3 (Geist et al., 2019) applies to our updates, leading to a bound worse by
a 1/(1 − γ) factor and also has logarithmic dependence on |A|. Our proof for this case is concise
and may be of independent interest. Also worth noting is the Dynamic Policy Programming of Azar
et al. (2012), which is an actor-critic algorithm with a softmax parameterization; this algorithm, even
though not identical, comes with similar guarantees in terms of its rate (it is weaker in terms of an
additional 1/(1 − γ) factor) than the NPG algorithm.
We now turn to function approximation, starting with a discussion of iterative algorithms which
make incremental updates in which the next policy is effectively constrained to be close to the
previous policy, such as in CPI and PSDP (Bagnell et al., 2004). Here, the work in Scherrer and Geist
(2014) show how CPI is part of broader family of boosting-style methods. Also, with regards to PSDP,
the work in Scherrer (2014) shows how PSDP actually enjoys an improved iteration complexity
over CPI, namely O(log 1/opt ) vs. O(1/2opt ). It is worthwhile to note that both NPG and projected
gradient ascent are also incremental algorithms.
We now discuss the approximate dynamic programming results characterized in terms of the
concentrability coefficient. Broadly we use the term approximate dynamic programming to refer
to fitted value iteration, fitted policy iteration and more generally generalized policy iteration
schemes such as classification-based policy iteration as well, in addition to the classical approximate
value/policy iteration works. While the approximate dynamic programming results typically require

7
AGARWAL , K AKADE , L EE AND M AHAJAN

`∞ bounded errors, which is quite stringent, the notion of concentrability (originally due to (Munos,
2003, 2005)) permits sharper bounds in terms of average case function approximation error, provided
that the concentrability coefficient is bounded (e.g. see Munos (2005); Szepesvári and Munos
(2005); Antos et al. (2008); Lazaric et al. (2016)). Chen and Jiang (2019) provide a more detailed
discussion on this quantity. Based on this problem dependent constant being bounded, Munos (2005);
Szepesvári and Munos (2005), Antos et al. (2008) and Lazaric et al. (2016) provide meaningful
sample size and error bounds for approximate dynamic programming methods, where there is a data
collection policy (under which value-function fitting occurs) that induces a concentrability coefficient.
In terms of the concentrability coefficient C∞ and the “distribution mismatch coefficient” D∞ in
Table 2 , we have that D∞ ≤ C∞ , as discussed in (Scherrer, 2014) (also see the table caption). Also,
as discussed in Chen and Jiang (2019), a finite concentrability coefficient is a restriction on the MDP
dynamics itself, while a bounded D∞ does not require any restrictions on the MDP dynamics. The
more refined quantities defined by Farahmand et al. (2010) (for the approximate policy iteration
result) partially alleviate some of these concerns, but their assumptions still implicitly constrain the
MDP dynamics, like the finiteness of the concentrability coefficient.
Assuming bounded concentrability coefficient, there are a notable set of provable average case
guarantees for the MD-MPI algorithm (Geist et al., 2019) (see also (Azar et al., 2012; Scherrer et al.,
2015)), which are stated in terms of various norms of function approximation error. MD-MPI is
a class of algorithms for approximate planning under regularized notions of optimality in MDPs.
Specifically, Geist et al. (2019) analyze a family of actor-critic style algorithms, where there are both
approximate value functions updates and approximate policy updates. As a consequence of utilizing
approximate value function updates for the critic, the guarantees of Geist et al. (2019) are stated with
dependencies on concentrability coefficients.
When dealing with function approximation, computational and statistical complexities are
relevant because they determine the effectiveness of approximate updates with finite samples. With
regards to sample complexity, the work in Szepesvári and Munos (2005); Antos et al. (2008)
provide finite sample rates (as discussed above), further generalized to actor-critic methods in Azar
et al. (2012); Scherrer et al. (2015). In our policy optimization approach, the analysis of both
computational and statistical complexities are straightforward, since we can leverage known statistical
and computational results from the stochastic approximation literature; in particular, we use the
stochastic projected gradient ascent to obtain a simple, linear time method for the critic estimation
step in the natural policy gradient algorithm.
In terms of the algorithmic updates for the function approximation setting, our development
of NPG bears similarity to the natural actor-critic algorithm Peters and Schaal (2008), for which
some asymptotic guarantees under finite concentrability coefficients are obtained in Bhatnagar et al.
(2009). While both updates seek to minimize the compatible function approximation error, we
perform streaming updates based on stochastic optimization using Monte Carlo estimates for values.
In contrast Peters and Schaal (2008) utilize Least Squares Temporal Difference methods (Boyan,
1999) to minimize the loss. As a consequence, their updates additionally make linear approximations
to the value functions in order to estimate the advantages; our approach is flexible in allowing for
wide family of smoothly differentiable policy classes (including neural policies).
Finally, we remark on some concurrent works. The work of Bhandari and Russo (2019) provides
gradient domination-like conditions under which there is (asymptotic) global convergence to the
optimal policy. Their results are applicable to the projected gradient ascent algorithm; they are not
applicable to gradient ascent with the softmax parameterization (see the discussion in Section 5

8
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

herein for the analysis challenges). Bhandari and Russo (2019) also provide global convergence
results beyond MDPs. Also, Liu et al. (2019) provide an analysis of the TRPO algorithm (Schulman
et al., 2015) with neural network parameterizations, which bears resemblance to our natural policy
gradient analysis. In particular, Liu et al. (2019) utilize ideas from both Even-Dar et al. (2009)
(with a mirror descent style of analysis) along with Cai et al. (2019) (to handle approximation with
neural networks) to provide conditions under which TRPO returns a near optimal policy. Liu et al.
(2019) do not explicitly consider the case where the policy class is not complete (i.e when there
is approximation). Another related work of Shani et al. (2019) considers the TRPO algorithm and
provides
√ theoretical guarantees in the tabular case; their convergence rates with exact updates are
O(1/ T ) for the (unregularized) objective function of interest; they also provide faster rates on
a modified (regularized) objective function. They do not consider the case of infinite state spaces
and function approximation. The closely related recent papers (Abbasi-Yadkori et al., 2019a,b)
also consider closely related algorithms to the Natural Policy Gradient approach studied here, in an
infinite horizon, average reward setting. Specifically, the EE-P OLITEX algorithm is closely related
to the Q-NPG algorithm which we study in Section 6.2, though our approach is in the discounted
setting. We adopt the name Q-NPG to capture its close relationship with the NPG algorithm, with the
main difference being the use of function approximation for the Q-function instead of advantages.
We refer the reader to Section 6.2 (and Remark 25) for more discussion of the technical differences
between the two works.

3. Setting
A (finite) Markov Decision Process (MDP) M = (S, A, P, r, γ, ρ) is specified by: a finite state space
S; a finite action space A; a transition model P where P (s0 |s, a) is the probability of transitioning
into state s0 upon taking action a in state s; a reward function r : S × A → [0, 1] where r(s, a) is the
immediate reward associated with taking action a in state s; a discount factor γ ∈ [0, 1); a starting
state distribution ρ over S.
A deterministic, stationary policy π : S → A specifies a decision-making strategy in which
the agent chooses actions adaptively based on the current state, i.e., at = π(st ). The agent may
also choose actions according to a stochastic policy π : S → ∆(A) (where ∆(A) is the probability
simplex over A), and, overloading notation, we write at ∼ π(·|st ).
A policy induces a distribution over trajectories τ = (st , at , rt )∞ t=0 , where s0 is drawn from the
starting state distribution ρ, and, for all subsequent timesteps t, at ∼ π(·|st ) and st+1 ∼ P (·|st , at ).
The value function V π : S → R is defined as the discounted sum of future rewards starting at state s
and executing π, i.e.
"∞ #
X
V π (s) := E γ t r(st , at )|π, s0 = s ,
t=0

where the expectation is with respect to the randomness of the trajectory τ induced by π in M . Since
1
we assume that r(s, a) ∈ [0, 1], we have 0 ≤ V π (s) ≤ 1−γ . We overload notation and define V π (ρ)
as the expected value under the initial state distribution ρ, i.e.

V π (ρ) := Es0 ∼ρ [V π (s0 )].

9
AGARWAL , K AKADE , L EE AND M AHAJAN

The action-value (or Q-value) function Qπ : S × A → R and the advantage function Aπ :


S × A → R are defined as:
"∞ #
X
π
Q (s, a) = E γ r(st , at )|π, s0 = s, a0 = a , Aπ (s, a) := Qπ (s, a) − V π (s) .
t

t=0

The goal of the agent is to find a policy π that maximizes the expected value from the initial state,
i.e. the optimization problem the agent seeks to solve is:

max V π (ρ), (1)


π

where the max is over all policies. The famous theorem of Bellman and Dreyfus (1959) shows there
exists a policy π ? which simultaneously maximizes V π (s0 ), for all states s0 ∈ S.
Policy Parameterizations. This work studies ascent methods for the optimization problem:

max V πθ (ρ),
θ∈Θ

where {πθ |θ ∈ Θ} is some class of parametric (stochastic) policies. We consider a number of


different policy classes. The first two are complete in the sense that any stochastic policy can be
represented in the class. The final class may be restrictive. These classes are as follows:

• Direct parameterization: The policies are parameterized by

πθ (a|s) = θs,a , (2)

where θ ∈ ∆(A)|S| , i.e. θ is subject to θs,a ≥ 0 and a∈A θs,a = 1 for all s ∈ S and a ∈ A.
P

• Softmax parameterization: For unconstrained θ ∈ R|S||A| ,

exp(θs,a )
πθ (a|s) = P . (3)
a0 ∈A exp(θs,a0 )

The softmax parameterization is also complete.

• Restricted parameterizations: We also study parametric classes {πθ |θ ∈ Θ} that may not
contain all stochastic policies. In particular, we pay close attention to both log-linear policy
classes and neural policy classes (see Section 6). Here, the best we may hope for is an agnostic
result where we do as well as the best policy in this class.

While the softmax parameterization is the more natural parametrization among the two complete
policy classes, it is also informative to consider the direct parameterization.
It is worth explicitly noting that V πθ (s) is non-concave in θ for both the direct and the softmax
parameterizations, so the standard tools of convex optimization are not applicable. For completeness,
we formalize this as follows (with a proof in Appendix A, along with an example in Figure 1):

Lemma 1 There is an MDP M (described in Figure 1) such that the optimization problem V πθ (s)
is not concave for both the direct and softmax parameterizations.

10
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

0 0
a1
a4 a4
a3 a3
s3 s4
a2 a2
0 s0 s1 ··· sH sH+1
a1
0 r>0 a1 a1
s1 0 s2 0 s5

Figure 2: (Vanishing gradient example) A deter-


Figure 1: (Non-concavity example) A deter- ministic, chain MDP of length H + 2.
ministic MDP corresponding to We consider a policy where π(a|si ) =
Lemma 1 where V πθ (s) is not con- θsi ,a for i = 1, 2, . . . , H. Rewards are 0
cave. Numbers on arrows represent everywhere other than r(sH+1 , a1 ) = 1.
the rewards for each action. See Proposition 6.

Policy gradients. In order to introduce these methods, it is useful to define the discounted state
visitation distribution dπs0 of a policy π as:

X
dπs0 (s) := (1 − γ) γ t Prπ (st = s|s0 ), (4)
t=0

where Prπ (st = s|s0 ) is the state visitation probability that st = s, after we execute π starting at
state s0 . Again, we overload notation and write:
dπρ (s) = Es0 ∼ρ dπs0 (s) ,
 

where dπρ is the discounted state visitation distribution under initial distribution ρ.
The policy gradient functional form (see e.g. Williams (1992); Sutton et al. (1999)) is then:
1
∇θ V πθ (s0 ) = Es∼dπs θ Ea∼πθ (·|s) ∇θ log πθ (a|s)Qπθ (s, a) .
 
(5)
1−γ 0

Furthermore, if we are working with a differentiable parameterization of πθ (·|s) that explicitly


constrains πθ (·|s) to be in the simplex, i.e. πθ ∈ ∆(A)|S| for all θ, then we also have:
1
∇θ V πθ (s0 ) = Es∼dπs θ Ea∼πθ (·|s) ∇θ log πθ (a|s)Aπθ (s, a) .
 
(6)
1−γ 0

Note the above gradient expression (Equation 6) does not hold for the direct parameterization, while
Equation 5 is valid. 2
The performance difference lemma. The following lemma is helpful throughout:
Lemma 2 (The performance difference lemma (Kakade and Langford, 2002)) For all policies π, π 0
and states s0 ,
0 1 h 0 i
V π (s0 ) − V π (s0 ) = Es∼dπs Ea∼π(·|s) Aπ (s, a) .
1−γ 0

For completeness, we provide a proof in Appendix A.


P
2. This is due to a ∇θ πθ (a|s) = 0 not explicitly being maintained by the direct parameterization.

11
AGARWAL , K AKADE , L EE AND M AHAJAN

The distribution mismatch coefficient. We often characterize the difficulty of the exploration
problem faced by our policy optimization algorithms when maximizing the objective V π (µ) through
the following notion of distribution mismatch coefficient.

Definition 3 (Distribution mismatch coefficient) Given a policy π and measures ρ, µ ∈ ∆(S),


dπ dπ
we refer to µρ as the distribution mismatch coefficient of π relative to µ. Here, µρ denotes

componentwise division.

We often instantiate this coefficient with µ as the initial state distribution used in a policy
optimization algorithm, ρ as the distribution to measure the sub-optimality of our policy (this is the
start state distribution of interest), and where π above is often chosen to be π ? ∈ argmaxπ∈Π V π (ρ),
given a policy class Π.
? ?
Notation. Following convention, we use V ? and Q? to denote V π and Qπ respectively. For
iterative algorithms which obtain policy parameters θ(t) at iteration t, we let π (t) , V (t) and A(t) denote
(t) (t)
the corresponding quantities parameterized by θ(t) , i.e. πθ(t) , V θ and Aθ , respectively. For vectors
u and v, we use uv to denote the componentwise ratio; u ≥ v denotes a componentwise inequality;
qP
2
P
we use the standard convention where kvk2 = i vi , kvk1 = i |vi |, and kvk∞ = maxi |vi |.

4. Warmup: Constrained Tabular Parameterization


Our starting point is, arguably, the simplest first-order method: we directly take gradient ascent
updates on the policy simplex itself and then project back onto the simplex if the constraints
are violated after a gradient update. This algorithm is projected gradient ascent on the direct
policy parametrization of the MDP, where the parameters are the state-action probabilities, i.e.
θs,a = πθ (a|s) (see (2)). As noted in Lemma 1, V πθ (s) is non-concave in the parameters πθ . Here,
we first prove that V πθ (µ) satisfies a Polyak-like gradient domination condition (Polyak, 1963), and
this tool helps in providing convergence rates. The basic approach was also used in the analysis
of CPI (Kakade and Langford, 2002); related gradient domination-like lemmas also appeared in
Scherrer and Geist (2014).
It is instructive to consider this special case due to the connections it makes to the non-convex
optimization literature. We also provide a lower bound that rules out algorithms whose runtime
appeals to the curvature of saddle points (e.g. (Nesterov and Polyak, 2006; Ge et al., 2015; Jin et al.,
2017)).
For the direct policy parametrization where θs,a = πθ (a|s), the gradient is:

∂V π (µ) 1
= dπ (s)Qπ (s, a), (7)
∂π(a|s) 1−γ µ

using (5). In particular, for this parameterization, we may write ∇π V π (µ) instead of ∇θ V πθ (µ).

4.1 Gradient Domination


Informally, we say a function f (θ) satisfies a gradient domination property if for all θ ∈ Θ,

f (θ? ) − f (θ) = O(G(θ)),

12
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

where θ? ∈ argmaxθ0 ∈Θ f (θ0 ) and where G(θ) is some suitable scalar notion of first-order station-
arity, which can be considered a measure of how large the gradient is (see (Karimi et al., 2016;
Bolte et al., 2007; Attouch et al., 2010)). Thus if one can find a θ that is (approximately) a first-
order stationary point, then the parameter θ will be near optimal (in terms of function value). Such
conditions are a standard device to establishing global convergence in non-convex optimization, as
they effectively rule out the presence of bad critical points. In other words, given such a condition,
quantifying the convergence rate for a specific algorithm, like say projected gradient ascent, will
require quantifying the rate of its convergence to a first-order stationary point, for which one can
invoke standard results from the optimization literature.
The following lemma shows that the direct policy parameterization satisfies a notion of gradient
domination. This is the basic approach used in the analysis of CPI (Kakade and Langford, 2002); a
variant of this lemma also appears in Scherrer and Geist (2014). We give a proof for completeness.
Even though we are interested in the value V π (ρ), it is helpful to consider the gradient with
respect to another state distribution µ ∈ ∆(S).

Lemma 4 (Gradient domination) For the direct policy parameterization (as in (2)), for all state
distributions µ, ρ ∈ ∆(S), we have

?
dπρ
? π
V (ρ) − V (ρ) ≤ max (π̄ − π)> ∇π V π (µ)
dπµ π̄

?
1 dπρ
≤ max (π̄ − π)> ∇π V π (µ),
1−γ µ π̄

where the max is over the set of all policies, i.e. π̄ ∈ ∆(A)|S| .

Before we provide the proof, a few comments are in order with regards to the performance
measure ρ and the optimization measure µ. Subtly, note that although the gradient is with respect
to V π (µ), the final guarantee applies to all distributions ρ. The significance is that even though we
may be interested in our performance under ρ, it may be helpful to optimize under the distribution
µ. To see this, note the lemma shows that a sufficiently small gradient magnitude in the feasible
directions implies the policy is nearly optimal in terms of its value, but only if the state distribution
of π, i.e. dπµ , adequately covers the state distribution of some optimal policy π ? . Here, it is also worth
recalling the theorem of Bellman and Dreyfus (1959) which shows there exists a single policy π ?
that is simultaneously optimal for all starting states s0 . Note that the hardness of the exploration
problem is captured through the distribution mismatch coefficient (Definition 3).

13
AGARWAL , K AKADE , L EE AND M AHAJAN

Proof [of Lemma 4] By the performance difference lemma (Lemma 2),


1 X π?
V ? (ρ) − V π (ρ) = d (s)π ? (a|s)Aπ (s, a)
1 − γ s,a ρ
1 X π?
≤ d (s) max Aπ (s, ā)
1 − γ s,a ρ ā
?
1 X dπρ (s) π
= · dµ (s) max Aπ (s, ā)
1 − γ s dπµ (s) ā
?
!
1 dπρ (s) X π
≤ max π dµ (s) max Aπ (s, ā), (8)
1−γ s dµ (s) s

where the last inequality follows since maxā Aπ (s, ā) ≥ 0 for all states s and policies π. We wish to
upper bound (8). We then have:
X dπµ (s) X dπµ (s)
max Aπ (s, ā) = max π̄(a|s)Aπ (s, a)
s
1−γ ā π̄∈∆(A)|S| s,a 1−γ
X dπµ (s)
= max (π̄(a|s) − π(a|s))Aπ (s, a)
π̄∈∆(A)|S| s,a 1−γ
X dπµ (s)
= max (π̄(a|s) − π(a|s))Qπ (s, a)
π̄∈∆(A)|S| s,a 1−γ
= max (π̄ − π)> ∇π V π (µ)
π̄∈∆(A)|S|

maximizes Aπ (s, ·) (per state);


where the first step followsPsince maxπ̄ is attained at an action which P
the second step follows as a π(a|s)Aπ (s, a) = 0; the third step uses a (π̄(a|s) − π(a|s))V π (s) = 0
for all s; and the final step follows from the gradient expression (see (7)). Using this in (8),
?
dπρ
? π
V (ρ) − V (ρ) ≤ max (π̄ − π)> ∇π V π (µ)
dπµ π̄∈∆(A)|S|

?
1 dπρ
≤ max (π̄ − π)> ∇π V π (µ).
1−γ µ π̄∈∆(A)|S|

where the last step follows due to maxπ̄∈∆(A)|S| (π̄ − π)> ∇π V π (µ) ≥ 0 for any policy π and
dπµ (s) ≥ (1 − γ)µ(s) (see (4)).

In a sense, the use of an appropriate µ circumvents the issues of strategic exploration. It is natural
to ask whether this additional term is necessary, a question which we return to. First, we provide a
convergence rate for the projected gradient ascent algorithm.

4.2 Convergence Rates for Projected Gradient Ascent


Using this notion of gradient domination, we now give an iteration complexity bound for projected
gradient ascent over the space of stochastic policies, i.e. over ∆(A)|S| . The projected gradient ascent

14
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

algorithm updates
π (t+1) = P∆(A)|S| (π (t) + η∇π V (t) (µ)), (9)

where P∆(A)|S| is the projection onto ∆(A)|S| in the Euclidean norm.

(1−γ)3
Theorem 5 The projected gradient ascent algorithm (9) on V π (µ) with stepsize η = 2γ|A| satisfies
for all distributions ρ ∈ ∆(S),

? 2
n
? (t)
o 64γ|S||A| dπρ
min V (ρ) − V (ρ) ≤  whenever T > .
t<T (1 − γ)6 2 µ

A proof is provided in Appendix B.1. The proof first invokes a standard iteration complexity result
of projected gradient ascent to show that the gradient magnitude with respect to all feasible directions
is small. More concretely, we show the policy is -stationary3 , that is, for all πθ + δ ∈ ∆(A)|S| and
kδk2 ≤ 1, δ > ∇π V πθ (µ) ≤ . We then use Lemma 4 to complete the proof.
Note that the guarantee we provide is for the best policy found over the T rounds, which we
obtain from a bound on the average norm of the gradients. This type of a guarantee is standard in the
non-convex optimization literature, where an average regret bound cannot be used to extract a single
good solution, e.g. by averaging. In the context of policy optimization, this is not a serious limitation
as we collect on-policy trajectories for each policy in doing sample-based gradient estimation, and
these samples can be also used to estimate the policy’s value. Note that the evaluation step is not
required for every policy, and can also happen on a schedule, though we still need to evaluate O(T )
policies to obtain the convergence rates described here.

4.3 A Lower Bound: Vanishing Gradients and Saddle Points


To understand the necessity of the distribution mismatch coefficient in Lemma 4 and Theorem 5, let
us first give an informal argument that some condition on the state distribution of π, or equivalently
µ, is necessary for stationarity to imply optimality. For example, in a sparse-reward MDP (where
the agent is only rewarded upon visiting some small set of states), a policy that does not visit any
rewarding states will have zero gradient, even though it is arbitrarily suboptimal in terms of values.
Below, we give a more quantitative version of this intuition, which demonstrates that even if π
chooses all actions with reasonable probabilities (and hence the agent will visit all states if the MDP
is connected), then there is an MDP where a large fraction of the policies π have vanishingly small
gradients, and yet these policies are highly suboptimal in terms of their value.
Concretely, consider the chain MDP of length H + 2 shown in Figure 2. The starting state
of interest is state s0 and the discount factor γ = H/(H + 1). Suppose we work with the direct
parameterization, where πθ (a|s) = θs,a for a = a1 , a2 , a3 and πθ (a4 |s) = 1 − θs,a1 − θs,a2 − θs,a3 .
Note we do not over-parameterize the policy. For this MDP and policy structure, if we were to
initialize the probabilities over actions, say deterministically, then there is an MDP (obtained by
permuting the actions) where all the probabilities for a1 will be less than 1/4.
The following result not only shows that the gradient is exponentially small in H, it also shows
that many higher order derivatives, up to O(H/ log H), are also exponentially small in H.

3. See Appendix B.1 for discussion on this definition.

15
AGARWAL , K AKADE , L EE AND M AHAJAN

Proposition 6 (Vanishing gradients at suboptimal parameters) Consider the chain MDP of Fig-
ure 2, with H + 2 states, γ = H/(H + 1), and with the direct policy parameterization (with 3|S|
parameters, as described in the text above). Suppose θ is such that 0 < θ < 1 (componentwise) and
H
θs,a1 < 1/4 (for all states s). For all k ≤ 40 log(2H) − 1, we have k∇kθ V πθ (s0 )k ≤ (1/3)H/4 , where
∇kθ V πθ (s0 ) is a tensor of the kth order derivatives of V πθ (s0 ) and the norm is the operator norm of
the tensor.4 Furthermore, V ? (s0 ) − V πθ (s0 ) ≥ (H + 1)/8 − (H + 1)2 /3H .

This lemma also suggests that results in the non-convex optimization literature, on escaping from
saddle points, e.g. (Nesterov and Polyak, 2006; Ge et al., 2015; Jin et al., 2017), do not directly imply
global convergence due to that the higher order derivatives are small.

Remark 7 (Exact vs. Approximate Gradients) The chain MDP of Figure 2, is a common example
where sample based estimates of gradients will be 0 under random exploration strategies; there is an
exponentially small in H chance of hitting the goal state under a random exploration strategy. Note
that this lemma is with regards to exact gradients. This suggests that even with exact computations
(along with using exact higher order derivatives) we might expect numerical instabilities.

Remark 8 (Comparison with the upper bound) The lower bound does not contradict the upper
bound of Theorem 4 (where a small gradient is turned into a small policy suboptimality bound), as
the distribution mismatch coefficient, as defined in Definition 3, could be infinite in the chain MDP of
Figure 2, since the start-state distribution is concentrated on one state only. More generally, for any
?

policy with θs,a1 < 1/4 in all states s, ρ
π
dρ θ
= Ω(4H ).

Remark 9 (Comparison with information-theoretic lower bounds) The lower bound here is not
information theoretic, in that it does not present a hard problem instance for all algorithms. Indeed,
exploration algorithms for tabular MDPs starting from E 3 (Kearns and Singh, 2002), RMAX (Braf-
man and Tennenholtz, 2003) and several subsequent works yield polynomial sample complexities
for the chain MDP. Proposition 6 should be interpreted as a hardness result for the specific class
of policy gradient like approaches that search for a policy with a small policy gradient, as these
methods will find the initial parameters to be valid in terms of the size of (several orders of) gradients.
In particular, it precludes any meaningful claims on global optimality, based just on the size of the
policy gradients, without additional assumptions as discussed in the previous remark.

The proof is provided in Appendix B.2. The lemma illustrates that lack of good exploration can
indeed be detrimental in policy gradient algorithms, since the gradient can be small either due to π
being near-optimal, or, simply because π does not visit advantageous states often enough. In this
sense, it also demonstrates the necessity of the distribution mismatch coefficient in Lemma 4.

5. The Softmax Tabular Parameterization


We now consider the softmax policy parameterization (3). Here, we still have a non-concave
optimization problem in general, as shown in Lemma 1, though we do show that global optimality
can be reached under certain regularity conditions. From a practical perspective, the softmax
⊗k
4. The operator norm of a kth -order tensor J ∈ Rd is defined as supu1 ,...,uk ∈Rd : kui k2 =1 hJ, u1 ⊗ . . . ⊗ ud i.

16
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

parameterization of policies is preferable to the direct parameterization, since the parameters θ are
unconstrained and standard unconstrained optimization algorithms can be employed. However,
optimization over this policy class creates other challenges as we study in this section, as the optimal
policy (which is deterministic) is attained by sending the parameters to infinity.
We study three algorithms for this problem. The first performs direct policy gradient ascent
on the objective without modification, while the second adds a log barrier regularizer to keep the
parameters from becoming too large, as a means to ensure adequate exploration. Finally, we study
the natural policy gradient algorithm and establish a global optimality result with no dependence on
the distribution mismatch coefficient or dimension-dependent factors.
For the softmax parameterization, the gradient takes the form:

∂V πθ (µ) 1
= dπθ (s)πθ (a|s)Aπθ (s, a) (10)
∂θs,a 1−γ µ

(see Lemma 40 for a proof).

5.1 Asymptotic Convergence, without Regularization


Due to the exponential scaling with the parameters θ in the softmax parameterization, any policy
that is nearly deterministic will have gradients close to 0. In spite of this difficulty, we provide a
positive result that gradient ascent asymptotically converges to the global optimum for the softmax
parameterization.
The update rule for gradient ascent is:

θ(t+1) = θ(t) + η∇θ V (t) (µ). (11)

Theorem 10 (Global convergence for softmax parameterization) Assume we follow the gradient
ascent update rule as specified in Equation (11) and that the distribution µ is strictly positive i.e.
3
µ(s) > 0 for all states s. Suppose η ≤ (1−γ)
8 , then we have that for all states s, V (t) (s) → V ? (s)
as t → ∞.

Remark 11 (Strict positivity of µ and exploration) Theorem 10 assumed that optimization distribu-
tion µ was strictly positive, i.e. µ(s) > 0 for all states s. We leave it is an open question of whether
or not gradient ascent will globally converge if this condition is not met. The concern is that if this
condition is not met, then gradient ascent may not globally converge due to that dπµθ (s) effectively
scales down the learning rate for the parameters associated with state s (see (10)).

The complete proof is provided in the Appendix C.1. We now discuss the subtleties in the proof
and show why the softmax parameterization precludes a direct application of the gradient domination
lemma. In order to utilize the gradient domination property (in Lemma 4), we would desire to show
that: ∇π V π (µ) → 0. However, using the functional form of the softmax parameterization (see
Lemma 40) and (7), we have that:

∂V πθ (µ) 1 ∂V πθ (µ)
= dπµθ (s)πθ (a|s)Aπθ (s, a) = πθ (a|s) .
∂θs,a 1−γ ∂πθ (a|s)

Hence, we see that even if ∇θ V πθ (µ) → 0, we are not guaranteed that ∇π V πθ (µ) → 0.

17
AGARWAL , K AKADE , L EE AND M AHAJAN

We now briefly discuss the main technical challenges in the proof. The proof first shows that
the sequence V (t) (s) is monotone increasing pointwise, i.e. for every state s, V (t+1) (s) ≥ V (t) (s)
(Lemma 41). This implies the existence of a limit V (∞) (s) by the monotone convergence theorem
(Lemma 42). Based on the limiting quantities V (∞) (s) and Q(∞) (s, a), which we show exist, define
the following limiting sets for each state s:

I0s := {a|Q(∞) (s, a) = V (∞) (s)}


s
I+ := {a|Q(∞) (s, a) > V (∞) (s)}
s
I− := {a|Q(∞) (s, a) < V (∞) (s)} .

The challenge is to then show that, for all states s, the set I+ s is the empty set, which would

immediately imply V (∞) (s) = V ? (s). The proof proceeds by contradiction, assuming that I+ s is non-
s π
empty. Using that I+ is non-empty and that the gradient tends to zero in the limit, i.e. ∇θ V (µ) → 0,
θ

we have that for all a ∈ I+ s , π (t) (a|s) → 0 (see (10)). This, along with the functional form of the

softmax parameterization, implies that there must be divergence (in magnitude) among the set of
(t)
parameters associated with some action a at state s, i.e. that maxa∈A |θs,a | → ∞. The primary
technical challenge in the proof is to then use this divergence, along with the dynamics of gradient
ascent, to show that I+s is empty via a contradiction.

We leave it as a question for future work as to characterizing the convergence rate, which we
conjecture is exponentially slow in some of the relevant quantities, such as in terms of the size of
state space. Here, we turn to a regularization based approach to ensure convergence at a polynomial
rate in all relevant quantities.

5.2 Polynomial Convergence with Log Barrier Regularization


Due to the exponential scaling with the parameters θ, policies can rapidly become near deterministic,
when optimizing under the softmax parameterization, which can result in slow convergence. Indeed
a key challenge in the asymptotic analysis in the previous section was to handle the growth of the
absolute values of parameters as they tend to infinity. A common practical remedy for this is to
use entropy-based regularization to keep the probabilities from getting too small (Williams and
Peng, 1991; Mnih et al., 2016), and we study gradient ascent on a similarly regularized objective
in this section. Recall that the relative-entropy for distributions p and q is defined as: KL(p, q) :=
Ex∼p [− log q(x)/p(x)]. Denote the uniform distribution over a set X by UnifX , and define the
following log barrier regularized objective as:
 
πθ
Lλ (θ) := V (µ) − λ Es∼UnifS KL(UnifA , πθ (·|s))

λ X
= V πθ (µ) + log πθ (a|s) + λ log |A| , (12)
|S| |A| s,a

where λ is a regularization parameter. The constant (i.e. the last term) is not relevant with regards to
optimization. This regularizer is different from the more commonly utilized entropy regularizer as in
Mnih et al. (2016), a point which we return to in Remark 14.
The policy gradient ascent updates for Lλ (θ) are given by:

θ(t+1) = θ(t) + η∇θ Lλ (θ(t) ). (13)

18
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Our next theorem shows that approximate first-order stationary points of the entropy-regularized
objective are approximately globally optimal, provided the regularization is sufficiently small.

Theorem 12 (Log barrier regularization) Suppose θ is such that:

k∇θ Lλ (θ)k2 ≤ opt

and opt ≤ λ/(2|S| |A|). Then we have that for all starting state distributions ρ:
?
πθ ? 2λ dπρ
V (ρ) ≥ V (ρ) − .
1−γ µ

Proof The proof consists of showing that maxa Aπθ (s, a) ≤ 2λ/(µ(s)|S|) for all states. To see that
this is sufficient, observe that by the performance difference lemma (Lemma 2),

1 X π?
V ? (ρ) − V πθ (ρ) = d (s)π ? (a|s)Aπθ (s, a)
1 − γ s,a ρ
1 X π?
≤ d (s) max Aπθ (s, a)
1−γ s ρ a∈A

1 X π?
≤ 2dρ (s)λ/(µ(s)|S|)
1−γ s
?
!
2λ dπρ (s)
≤ max .
1−γ s µ(s)

which would then complete the proof.


We now proceed to show that maxa Aπθ (s, a) ≤ 2λ/(µ(s)|S|). For this, it suffices to bound
A θ (s, a) for any state-action pair s, a where Aπθ (s, a) ≥ 0 else the claim is trivially true. Consider
π

an (s, a) pair such that Aπθ (s, a) > 0. Using the policy gradient expression for the softmax
parameterization (see Lemma 40),
 
∂Lλ (θ) 1 λ 1
= dπµθ (s)πθ (a|s)Aπθ (s, a) + − πθ (a|s) . (14)
∂θs,a 1−γ |S| |A|

The gradient norm assumption k∇θ Lλ (θ)k2 ≤ opt implies that:


 
∂Lλ (θ) 1 λ 1
opt ≥ = dπµθ (s)πθ (a|s)Aπθ (s, a) + − πθ (a|s)
∂θs,a 1−γ |S| |A|
 
λ 1
≥ − πθ (a|s) ,
|S| |A|

where we have used Aπθ (s, a) ≥ 0. Rearranging and using our assumption opt ≤ λ/(2|S| |A|),

1 opt |S| 1
πθ (a|s) ≥ − ≥ .
|A| λ 2|A|

19
AGARWAL , K AKADE , L EE AND M AHAJAN

Solving for Aπθ (s, a) in (14), we have:


  
πθ 1−γ 1 ∂Lλ (θ) λ 1
A (s, a) = + 1−
dπµθ (s) πθ (a|s) ∂θs,a |S| πθ (a|s)|A|
 
1−γ λ
≤ 2|A|opt +
dπµθ (s) |S|
1−γ λ
≤ 2 πθ
dµ (s) |S|
≤ 2λ/(µ(s)|S|) ,
where the penultimate step uses opt ≤ λ/(2|S| |A|) and the final step uses dπµθ (s) ≥ (1 − γ)µ(s).
This completes the proof.

By combining the above theorem with standard results on the convergence of gradient ascent (to
first order stationary points), we obtain the following corollary.
8γ 2λ
Corollary 13 (Iteration complexity with log barrier regularization) Let βλ := (1−γ)3
+ |S| . Starting
(1−γ)
from any initial θ(0) , consider the updates (13) with λ = dπ
? and η = 1/βλ . Then for all
ρ
2 µ

starting state distributions ρ, we have
? 2
n
? (t)
o 320|S|2 |A|2 dπρ
min V (ρ) − V (ρ) ≤  whenever T ≥ .
t<T (1 − γ)6 2 µ

See Appendix C.2 for the proof. The corollary shows the importance of balancing how the
regularization parameter λ is set relative to the desired accuracy , as well as the importance of the
initial distribution µ to obtain global optimality.

Remark 14 (Entropy vs. log barrier regularization) The more commonly considered regularizer
is the entropy (Mnih et al., 2016) (also see Ahmed et al. (2019) for a more detailed empirical
investigation), where the regularizer would be:
1 X 1 XX
H(πθ (·|s)) = −πθ (a|s) log πθ (a|s).
|S| s |S| s a

Note the entropy is far less aggressive in penalizing small probabilities, in comparison to the log
barrier, which is equivalent to the relative entropy. In particular, the entropy regularizer is always
bounded between 0 and log |A|, while the relative entropy (against the uniform distribution over
actions), is bounded between 0 and infinity, where it tends to infinity as probabilities tend to 0. We
leave it is an open question if a polynomial convergence rate 5 is achievable with the more common
entropy regularizer; our polynomial convergence rate using the KL regularizer crucially relies on
the aggressive nature in which the relative entropy prevents small probabilities (the proof shows that
any action, with a positive advantage, has a significant probability for any near-stationary policy of
the regularized objective).
5. Here, ideally we would like to be poly in |S|, |A|, 1/(1 − γ), 1/, and the distribution mismatch coefficient, which
we conjecture may not be possible.

20
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

5.3 Dimension-free Convergence of Natural Policy Gradient Ascent


We now show the Natural Policy Gradient algorithm, with the softmax parameterization (3), obtains
an improved iteration complexity. The NPG algorithm defines a Fisher information matrix (induced
by π), and performs gradient updates in the geometry induced by this matrix as follows:
h  > i
Fρ (θ) = Es∼dπρ θ Ea∼πθ (·|s) ∇θ log πθ (a|s) ∇θ log πθ (a|s)
θ(t+1) = θ(t) + ηFρ (θ(t) )† ∇θ V (t) (ρ), (15)

where M † denotes the Moore-Penrose pseudoinverse of the matrix M . Throughout this section, we
restrict to using the initial state distribution ρ ∈ ∆(S) in our update rule in (15) (so our optimization
measure µ and the performance measure ρ are identical). Also, we restrict attention to states s ∈ S
reachable from ρ, since, without loss of generality, we can exclude states that are not reachable under
this start state distribution6 .
We leverage a particularly convenient form the update takes for the softmax parameterization
(see Kakade (2001)). For completeness, we provide a proof in Appendix C.3.

Lemma 15 (NPG as soft policy iteration) For the softmax parameterization (3), the NPG up-
dates (15) take the form:

η exp(ηA(t) (s, a)/(1 − γ))


θ(t+1) = θ(t) + A(t) and π (t+1) (a|s) = π (t) (a|s) ,
1−γ Zt (s)
(t) (a|s) exp(ηA(t) (s, a)/(1
P
where Zt (s) = a∈A π − γ)).

The updates take a strikingly simple form in this special case; they are identical to the classical
multiplicative weights updates (Freund and Schapire, 1997; Cesa-Bianchi and Lugosi, 2006) for
online linear optimization over the probability simplex, where the linear functions are specified by
the advantage function of the current policy at each iteration. Notably, there is no dependence on the
(t)
state distribution dρ , since the pseudoinverse of the Fisher information cancels out the effect of the
state distribution in NPG. We now provide a dimension free convergence rate of this algorithm.

Theorem 16 (Global convergence for NPG) Suppose we run the NPG updates (15) using ρ ∈
∆(S) and with θ(0) = 0. Fix η > 0. For all T > 0, we have:

log |A| 1
V (T ) (ρ) ≥ V ∗ (ρ) − − .
ηT (1 − γ)2 T

In particular, setting η ≥ (1 − γ)2 log |A|, we see that NPG finds an -optimal policy in a number
of iterations that is at most:
2
T ≤ ,
(1 − γ)2 
which has no dependence on the number of states or actions, despite the non-concavity of the
underlying optimization problem.
6. Specifically, we restrict the MDP to the set of states {s ∈ S : ∃π such that dπρ (s) > 0}.

21
AGARWAL , K AKADE , L EE AND M AHAJAN

The proof strategy we take borrows ideas from the online regret framework in changing MDPs
(in (Even-Dar et al., 2009)); here, we provide a faster rate of convergence than the analysis implied
by Even-Dar et al. (2009) or by Geist et al. (2019). We also note that while this proof is obtained for
the NPG updates, it is known in the literature that in the limit of small stepsizes, NPG and TRPO
updates are closely related (e.g. see Schulman et al. (2015); Neu et al. (2017); Rajeswaran et al.
(2017)).
First, the following improvement lemma is helpful:

Lemma 17 (Improvement lower bound for NPG) For the iterates π (t) generated by the NPG up-
dates (15), we have for all starting state distributions µ

(1 − γ)
V (t+1) (µ) − V (t) (µ) ≥ Es∼µ log Zt (s) ≥ 0.
η

Proof First, let us show that log Zt (s) ≥ 0. To see this, observe:

X
log Zt (s) = log π (t) (a|s) exp(ηA(t) (s, a)/(1 − γ))
a
X η X (t)
≥ π (t) (a|s) log exp(ηA(t) (s, a)/(1 − γ)) = π (a|s)A(t) (s, a) = 0.
a
1−γ a

where the inequality follows by Jensen’s inequality on the concave function log x and the final
(t+1)
equality uses a π (t) (a|s)A(t) (s, a) = 0. Using d(t+1) as shorthand for dµ , the performance
P
difference lemma implies:

1 X
V (t+1) (µ) − V (t) (µ) = Es∼d(t+1) π (t+1) (a|s)A(t) (s, a)
1−γ a
1 X π (t+1) (a|s)Zt (s)
= Es∼d(t+1) π (t+1) (a|s) log
η a
π (t) (a|s)
1 1
= Es∼d(t+1) KL(πs(t+1) ||πs(t) ) + Es∼d(t+1) log Zt (s)
η η
1 1−γ
≥ Es∼d(t+1) log Zt (s) ≥ Es∼µ log Zt (s),
η η

(t+1)
where the last step uses that d(t+1) = dµ ≥ (1 − γ)µ, componentwise (by (4)), and that
log Zt (s) ≥ 0.

With this lemma, we now prove Theorem 16.

22
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

?
Proof [of Theorem 16] Since ρ is fixed, we use d? as shorthand for dπρ ; we also use πs as shorthand
for the vector of π(·|s). By the performance difference lemma (Lemma 2),
? 1 X
V π (ρ) − V (t) (ρ) = Es∼d? π ? (a|s)A(t) (s, a)
1−γ a
1 X π (t+1) (a|s)Zt (s)
= Es∼d? π ? (a|s) log
η a
π (t) (a|s)
!
1 X
= Es∼d? KL(πs? ||πs(t) ) − KL(πs? ||πs(t+1) ) + π ∗ (a|s) log Zt (s)
η a
1  
= Es∼d? KL(πs? ||πs(t) ) − KL(πs? ||πs(t+1) ) + log Zt (s) ,
η
where we have used the closed form of our updates from Lemma 15 in the second step.
By applying Lemma 17 with d? as the starting state distribution, we have:
1 1  (t+1) ? 
Es∼d? log Zt (s) ≤ V (d ) − V (t) (d? )
η 1−γ
which gives us a bound on Es∼d? log Zt (s).
Using the above equation and that V (t+1) (ρ) ≥ V (t) (ρ) (as V (t+1) (s) ≥ V (t) (s) for all states s
by Lemma 17), we have:
T −1
π? (T −1) 1 X π?
V (ρ) − V (ρ) ≤ (V (ρ) − V (t) (ρ))
T
t=0
T −1 T −1
1 X 1 X
≤ Es∼d? (KL(πs? ||πs(t) ) − KL(πs? ||πs(t+1) )) + Es∼d? log Zt (s)
ηT ηT
t=0 t=0
T −1 
Es∼d? KL(πs? ||π (0) ) 1 X 
≤ + V (t+1) (d? ) − V (t) (d? )
ηT (1 − γ)T
t=0
Es∼d? KL(πs? ||π (0) ) V (T ) (d? )
− V (0) (d? )
= +
ηT (1 − γ)T
log |A| 1
≤ + .
ηT (1 − γ)2 T

The proof is completed using that V (T ) (ρ) ≥ V (T −1) (ρ).

6. Function Approximation and Distribution Shift


We now analyze the case of using parametric policy classes:

Π = {πθ | θ ∈ Rd },

where Π may not contain all stochastic policies (and it may not even contain an optimal policy). In
contrast with the tabular results in the previous sections, the policy classes that we are often interested

23
AGARWAL , K AKADE , L EE AND M AHAJAN

in are not fully expressive, e.g. d  |S||A| (indeed |S| or |A| need not even be finite for the results
in this section); in this sense, we are in the regime of function approximation.
We focus on obtaining agnostic results, where we seek to do as well as the best policy in this
class (or as well as some other comparator policy). While we are interested in a solution to the
(unconstrained) policy optimization problem

max V πθ (ρ),
θ∈Rd

(for a given initial distribution ρ), we will see that optimization with respect to a different distribution
will be helpful, just as in the tabular case,
We will consider variants of the NPG update rule (15):

θ ← θ + ηFρ (θ)† ∇θ V θ (ρ) . (16)

Our analysis will leverage a close connection between the NPG update rule (15) with the notion of
compatible function approximation (Sutton et al., 1999), as formalized in Kakade (2001). Specifically,
it can be easily seen that:
1
Fρ (θ)† ∇θ V θ (ρ) = w? , (17)
1−γ
where w? is a minimizer of the following regression problem:
h i
w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) (w> ∇θ log πθ (·|s) − Aπθ (s, a))2 .

The above is a straightforward consequence of the first order optimality conditions (see (50)).
The above regression problem can be viewed as “compatible” function approximation: we are
approximating Aπθ (s, a) using the ∇θ log πθ (·|s) as features. We also consider a variant of the above
update rule, Q-NPG, where instead of using advantages in the above regression we use the Q-values.
This viewpoint provides a methodology for approximate updates, where we can solve the relevant
regression problems with samples. Our main results establish the effectiveness of NPG updates
where there is error both due to statistical estimation (where we may not use exact gradients) and
approximation (due to using a parameterized function class); in particular, we provide a novel
estimation/approximation decomposition relevant for the NPG algorithm. For these algorithms, we
will first consider log linear policies classes (as a special case) and then move on to more general
policy classes (such as neural policy classes). Finally, it is worth remarking that the results herein
provide one of the first provable approximation guarantees where the error conditions required do
not have explicit worst case dependencies over the state space.

6.1 NPG and Q-NPG Examples


In practice, the most common policy classes are of the form:
(  )
exp fθ (s, a) d
Π = πθ (a|s) = P  θ∈R , (18)
0
a0 ∈A exp fθ (s, a )

where fθ is a differentiable function. For example, the tabular softmax policy class is one where
fθ (s, a) = θs,a . Typically, fθ is either a linear function or a neural network. Let us consider the NPG
algorithm, and a variant Q-NPG, in each of these two cases.

24
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

6.1.1 L OG - LINEAR P OLICY C LASSES AND S OFT P OLICY I TERATION


For any state-action pair (s, a), suppose we have a feature mapping φs,a ∈ Rd . Each policy in the
log-linear policy class is of the form:

exp(θ · φs,a )
πθ (a|s) = P ,
a0 ∈A exp(θ · φs,a0 )

with θ ∈ Rd . Here, we can take fθ (s, a) = θ · φs,a .


With regards to compatible function approximation for the log-linear policy class, we have:
θ θ
∇θ log πθ (a|s) = φs,a , where φs,a = φs,a − Ea0 ∼πθ (·|s) [φs,a0 ],
θ
that is, φs,a is the centered version of φs,a . With some abuse of notation, we accordingly also define
φ̄π for any policy π. Here, using (17), the NPG update rule (16) is equivalent to:
h i
θ 2
NPG: θ ← θ + ηw? , w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) Aπθ (s, a) − w · φs,a .

(We have rescaled the learning rate η in comparison to (16)). Note that we recompute w? for
every update of θ. Here, the compatible function approximation error measures the expressivity of
our parameterization in how well linear functions of the parameterization can capture the policy’s
advantage function.
We also consider a variant of the NPG update rule (16), termed Q-NPG, where:
h 2 i
Q-NPG: θ ← θ + ηw? , w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) Qπθ (s, a) − w · φs,a .

Note we do not center the features for Q-NPG; observe that Qπ (s, a) is also not 0 in expectation
under π(·|s), unlike the advantage function.

Remark 18 (NPG/Q-NPG and Soft-Policy Iteration) We now see how we can view both NPG and
Q-NPG as an incremental (soft) version of policy iteration, just as in Lemma 15 for the tabular case.
Rather than writing the update rule in terms of the parameter θ, we can write an equivalent update
rule directly in terms of the (log-linear) policy π:
h i
π 2
NPG: π(a|s) ← π(a|s) exp(w? ·φs,a )/Zs , w? ∈ argminw Es∼dπρ ,a∼π(·|s) Aπ (s, a)−w·φs,a ,

where Zs is normalization constant. While the policy update uses the original features φ instead of
π π
φ , whereas the quadratic error minimization is terms of the centered features φ , this distinction
π
is not relevant due to that we may also instead use φ (in the policy update) which would result in
an equivalent update; the normalization makes the update invariant to (constant) translations of the
features. Similarly, an equivalent update for Q-NPG, where we update π directly rather than θ, is:
h 2 i
Q-NPG: π(a|s) ← π(a|s) exp(w? ·φs,a )/Zs , w? ∈ argminw Es∼dπρ ,a∼π(·|s) Qπ (s, a)−w·φs,a .

Remark 19 (On the equivalence of NPG and Q-NPG) If it is the case that the compatible function
approximation error is 0, then it straightforward to verify that the NPG and Q-NPG are equivalent
algorithms, in that their corresponding policy updates will be equivalent to each other.

25
AGARWAL , K AKADE , L EE AND M AHAJAN

6.1.2 N EURAL P OLICY C LASSES


Now suppose fθ (s, a) is a neural network parameterized by θ ∈ Rd , where the policy class Π is of
form in (18). Observe:
∇θ log πθ (a|s) = gθ (s, a), where gθ (s, a) = ∇θ fθ (s, a) − Ea0 ∼πθ (·|s) [∇θ fθ (s, a0 )],
and, using (17), the NPG update rule (16) is equivalent to:
h 2 i
NPG: θ ← θ + ηw? , w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) Aπθ (s, a) − w · gθ (s, a)

(Again, we have rescaled the learning rate η in comparison to (16)).


The Q-NPG variant of this update rule is:
h 2 i
Q-NPG: θ ← θ + ηw? , w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) Qπθ (s, a) − w · ∇θ fθ (s, a) .

6.2 Q-NPG: Performance Bounds for Log-Linear Policies


For a state-action distribution υ, define:
 
2
L(w; θ, υ) := Es,a∼υ Qπθ (s, a) − w · φs,a .

The iterates of the Q-NPG algorithm can be viewed as minimizing this loss under some (changing)
distribution υ.
We now specify an approximate version of Q-NPG. It is helpful to consider a slightly more
general version of the algorithm in the previous section, where instead of optimizing under a starting
state distribution ρ, we have a different starting state-action distribution ν. Analogous to the definition
of the state visitation measure, dπµ , we can define a visitation measure over states and actions induced
by following π after s0 , a0 ∼ ν. We overload notation using dπν to also refer to the state-action
visitation measure; precisely,

X
dπν (s, a) := (1 − γ)Es0 ,a0 ∼ν γ t Prπ (st = s, at = a|s0 , a0 ) (19)
t=0

where Prπ (st = s, at = a|s0 , a0 ) is the probability that st = s and at = a, after starting at state s0 ,
taking action a0 , and following π thereafter. While we overload notation for visitation distributions
(dπµ (s) and dπν (s, a)) for notational convenience, note that the state-action measure dπν uses the
subscript ν, which is a state-action measure.
Q-NPG will be defined with respect to the on-policy state action measure starting with s0 , a0 ∼ ν.
As per our convention, we define
(t)
d(t) := dπν .
The approximate version of this algorithm is:
Approx. Q-NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argminkwk2 ≤W L(w; θ(t) , d(t) ), (20)

where the above update rule also permits us to constrain the norm of the update direction w(t)
(alternatively, we could use `2 regularization as is also common in practice). The exact minimizer is
denoted as:
(t)
w? ∈ argminkwk2 ≤W L(w; θ(t) , d(t) ).

26
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

(t)
Note that w? depends on the current parameter θ(t) and W can scale with |S| and |A| in general.
Our analysis will take into account both the excess risk (often also referred to as estimation error)
(t)
and the transfer error. Here, the excess risk will be due to that w(t) may not be equal w? , and the
(t)
approximation error will be due to that even the best linear fit using w? may not perfectly match the
(t) (t) (t)
Q-values, i.e. L(w? ; θ ; d ) is unlikely to be 0 in practical applications.
We now formalize these concepts in the following assumption:
Assumption 6.1 (Estimation/Transfer errors) Fix a state distribution ρ; a state-action distribution
ν; an arbitrary comparator policy π ? (not necessarily an optimal policy). With respect to π ? , define
the state-action measure d? as
?
d? (s, a) = dπρ (s) ◦ UnifA (a)
?
i.e. d? samples states from the comparators state visitation measure, dπρ and actions from the
uniform distribution. Let us permit the sequence of iterates w(0) , w(1) , . . . w(T −1) used by the Q-
NPG algorithm to be random, where the randomness could be due to sample-based, estimation error.
Suppose the following holds for all t < T :
1. (Excess risk) Assume that the estimation error is bounded as follows:
h i
(t)
E L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) ≤ stat

Note that using a sample based approach we would expect stat = O(1/ N ) or better, where
(t)
N is the number of samples used to estimate. w? We formalize this in Corollary 26.
(t)
2. (Transfer error) Suppose that the best predictor w? has an error bounded by bias , in expecta-
tion, with respect to the comparator’s measure of d∗ . Specifically, assume:
h i
(t)
E L(w? ; θ(t) , d? ) ≤ bias .

We refer to bias as the transfer error (or transfer bias); it is the error where relevant distribution
is shifted to d? . For the softmax policy parameterization for tabular MDPs, bias = 0 (see
remark 24 for another example).
In both conditions, the expectations are with respect to the randomness in the sequence of iterates
w(0) , w(1) , . . . w(T −1) , e.g. the approximate algorithm may be sample based.
Shortly, we discuss how the transfer error relates to the more standard approximation-estimation
decomposition. Importantly, with the transfer error, it is always defined with respect to a single, fixed
measure, d? .
Assumption 6.2 (Relative condition number) Consider the same ρ, ν, and π ? as in Assump-
tion 6.1. With respect to any state-action distribution υ, define:
h i
Συ = Es,a∼υ φs,a φ> s,a ,

and define
w> Σd? w
sup = κ.
w∈Rd w > Σν w
Assume that κ is finite.

27
AGARWAL , K AKADE , L EE AND M AHAJAN

Remark 22 discusses why it is reasonable to expect that κ is not a quantity related to the size of
the state space.7
Our main theorem below shows how the approximation error, the excess risk, and the conditioning,
determine the final performance. Note that both the transfer error bias and κ are defined with respect
to the comparator policy π ? .

Theorem 20 (Agnostic learning with Q-NPG) Fix a state distribution ρ; a state-action distribution
ν; an arbitrary comparator policy π ? (not necessarily an optimal policy). Suppose Assumption 6.2
holdspand kφs,a k2 ≤ B for all s, a. Suppose the Q-NPG update rule (in (20)) starts with θ(0) = 0,
η = 2 log |A|/(B 2 W 2 T ), and the (random) sequence of iterates satisfies Assumption 6.1. We
have that
 r s p
n ? o BW 2 log |A| 4|A|κstat 4|A|bias
π (t)
E min V (ρ) − V (ρ) ≤ + 3
+ .
t<T 1−γ T (1 − γ) 1−γ

The proof is provided in Section 6.4. p


Note when bias = 0, our convergence rate is O( 1/T ) plus a term that depends on the excess
risk; hence, provided we obtain enough samples, then stat will also tend to 0, and we will be
competitive with the comparison policy π ? . When bias = 0 and stat = 0, as in the tabular setting
with exact gradients, thepadditional two terms become 0, consistent with Theorem 16 except that the
convergence rate is O( 1/T ) rather than the faster rate of O(1/T ). Obtaining a faster rate in the
function approximation regime appears to require stronger conditions on how the approximation
errors are controlled at each iteration.
The usual approximation-estimation error decomposition is that we can write our error as:
(t) (t)
L(w(t) ; θ(t) , d(t) ) = L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) + L(w? ; θ(t) , d(t) )
| {z } | {z }
Excess risk Approximation error

As we obtain more samples, we can drive the excess risk (the estimation error) to 0 (see Corollary 26).
The approximation error above is due to modeling error. Importantly, for our Q-NPG performance
bound, it is not this standard approximation error notion which is relevant, but it is this error under
(t)
a different measure d? , i.e. L(w? ; θ(t) , d? ). One appealing aspect about the transfer error is that
this error is with respect to a fixed measure, namely d? . Furthermore, in practice, modern machine
learning methods often performs favorably with regards to transfer learning, substantially better than
worst case theory might suggest.
The following corollary provides a performance bound in terms of the usual notion of approxima-
tion error, at the cost of also depending on the worst case distribution mismatch ratio. The corollary
disentangles the estimation error from the approximation error.

Corollary 21 (Estimation error/Approximation error bound for Q-NPG) Consider the same setting
as in Theorem 20. Rather than assuming the transfer error is bounded (part 2 in Assumption 6.1),
suppose that, for all t ≤ T , h i
(t)
E L(w? ; θ(t) , d(t) ) ≤ approx .

w> Σd? w
7. Technically, we only need the relative condition number supw∈Rd w> Σ (t) w
to be bounded for all t. We state this as
π
a sufficient condition based on the initial distribution ν due to: this is more interpretable, and, as per Remark 22, this
quantity can be bounded in a manner that is independent of the sequence of iterates produced by the algorithm.

28
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

We have that
r s
d?
 o  
n ?
π (t) BW 2 log |A| 4|A|
E min V (ρ) − V (ρ) ≤ + κ · stat + · approx .
t<T 1−γ T (1 − γ)3 ν ∞

Proof We have the following crude upper bound on the transfer error:

(t) d? (t) 1 d? (t)


L(w? ; θ(t) , d? ) ≤ L(w? ; θ(t) , d(t) ) ≤ L(w? ; θ(t) , d(t) ),
d(t) ∞ 1−γ ν ∞

1 d?
where the last step uses the defintion of d(t) (see (19)). This implies bias ≤ 1−γ ν ∞ approx , and
the corollary follows.

The above also shows the striking difference between the effects of estimation error and approx-
imation error. The proof shows how the transfer error notion is weaker than previous conditions
based on distribution mistmatch coefficients or concentrability coefficients. Also, as discussed in
?
Scherrer (2014), the (distribution mismatch) coefficient dν is already weaker than the more

standard concentrability coefficients.
A few additional remarks are now in order. We now make a few observations with regards to κ.

Remark 22 (Dimension dependence in κ and the importance of ν) It is reasonable to think about κ


as being dimension dependent (or worse), but it is not necessarily related to the size of the state space.
B2
For example, if kφs,a k2 ≤ B, then κ ≤ σ (Es,a∼ν [φs,a φ>
though this bound may be pessimistic.
min s,a ])
Here, we also see the importance of choice of ν in having a small (relative) condition number;
in particular, this is the motivation for considering the generalization which allows for a starting
state-action distribution ν vs. just a starting state distribution µ (as we did in the tabular case).
Roughly speaking, we desire a ν which provides good coverage over the features. As the following
lemma shows, there always exists a universal distribution ν, which can be constructed only with
knowledge of the feature set (without knowledge of d? ), such that κ ≤ d.

Lemma 23 (κ ≤ d is always possible) Let Φ = {φ(s, a)|(s, a) ∈ S × A} ⊂ Rd and suppose Φ is


a compact set. There always exists a state-action distribution ν, which is supported on at most d2
state-action pairs and which can be constructed only with knowledge of Φ (without knowledge of the
MDP or d? ), such that:
κ ≤ d.

Proof The distribution can be found through constructing the minimal volume ellipsoid containing
Φ, i.e. the Loẅner-John ellipsoid (John, 1948). In particular, this ν is supported on the contact points
between this ellipsoid and Φ; the lemma immediately follows from properties of this ellipsoid (e.g.
see Ball (1997); Bubeck et al. (2012)).

It is also worth considering a more general example (beyond tabular MDPs) in which bias = 0
for the log-linear policy class.

29
AGARWAL , K AKADE , L EE AND M AHAJAN

Algorithm 1 Sampler for: s, a ∼ dπν and unbiased estimate of Qπ (s, a)


Require: Starting state-action distribution ν.
1: Sample s0 , a0 ∼ ν.
2: Sample s, a ∼ dπ ν as follows: at every timestep h, with probability γ, act according to π; else,
accept (sh , ah ) as the sample and proceed to Step 4. See (19).
3: From sh , ah , continue to execute π, and use a termination probability of 1−γ. Upon termination,
cπ (sh , ah ) as the undiscounted sum of rewards from time h onwards.
set Q
4: return (sh , ah ) and Q cπ (sh , ah ).

Remark 24 (bias = 0 for “linear” MDPs) In the recent linear MDP model of Jin et al. (2019);
Yang and Wang (2019); Jiang et al. (2017), where the transition dynamics are low rank, we have
that bias = 0 provided we use the features of the linear MDP. Our guarantees also permit model
misspecification of linear MDPs, with non worst-case approximation error where bias 6= 0.

Remark 25 (Comparison with P OLITEX and EE-P OLITEX) Compared with P OLITEX (Abbasi-
Yadkori et al., 2019a), Assumption 6.2 is substantially milder, in that it just assumes a good relative
condition number for one policy rather than all possible policies (which cannot hold in general
even for tabular MDPs). Changing this assumption to an analog of Assumption 6.2 is the main
improvement in the analysis of the EE-P OLITEX (Abbasi-Yadkori et al., 2019b) algorithm. They
provide a regret bound for the average reward setting, which is qualitatively different from the
suboptimality bound in the discounted setting that we study. They provide a specialized result for
linear function approximation, similar to Theorem 20.

6.2.1 Q-NPG S AMPLE C OMPLEXITY


Assumption 6.3 (Episodic Sampling Oracle) For a fixed state-action distribution ν, we assume
the ability to: start at s0 , a0 ∼ ν; continue to act thereafter in the MDP according to any policy π;
and terminate this “rollout” when desired. With this oracle, it is straightforward to obtain unbiased
samples of Qπ (s, a) (or Aπ (s, a)) under s, a ∼ dπν for any π; see Algorithms 1 and 3.

Algorithm 2 provides a sample based version of the Q-NPG algorithm; it simply uses stochastic
projected gradient ascent within each iteration. The following corollary shows this algorithm suffices
to obtain an accurate sample based version of Q-NPG.

Corollary 26 (Sample complexity of Q-NPG) Assume we are in the setting of Theorem 20 and that
we have access to an episodic sampling oracle (i.e. Assumption 6.3). Suppose that the Sample Based
Q-NPG Algorithm (Algorithm 2) is run for T iterations, with N gradient steps per iteration, with an
appropriate setting of the learning rates η and α. We have that:
 n ? o
E min V π (ρ) − V (t) (ρ)
t<T
r s p
BW 2 log |A| 8κ|A|BW (BW + 1) 1 4|A|bias
≤ + 4 1/4
+ .
1−γ T (1 − γ) N 1−γ

Furthermore, since each episode has expected length 2/(1 − γ), the expected number of total samples
used by Q-NPG is 2N T /(1 − γ).

30
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Algorithm 2 Sample-based Q-NPG for Log-linear Policies


Require: Learning rate η; SGD learning rate α; number of SGD iterations N
1: Initialize θ (0) = 0.
2: for t = 0, 1, . . . , T − 1 do
3: Initialize w0 = 0
4: for n = 0, 1, . . . , N − 1 do
5: Call Algorithm 1 to obtain s, a ∼ d(t) and an unbiased estimate Q(s,
b a).
6: Update:
   
wn+1 = ProjW wn − 2α wn · φs,a − Q(s, b a) φs,a

where W = {w : kwk2 ≤ W }.
7: end for
b(t) = N1 N
P
8: Set w n=1 wn .
9: Update θ (t+1) = θ(t) + η w
b(t) .
10: end for

1 W
Proof Note that our sampled gradients are bounded by G := 2B(BW + 1−γ ). Using α = √ ,
G N
a
standard analysis for stochastic projected gradient ascent (Theorem 59) shows that:
1
2BW (BW + 1−γ )
stat ≤ √ .
N
The proof is completed via substitution.

Remark 27 (Improving the scaling with N ) Our current rate of convergence is 1/N 1/4 due to our
use of stochastic projected gradient ascent. Instead, for the least squares estimator, stat would be
O(d/N ) provided certain further regularity assumptions hold (a bound on the minimal eigenvalue
of Σν would be sufficient but not necessary. See Hsu et al. (2014)
√ for such conditions). With such
further assumptions, our rate of convergence would be O(1/ N ).

6.3 NPG: Performance Bounds for Smooth Policy Classes


We now return to the analyzing the standard NPG update rule, which uses advantages rather than
Q-values (see Section 6.1). It is helpful to define
 
πθ
2
LA (w; θ, υ) := Es,a∼υ A (s, a) − w · ∇θ log πθ (a|s) .

where υ is state-action distribution, and the subscript of A denotes the loss function uses advantages
(rather than Q-values). The iterates of the NPG algorithm can be viewed as minimizing this loss
under some appropriately chosen measure.
We now consider an approximate version of the NPG update rule:

Approx. NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argminkwk2 ≤W LA (w; θ(t) , d(t) ), (21)

31
AGARWAL , K AKADE , L EE AND M AHAJAN

where again we use the on-policy, fitting distribution d(t) . As with Q-NPG, we also permit the use of
a starting state-action distribution ν as opposed to just a starting state distribution (see Remark 22).
(t) (t)
Again, we let w? denote the minimizer, i.e. w? ∈ argminkwk2 ≤W LA (w; θ(t) , d(t) ).
For this section, our analysis will focus on more general policy classes, beyond log-linear policy
classes. In particular, we make the following smoothness assumption on the policy class:

Assumption 6.4 (Policy Smoothness) Assume for all s ∈ S and a ∈ A that log πθ (a|s) is a β-
smooth function of θ (to recall the definition of smoothness, see (24)).

It is not to difficult to verify that the tabular softmax policy parameterization is a 1-smooth policy
class in the above sense. The more general class of log-linear policies is also smooth as we remark
below.
Remark 28 (Smoothness of the log-linear policy class) For the log-linear policy class (see Sec-
tion 6.1.1), smoothness is implied if the features φ have bounded Euclidean norm. Precisely, if
the feature mapping φ satisfies kφs,a k2 ≤ B, then it is not difficult to verify that log πθ (a|s) is a
B 2 -smooth function.
For any state-action distribution υ, define:
h i
Σθυ = Es,a∼υ ∇θ log πθ (a|s) (∇θ log πθ (a|s))>

(t) (t)
and, again, we use Συ as shorthand for Σθυ .

Assumption 6.5 (Estimation/Transfer/Conditioning) Fix a state distribution ρ; a state-action distri-


bution ν; an arbitrary comparator policy π ? (not necessarily an optimal policy). With respect to π ? ,
define the state-action measure d? as
?
d? (s, a) = dπρ (s)π ? (a|s).

Note that, in comparison to Assumption 6.1, d? is the state-action visitation measure of the comparator
policy. Let us permit the sequence of iterates w(0) , w(1) , . . . w(T −1) used by the NPG algorithm to
be random, where the randomness could be due to sample-based, estimation error. Suppose the
following holds for all t < T :
1. (Excess risk) Assume the estimation error is bounded as:
h i
(t)
E LA (w(t) ; θ(t) , d(t) ) − LA (w? ; θ(t) , d(t) ) | θ(t) ≤ stat

i.e. the above conditional expectation is bounded 8


p (with probability one). As we see in
Corollary 26, we can guarantee stat to drop as 1/N .

2. (Transfer error) Suppose that:


h i
(t)
E LA (w? ; θ(t) , d? ) ≤ bias .
8. The use of a conditional expectation here (vs. the unconditional one in Assumption 6.1) permits the assumption to
hold even in settings where we may reuse data in the sample-based approximation of LA . Also, the expectation over
the iterates allows a more natural assumption on the relative condition number, relevant for the more general case of
smooth policies.

32
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

3. (Relative condition number) For all iterations t, assume the average relative condition number
is bounded as follows:
" (t)
#
w > Σd ? w
E sup (t)
≤ κ. (22)
w∈Rd w > Σν w

Note that term inside the expectation is a random quantity as θ(t) is random.

In the above conditions, the expectation is with respect to the randomness in the sequence of iterates
w(0) , w(1) , . . . w(T −1) .

Analogous to our Q-NPG theorem, our main theorem for NPG shows how the transfer error is
relevant in addition the statistical error stat .

Theorem 29 (Agnostic learning with NPG) Fix a state distribution ρ; a state-action distribution
ν; an arbitrary comparator policy π ? (not necessarily an optimal policy). Suppose Assumption 6.4
(0)
p the NPG update rule (in (21)) starts with π being the uniform distribution (at each
holds. Suppose
state), η = 2 log |A|/(βW 2 T ), and the (random) sequence of iterates satisfies Assumption 6.5.
We have that
 o
r √
2β log |A| bias
r
n ?
π (t) W κstat
E min V (ρ) − V (ρ) ≤ + 3
+ .
t<T 1−γ T (1 − γ) 1−γ

The proof is provided in Section 6.4.

Remark 30 (The |A| dependence: NPG vs. Q-NPG) Observe there is no polynomial dependence
on |A| in the rate for NPG (in constrast to Theorem 20); also observe that here we define d? as
the state-action distribution of π ? in Assumption 6.5, as opposed to a uniform distribution over the
actions, as in Assumption 6.1. The main difference in the analysis is that, for Q-NPG, we need to
bound the error in fitting the advantage estimates; this leads to the dependence on |A| (which can
be removed with a path dependent bound, i.e. a bound which depends on the sequence of iterates
produced by the algorithm)9 . For NPG, the direct fitting of the advantage function sidesteps this
conversion step. Note that the relative condition number assumption in Q-NPG (Assumption 6.2) is
a weaker assumption, due to that it can be bounded independently of the path of the algorithm (see
Remark 6.2), while NPG’s centering of the features makes the assumption on the relative condition
number depend on the path of the algorithm.

Remark 31 (Comparison with Theorem 16) Compared with the result of Theorem 16 in the noiseless,
tabular case, we see two main differences. In the setting of Theorem 16, we have stat = √ bias = 0,
so that the last two terms vanish. This leaves the first term where we observe a slower 1/ T rate
compared with Theorem 16, and with an additional dependence on W (which grows as O(|S||A|/(1−
γ) to approximate the advantages in the tabular setting). Both differences arise from the additional
monotonicity property (Lemma 17) on the per-step improvements in the tabular case, which is not
easily generalized to the function approximation setting.
9. For Q-NPG, we have to bound two distribution shift terms to both π ? and π (t) at step t of the algorithm.

33
AGARWAL , K AKADE , L EE AND M AHAJAN

Algorithm 3 Sampler for: s, a ∼ dπν and unbiased estimate of Aπ (s, a)


Require: Starting state-action distribution ν.
cπ = 0 and Vcπ = 0.
1: Set Q
2: Start at state s0 ∼ ν. Sample a0 ∼ ν(·|s0 ) (though do not necessarily execute a0 ).
3: (dπ
ν sampling) At every timestep h ≥ 0,
• With probability γ, execute ah , transition to sh+1 , and sample ah+1 ∼ π(·|sh+1 ).
• Else accept (sh , ah ) as the sample and proceed to Step 4.
4: (Aπ (s, a) sampling) Set SampleQ = True with probability 1/2.
• If SampleQ = True, execute ah at state sh and then continue executing π with a termination
cπ as the undiscounted sum of rewards from
probability of 1 − γ. Upon termination, set Q
time h onwards.
• Else sample a0h ∼ π(·|sh ). Then execute a0h at state sh and then continue executing π with
a termination probability of 1 − γ. Upon termination, set Vcπ as the undiscounted sum of
rewards from time h onwards.
cπ (sh , ah ) = 2(Q
5: return (sh , ah ) and A cπ − Vcπ ).

Remark 32 (Generalizing Q-NPG for smooth policies) A similar reasoning as the analysis here
can be also used to establish a convergence result for the Q-NPG algorithm in this more general
setting of smooth policy classes. Concretely, we can analyze the Q-NPG update described for neural
policy classes in Section 6.1.2, assuming that the function fθ is Lipschitz-continuous in θ. Like for
Theorem 29, the main modification is that Assumption 6.2 on relative condition numbers is now
defined using the covariance matrix for the features fθ (s, a), which depend on θ, as opposed to some
a feature map φ(s, a) in the log-linear case. The rest of the analysis follows with an appropriate
adaptation of the results above.

6.3.1 NPG S AMPLE C OMPLEXITY


Algorithm 4 provides a sample based version of the NPG algorithm, again using stochastic projected
gradient ascent; it uses a slight modification of the Q-NPG algorithm to obtain unbiased gradient
estimates. The following corollary shows that this algorithm provides an accurate sample based
version of NPG.

Corollary 33 (Sample complexity of NPG) Assume we are in the setting of Theorem 29 and that we
have access to an episodic sampling oracle (i.e. Assumption 6.3). Suppose that the Sample Based
NPG Algorithm (Algorithm 4) is run for T iterations, with N gradient steps per iteration. Also,
suppose that k∇θ log π (t) (a|s)k2 ≤ B holds with probability one. There exists a setting of η and α
such that:
 n ? o
π (t)
E min V (ρ) − V (ρ)
t<T
r s √
W 2β log |A| 8κBW (BW + 1) 1 bias
≤ + 4 1/4
+ .
1−γ T (1 − γ) N 1−γ

34
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Algorithm 4 Sample-based NPG


Require: Learning rate η; SGD learning rate α; number of SGD iterations N
1: Initialize θ (0) = 0.
2: for t = 0, 1, . . . , T − 1 do
3: Initialize w0 = 0
4: for n = 0, 1, . . . , N − 1 do
5: Call Algorithm 3 to obtain s, a ∼ d(t) , and an unbiased estimate A(s,
b a) of A(t) (s, a).
6: Update:
   
(t) (t)
wn+1 = ProjW wn − 2α wn · ∇θ log π (a|s) − A(s, b a) ∇θ log π (a|s) ,

where W = {w : kwk2 ≤ W }
7: end for
b(t) = N1 N
P
8: Set w n=1 wn .
9: Update θ(t+1) = θ(t) + η w
b(t) .
10: end for

Furthermore, since each episode has expected length 2/(1 − γ), the expected number of total samples
used by NPG is 2N T /(1 − γ).

Proof Let us see that the update direction in Step 6 of Algorithm 4 uses an unbiased estimate of the
true gradient of the loss function LA :

  
(t) (t)
2Es,a∼d(t) wn · ∇θ log π (a|s) − A(s, a) ∇θ log π (a|s)
b
  
(t) (t)
= 2Es,a∼d(t) wn · ∇θ log π (a|s) − E[A(s,
b a)|s, a] ∇θ log π (a|s)
  
(t) (t) (t)
= 2Es,a∼d(t) wn · ∇θ log π (a|s) − A (s, a) ∇θ log π (a|s)

= ∇w LA (wn ; θ(t) , d(t) )

where the last step follows due to that sampling procedure in Algorithm 3 produces a conditionally
unbiased estimate.
Since k∇θ log π (t) (a|s)k2 ≤ B and since A(s,b a) ≤ 2/(1 − γ), our sampled gradients are
1
bounded by G := 8B(BW + 1−γ ). The remainder of the proof follows that of Corollary 26

6.4 Analysis

We first proceed by providing a general analysis of NPG, for arbitrary sequences. We then specialize
it to complete the proof of our two main theorems in this section.

35
AGARWAL , K AKADE , L EE AND M AHAJAN

6.4.1 T HE NPG “R EGRET L EMMA”

It is helpful for us to consider NPG more abstractly, as an update rule of the form

θ(t+1) = θ(t) + ηw(t) . (23)

We will now provide a lemma where w(t) is an arbitrary (bounded) sequence, which will be helpful
when specialized.
Recall a function f : Rd → R is said to be β-smooth if for all x, x0 ∈ Rd :

k∇f (x) − ∇f (x0 )k2 ≤ βkx − x0 k2 ,

and, due to Taylor’s theorem, recall that this implies:

β 0
f (x0 ) − f (x) − ∇f (x) · (x0 − x) ≤ kx − xk22 . (24)
2

The following analysis of NPG is based on the mirror-descent approach developed in (Even-Dar
et al., 2009), which motivates us to refer to it as a “regret lemma”.

Lemma 34 (NPG Regret Lemma) Fix a comparison policy π e and a state distribution ρ. Assume
for all s ∈ S and a ∈ A that log πθ (a|s) is a β-smooth function of θ. Consider the update rule
(23), where π (0) is the uniform distribution (for all states) and where the sequence of weights
w(0) , . . . , w(T ) , satisfies kw(t) k2 ≤ W (but is otherwise arbitrary). Define:

h i
errt = Es∼dπρe Ea∼eπ(·|s) A(t) (s, a) − w(t) · ∇θ log π (t) (a|s) .

We have that:

T −1
!
n o 1 log |A| ηβW 2 1 X
min V πe (ρ) − V (t) (ρ) ≤ + + errt .
t<T 1−γ ηT 2 T
t=0

Proof By smoothness (see (24)),

π (t+1) (a|s)  β
log ≥ ∇θ log π (t) (a|s) · θ(t+1) − θ(t) − kθ(t+1) − θ(t) k22
π (t) (a|s) 2
β
= η∇θ log π (t) (a|s) · w(t) − η 2 kw(t) k22 .
2

36
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

We use de as shorthand for dπρe (note ρ and π e are fixed); for any policy π, we also use πs as
shorthand for the vector π(·|s). Using the performance difference lemma (Lemma 2),
 
Es∼de KL(e πs ||πs(t) ) − KL(e πs ||πs(t+1) )
" #
π (t+1) (a|s)
= Es∼de Ea∼eπ(·|s) log (t)
π (a|s)
h i β
≥ ηEs∼de Ea∼eπ(·|s) ∇θ log π (t) (a|s) · w(t) − η 2 kw(t) k22 (using previous display)
2
h i β
= ηEs∼de Ea∼eπ(·|s) A(t) (s, a) − η 2 kw(t) k22
h 2 i
+ ηEs∼de Ea∼eπ(·|s) ∇θ log π (a|s) · w(t) − A(t) (s, a)
(t)

 
β
= (1 − γ)η V (ρ) − V (ρ) − η 2 kw(t) k22 − η errt
π
e (t)
2

Rearranging, we have:
 
π (t) 1 1   ηβ
V (ρ) − V
e
(ρ) ≤ πs ||πs(t) ) − KL(e
Es∼de KL(e πs ||πs(t+1) ) + W 2 + errt
1−γ η 2

Proceeding,
T −1 T −1
1 X πe 1 X
(V (ρ) − V (t) (ρ)) ≤ πs ||πs(t) ) − KL(e
Es∼de (KL(e πs ||πs(t+1) ))
T ηT (1 − γ)
t=0 t=0
T −1 
ηβW 2

1 X
+ + errt
T (1 − γ) 2
t=0
T −1
πs ||π (0) )
Es∼de KL(e ηβW 2 1 X
≤ + + errt
ηT (1 − γ) 2(1 − γ) T (1 − γ)
t=0
T −1
log |A| ηβW 2 1 X
≤ + + errt ,
ηT (1 − γ) 2(1 − γ) T (1 − γ)
t=0

which completes the proof.

6.4.2 P ROOFS OF T HEOREM 20 AND 29


Proof (of Theorem 20) Using the NPG regret lemma (Lemma 34) and the smoothness of the
log-linear policy class (see Example 28),
 r " T −1 #
n ? o BW 2 log |A| 1 1 X
E min V π (ρ) − V (t) (ρ) ≤ + E errt .
t<T 1−γ T 1−γ T
t=0

where we have used our setting of η.

37
AGARWAL , K AKADE , L EE AND M AHAJAN

We make the following decomposition of errt :


h i
(t)
errt = Es∼d?ρ ,a∼π? (·|s) A(t) (s, a) − w? · ∇θ log π (t) (a|s)
h i
(t)
+ Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s) .


For the first term, using that ∇θ log πθ (a|s) = φs,a − Ea0 ∼πθ (·|s) [φs,a0 ] (see Section 6.1.1), we have:
h i
(t)
Es∼d?ρ ,a∼π? (·|s) A(t) (s, a) − w? · ∇θ log π (t) (a|s)
h i h i
(t) (t)
= Es∼d?ρ ,a∼π? (·|s) Q(t) (s, a) − w? · φs,a − Es∼d?ρ ,a0 ∼π(t) (·|s) Q(t) (s, a0 ) − w? · φs,a0
r r

(t) 2 
(t)
2
≤ Es∼d?ρ ,a∼π? (·|s) Q(t) (s, a) − w? · φs,a + Es∼d?ρ ,a0 ∼π(t) (·|s) Q(t) (s, a0 ) − w? · φs,a0
r h 2 i q
(t) (t) (t)
≤ 2 |A|Es∼dρ ,a∼UnifA Q (s, a) − w? · φs,a
? = 2 |A|L(w? ; θ(t) , d? ). (25)

where in the first equality, we have used A(t) (s, a) = Q(t) (s, a) − Ea0 ∼π(t) (·|s) Q(t) (s, a0 ) and in the
(t)
last step, we have used the definition of d? and L(w? ; θ(t) , d? ).
For the second term, let us now show that:
h i
(t)
Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s)

s
|A|κ  (t)

≤ 2 L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) (26)
1−γ

To see this, first observe that a similar argument to the above leads to:
h i
(t)
Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s)

h i h i
(t) (t)
= Es∼d?ρ ,a∼π? (·|s) w? − w(t) · φs,a − Es∼d?ρ ,a0 ∼π(t) (·|s) w? − w(t) · φs,a0
 
r h 2 i q
(t) (t)
= 2 |A| · kw? − w(t) k2Σd? ,

≤ 2 |A|Es,a∼d? w? − w(t) · φs,a

where we use the notation kxk2M := x> M x for a matrix M and a vector x. From the definition of κ,

(t) (t) κ (t)


kw? − w(t) k2Σd? ≤ κkw? − w(t) k2Σν ≤ kw? − w(t) k2Σ (t)
1−γ d

(t) (t)
using that (1 − γ)ν ≤ dπν (see (19)). Due to that w? minimizes L(w; θ(t) , d(t) ) over the set
(t)
W := {w : kwk2 ≤ W }, for any w ∈ W the first-order optimality conditions for w? imply that:

(t) (t)
(w − w? ) · ∇L(w? ; θ(t) , d(t) ) ≥ 0.

38
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Therefore, for any w ∈ W,


(t)
L(w; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) )
h 2 i
(t) (t) (t)
= Es,a∼d(t) w · φ(s, a) − w? · φ(s, a) + w? · φ(s, a) − Q(t) (s, a) − L(w? ; θ(t) , d(t) )
h 2 i h i
(t) (t) (t)
= Es,a∼d(t) w · φ(s, a) − w? · φ(s, a) + 2(w − w? )Es,a∼d(t) φ(s, a) w? · φ(s, a) − Q(t) (s, a)
(t) (t) (t)
= kw − w? k2Σ + (w − w? ) · ∇L(w? ; θ(t) , d(t) )
d(t)
(t)
≥ kw − w? k2Σ .
d(t)

Noting that w(t) ∈ W by construction in Algorithm 20 yields the claimed bound on the second term
in (26).
Using the bounds on the first and second terms in (25) and (26), along with concavity of the
square root function, we have that:
r s
h
(t)
i |A|κ h (t)
i
E[errt ] ≤ 2 |A|E L(w? ; θ(t) , d? ) + 2 E L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) .
1−γ

The proof is completed by substitution and using our assumptions on stat and bias .

The following proof for the NPG algorithm follows along similar lines.
Proof (of Theorem 29) Using the NPG regret lemma and our setting of η,
 r " T −1 #
n ? o W 2β log |A| 1 1 X
E min V π (ρ) − V (t) (ρ) ≤ + E errt .
t<T 1−γ T 1−γ T
t=0

where the expectation is with respect to the sequence of iterates w(0) , w(1) , . . . w(T −1) .
Again, we make the following decomposition of errt :
h i
(t)
errt = Es∼d?ρ ,a∼π? (·|s) A(t) (s, a) − w? · ∇θ log π (t) (a|s)
h i
(t)
+ Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s) .


For the first term,


h i
(t)
Es∼d?ρ ,a∼π? (·|s) A(t) (s, a) − w? · ∇θ log π (t) (a|s)
r h 2 i q
(t) (t) (t)
≤ Es∼d?ρ ,a∼π? (·|s) A (s, a) − w? · φs,a = LA (w? ; θ(t) , d? ).

(t)
where we have used the definition of LA (w? ; θ(t) , d? ) in the last step.
For the second term, a similar argument leads to:
h i q
(t) (t)
Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s) = kw? − w(t) k2Σd? .


39
AGARWAL , K AKADE , L EE AND M AHAJAN

(t) (t)
Define κ(t) := k(Σν )−1/2 Σd? (Σν )−1/2 k2 , which is the relative condition number at iteration t.
We have
(t) −1/2 −1/2 (t)
kw? − w(t) k2Σd? ≤ k(Σ(t)
ν ) Σd? (Σ(t)
ν ) k2 kw? − w(t) k2Σν
κ(t) (t)
≤ kw? − w(t) k2Σ (t)
1−γ d

κ(t)  
(t)
≤ LA (w(t) ; θ(t) , d(t) ) − LA (w? ; θ(t) , d(t) )
1−γ
(t)
where the last step uses that w? is a minimizer of LA over W and that w(t) is feasible as before (see
the proof of Theorem 20). Now taking an expectation we have:
" #
h
(t) (t) 2
i κ(t)  (t) (t) (t) (t) (t) (t)

E kw? − w kΣd? ≤ E LA (w ; θ , d ) − LA (w? ; θ , d )
1−γ
" #
κ(t) h (t) (t) (t) (t) (t) (t) (t)
i
= E E LA (w ; θ , d ) − LA (w? ; θ , d ) | θ
1−γ
" #
κ(t) κstat
≤ E · stat ≤
1−γ 1−γ

where we have used our assumption on κ and stat .


The proof is completed by substitution and using the concavity of the square root function.

7. Discussion
This work provides a systematic study of the convergence properties of policy optimization techniques,
both in the tabular and the function approximation settings. At the core, our results imply that the non-
convexity of the policy optimization problem is not the fundamental challenge for typical variants of
the policy gradient approach. This is evidenced by the global convergence results which we establish
and that demonstrate the relative niceness of the underlying optimization problem. At the same
time, our results highlight that insufficient exploration can lead to the convergence to sub-optimal
policies, as is also observed in practice; technically, we show how this is an issue of conditioning.
Conversely, we can expect typical policy gradient algorithms to find the best policy from amongst
those whose state-visitation distribution is adequately aligned with the policies we discover, provided
a distribution-shifted notion of approximation error is small.
In the tabular case, our results show that the nature and severity of the exploration/distribution
mismatch term differs in different policy optimization approaches. For instance, we find that doing
policy gradient in its standard form for both the direct and softmax parameterizations can be slow to
converge, particularly in the face of distribution mismatch, even when policy gradients are computed
exactly. Natural policy gradient, on the other hand, enjoys a fast dimension-free convergence when
we are in tabular settings with exact gradients. On the other hand, for the function approximation
setting, or when using finite samples, all algorithms suffer to some degree from the exploration issue
captured through a conditioning effect.

40
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

With regards to function approximation, the guarantees herein are the first provable results
that permit average case approximation errors, where the guarantees do not have explicit worst
case dependencies over the state space. These worst case dependencies are avoided by precisely
characterizing an approximation/estimation error decomposition, where the relevant approximation
error is under distribution shift to an optimal policies measure. Here, we see that successful
function approximation relies on two key aspects: good conditioning (related to exploration) and low
distribution-shifted, approximation error. In particular, these results identify the relevant measure of
the expressivity of a policy class, for the natural policy gradient.
With regards to sample size issues, we showed that simply using stochastic (projected) gradient
ascent suffices for accurate policy optimization. However, in terms of improving sample efficiency
and polynomial dependencies, there are number of important questions for future research, including
variance reduction techniques along with data re-use.
There are number of compelling directions for further study. The first is in understanding
how to remove the density ratio guarantees among prior algorithms; our results are suggestive
that the incremental policy optimization approaches, including CPI (Kakade and Langford, 2002),
PSDP (Bagnell et al., 2004), and MD-MPI Geist et al. (2019), may permit such an improved analysis.
The question of understanding what representations are robust to distribution shift is well-motivated
by the nature of our distribution-shifted, approximation error (the transfer error). Finally, we hope
that policy optimization approaches can be combined with exploration approaches, so that, provably,
these approaches can retain their robustness properties (in terms of their agnostic learning guarantees)
while mitigating the need for a well conditioned initial starting distribution.

Acknowledgments

We thank the anonymous reviewers who provided detailed and constructive feedback that helped us
significantly improve the presentation and exposition. Sham Kakade and Alekh Agarwal gratefully
acknowledge numerous helpful discussions with Wen Sun with regards to the Q-NPG algorithm
and our notion of transfer error. We also acknowledge numerous helpful comments from Ching-An
Cheng and Andrea Zanette on an earlier draft of this work. We thank Nan Jiang, Bruno Scherrer, and
Matthieu Geist for their comments with regards to the relationship between concentrability coeffi-
cients, the condition number, and the transfer error; this discussion ultimately lead to Corollary 21.
Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in
Data-intensive Discovery, the ONR award N00014-18-1-2247, and the DARPA award FA8650-18-2-
7836. Jason D. Lee acknowledges support of the ARO under MURI Award W911NF-11-1-0303.
This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical
Research Council (EPSRC) under the Multidisciplinary University Research Initiative.

References
Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért
Weisz. POLITEX: Regret bounds for policy iteration using expert prediction. In International
Conference on Machine Learning, pages 3692–3702, 2019a.

Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced
politex. arXiv preprint arXiv:1908.10479, 2019b.

41
AGARWAL , K AKADE , L EE AND M AHAJAN

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin
Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning
Representations, 2018.

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans, editors. Understand-
ing the impact of entropy on policy optimization, 2019. URL https://arxiv.org/abs/
1811.11214.

András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-
residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71
(1):89–129, 2008.

Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating mini-
mization and projection methods for nonconvex problems: An approach based on the kurdyka-
łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.

Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J. Kappen. Dynamic policy programming.
J. Mach. Learn. Res., 13(1), November 2012. ISSN 1532-4435.

J. A. Bagnell, Sham M Kakade, Jeff G. Schneider, and Andrew Y. Ng. Policy search by dynamic
programming. In S. Thrun, L. K. Saul, and B. Schölkopf, editors, Advances in Neural Information
Processing Systems 16, pages 831–838. MIT Press, 2004.

J. Andrew Bagnell and Jeff Schneider. Covariant policy search. In Proceedings of the 18th Interna-
tional Joint Conference on Artificial Intelligence, IJCAI’03, pages 1019–1024, San Francisco, CA,
USA, 2003. Morgan Kaufmann Publishers Inc.

Keith Ball. An elementary introduction to modern convex geometry. Flavors of geometry, 31:1–58,
1997.

A. Beck. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997.

Richard Bellman and Stuart Dreyfus. Functional approximations and dynamic programming. Mathe-
matical Tables and Other Aids to Computation, 13(68):247–251, 1959.

Dimitri P Bertsekas and John N Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont,
MA, 1996.

Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. CoRR,
abs/1906.01786, 2019. URL http://arxiv.org/abs/1906.01786.

Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic
algorithms. Automatica, 45(11):2471–2482, 2009.

Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The łojasiewicz inequality for nonsmooth subana-
lytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization,
17(4):1205–1223, 2007.

42
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Justin A Boyan. Least-squares temporal difference learning. In Proceedings of the Sixteenth


International Conference on Machine Learning, pages 49–56. Morgan Kaufmann Publishers Inc.,
1999.
Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-
optimal reinforcement learning. The Journal of Machine Learning Research, 3:213–231, 2003.
Sébastien Bubeck, Nicolò Cesa-Bianchi, and Sham M. Kakade. Towards minimax policies for
online linear optimization with bandit feedback. In COLT 2012 - The 25th Annual Conference on
Learning Theory, June 25-27, 2012, Edinburgh, Scotland, volume 23 of JMLR Proceedings, pages
41.1–41.14, 2012.
Qi Cai, Zhuoran Yang, Jason D. Lee, and Zhaoran Wang. Neural temporal-difference learning
converges to global optima. CoRR, abs/1905.10027, 2019. URL http://arxiv.org/abs/
1905.10027.
Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University
Press, New York, NY, USA, 2006. ISBN 0521841089.
Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning.
In International Conference on Machine Learning, pages 1042–1051, 2019.
Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online Markov decision processes. Mathe-
matics of Operations Research, 34(3):726–736, 2009.
Amir-massoud Farahmand, Csaba Szepesvári, and Rémi Munos. Error propagation for approximate
policy and value iteration. In Advances in Neural Information Processing Systems, pages 568–576,
2010.
Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy
gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039, 2018.
Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic
gradient for tensor decomposition. Proceedings of The 28th Conference on Learning Theory,
COLT 2015, Paris, France, July 3-6, 2015, 2015.
Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision
processes. arXiv preprint arXiv:1901.11275, 2019.
Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and
stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression.
Foundations of Computational Mathematics, 14(3):569–600, 2014.
Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Con-
textual decision processes with low bellman rank are PAC-learnable. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.

43
AGARWAL , K AKADE , L EE AND M AHAJAN

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape
saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning,
ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1724–1732, 2017.

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement
learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.

Fritz John. Extremum problems with inequalities as subsidiary conditions. Interscience Publishers,
1948.

S. Kakade. A natural policy gradient. In NIPS, 2001.

Sham Kakade and John Langford. Approximately Optimal Approximate Reinforcement Learning. In
Proceedings of the 19th International Conference on Machine Learning, volume 2, pages 267–274,
2002.

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-
gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.
Machine Learning, 49(2-3):209–232, 2002.

Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information
processing systems, pages 1008–1014, 2000.

Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Analysis of classification-based


policy iteration algorithms. The Journal of Machine Learning Research, 17(1):583–612, 2016.

Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization
attains globally optimal policy. CoRR, abs/1906.10306, 2019. URL http://arxiv.org/
abs/1906.10306.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement
learning. In International conference on machine learning, pages 1928–1937, 2016.

Rémi Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567,
2003.

Rémi Munos. Error bounds for approximate value iteration. In AAAI, 2005.

Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method
efficiency in optimization. 1983.

Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global perfor-
mance. Math. Program., pages 177–205, 2006.

Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. Online markov decision
processes under bandit feedback. In Advances in Neural Information Processing Systems 23.
Curran Associates, Inc., 2010.

44
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov
decision processes. CoRR, abs/1705.07798, 2017.

Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomput., 71(7-9):1180–1190, 2008. ISSN
0925-2312.

Jan Peters, Katharina Mülling, and Yasemin Altün. Relative entropy policy search. In Proceedings
of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2010), pages 1607–1612.
AAAI Press, 2010.

B. T. Polyak. Gradient methods for minimizing functionals. USSR Computational Mathematics and
Mathematical Physics, 3(4):864–878, 1963.

Aravind Rajeswaran, Kendall Lowrey, Emanuel V. Todorov, and Sham M Kakade. Towards general-
ization and simplicity in continuous control. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing
Systems 30, pages 6550–6561. Curran Associates, Inc., 2017.

Bruno Scherrer. Approximate policy iteration schemes: A comparison. In Proceedings of the


31st International Conference on International Conference on Machine Learning - Volume 32,
ICML’14. JMLR.org, 2014.

Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative policy
iteration as boosted policy search. In Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, pages 35–50. Springer, 2014.

Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist.
Approximate modified policy iteration and its application to the game of tetris. Journal of Machine
Learning Research, 16:1629–1676, 2015.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region
policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algo-


rithms. Understanding Machine Learning: From Theory to Algorithms. Cambridge University
Press, 2014. ISBN 9781107057135. URL https://books.google.com/books?id=
ttJkAwAAQBAJ.

Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends
in Machine Learning, 4(2):107–194, 2012.

Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global
convergence and fa ster rates for regularized mdps, 2019.

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient meth-
ods for reinforcement learning with function approximation. In Advances in Neural Information
Processing Systems, volume 99, pages 1057–1063, 1999.

45
AGARWAL , K AKADE , L EE AND M AHAJAN

Csaba Szepesvári and Rémi Munos. Finite time bounds for sampling based fitted value iteration. In
Proceedings of the 22nd international conference on Machine learning, pages 880–887. ACM,
2005.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement


learning. Machine learning, 8(3-4):229–256, 1992.

Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning
algorithms. Connection Science, 3(3):241–268, 1991.

Lin F. Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features.
In International Conference on Machine Learning, pages 6995–7004, 2019.

46
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Appendix A. Proofs for Section 3


Proof [of Lemma 1] Recall the MDP in Figure 1. Note that since actions in terminal states s3 , s4
and s5 do not change the expected reward, we only consider actions in states s1 and s2 . Let the
”up/above” action as a1 and ”right” action as a2 . Note that
V π (s1 ) = π(a2 |s1 )π(a1 |s2 ) · r
Consider
θ(1) = (log 1, log 3, log 3, log 1), θ(2) = (− log 1, − log 3, − log 3, − log 1)
where θ is written as a tuple (θa1 ,s1 , θa2 ,s1 , θa1 ,s2 , θa2 ,s2 ). Then, for the softmax parameterization,
we have
3 3 9
π (1) (a2 |s1 ) = ; π (1) (a1 |s2 ) = ; V (1) (s1 ) = r
4 4 16
and
1 1 1
π (2) (a2 |s1 ) = ; π (2) (a1 |s2 ) = ; V (2) (s1 ) = r
4 4 16
θ(1) +θ(2)
Also, for θ(mid) = 2 ,
1 1 1
π (mid) (a2 |s1 ) = ; π (mid) (a1 |s2 ) = ; V (mid) (s1 ) = r
2 2 4
This gives
V (1) (s1 ) + V (2) (s1 ) > 2V (mid) (s1 )
which shows that V π is non-concave.

Proof [of Lemma 2] Let Prπ (τ |s0 = s) denote the probability of observing a trajectory τ when
starting in state s and following the policy π. Using a telescoping argument, we have:
"∞ #
π0 0
X
π
V (s) − V (s) = Eτ ∼Prπ (τ |s0 =s) γ r(st , at ) − V π (s)
t

t=0
"∞ #
 
π0 π0 0
X
= Eτ ∼Prπ (τ |s0 =s) γt r(st , at ) + V (st ) − V (st ) − V π (s)
t=0

" #
(a)
 
π0 π0
X
= Eτ ∼Prπ (τ |s0 =s) γ t r(st , at ) + γV (st+1 ) − V (st )
t=0

" #
(b)
 
π0 π0
X
t
= Eτ ∼Prπ (τ |s0 =s) γ r(st , at ) + γE[V (st+1 )|st , at ] − V (st )
t=0

" #
X
π0 1 0
= Eτ ∼Prπ (τ |s0 =s) γ t A (st , at ) = Es0 ∼dπs Ea∼π(·|s) γ t Aπ (s0 , a),
1−γ
t=0
0 0
where (a) rearranges terms in the summation and cancels the V π (s0 ) term with the −V π (s) outside
the summation, and (b) uses the tower property of conditional expectations and the final equality
follows from the definition of dπs .

47
AGARWAL , K AKADE , L EE AND M AHAJAN

Appendix B. Proofs for Section 4


B.1 Proofs for Section 4.2
We first define first-order optimality for constrained optimization.

Definition 35 (First-order Stationarity) A policy πθ ∈ ∆(A)|S| is -stationary with respect to the


initial state distribution µ if

G(πθ ) := max δ > ∇π V πθ (µ) ≤ .


πθ +δ∈∆(A)|S| , kδk2 ≤1

where ∆(A)|S| is the set of all policies.

Due to that we are working with the direct parameterization (see (2)), we drop the θ subscript.

Remark 36 If  = 0, then the definition simplifies to δ > ∇π V π (µ) ≤ 0. Geometrically, δ is a


feasible direction of movement since the probability simplex ∆(A)|S| is convex. Thus the gradient is
negatively correlated with any feasible direction of movement, and so π is first-order stationary.

Proposition 37 Let V π (µ) be β-smooth in π. Define the gradient mapping

1 
Gη (π) = P∆(A)|S| (π + η∇π V π (µ)) − π ,
η

and the update rule for the projected gradient is π + = π + ηGη (π). If kGη (π)k2 ≤ , then
+
max δ > ∇π V π (µ) ≤ (ηβ + 1).
π+δ∈∆(A)|S| , kδk2 ≤1

Proof By Theorem 58,


+
∇π V π (µ) ∈ N∆(A)|S| (π + ) + (ηβ + 1)B2 ,

where B2 is the unit `2 ball, and N∆(A)|S| is the normal cone of the product simplex ∆(A)|S| .
+
Since ∇π V π (µ) is (ηβ + 1) distance from the normal cone and δ is in the tangent cone, then
+
δ > ∇π V π (µ) ≤ (ηβ + 1).

Proof [of Theorem 5] Recall the definition of gradient mapping

1 
Gη (π) = P∆(A)|S| (π + η∇π V (t) (µ)) − π
η

From Lemma 54, we have V π (s) is β-smooth for all states s (and also hence V π (µ) is also β-smooth)
2γ|A| η
with β = (1−γ)3 . Then, from standard result (Theorem 57), we have that for G (π) with step-size

η = β1 ,
p
η (t) 2β(V ? (µ) − V (0) (µ))
min kG (π )k2 ≤ √
t=0,1,...,T −1 T

48
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Then, from Proposition 37, we have


p
π (t+1) 2β(V ? (µ) − V (0) (µ))
min max δ > ∇π V (µ) ≤ (ηβ + 1) √
t=0,1,...,T π (t) +δ∈∆(A)|S| , kδk2 ≤1 T
Observe that
1
max (π̄ − π)> ∇π V π (µ) = 2 |S| p (π̄ − π)> ∇π V π (µ)
p
max
π̄∈∆(A)|S| π̄∈∆(A)|S| 2 |S|
δ > ∇π V π (µ)
p
≤ 2 |S| max
π+δ∈∆(A)|S| , kδk2 ≤1
p
where the last step follows as kπ̄ − πk2 ≤ 2 |S|. And then using Lemma 4 and ηβ = 1, we have
?
4 |S| dπρ
p p
? (t) 2β(V ? (µ) − V (0) (µ))
min V (ρ) − V (ρ) ≤ √
t=0,1,...,T 1−γ µ T ∞

We can get our required bound of , if we set T such that


?
4 |S| dπρ
p p
2β(V ? (µ) − V (0) (µ))
√ ≤
1−γ µ T

or, equivalently,
? 2
32|S|β(V ? (µ) − V (0) (µ)) dπρ
T ≥ .
(1 − γ)2 2 µ

? (µ) (0) (µ) 1 2γ|A|
Using V −V ≤ 1−γ and β = (1−γ)3
from Lemma 54 leads to the desired result.

B.2 Proofs for Section 4.3


Recall the MDP in Figure 2. Each trajectory starts from the initial state s0 , and we use the discount
factor γ = H/(H + 1). Recall that we work with the direct parameterization, where πθ (a|s) = θs,a
for a = a1 , a2 , a3 and πθ (a4 |s) = 1 − θs,a1 − θs,a2 − θs,a3 . Note that since states s0 and sH+1 only
have once action, therefore, we only consider the parameters for states s1 to sH . For this policy class
and MDP, let P θ be the state transition matrix under πθ , i.e. [P θ ]s,s0 is the probability of going from
state s to s0 under policy πθ :
X
[P θ ]s,s0 = πθ (a|s)P (s0 |s, a).
a∈A

For the MDP illustrated in Figure 2, the entries of this matrix are given as:

if s0 = si+1 and s = si with 1 ≤ i ≤ H



 θs,a1
 1 − θs,a1 if s0 = si−1 and s = si with 1 ≤ i ≤ H



[P θ ]s,s0 = 1 if s0 = s1 and s = s0 . (27)

1 if s 0 =s=s
H+1



0 otherwise

49
AGARWAL , K AKADE , L EE AND M AHAJAN

With this definition, we recall that the value function in the initial state s0 is given by

X∞
V πθ (s0 ) = Eτ ∼πθ [ γ t rt ] = eT0 (I − γP θ )−1 r,
t=0

where e0 is an indicator vector for the starting state s0 . From the form of the transition probabili-
ties (27), it is clear that the value function only depends on the parameters θs,a1 in any state s. While
care is needed for derivatives as the parameters across actions are related by the simplex feasibility
constraints, we have assumed each parameter is strictly positive, so that an infinitesimal change to
any parameter other than θs,a1 does not affect the policy value and hence the policy gradients. With
this understanding, we succinctly refer to θs,a1 as θs in any state s. We also refer to the state si
simply as i to reduce subscripts.
For convenience, we also define p̄ (resp. p) to be the largest (resp. smallest) of the probabilities
θs across the states s ∈ [1, H] in the MDP.
In this section, we prove Proposition 6, that is: for 0 < θ < 1 (componentwise across states
H
and actions), p̄ ≤ 1/4, and for all k ≤ 40 log(2H) − 1, we have k∇kθ V πθ (s0 )k ≤ (1/3)H/4 , where
∇kθ V πθ (s0 ) is a tensor of the kth order. Furthermore, we seek to show V ? (s0 ) − V πθ (s0 ) ≥
(H + 1)/8 − (H + 1)2 /3H (where θ? are the optimal policy’s parameters).
It is easily checked that V πθ (s0 ) = M0,H+1
θ , where

M θ := (I − γP θ )−1 ,

since the only rewards are obtained in the state sH+1 . In order to bound the derivatives of the
expected reward, we first establish some properties of the matrix M θ .
 √  √ 
1− 1−4γ 2 p̄(1−p) 1+ 1−4γ 2 p̄(1−p)
Lemma 38 Suppose p̄ ≤ 1/4. Fix any α ∈ 2γ(1−p) , max 2γ(1−p) , 1 . Then

θ ≤ αb−a−1
1. Ma,b 1−γ for 0 ≤ a ≤ b ≤ H.
θ γ p̄ θ γ p̄
2. Ma,H+1 ≤ 1−γ Ma,H ≤ (1−γ)2
αH−a for 0 ≤ a ≤ H.

Proof Let ρka,b be the normalized discounted probability of reaching b, when the initial state is a, in
k steps, that is
k
X
k
ρa,b := (1 − γ) [(γP θ )i ]a,b , (28)
i=0

where we recall the convention that U 0 is the identity matrix for any square matrix U . Observe that
0 ≤ ρka,b ≤ 1, and, based on the form (27) of P θ , we have the recursive relation for all k > 0:

γ(1 − θb+1 )ρk−1 k−1




 a,b+1 + γθb−1 ρa,b−1 if 1 < b < H
 k−1


 γθ ρ
H−1 a,H−1 if b = H
ρka,b = k−1 k−1
γθH ρa,H + γρa,H+1 if b = H + 1 and a < H + 1 . (29)



 1−γ if b = H + 1 and a = H + 1
γ(1 − θ2 )ρk−1 k−1

+ γ ρ if b = 0

a,2 a,0

50
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Note that ρ0a,b = 0 for a 6= b and ρ0a,b = 1 − γ for a = b. Now let us inductively prove that for all
k≥0
ρka,b ≤ αb−a for 1 ≤ a ≤ b ≤ H. (30)

Clearly this holds for k = 0 since ρ0a,b = 0 for a 6= b and ρ0a,b = 1 − γ for a = b. Now, assuming the
bound for all steps till k − 1, we now prove it for k case by case.
For a = b the result follows since

ρka,b ≤ 1 = αb−a .

For 1 < b < H and a < b, observe that the recursion (29) and the inductive hypothesis imply
that

ρka,b ≤ γ(1 − θb+1 )αb+1−a + γθb−1 αb−1−a


= αb−a−1 γ(1 − θb+1 )α2 + γθb−1


≤ αb−a−1 γ(1 − p)α2 + γ p̄




= αb−a−1 α + γ(1 − p)α2 − α + γ p̄ ≤ αb−a ,




where the last inequality follows since α2 γ(1 − p) − α + γ p̄ ≤ 0 due to that α is within the roots
of this quadratic equation. Note the discriminant term in the square root is non-negative provided
p̄ < 1/4, since the condition along with the knowledge that p ≤ p̄ ensures that 4γ 2 p̄(1 − p) ≤ 1.
For b = H and a < H, we observe that

ρka,b ≤ γθH−1 αH−1−a


γθH−1
= αH−a
α
H−a γ p̄ γ p̄ 
≤α ( ) ≤ αH−a γ(1 − p)α + ≤ αH−a .
α α

This proves the inductive claim (note that the cases of b = a = 1 and b = a = H are already
handled in the first part above). Next, we prove that for all k ≥ 0

ρk0,b ≤ αb−1 .

Clearly this holds for k = 0 and b 6= 0 since ρ00,b = 0. Furthermore, for all k ≥ 0 and b = 0,

ρk0,b ≤ 1 ≤ αb−1 ,

since α ≤ 1 by construction and b = 0. Now, we consider the only remaining case when k > 0 and
b ∈ [1, H + 1]. By (27), observe that for k > 0 and b ∈ [1, H + 1],

[(P θ )i ]0,b = [(P θ )i−1 ]1,b , (31)

51
AGARWAL , K AKADE , L EE AND M AHAJAN

for all i ≥ 1. Using the definition of ρka,b (28) for k > 0 and b ∈ [1, H + 1],
k
X k
X
ρk0,b = (1 − γ) θ i
[(γP ) ]0,b θ 0
= (1 − γ)[(γP ) ]0,b + (1 − γ) [(γP θ )i ]0,b
i=0 i=1
k
X
= 0 + (1 − γ) γ i [(P θ )i ]0,b (since b ≥ 1)
i=1
k
X h i
= (1 − γ) γ i (P θ )i−1 (using Equation (31))
1,b
i=1
k−1
X
= (1 − γ)γ γ j [(P θ )j ]1,b (By substituting j = i − 1)
j=0

= γρk−1
1,b (using Equation (28))
≤ αb−1 (using Equation (30) and γ, α ≤ 1)

Hence, for all k ≥ 0


ρk0,b ≤ αb−1
In conjunction with Equation (30), the above display gives for all k ≥ 0,

ρka,b ≤ αb−a for 1 ≤ a ≤ b ≤ H


ρka,b ≤ αb−a−1 for 0 = a ≤ b ≤ H

Also observe that


θ
ρka,b
Ma,b = lim .
k→∞ 1−γ
θ , which shows
Since the above bound holds for all k ≥ 0, it also applies to the limiting value Ma,b
that

θ αb−a αb−a−1
Ma,b ≤ ≤ for 1 ≤ a ≤ b ≤ H
1−γ 1−γ
θ αb−a−1
Ma,b ≤ for 0 = a ≤ b ≤ H
1−γ
which completes the proof of the first part of the lemma.
For the second claim, from recursion (29) and b = H + 1 and a < H + 1

ρka,H+1 = γθH ρk−1 k−1 k−1 k−1


a,H + γρa,H+1 ≤ γ p̄ρa,H + γρa,H+1 ,

Taking the limit of k → ∞, we see that


θ θ
Ma,H+1 ≤ γ p̄Ma,H + γMa,H+1 .

Rearranging the terms in the above bound yields the second claim in the lemma.

Using the lemma above, we now bound the derivatives of M θ .

52
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Lemma 39 The kth order partial derivatives of M satisfy:

∂ k M0,H+1
θ
p̄ 2k γ k+1 k! αH−2k
≤ .
∂θβ1 . . . ∂θβk (1 − γ)k+2

where β denotes a k dimensional vector with entries in {1, 2, . . . , H}.

Proof Since the parameter θ is fixed throughout, we drop the superscript in M θ for brevity. Using
∇θ M = −M ∇θ (I − γP θ )M , using the form of P θ in (27), we get for any h ∈ [1, H]
H+1
∂Ma,b X ∂Pi,j
− = −γ Ma,i Mj,b = γMa,h (Mh−1,b − Mh+1,b ) (32)
∂θh ∂θh
i,j=0

where the second equality follows since Ph,h+1 = θh and Ph,h−1 = 1 − θh are the only two entries
in the transition matrix which depend on θh for h ∈ [1, H].
∂k M
0,H+1
Next, let us consider a kth order partial derivative of M0,H+1 , denoted as ∂θβ . Note that
β can have repeated entries to capture higher order derivative with respect to some parameter. We
∂k M
prove by induction for all k ≥ 1, − ∂θ0,H+1 can be written as N
P
β n=1 cn ζn where

1. |cn | = γ k and N ≤ 2k k!,

2. Each monomial ζn is of the form Mi1 ,j1 . . . Mik+1 ,jk+1 , i1 = 0, jk+1 = H + 1, jl ≤ Hand
il+1 = jl ± 1 for all l ∈ [1, k].

The base case k = 1 follows from Equation (32), as we can write for any h ∈ [H]

∂M0,H+1
− = γM0,h Mh−1,H+1 − γM0,h Mh+1,H+1
∂θh

Clearly, the induction hypothesis is true with |cn | = γ, N = 2, i1 = 0, j2 = H + 1, j1 ≤ H and


i2 = j1 ± 1. Now, suppose the claim holds till k − 1. Then by the chain rule:
∂ k−1 M0,H+1

∂kM 0,H+1 ∂θβ/1
= ,
∂θβ1 . . . ∂θβk ∂θβ1

where β /i is the vector β with the ith entry removed. By inductive hypothesis,
N
∂ k−1 M0,H+1 X
− = cn ζn
∂θβ/1
n=1

where

1. |cn | = γ k−1 and N ≤ 2k−1 (k − 1)!,

2. Each monomial ζn is of the form Mi1 ,j1 . . . Mik ,jk , i1 = 0, jk = H + 1, jl ≤ H and


il+1 = jl ± 1 for all l ∈ [1, k − 1].

53
AGARWAL , K AKADE , L EE AND M AHAJAN

In order to compute the (k)th derivative of M0,H+1 , we have to compute derivative of each mono-
mial ζn with respect to θβ1 . Consider one of the monomials in the (k − 1)th derivative, say,
ζ = Mi1 ,j1 . . . Mik ,jk . We invoke the chain rule as before and replace one of the terms in ζ, say
Mim ,jm , with γMim ,β1 Mβ1 −1,jm − γMim ,β1 Mβ1 +1,jm using Equation 32. That is, the derivative of
each entry gives rise to two monomials and therefore derivative of ζ leads to 2k monomials which
can be written in the form ζ 0 = Mi01 ,j10 . . . Mi0k+1 ,jk+1
0 where we have the following properties (by
appropriately reordering terms)

1. i0l , jl0 = il , jl for l < m

2. i0l , jl0 = il−1 , jl−1 for l > m + 1

3. i0m , jm
0 = i , β and i0
m 1
0
m+1 , jm+1 = jm ± 1, jm

Using the induction hypothesis, we can write

N0
∂ k M0,H+1 X
− = c0n ζn0
∂θβ1 . . . ∂θβk
n=0

where

1. |c0n | = γ|cn | = γ k , since as shown above each coefficient gets multiplied by ±γ.

2. N 0 ≤ 2k2k−1 (k − 1)! = 2k k!, since as shown above each monomial ζ leads to 2k monomials
ζ 0.

3. Each monomial ζn0 is of the form Mi1 ,j1 . . . Mik+1 ,jk+1 , i1 = 0, jk+1 = H + 1, jl ≤ H and
il+1 = jl ± 1 for all l ∈ [1, k].

This completes the induction.


Next we prove a bound on the magnitude of each of the monomials which arise in the derivatives
of M0,H+1 . Specifically, we show that for each monomial ζ = Mi1 ,j1 . . . Mik+1 ,jk+1 , we have

γ p̄αH−2k
Mi1 ,j1 . . . Mik+1 ,jk+1 ≤ (33)
(1 − γ)k+2

1
We observe that it suffices to only consider pairs of indices il , jl where il < jl . Since |Mi,j | ≤ 1−γ
for all i, j,

54
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

k+1
Y Y Y 1
Mi0l ,jl0 ≤ Mi0l ,jl0 Mi0k+1 ,jk+1
0
1−γ
l=1 1≤l≤k : i0l <jl0 1≤l≤k : i0l ≥jl0

Y Y 1
= Mi0l ,jl0 Mi0k+1 ,H+1
1−γ
1≤l≤k : i0l <jl0 1≤l≤k : i0l ≥jl0
(by the inductive claim shown above)
0 0
P
{1≤l≤k : i0 <j 0 } jl −il −1
0
α l l γ p̄αH−ik+1

(1 − γ)k (1 − γ)2
(using Lemma 38, parts 1 and 2 on the first and last terms resp.)
0 0
P
{1≤l≤k+1 : i0 <j 0 } jl −il −1
γ p̄α l l
= (34)
(1 − γ)k+2
0
The last step follows from H + 1 = jk+1 ≥ i0k+1 . Note that

X k+1
X k
X
jl0 − i0l ≥ 0
jl0 − i0l = jk+1 − i01 + 0
(jl+1 − i0l ) ≥ H + 1 − k ≥ 0
{1≤l≤k+1 : i0l <jl0 } l=1 l=1

where the first inequality follows from adding only non-positive terms to the sum, the second equality
follows from rearranging terms and the third inequality follows from i01 = 0, jk+1 0 = H + 1 and
0 0
il+1 = jl ± 1 for all l ∈ [1, k]. Therefore,
X
jl0 − i0l − 1 ≥ H − 2k
{1≤l≤k+1 : i0l <jl0 }

Using Equation (34) and α ≤ 1 with above display gives


k+1
Y γ p̄αH−2k
Mi0l ,jl0 ≤
(1 − γ)k+2
l=1

This proves the bound. Now using the claim that


N
∂ k M0,H+1 X
= cn ζn
∂θβ
n=1

where |cn | = γ k and N ≤ 2k k!, we have shown that

∂ k M0,H+1 p̄ 2k γ k+1 k! αH−2k


≤ ,
∂θβ (1 − γ)k+2

which completes the proof.

55
AGARWAL , K AKADE , L EE AND M AHAJAN

We are now ready to prove Proposition 6.


Proof [Proof of Proposition 6] The kth order partial derivative of V πθ (s0 ) is equal to

∂ k V πθ (s0 ) ∂ k M0,H+1
θ
= .
∂θβ1 . . . ∂θβh ∂θβ1 . . . ∂θβk

k k
Given vectors u1 , . . . , uk which are unit vectors in RH (we denote the unit sphere by SH ), the
norm of this gradient tensor is given by:

X ∂ k V πθ (s0 ) 1
k∇kθ V πθ (s0 )k = max u . . . ukβk
u1 ,...,uk ∈SH k ∂θβ1 . . . ∂θβk β1
β∈[H]k
v
∂ V (s0 ) 2
u X  k πθ  s X
2
u1β1 . . . ukβk
u
≤ max t
u1 ,...,uk ∈SH k ∂θβ1 . . . ∂θβk
β∈[H]k β∈[H]k
v v
k
∂ V (s0 ) 2 u
u X  k πθ  uY
= max
u t kui k2
2
t
u1 ,...,uk ∈SH k ∂θβ1 . . . ∂θβk
β∈[H]k i=1
v 2 v u X  k θ
u X  k πθ ∂ M0,H+1 2

u ∂ V (s0 ) u
=t =t
∂θβ1 . . . ∂θβk ∂θβ1 . . . ∂θβk
β∈[H]k β∈[H]k
s
H k p̄2 22k γ 2k+2 (k!)2 α2H−4k
≤ ,
(1 − γ)2k+4

where the last inequality follows from Lemma 39. In order to proceed further, we need an upper
bound on the smallest admissible value of α. To do so, let us consider all possible parameters θ such
that p̄ ≤ 1/4 in accordance with the theorem statement. In order to bound α, it suffices to place an
upper bound on the lower end of the range for α in Lemma 38 (note Lemma 38 holds for any choice
of α in the range). Doing so, we see that

q q
1− 1 − 4γ 2 p̄(1 − p) 1 − 1 + 2γ p̄(1 − p)

2γ(1 − p) 2γ(1 − p)
s r
p̄ 4p̄
= ≤ ,
1−p 3

√ √ √
where the first inequality uses x−y ≥ x− y, by triangle inequality while the last inequality
uses p ≤ p̄ ≤ 1/4.

56
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Hence, we have the bound


s
X ∂ k V πθ (s0 ) 1 H k p̄2 22k γ 2k+2 (k!)2 ( 4p̄
3 )
H−2k
max u . . . ukβk ≤
u1 ,...,uk ∈SH k ∂θβ1 . . . ∂θβk β1 (1 − γ)2k+4
β∈[H]h
(a) q
≤ (H + 1)2k+4 H k p̄2 22k γ 2k+2 (k!)2 ( 4p̄
3 )
H−2k

(b)
q
≤ (2H)2k+4 H k 22k (H)2k ( 4p̄
3 )
H−2k
q
= (2)4k+4 (H)5k+4 ( 4p̄
3 )
H−2k

where (a) uses γ = H/(H + 1), (b) follows since p̄ ≤ 1, H, k ≥ 1, γ ≤ 1 and k ≤ H.


Requiring that the gradient norm be no larger than ( 4p̄
3 )
H/4 , we would like to satisfy

4p̄ H−2k 4p̄


(2)4k+4 (H)5k+4 ( ) ≤ ( )H/2 ,
3 3
for which it suffices to have
H
2 log(3/4p̄) − log(24 H 4 )
k ≤ k0 := .
log(24 H 5 ) + 2 log(3/4p̄)
Since,
H
2 log(3/4p̄) − log(24 H 4 )
log(24 H 5 ) + 2 log(3/4p̄)
(a) H log(3/4p̄) − log(24 H 4 )
2

2 log(24 H 5 )2 log(3/4p̄)
H log(24 H 4 )
≥ −
8 log(24 H 5 ) 4 log(24 H 5 ) log(3/4p̄)
(b) H log(24 H 4 )
≥ −
8 log(24 H 5 ) 4 log(24 H 4 ) log(3)
H
≥ −1
40 log(2H)
where (a) follows from a + b ≤ 2ab when a, b ≥ 1, (b) follows from H ≥ 1 and p̄ ≤ 1/4. Therefore,
in order to obtain the smallest value of k0 for all choices of 0 ≤ p̄ < 1/4, we further lower bound k0
as

H
k0 ≥ − 1,
40 log(2H)
Thus, the norm of the gradient is bounded by ( 4p̄
3 )
H/4 ≤ (1/3)H/4 for all k ≤ H
40 log(2H) − 1 as long
as p̄ ≤ 1/4, which gives the first part of the lemma.
For the second part, note that the optimal policy always chooses the action a1 , and gets a
discounted reward of
 H+2
H+2 1 H +1
γ /(1 − γ) = (H + 1) 1 − ≥ ,
H +1 8

57
AGARWAL , K AKADE , L EE AND M AHAJAN

where the final inequality uses (1 − 1/x)x ≥ 1/8 for x ≥ 1. On the other hand, the value of πθ is
upper bounded by
H
γ p̄αH

γ p̄ 4p̄
M0,H+1 ≤ 2

(1 − γ) (1 − γ)2 3
(H + 1)2
≤ .
3H
This gives the second part of the lemma.

Appendix C. Proofs for Section 5


We first give a useful lemma about the structure of policy gradients for the softmax parameterization.
We use the notation Prπ (τ |s0 = s) to denote the probability of observing a trajectory τ when starting
in state s and following the policy π and Prπµ (τ ) be Es∼µ [Prπ (τ |s0 = s)] for a distribution µ over
states.

Lemma 40 For the softmax policy class, we have:

∂V πθ (µ) 1
= dπθ (s)πθ (a|s)Aπθ (s, a)
∂θs,a 1−γ µ

Proof First note that


∂ log πθ (a|s) h i h i 
= 1 s = s0 1 a = a0 − πθ (a0 |s) (35)
∂θs0 ,a0

where 1|E] is the indicator of E being true.


Using this along with the policy gradient expression (6), we have:
"∞ #
∂V πθ (µ) X
t

π π

= Eτ ∼Prπµθ γ 1[st = s] 1[at = a]A θ (s, a) − πθ (a|s)A θ (st , at )
∂θs,a
t=0
"∞ #
X
t π
= Eτ ∼Prπµθ γ 1[(st , at ) = (s, a)]A θ (s, a)
t=0
X∞
− πθ (a|s) γ t Eτ ∼Prπµθ [1[st = s]Aπθ (st , at )]
t=0
1 h i
= Es0 ∼dπµθ Ea0 ∼πθ (·|s) 1[(s0 , a0 ) = (s, a)]Aπθ (s, a) − 0
1−γ
1
= dπθ (s)πθ (a|s)Aπθ (s, a) ,
1−γ µ

where the second to last step uses that for any policy a π(a|s)Aπ (s, a) = 0.
P

58
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

C.1 Proofs for Section 5.1


We now prove Theorem 10, i.e. we show that for the updates given by

θ(t+1) = θ(t) + η∇V (t) (µ), (36)

policy gradient converges to optimal policy for the softmax parameterization.


We prove this theorem by first proving a series of supporting lemmas. First, we show in Lemma
41, that V (t) (s) is monotonically increasing for all states s using the fact that for appropriately chosen
stepsizes GD makes monotonic improvement for smooth objectives.

Lemma 41 (Monotonic Improvement in V (t) (s)) For all states s and actions a, for updates (36)
2
with learning rate η ≤ (1−γ)
5 , we have

V (t+1) (s) ≥ V (t) (s); Q(t+1) (s, a) ≥ Q(t) (s, a).

Proof The proof will consist of showing that:


X X
π (t+1) (a|s)A(t) (s, a) ≥ π (t) (a|s)A(t) (s, a) = 0. (37)
a∈A a∈A

holds for all states s. To see this, observe that since the above holds for all states s0 , the performance
difference lemma (Lemma 2) implies

1 h i
V (t+1) (s) − V (t) (s) = Es0 ∼dπ(t+1) Ea∼π(t+1) (·|s0 ) A(t) (s0 , a) ≥ 0,
1−γ s

which would complete the proof.


Let us use the notation θs ∈ R|A| to refer to the vector of θs,· for some fixed state s. Define the
function X
Fs (θs ) := πθs (a|s)c(s, a) (38)
a∈A

where c(s, a) is constant, which we later set to be A(t) (s, a); note we do not treat c(s, a) as a function
of θ. Thus,

∂Fs (θs ) X ∂πθ (a0 |s)


(t)
= s
(t)
c(s, a0 )
∂θs,a θs 0
∂θ s,a θs
a ∈A
X
= π (a|s)(1 − π (t) (a|s))c(s, a) −
(t)
π (t) (a|s)π (t) (a0 |s)c(s, a0 )
a0 6=a
!
X
= π (t) (a|s) c(s, a) − π (t) (a0 |s)c(s, a0 )
a0 ∈A

Taking c(s, a) to be A(t) (s, a) implies (t) (a0 |s)c(s, a0 ) (t) (a0 |s)A(t) (s, a0 )
P P
a0 ∈A π = a0 ∈A π = 0,

∂Fs (θs )
(t)
= π (t) (a|s)A(t) (s, a) (39)
∂θs,a θs

59
AGARWAL , K AKADE , L EE AND M AHAJAN

Observe that for the softmax parameterization,


θs(t+1) = θs(t) + η∇s V (t) (µ)
where ∇s is gradient w.r.t. θs and from Lemma 40 that:
∂V (t) (µ) 1 (t)
= dπµ (s)π (t) (a|s)A(t) (s, a)
∂θs,a 1−γ
This gives using Equation (39)
1 (t)
θs(t+1) = θs(t) + η dπµ (s)∇s Fs (θs ) (t)
1−γ θs

Recall that for a β smooth function, gradient ascent will decrease the function value provided that
5
η ≤ 1/β (Theorem 57). Because Fs (θs ) is β-smooth for β = 1−γ (Lemma 52 and A(t) (s, a) ≤
1
1−γ ), then our assumption that

(1 − γ)2
η≤ = (1 − γ)β −1
5
1 (t)
implies that η 1−γ dπµ (s) ≤ 1/β, and so we have

Fs (θs(t+1) ) ≥ Fs (θs(t) )
which implies (37).

Next, we show the limit for iterates V (t) (s) and Q(t) (s, a) exists for all states s and actions a.

Lemma 42 For all states s and actions a, there exists values V (∞) (s) and Q(∞) (s, a) such that as
t → ∞, V (t) (s) → V (∞) (s) and Q(t) (s, a) → Q(∞) (s, a). Define
∆= min |A(∞) (s, a)|
{s,a|A(∞) (s,a)6=0}

where A(∞) (s, a) = Q(∞) (s, a) − V (∞) (s). Furthermore, there exists a T0 such that for all t > T0 ,
s ∈ S, and a ∈ A, we have
Q(t) (s, a) ≥ Q(∞) (s, a) − ∆/4 (40)
1
Proof Observe that Q(t+1) (s, a) ≥ Q(t) (s, a) (by Lemma 41) and Q(t) (s, a) ≤ 1−γ , therefore by
(t) (∞) (∞)
monotone convergence theorem, Q (s, a) → Q (s, a) for some constant Q (s, a). Similarly it
follows that V (t) (s) → V (∞) (s) for some constant V (∞) (s). Due to the limits existing, this implies
we can choose T0 , such that the result (40) follows.

Based on the limits V (∞) (s) and Q(∞) (s, a), define following sets:
I0s := {a|Q(∞) (s, a) = V (∞) (s)}
s
I+ := {a|Q(∞) (s, a) > V (∞) (s)}
s
I− := {a|Q(∞) (s, a) < V (∞) (s)} .

60
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

In the following lemmas 44- 50, we first show that probabilities π (t) (a|s) → 0 for actions a ∈ I+
s ∪I s

s , lim (t) s
as t → ∞. We then show that for actions a ∈ I− t→∞ θs,a = −∞ and for all actions a ∈ I+ ,
θ(t) (a|s) is bounded from below as t → ∞.

Lemma 43 We have that there exists a T1 such that for all t > T1 , s ∈ S, and a ∈ A, we have
∆ ∆
A(t) (s, a) < − s
for a ∈ I− ; A(t) (s, a) > s
for a ∈ I+ (41)
4 4
Proof Since, V (t) (s) → V (∞) (s), we have that there exists T1 > T0 such that for all t > T1 ,

V (t) (s) > V (∞) (s) − . (42)
4
s
Using Equation (40), it follows that for t > T1 > T0 , for a ∈ I−

A(t) (s, a) = Q(t) (s, a) − V (t) (s)


≤ Q(∞) (s, a) − V (t) (s)
≤ Q(∞) (s, a) − V (∞) (s) + ∆/4 (Equation (42))
≤ −∆ + ∆/4 (definition of s
I− and Lemma 42)
< −∆/4

Similarly A(t) (s, a) = Q(t) (s, a) − V (t) (s) > ∆/4 for a ∈ I+
s as

A(t) (s, a) = Q(t) (s, a) − V (t) (s)


≥ Q(∞) (s, a) − ∆/4 − V (t) (s) (Lemma 42)
(∞) (∞) (∞) (s) (t) (s)
≥Q (s, a) − V (s) − ∆/4 (V ≥V from Lemma 41)
≥ ∆ − ∆/4
> ∆/4

which completes the proof.

∂V (t) (µ) s ∪ Is ,
Lemma 44 → 0 as t → ∞ for all states s and actions a. This implies that for a ∈ I+
∂θs,a −
(t) (t)
P
π (a|s) → 0 and that a∈I s π (a|s) → 1.
0

Proof Because V πθ (µ) is smooth (Lemma 55) as a function of θ, it follows from standard optimiza-
(t)
tion results (Theorem 57) that ∂V∂θs,a(µ) → 0 for all states s and actions a. We have from Lemma
40
∂V (t) (µ) 1 (t)
= dπµ (s)π (t) (a|s)A(t) (s, a).
∂θs,a 1−γ
(t) µ(s)
Since, |A(t) (s, a)| > ∆ s s π
4 for all t > T1 (from Lemma 43) for all a ∈ I− ∪ I+ and dµ (s) ≥ 1−γ >0
(using the strict positivity of µ in our assumption in Theorem 10), we have π (t) (a|s) → 0.

61
AGARWAL , K AKADE , L EE AND M AHAJAN

(t) s, θ (t)
Lemma 45 (Monotonicity in θs,a ). For all a ∈ I+ s,a is strictly increasing for t ≥ T1 . For all
s (t)
a ∈ I− , θs,a is strictly decreasing for t ≥ T1 .

Proof We have from Lemma 40


∂V (t) (µ) 1 (t)
= dπµ (s)π (t) (a|s)A(t) (s, a)
∂θs,a 1−γ
From Lemma 43, we have for all t > T1

A(t) (s, a) > 0 for a ∈ I+


s
; A(t) (s, a) < 0 for a ∈ I−
s

(t)
Since dπµ (s) > 0 and π (t) (a|s) > 0 for the softmax parameterization, we have for all t > T1

∂V (t) (µ) s ∂V (t) (µ) s


> 0 for a ∈ I+ ; < 0 for a ∈ I−
∂θs,a ∂θs,a

s,θ (t+1) (t) ∂V (t) (µ) (t)


This implies for all a ∈ I+ s,a − θs,a = ∂θs,a > 0 i.e. θs,a is strictly increasing for t ≥ T1 .
The second claim follows similarly.

s 6= ∅, we have that:
Lemma 46 For all s where I+
(t) (t)
maxs θs,a → ∞, min θs,a → −∞
a∈I0 a∈A

s 6= ∅, there exists some action a ∈ I s . From Lemma 44,


Proof Since I+ + +

π (t) (a+ |s) → 0, as t → ∞

or equivalently by softmax parameterization,


(t)
exp(θs,a+ )
P (t)
→ 0, as t → ∞
a exp(θs,a )

s and in particular for a , θ (t)


From Lemma 45, for any action a ∈ I+ + s,a+ is monotonically increasing
for t > T1 . That is the numerator in previous display is monotonically increasing. Therefore, the
denominator should go to infinity i.e.
X
(t)
exp(θs,a ) → ∞, as t → ∞.
a

From Lemma 44, X


π (t) (a|s) → 1, as t → ∞
a∈I0s

or equivalently
P (t)
a∈I0s exp(θs,a )
P (t)
→ 1, as t → ∞
a exp(θs,a )

62
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Since, denominator goes to ∞,


X
(t)
exp(θs,a ) → ∞, as t → ∞
a∈I0s

which implies
(t)
maxs θs,a → ∞, as t → ∞
a∈I0

(t)
Note this also implies maxa∈A θs,a → ∞. The last part of the proof is completed using that the
P ∂V (t) (µ) P (t)
gradients sum to 0, i.e. a ∂θs,a = 0. From gradient sum to 0, we get that a∈A θs,a =
P (0)
a∈A θs,a := c for all t > 0 where c is defined as the sum (over A) of initial parameters. That is
(t) 1 (t) (t)
mina∈A θs,a < − |A| maxa∈A θs,a + c. Since, maxa∈A θs,a → ∞, the result follows.

Lemma 47 Suppose a+ ∈ I+ s . For any a ∈ I s , if there exists a t ≥ T such that π (t) (a|s) ≤
0 0
π (t) (a+ |s), then for all τ ≥ t, π (τ ) (a|s) ≤ π (τ ) (a+ |s).

Proof The proof is inductive. Suppose π (t) (a|s) ≤ π (t) (a+ |s), this implies from Lemma 40

∂V (t) (µ)
 
1 (t) (t) (t) (t)
= d (s)π (a|s) Q (s, a) − V (s)
∂θs,a 1−γ µ
∂V (t) (µ)
 
1 (t) (t) (t) (t)
≤ dµ (s)π (a+ |s) Q (s, a+ ) − V (s) = .
1−γ ∂θs,a+

where the second to last step follows from Q(t) (s, a+ ) ≥ Q(∞) (s, a+ ) − ∆/4 ≥ Q(∞) (s, a) + ∆ −
∆/4 > Q(t) (s, a) for t > T0 . This implies that π (t+1) (a|s) ≤ π (t+1) (a+ |s) which completes the
proof.

Consider an arbitrary a+ ∈ I+ s . Let us partition the set I s into B s (a ) and B̄ s (a ) as follows:


0 0 + 0 +
B0s (a+ ) is the set of all a ∈ I0s such that for all t ≥ T0 , π (t) (a+ |s) < π (t) (a|s), and B̄0s (a+ ) contains
the remainder of the actions from I0s . We drop the argument (a+ ) when clear from the context.
s 6= ∅. For all a ∈ I s , we have that B s (a ) 6= ∅ and that
Lemma 48 Suppose I+ + + 0 +
X
π (t) (a|s) → 1, as t → ∞.
a∈B0s (a+ )

This implies that:


(t)
max θs,a → ∞.
a∈B0s (a+ )

Proof Let a+ ∈ I+ s . Consider any a ∈ B̄ s . Then, by definition of B̄ s , there exists t0 > T such
0 0 0
that π (a+ |s) ≥ π (t) (a|s). From Lemma 47, for all τ > t π (τ ) (a+ |s) ≥ π (τ ) (a|s). Also, since
(t)

π (t) (a+ |s) → 0, this implies


π (t) (a|s) → 0 for all a ∈ B̄0s

63
AGARWAL , K AKADE , L EE AND M AHAJAN

Since, B0s ∪ B̄0s = I0s and π (t) (a|s) → 1 (from Lemma 44), this implies that B0s 6= ∅ and that
P
a∈I0s
means X
π (t) (a|s) → 1, as t → ∞,
a∈B0s

which completes the proof of the


Pfirst claim. The proof of theP
second claim is identical to the proof
in Lemma 46 where instead of a∈I s π (t) (a|s) → 1, we use a∈B s π (t) (a|s) → 1.
0 0

s 6= ∅. Then, for any a ∈ I s , there exists an iteration T


Lemma 49 Consider any s where I+ + + a+
such that for all t > Ta+ ,
π (t) (a+ |s) > π (t) (a|s)
for all a ∈ B̄0s (a+ ).

Proof The proof follows from definition of B̄0s (a+ ). That is if a ∈ B̄0s (a+ ), then there exists
a iteration ta > T0 such that π (ta ) (a+ |s) > π (ta ) (a|s). Then using Lemma 47, for all τ > ta ,
π (τ ) (a+ |s) > π (τ ) (a|s). Choosing
Ta+ = max s
ta
a∈B0 (a+ )

completes the proof.

s , we have that θ (t)


Lemma 50 For all actions a ∈ I+ s,a is bounded from below as t → ∞. For all
s (t)
actions a ∈ I− , we have that θs,a → −∞ as t → ∞.

(t)
Proof For the first claim, from Lemma 45, we know that after T1 , θs,a is strictly increasing for
s , i.e. for all t > T
a ∈ I+ 1
(t) (T1 )
θs,a ≥ θs,a .
(t) s (Lemma 45).
For the second claim, we know that after T1 , θs,a is strictly decreasing for a ∈ I−
(t)
Therefore, by monotone convergence theorem, limt→∞ θs,a exists and is either −∞ or some constant
θ0 . We now prove the second claim by contradiction. Suppose a ∈ I− s and that there exists a θ , such
0
(t) 0
that θs,a > θ0 , for t ≥ T1 . By Lemma 46, there must exist an action where a ∈ A such that
(t)
lim inf θs,a0 = −∞. (43)
t→∞

(T )
Let us consider some δ > 0 such that θs,a10 ≥ θ0 − δ. Now for t ≥ T1 define τ (t) as follows:
(k)
τ (t) = k if k is the largest iteration in the interval [T1 , t] such that θs,a0 ≥ θ0 − δ (i.e. τ (t) is
the latest iteration before θs,a0 crosses below θ0 − δ). Define T (t) as the subsequence of iterations
(t0 )
τ (t) < t0 < t such that θs,a0 decreases, i.e.
0
∂V (t ) (µ)
≤ 0, for τ (t) < t0 < t.
∂θs,a0

64
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Define Zt as the sum (if T (t) = ∅, we define Zt = 0):

X ∂V (t0 ) (µ)
Zt = .
0 (t)
∂θs,a0
t ∈T

For non-empty T (t) , this gives:

X ∂V (t0 ) (µ) t−1 0 t−1 0


X ∂V (t ) (µ) X ∂V (t ) (µ) 1
Zt = ≤ ≤ +
∂θs,a0 ∂θs,a0 ∂θs,a0 1 − γ2
0 (t)
t ∈T t0 =τ (t)−1 t0 =τ (t)
 
1 (t) (τ (t)) 1 1 (t) 1
= (θs,a0 − θs,a0 ) + 2
≤ θs,a0 − (θ0 − δ) + ,
η 1−γ η 1 − γ2

(t0 ) (µ)
where we have used that | ∂V∂θ | ≤ 1/(1 − γ). By (43), this implies that:
s,a0

lim inf Zt = −∞.


t→∞

For any T (t) 6= ∅, this implies that for all t0 ∈ T (t) , from Lemma 40
0 0 0
∂V (t ) (µ)/∂θs,a π (t ) (a|s)A(t ) (s, a) (t0 )  (1 − γ)∆
(t 0) = (t0) 0 (t0) 0
≥ exp θ0 − θs,a0
∂V (µ)/∂θs,a0 π (a |s)A (s, a ) 4
 (1 − γ)∆
≥ exp δ
4
0 0
where we have used that |A(t ) (s, a0 )| ≤ 1/(1 − γ) and |A(t ) (s, a)| ≥ ∆
4 for all t0 > T1 (from
∂V (t0 ) (µ) ∂V (t0 ) (µ)
Lemma 43). Note that since ∂θs,a < 0 and ∂θs,a0 < 0 over the subsequence T (t) , the sign of
the inequality reverses. In particular, for any T (t) 6= ∅

t−1 0
X ∂V (t0 ) (µ)
1 (T1 ) (t)
X ∂V (t ) (µ)
(θs,a − θs,a )= ≤
η 0
∂θs,a ∂θs,a
t =T1 0 (t) t ∈T
 (1 − γ)∆ X ∂V (t0 ) (µ)
≤ exp δ
4 0 (t)
∂θs,a0
t ∈T
 (1 − γ)∆
= exp δ Zt
4
(t) ∂V (t) (µ)
where the first step follows from that θs,a is monotonically decreasing, i.e. ∂θs,a < 0 for t ∈
/T
(Lemma 45). Since,
lim inf Zt = −∞,
t→∞

(t)
this contradicts that θs,a is lower bounded from below, which completes the proof.

65
AGARWAL , K AKADE , L EE AND M AHAJAN

s 6= ∅. Then, for any a ∈ I s ,


Lemma 51 Consider any s where I+ + +
X
(t)
θs,a → ∞, as t → ∞
a∈B0s (a+ )

Proof Consider any a ∈ B0s . We have by definition of B0s that π (t) (a+ |s) < π (t) (a|s) for all t > T0 .
(t) (t) (t)
This implies by the softmax parameterization that θs,a+ < θs,a . Since, θs,a+ is lower bounded as
(t)
t → ∞ (using Lemma 50), this implies θs,a is lower bounded as t → ∞ for all a ∈ B0s . This in
(t)
conjunction with maxa∈B0s (a+ ) θs,a → ∞ implies
X
(t)
θs,a → ∞, (44)
a∈B0s

which proves this claim.

We are now ready to complete the proof for Theorem 10. We prove it by showing that I+ s is

empty for all states s or equivalently V (t) (s0 ) → V ? (s0 ) as t → ∞.


Proof [Proof for Theorem 10] Suppose the set I+ s is non-empty for some s, else the proof is complete.
s
Let a+ ∈ I+ . Then, from Lemma 51,
X
(t)
θs,a → ∞, (45)
a∈B0s

s , we have that since π (t) (a|s)


Now we proceed by showing a contradiction. For a ∈ I− π (t) (a+ |s)
=
(t) (t) (t) (t)
exp(θs,a − θs,a+ ) → 0 (as θs,a+ is lower bounded and θs,a → −∞ by Lemma 50), there exists
T2 > T0 such that
π (t) (a|s) (1 − γ)∆
(t)
<
π (a+ |s) 16|A|
or, equivalently,
X π (t) (a|s) ∆
− > −π (t) (a+ |s) . (46)
s
1−γ 16
a∈I−

π (t) (a+ |s)


For a ∈ B̄0s , we have A(t) (s, a) → 0 (by definition of set I0s and B̄0s ⊂ I0s ) and 1 < π (t) (a|s)
for
all t > Ta+ from Lemma 49. Thus, there exists T3 > T2 , Ta+ such that

π (t) (a+ |s) ∆


|A(t) (s, a)| <
π (t) (a|s) 16|A|
which implies
X ∆
π (t) (a|s)|A(t) (s, a)| < π (t) (a+ |s)
16
a∈B̄0s

∆ X ∆
−π (t) (a+ |s) < π (t) (a|s)A(t) (s, a) < π (t) (a+ |s) (47)
16 s
16
a∈B̄0

66
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

We have for t > T3 , from a∈A π (t) (a|s)A(t) (s, a) = 0,


P

X X X
0= π (t) (a|s)A(t) (s, a) + π (t) (a|s)A(t) (s, a) + π (t) (a|s)A(t) (s, a)
a∈I0s s
a∈I+ s
a∈I−
(a) X X
≥ π (t) (a|s)A(t) (s, a) + π (t) (a|s)A(t) (s, a) + π (t) (a+ |s)A(t) (s, a+ )
a∈B0s a∈B̄0s
X
+ π (t) (a|s)A(t) (s, a)
s
a∈I−
(b) X X ∆ X π (t) (a|s)
≥ π (t) (a|s)A(t) (s, a) + π (t) (a|s)A(t) (s, a) + π (t) (a+ |s) −
4 1−γ
a∈B0s a∈B̄0s s
a∈I−
(c) X ∆ ∆ ∆
> π (t) (a|s)A(t) (s, a) − π (t) (a+ |s) + π (t) (a+ |s) − π (t) (a+ |s)
16 4 16
a∈B0s
X
> π (t) (a|s)A(t) (s, a)
a∈B0s

where in the step (a), we used A(t) (s, a) > 0 for all actions a ∈ I+
s for t > T > T from Lemma 43.
3 1
∆ 1
In the step (b), we used A (s, a+ ) ≥ 4 for t > T3 > T1 from Lemma 43 and A(t) (s, a) ≥ − 1−γ
(t) .
In the step (c), we used Equation (46) and left inequality in (47). This implies that for all t > T3
X ∂V (t) (µ)
<0
s
∂θs,a
a∈B0

This contradicts Equation (45) which requires


∞ X
X 
(t) (T3 )
 X ∂V (t) (µ)
lim θs,a − θs,a =η → ∞.
t→∞ ∂θs,a
a∈B0s s t=T3 a∈B0

s must be empty, which completes the proof.


Therefore, the set I+

C.2 Proofs for Section 5.2


Proof [of Corollary 13] Using Theorem 12, the desired optimality gap  will follow if we set
(1 − γ)
λ= ? (48)

ρ
2 µ ∞

and if k∇θ Lλ (θ)k2 ≤ λ/(2|S| |A|). In order to complete the proof, we need to bound the iteration
complexity of making the gradient sufficiently small.
Since the optimization is deterministic and unconstrained, we can appeal to standard results
(Theorem 57) which give that after T iterations of gradient ascent with stepsize of 1/βλ , we have

2βλ (Lλ (θ? ) − Lλ (θ(0) )) 2βλ


mink∇θ Lλ (θ(t) )k22 ≤ ≤ , (49)
t≤T T (1 − γ) T

67
AGARWAL , K AKADE , L EE AND M AHAJAN

where βλ is an upper bound on the smoothness of Lλ (θ). We seek to ensure


s
2βλ λ
opt ≤ ≤
(1 − γ) T 2|S| |A|

8βλ |S|2 |A|2


Choosing T ≥ (1−γ) λ2
satisfies the above inequality. By Lemma 55, we can take βλ =
8γ 2λ
(1−γ)3
+ |S| , and so

8βλ |S|2 |A|2 64 |S|2 |A|2 16 |S||A|2


≤ +
(1 − γ) λ2 (1 − γ)4 λ2 (1 − γ) λ
2
80 |S| |A| 2

(1 − γ)4 λ2
? 2
320 |S|2 |A|2 dπρ
=
(1 − γ)6 2 µ

where we have used that λ < 1. This completes the proof.

C.3 Proofs for Section 5.3


Proof [of Lemma 15] Following the definition of compatible function approximation in Sutton et al.
(1999), which was also invoked in Kakade (2001), for a vector w ∈ R|S||A| , we define the error
function
Lθ (w) = Es∼dπρ θ ,a∼πθ (·|s) (w> ∇θ log πθ (·|s) − Aπθ (s, a))2 .

Let wθ? be the minimizer of Lθ (w) with the smallest `2 norm. Then by definition of Moore-
Penrose pseudoinverse, it is easily seen that

wθ? = Fρ (θ)† Es∼dπρ θ ,a∼πθ (a|s) [∇θ log πθ (a|s)Aπθ (s, a)] = (1 − γ)Fρ (θ)† ∇θ V πθ (ρ). (50)

In other words, wθ? is precisely proportional to the NPG update direction. Note further that for the
Softmax policy parameterization, we have by (35),
X
w> ∇θ log πθ (a|s) = ws,a − ws,a0 πθ (a0 |s).
a0 ∈A

Since a∈A π(a|s)Aπ (s, a) = 0, this immediately yields that Lθ (Aπθ ) = 0. However, this might
P
not be the unique minimizer of Lθ , which is problematic since w? (θ) as defined in terms of the
Moore-Penrose pseudoinverse is formally the smallest norm solution to the least-squares problem,
which Aπθ may not be. However, given any vector v ∈ R|S||A| , let us consider solutions of the form
Aπθ + v. Due to the form of the derivatives of the policy for the softmax parameterization (recall
Equation 35), we have for any state s, a such that s is reachable under ρ,
X X
v > ∇θ log πθ (a|s) = (vs,a0 1[a = a0 ] − vs,a0 πθ (a0 |s)) = vs,a − vs,a0 π(a0 |s).
a0 ∈A a0 ∈A

68
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Note that here we have used that πθ is a stochastic policy with πθ (a|s) > 0 for all actions a in each
state s, so that if a state is reachable under ρ, it will also be reachable using πθ , and hence the zero
derivative conditions apply at each reachable state. For Aπθ + v to minimize Lθ , we would like
v > ∇θ log πθ (a|s) = 0 for all s, a so that vs,a is independent of the action and can be written as a
constant cs for each s by the above equality. Hence, the minimizer of Lθ (w) is determined up to a
state-dependent offset, and
Aπθ
Fρ (θ)† ∇θ V πθ (ρ) = + v,
1−γ
where vs,a = cs for some cs ∈ R for each state s and action a. Finally, we observe that this yields
the updates

η exp(ηA(t) (s, a)/(1 − γ) + ηcs )


θ(t+1) = θ(t) + A(t) + ηv and π (t+1) (a|s) = π (t) (a|s) .
1−γ Zt (s)

Owing to the normalization factor Zt (s), the state dependent offset cs cancels in the updates for π,
so that resulting policy is invariant to the specific choice of cs . Hence, we pick cs ≡ 0, which yields
the statement of the lemma.

Appendix D. Smoothness Proofs


Various convergence guarantees we show leverage results from smooth, non-convex optimization.
In this section, we collect the various results on smoothness of policies and value functions in the
different parameterizations which are needed in our analysis.
Define the Hadamard product of two vectors:

[x y]i = xi yi

Define diag(x) for a column vector x as the diagonal matrix with diagonal as x.

Lemma 52 (Smoothness of F (see Equation 38) ) Fix a state s. Let θs ∈ R|A| be the column
vector of parameters for state s. Let πθ (·|s) be the corresponding vector of action probabilities given
by the softmax parameterization. For some fixed vector c ∈ R|A| , define:
X
F (θ) := πθ (·|s) · c = πθ (a|s)ca .
a

Then
k∇θs F (θs ) − ∇θs F (θs0 )k2 ≤ βkθs − θs0 k2
where
β = 5kck∞ .

Proof For notational convenience, we do not explicitly state the s dependence. For the softmax
parameterization, we have that
∇θ πθ = diag(πθ ) − πθ πθ> .

69
AGARWAL , K AKADE , L EE AND M AHAJAN

We can then write (as ∇θ πθ is symmetric),

∇θ (πθ · c) = (diag(πθ ) − πθ πθ> )c = πθ c − (πθ · c)πθ (51)

and therefore
∇2θ (πθ · c) = ∇θ (πθ c − (πθ · c)πθ ).
For the first term, we get

∇θ (πθ c) = diag(πθ c) − πθ (πθ c)> ,

and the second term, we can decompose by chain rule

∇θ ((πθ · c)πθ ) = (πθ · c)∇θ πθ + (∇θ (πθ · c))πθ>

Substituting these back, we get

∇2θ (πθ · c) = diag(πθ c) − πθ (πθ c)> − (πθ · c)∇θ πθ − (∇θ (πθ · c))πθ> . (52)

Note that

max(kdiag(πθ c)k2 , kπθ ck2 , |πθ · c|) ≤ kck∞


k∇θ πθ k2 = diag(πθ ) − πθ πθ> ≤1
2
k∇θ (πθ · c)k2 ≤ kπθ ck2 + k(πθ · c)πθ k2 ≤ 2kck∞ ,

which gives
∇2θ (πθ · c) 2
≤ 5kck∞ .

Before we prove the smoothness results for ∇π V π (s0 ) and ∇θ V πθ (s0 ), we prove the following
helpful lemma. This lemma is general and not specific to the direct or softmax policy parameteriza-
tions.

Lemma 53 Let πα := πθ+αu and let Ve (α) be the corresponding value at a fixed state s0 , i.e.

Ve (α) := V πα (s0 ).

Assume that
X dπα (a|s0 ) X d2 πα (a|s0 )
≤ C1 , ≤ C2
dα α=0 (dα)2 α=0
a∈A a∈A

Then
d2 Ve (α) C2 2γC12
max ≤ + .
kuk2 =1 (dα)2 α=0 (1 − γ)2 (1 − γ)3

70
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Proof Consider a unit vector u and let Pe(α) be the state-action transition matrix under π, i.e.

[Pe(α)](s,a)→(s0 ,a0 ) = πα (a0 |s0 )P (s0 |s, a).

We can differentiate Pe(α) w.r.t α to get


" #
dPe(α) dπα (a0 |s0 )
= P (s0 |s, a).
dα α=0 dα α=0
(s,a)→(s0 ,a0 )

For an arbitrary vector x,


" #
dPe(α) X dπα (a0 |s0 )
x = P (s0 |s, a)xa0 ,s0
dα α=0 dα α=0
s,a a0 ,s0

and therefore
" #
dPe(α) X dπα (a0 |s0 )
max x = max P (s0 |s, a)xa0 ,s0
kuk2 =1 dα α=0 kuk2 =1 dα α=0
s,a a0 ,s0
X dπα (a0 |s0 )
≤ P (s0 |s, a)|xa0 ,s0 |
0 0
dα α=0
a ,s
X X dπα (a0 |s0 )
≤ P (s0 |s, a)kxk∞
dα α=0
s0 0a
X
0
≤ P (s |s, a)kxk∞ C1
s0
≤ C1 kxk∞ .

By definition of `∞ norm,
dPe(α)
max x ≤ C1 kxk∞
kuk2 =1 dα

Similarly, differentiating Pe(α) twice w.r.t. α, we get


" #
d2 Pe(α) d2 πα (a0 |s0 )
= P (s0 |s, a).
(dα)2 α=0 0 0
(dα)2 α=0
(s,a)→(s ,a )

An identical argument leads to that, for arbitrary x,

d2 Pe(α)
max x ≤ C2 kxk∞
kuk2 =1 (dα)2 α=0 ∞

Let Qα (s0 , a0 ) be the corresponding Q-function for policy πα at state s0 and action a0 . Observe
that Qα (s0 , a0 ) can be written as:

Qα (s0 , a0 ) = e> −1 >


(s0 ,a0 ) (I − γ P (α)) r = e(s0 ,a0 ) M (α)r
e

71
AGARWAL , K AKADE , L EE AND M AHAJAN

where M (α) := (I − γ Pe(α))−1 and differentiating twice w.r.t α gives:


dQα (s0 , a) dPe(α)
= γe>
(s0 ,a) M (α) M (α)r,
dα dα
d2 Qα (s0 , a0 ) dPe(α) dPe(α)
2
= 2γ 2 e>
(s0 ,a0 ) M (α) M (α) M (α)r
(dα) dα dα
d2 Pe(α)
+ γe> (s0 ,a0 ) M (α) M (α)r.
(dα)2
By using power series expansion of matrix inverse, we can write M (α) as:

X
M (α) = (I − γ Pe(α))−1 = γ n Pe(α)n
n=0
1
which implies that M (α) ≥ 0 (componentwise) and M (α)1 = 1−γ 1, i.e. each row of M (α) is
positive and sums to 1/(1 − γ). This implies:
1
max kM (α)xk∞ ≤ kxk∞
kuk2 =1 1−γ
d2 Qα (s0 ,a0 ) dQα (s0 ,a)
This gives using expression for (dα)2
and dα ,

d2 Qα (s0 , a0 ) dPe(α) dPe(α)


max ≤ 2γ 2 M (α) M (α) M (α)r
kuk2 =1 (dα)2 α=0 dα dα

d2 Pe(α)
+ γ M (α) M (α)r
(dα)2

2γ 2 C12 γC2
≤ 3
+
(1 − γ) (1 − γ)2
dQα (s0 , a) dPe(α)
max ≤ γM (α) M (α)r
kuk2 =1 dα α=0 dα

γC1

(1 − γ)2
Consider the identity: X
Ve (α) = πα (a|s0 )Qα (s0 , a),
a

By differentiating Ve (α) twice w.r.t α, we get


d2 Ve (α) X d2 πα (a|s0 ) α X dπα (a|s0 ) dQα (s0 , a) X d2 Qα (s0 , a)
= Q (s 0 , a) + 2 + π α (a|s 0 ) .
(dα)2 a
(dα)2 a
dα dα a
(dα)2
Hence,
d2 Ve (α) C2 2γC12 2γ 2 C12 γC2
max 2
≤ + 2
+ 3
+
kuk2 =1 (dα) 1 − γ (1 − γ) (1 − γ) (1 − γ)2
C2 2γC12
= + ,
(1 − γ)2 (1 − γ)3

72
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

which completes the proof.

Using this lemma, we now establish smoothness for: the value functions under the direct policy
parameterization and the log barrier regularized objective 12 for the softmax parameterization.

Lemma 54 (Smoothness for direct parameterization) For all starting states s0 ,


0 2γ|A|
∇π V π (s0 ) − ∇π V π (s0 ) ≤ π − π0 2
2 (1 − γ)3

Proof By differentiating πα w.r.t α gives


X dπα (a|s0 ) X p
≤ |ua,s | ≤ |A|

a∈A a∈A

and differentiating again w.r.t α gives


X d2 πα (a|s0 )
=0
(dα)2
a∈A
p
Using this with Lemma 53 with C1 = |A| and C2 = 0, we get

d2 Ve (α) C2 2γC12 2γ|A|


max ≤ + ≤
kuk2 =1 (dα)2 α=0 (1 − γ)2 (1 − γ)3 (1 − γ)3

which completes the proof.

We now present a smoothness result for the entropy regularized policy optimization problem
which we study for the softmax parameterization.

Lemma 55 (Smoothness for log barrier regularized softmax) For the softmax parameterization
and
λ X
Lλ (θ) = V πθ (µ) + log πθ (a|s) ,
|S| |A| s,a

we have that
∇θ Lλ (θ) − ∇θ Lλ (θ0 ) 2
≤ βλ θ − θ 0 2
where
8 2λ
βλ = 3
+
(1 − γ) |S|

Proof Let us first bound the smoothness of V πθ (µ). Consider a unit vector u. Let θs ∈ R|A| denote
the parameters associated with a given state s. We have:
 
∇θs πθ (a|s) = πθ (a|s) ea − π(·|s)

73
AGARWAL , K AKADE , L EE AND M AHAJAN

and
 
> > > >
∇2θs πθ (a|s) = πθ (a|s) ea ea − ea π(·|s) − π(·|s)ea + 2π(·|s)π(·|s) − diag(π(·|s)) ,

where ea is a standard basis vector and π(·|s) is a vector of probabilities. We also have by differenti-
ating πα (a|s) once w.r.t α,
X dπα (a|s) X
≤ u> ∇θ+αu πα (a|s)
dα α=0 α=0
a∈A a∈A
X
≤ πθ (a|s) u> >
s ea − us π(·|s)
a∈A
 
≤ max u>
s ea + u>
s π(·|s) ≤2
a∈A

Similarly, differentiating once again w.r.t. α, we get


X d2 πα (a|s) X
≤ u> ∇2θ+αu πα (a|s) u
(dα)2 α=0 α=0
a∈A a∈A

≤ max u> > > > > >
s ea ea us + us ea π(·|s) us + us π(·|s)ea us
a∈A

> > >
+ 2 us π(·|s)π(·|s) us + us diag(π(·|s))us

≤6

Using this with Lemma 53 for C1 = 2 and C2 = 6, we get

d2 Ve (α) C2 2γC12 6 8γ 8
max ≤ + ≤ + ≤
kuk2 =1 (dα)2 α=0 (1 − γ) 2 (1 − γ)3 (1 − γ)2 (1 − γ) 3 (1 − γ)3

or equivalently for all starting states s and hence for all starting state distributions µ,

k∇θ V πθ (µ) − ∇θ V πθ0 (µ)k2 ≤ β θ − θ0 2


(53)
8
where β = (1−γ)3
.
λ
Now let us bound the smoothness of the regularizer |S| R(θ), where

1 X
R(θ) := log πθ (a|s)
|A| s,a

We have
∂R(θ) 1
= − πθ (a|s).
∂θs,a |A|
Equivalently,
1
∇θs R(θ) = 1 − πθ (·|s).
|A|

74
O N THE T HEORY OF P OLICY G RADIENT M ETHODS

Hence,
∇2θs R(θ) = −diag(πθ (·|s)) + πθ (·|s)πθ (·|s)> .
For any vector us ,

u> 2 > 2 2
s ∇θs R(θ)us = us diag(πθ (·|s))us − (us · πθ (·|s)) ≤ 2kus k∞ .

Since ∇θs ∇θs0 R(θ) = 0 for s 6= s0 ,

X X
u> ∇2θ R(θ)u = u> 2
s ∇θs R(θ)us ≤ 2 kus k2∞ ≤ 2kuk22 .
s s

λ 2λ
Thus R is 2-smooth and |S| R is |S| -smooth, which completes the proof.

Appendix E. Standard Optimization Results


In this section, we present the standard optimization results from Ghadimi and Lan (2016); Beck
(2017) used in our proofs. We consider solving the following problem

min{f (x)} (54)


x∈C

with C being a nonempty closed and convex set. We assume the following

Assumption E.1 f : Rd → (−∞, ∞) is proper and closed, dom(f ) is convex and f is β smooth
over int(dom(f )).

Throughout the section, we will denote the optimal f value by f (x∗ ).

Definition 56 (Gradient Mapping) We define the gradient mapping Gη (x) as

1
Gη (x) := (x − PC (x − η∇f (x))) (55)
η

where PC is the projection onto C.

Note that when C = Rd , the gradient mapping Gη (x) = ∇f (x).

Theorem 57 (Theorem 10.15 Beck (2017)) Suppose that Assumption E.1 holds and let {xk }k≥0
be the sequence generated by the gradient descent algorithm for solving the problem (54) with the
stepsize η = 1/β. Then,

1. The sequence {F (xt )}t≥0 is non-increasing.

2. Gη (xt ) → 0 as t → ∞

2β(f (x0 )−f (x∗ ))
3. mint=0,1,...,T −1 kGη (xt )k ≤ √
T

75
AGARWAL , K AKADE , L EE AND M AHAJAN

Theorem 58 (Lemma 3 Ghadimi and Lan (2016)) Suppose that Assumption E.1 holds. Let x+ =
x − ηGη (x). Then,

∇f (x+ ) ∈ NC (x+ ) + (ηβ + 1)B2 ,

where B2 is the unit `2 ball, and NC is the normal cone of the set C.

We now consider the stochastic projected gradient descent algorithm where at each time step t,
we update xt by sampling a random vt such that

xt+1 = PC (xt − ηvt ) , where E[vt |xt ] = ∇f (xt ) (56)

Theorem 59 (Theorem 14.8 and Lemma 14.9 Shalev-Shwartz and Ben-David (2014)) Assume C =
{x : kxk ≤ B}, for some B > 0. Let f be a convex function and let x∗ ∈ argminx:kxk≤B f (w).
Assume also that forqall t, kvt k ≤ ρ, and that stochastic projected gradient descent is run for N
B2
iterations with η = ρ2 N
. Then,

N
" !#
1 X Bρ
E f xt − f (x∗ ) ≤ √ (57)
N N
t=1

76

You might also like