Policy Gradient 2020
Policy Gradient 2020
Abstract
Policy gradient methods are among the most effective methods in challenging reinforcement learning
problems with large state and/or action spaces. However, little is known about even their most basic
theoretical convergence properties, including: if and how fast they converge to a globally optimal
solution or how they cope with approximation error due to using a restricted class of parametric
policies. This work provides provable characterizations of the computational, approximation, and
sample size properties of policy gradient methods in the context of discounted Markov Decision
Processes (MDPs). We focus on both: “tabular” policy parameterizations, where the optimal policy
is contained in the class and where we show global convergence to the optimal policy; and parametric
policy classes (considering both log-linear and neural policy classes), which may not contain the
optimal policy and where we provide agnostic learning results. One central contribution of this work
is in providing approximation guarantees that are average case — which avoid explicit worst-case
dependencies on the size of state space — by making a formal connection to supervised learning
under distribution shift. This characterization shows an important interplay between estimation
error, approximation error, and exploration (as characterized through a precisely defined condition
number).
Keywords: Policy Gradient, Reinforcement Learning
1. Introduction
Policy gradient methods have a long history in the reinforcement learning (RL) literature (Williams,
1992; Sutton et al., 1999; Konda and Tsitsiklis, 2000; Kakade, 2001) and are an attractive class of
algorithms as they are applicable to any differentiable policy parameterization; admit easy extensions
to function approximation; easily incorporate structured state and action spaces; are easy to implement
in a simulation based, model-free manner. Owing to their flexibility and generality, there has also
c 2021 Alekh Agarwal, Sham Kakade, Jason Lee and Gaurav Mahajan.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v22/19-736.html.
AGARWAL , K AKADE , L EE AND M AHAJAN
been a flurry of improvements and refinements to make these ideas work robustly with deep neural
network based approaches (see e.g. Schulman et al. (2015, 2017)).
Despite the large body of empirical work around these methods, their convergence properties
are only established at a relatively coarse level; in particular, the folklore guarantee is that these
methods converge to a stationary point of the objective, assuming adequate smoothness properties
hold and assuming either exact or unbiased estimates of a gradient can be obtained (with appropriate
regularity conditions on the variance). However, this local convergence viewpoint does not address
some of the most basic theoretical convergence questions, including: 1) if and how fast they converge
to a globally optimal solution (say with a sufficiently rich policy class); 2) how they cope with
approximation error due to using a restricted class of parametric policies; or 3) their finite sample
behavior. These questions are the focus of this work.
Overall, the results of this work place policy gradient methods under a solid theoretical footing,
analogous to the global convergence guarantees of iterative value function based algorithms.
Tabular case: We consider three algorithms: two of which are first order methods, projected
gradient ascent (on the simplex) and gradient ascent (with a softmax policy parameterization); and
the third algorithm, natural policy gradient ascent, can be viewed as a quasi second-order method (or
preconditioned first-order method). Table 1 summarizes our main results in this case: upper bounds
on the number of iterations taken by these algorithms to find an -optimal policy, when we have
access to exact policy gradients.
Arguably, the most natural starting point for an analysis of policy gradient methods is to consider
directly doing gradient ascent on the policy simplex itself and then to project back onto the simplex
if the constraint is violated after a gradient update; we refer to this algorithm as projected gradient
ascent on the simplex. Using a notion of gradient domination (Polyak, 1963), our results provably
show that any first-order stationary point of the value function results in an approximately optimal
policy, under certain regularity assumptions; this allows for a global convergence analysis by directly
appealing to standard results in the non-convex optimization literature.
A more practical and commonly used parameterization is the softmax parameterization, where
the simplex constraint is explicitly enforced by the exponential parameterization, thus avoiding
projections. This work provides the first global convergence guarantees using only first-order gradient
information for the widely-used softmax parameterization. Our first result for this parameterization
establishes the asymptotic convergence of the policy gradient algorithm; the analysis challenge here
is that the optimal policy (which is deterministic) is attained by sending the softmax parameters to
infinity.
2
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
2 |S|2 |A|2
Policy Gradient + log barrier regularization, softmax O D∞
(1−γ)6 2
parameterization (Cor 13)
Table 1: Iteration Complexities with Exact Gradients for the Tabular Case: A summary of
the number of iterations required by different algorithms to find a policy π such that
V ? (s0 ) − V π (s0 ) ≤ for some fixed s0 , assuming access to exact policy gradients. The
first three algorithms optimize the objective Es∼µ [V π (s)], where µ is the starting state
distribution for the algorithms. The MDP has |S| states, |A| actions, and discount factor
dπ? (s)
s0
0 ≤ γ < 1. The quantity D∞ := maxs µ(s) is termed the distribution mismatch
?
coefficient, where, roughly speaking, dπs0 (s) is the fraction of time spent in state s when
executing an optimal policy π ? , starting from the state s0 (see (4)). The NPG algorithm
directly optimizes V π (s0 ) for any state s0 . In contrast to the complexities of the previous
three algorithms, NPG has no dependence on the coefficient D∞ , nor does it depend on
the choice of s0 . Both the MDP Experts Algorithm (Even-Dar et al., 2009) and MD-MPI
algorithm (Geist et al., 2019) (see Corollary 3 of their paper) also yield guarantees for the
same update rule as NPG for the softmax parameterization, though at a worse rate. See
Section 2 for further discussion.
In order to establish a finite time convergence rate to optimality for the softmax parameteriza-
tion, we then consider a log barrier regularizer and provide an iteration complexity bound that is
polynomial in all relevant quantities. Our use of the log barrier regularizer is critical to avoiding the
issue of gradients becomingly vanishingly small at suboptimal near-deterministic policies, an issue
of significant practical relevance. The log barrier regularizer can also be viewed as using a relative
entropy regularizer; here, we note the general approach of entropy based regularization is common in
practice (e.g. see (Williams and Peng, 1991; Mnih et al., 2016; Peters et al., 2010; Abdolmaleki
et al., 2018; Ahmed et al., 2019)). One notable distinction, which we discuss later, is that our analysis
is for the log barrier regularization rather than the entropy regularization.
For these aforementioned algorithms, our convergence rates depend on the optimization measure
having coverage over the state space, as measured by the distribution mismatch coefficient D∞ (see
Table 1 caption). In particular, for the convergence rates shown in Table 1 (for the aforementioned
algorithms), we assume that the optimization objective is the expected (discounted) cumulative value
where the initial state is sampled under some distribution, and D∞ is a measure of the coverage of
3
AGARWAL , K AKADE , L EE AND M AHAJAN
this initial distribution. Furthermore, we provide a lower bound that shows such a dependence is
unavoidable for first-order methods, even when exact gradients are available.
We then consider the Natural Policy Gradient (NPG) algorithm (Kakade, 2001) (also see Bagnell
and Schneider (2003); Peters and Schaal (2008)), which can be considered a quasi second-order
method due to the use of its particular preconditioner, and provide an iteration complexity to achieve
2
an -optimal policy that is at most (1−γ) 2 iterations, improving upon the previous related results
of (Even-Dar et al., 2009; Geist et al., 2019) (see Section 2). Note the convergence rate has no
dependence on the number of states or the number of actions, nor does it depend on the distribution
mismatch coefficient D∞ . We provide a simple and concise proof for the convergence rate analysis
by extending the approach developed in (Even-Dar et al., 2009), which uses a mirror descent style
of analysis (Nemirovsky and Yudin, 1983; Cesa-Bianchi and Lugosi, 2006) and also handles the
non-concavity of the policy optimization problem.
This fast and dimension free convergence rate shows how the variable preconditioner in the
natural gradient method improves over the standard gradient ascent algorithm. The dimension free
aspect of this convergence rate is worth reflecting on, especially given the widespread use of the
natural policy gradient algorithm along with variants such as the Trust Region Policy Optimization
(TRPO) algorithm (Schulman et al., 2015); our results may help to provide analysis of a more general
family of entropy based algorithms (see for example Neu et al. (2017)).
Function Approximation: We now summarize our results with regards to policy gradient methods
in the setting where we work with a restricted policy class, which may not contain the optimal policy.
In this sense, these methods can be viewed as approximate methods. Table 2 provides a summary
along with the comparisons to some relevant approximate dynamic programming methods.
A long line of work in the function approximation setting focuses on mitigating the worst-case
“`∞ ” guarantees that are inherent to approximate dynamic programming methods (Bertsekas and
Tsitsiklis, 1996) (see the first row in Table 2). The reason to focus on average case guarantees is that
it supports the applicability of supervised machine learning methods to solve the underlying approx-
imation problem. This is because supervised learning methods, like classification and regression,
typically have bounds on the expected error under a distribution, as opposed to worst-case guarantees
over all possible inputs.
The existing literature largely consists of two lines of provable guarantees that attempt to mitigate
the explicit `∞ error conditions of approximate dynamic programming: those methods which utilize
a problem dependent parameter (the concentrability coefficient (Munos, 2005)) to provide more
refined dynamic programming guarantees (e.g. see Munos (2005); Szepesvári and Munos (2005);
Antos et al. (2008); Farahmand et al. (2010)) and those which work with a restricted policy class,
making incremental updates, such as Conservative Policy Iteration (CPI) (Kakade and Langford,
2002; Scherrer and Geist, 2014), Policy Search by Dynamic Programming (PSDP) (Bagnell et al.,
2004), and MD-MPI Geist et al. (2019). Both styles of approaches give guarantees based on
worst-case density ratios, i.e. they depend on a maximum ratio between two different densities over
the state space. As discussed in(Scherrer, 2014), the assumptions in the latter class of algorithms
are substantially weaker, in that the worst-case density ratio only depends on the state visitation
distribution of an optimal policy (also see Table 2 caption and Section 2).
With regards to function approximation, our main contribution is in providing performance
bounds that, in some cases, have milder dependence on these density ratios. We precisely quantify
an approximation/estimation error decomposition relevant for the analysis of the natural gradient
method; this decomposition is stated in terms of the compatible function approximation error as
4
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
introduced in Sutton et al. (1999). More generally, we quantify our function approximation results in
terms of a precisely quantified transfer error notion, based on approximation error under distribution
shift. Table 2 shows a special case of our convergence rates of NPG, which is governed by four
quantities: stat , approx , κ, and D∞ .
Let us discuss the important special case of log-linear policies (i.e. policies that take the softmax
of linear functions in a given feature space) where the relevant quantities are as follows: stat is
a bound on the excess risk (the estimation error) in fitting linearly parameterized√value functions,
which can be driven to 0 with more samples (at the usual statistical rate of O(1/ N ) where N is
the number of samples); approx is the usual notion of average squared approximation error where
the target function may not be perfectly representable by a linear function; κ can be upper bounded
with an inverse dependence on the minimal eigenvalue of the feature covariance matrix of the fitting
measure (as such it can be viewed as a dimension dependent quantity but not necessarily state
dependent); and D∞ is as before.
For the realizable case, where all policies have values which are linear in the given features (such
as in linear MDP models of (Jin et al., 2019; Yang and Wang, 2019; Jiang et al., 2017)), we have
that the approximation error approx is 0. Here, our guarantees yield a fully polynomial and sample
efficient convergence guarantee, provided the condition number κ is bounded. Importantly, there
always exists a good (universal) initial measure that ensures κ is bounded by a quantity that is only
polynomial in the dimension of the features, d, as opposed to an explicit dependence on the size of
the (infinite) state space (see Remark 22). Such a guarantee would not be implied by algorithms
which depend on the coefficients C∞ or D∞ .1
Our results are also suggestive that a broader class of incremental algorithms — such as
CPI (Kakade and Langford, 2002), PSDP (Bagnell et al., 2004), and MD-MPI Geist et al. (2019)
which make small changes to the policy from one iteration to the next — may also permit a sharper
analysis, where the dependence of worst-case density ratios can be avoided through an appropriate
approximation/estimation decomposition; this is an interesting direction for future work (a point
which we return to in Section 7). One significant advantage of NPG is that the explicit parametric
policy representation in NPG (and other policy gradient methods) leads to a succinct policy represen-
tation in comparison to CPI, PSDP, or related boosting-style methods (Scherrer and Geist, 2014),
where the representation complexity of the policy of the latter class of methods grows linearly in
the number of iterations (since these methods add one policy to the ensemble per iteration). This
representation complexity is likely why the latter class of algorithms are less widely used in practice.
1. Bounding C∞ would require a restriction on the dynamics of the MDP (see Chen and Jiang (2019) and Section 2).
?
Bounding D∞ would require an initial state distribution that is constructed using knowledge of π ? , through dπ . In
contrast, κ can be made O(d), with an initial state distribution that only depends on the geometry of the features (and
does not depend on any other properties of the MDP). See Remark 22.
5
AGARWAL , K AKADE , L EE AND M AHAJAN
Suboptimality
Algorithm Relevant Quantities
after T Iterations
Table 2: Overview of Approximate Methods: The suboptimality, V ? (s0 ) − V π (s0 ), after T iter-
ations for various approximate algorithms, which use different notions of approximation
error (sample complexities are not directly considered but instead may be thought of as
part of 1 and stat . See Section 2 for further discussion). Order notation is used to drop
constants, and we assume |A| = 2 for ease of exposition. For approximate dynamic pro-
gramming methods, the relevant error is the worst case, `∞ -error in approximating a value
function, e.g. ∞ = maxs,a |Qπ (s, a) − Q b π (s, a)|, where Q
b π is what an estimation oracle
returns during the course of the algorithm. The second row (see Lemma 12 in Antos et al.
(2008)) is a refinement of this approach, where 1 is an `1 -average error in fitting the value
functions under the fitting (state) distribution µ, and, roughly, C∞ is a worst case density
ratio between the state visitation distribution of any non-stationary policy and the fitting
distribution µ. For Conservative Policy Iteration, 1 is a related `1 -average case fitting
error with respect to a fitting distribution µ, and D∞ is as defined as before, in the caption
of Table 1 (see also (Kakade and Langford, 2002)); here, D∞ ≤ C∞ (e.g. see Scherrer
(2014)). For NPG, stat and approx measure the excess risk (the regret) and approximation
errors in fitting the values. Roughly speaking, stat is the excess squared loss relative to
the best fit (among an appropriately defined parametric class) under our fitting distribution
(defined with respect to the state distribution µ). Here, approx is the approximation error:
the minimal possible error (in our parametric class) under our fitting distribution. The
condition number κ is a relative eigenvalue condition between appropriately defined feature
?
covariances with respect to the state visitation distribution of an optimal policy, dπs0 , and
the state fitting distribution µ. See text for further discussion, and Section 6 for precise
statements as well as a more general result not explicitly dependent on D∞ .
6
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
2. Related Work
We now discuss related work, roughly in the order which reflects our presentation of results in the
previous section.
For the direct policy parameterization in the tabular case, we make use of a gradient domination-
like property, namely any first-order stationary point of the policy value is approximately optimal up
to a distribution mismatch coefficient. A variant of this result also appears in Theorem 2 of Scherrer
and Geist (2014), which itself can be viewed as a generalization of the approach in Kakade and
Langford (2002). In contrast to CPI (Kakade and Langford, 2002) and the more general boosting-
based approach in Scherrer and Geist (2014), we phrase this approach as a Polyak-like gradient
domination property (Polyak, 1963) in order to directly allow for the transfer of any advances in
non-convex optimization to policy optimization in RL. More broadly, it is worth noting the global
convergence of policy gradients for Linear Quadratic Regulators (Fazel et al., 2018) also goes through
a similar proof approach of gradient domination.
Empirically, the recent work of Ahmed et al. (2019) studies entropy based regularization and
shows the value of regularization in policy optimization, even with exact gradients. This is related to
our use of the log barrier regularization.
For our convergence results of the natural policy gradient algorithm in the tabular setting, there
are close connections between our results and the works of Even-Dar et al. (2009); Geist et al. (2019).
Even-Dar et al. (2009) provides provable online regret guarantees in changing MDPs utilizing experts
algorithms (also see Neu et al. (2010); Abbasi-Yadkori et al. (2019a)); as a special case, their MDP
Experts Algorithm is equivalent to the natural policy gradient algorithm with the softmax policy
parameterization. While the convergence result due to Even-Dar et al. (2009) was not specifically
designed for this setting, it is instructive to see what it implies due to the close connections between
optimization and regret (Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz et al., 2012). The Mirror
Descent-Modified Policy Iteration (MD-MPI) algorithm (Geist et al., 2019) with negative entropy as
the Bregman divergence results is an identical algorithm as NPG for softmax parameterization in
the tabular case; Corollary 3 (Geist et al., 2019) applies to our updates, leading to a bound worse by
a 1/(1 − γ) factor and also has logarithmic dependence on |A|. Our proof for this case is concise
and may be of independent interest. Also worth noting is the Dynamic Policy Programming of Azar
et al. (2012), which is an actor-critic algorithm with a softmax parameterization; this algorithm, even
though not identical, comes with similar guarantees in terms of its rate (it is weaker in terms of an
additional 1/(1 − γ) factor) than the NPG algorithm.
We now turn to function approximation, starting with a discussion of iterative algorithms which
make incremental updates in which the next policy is effectively constrained to be close to the
previous policy, such as in CPI and PSDP (Bagnell et al., 2004). Here, the work in Scherrer and Geist
(2014) show how CPI is part of broader family of boosting-style methods. Also, with regards to PSDP,
the work in Scherrer (2014) shows how PSDP actually enjoys an improved iteration complexity
over CPI, namely O(log 1/opt ) vs. O(1/2opt ). It is worthwhile to note that both NPG and projected
gradient ascent are also incremental algorithms.
We now discuss the approximate dynamic programming results characterized in terms of the
concentrability coefficient. Broadly we use the term approximate dynamic programming to refer
to fitted value iteration, fitted policy iteration and more generally generalized policy iteration
schemes such as classification-based policy iteration as well, in addition to the classical approximate
value/policy iteration works. While the approximate dynamic programming results typically require
7
AGARWAL , K AKADE , L EE AND M AHAJAN
`∞ bounded errors, which is quite stringent, the notion of concentrability (originally due to (Munos,
2003, 2005)) permits sharper bounds in terms of average case function approximation error, provided
that the concentrability coefficient is bounded (e.g. see Munos (2005); Szepesvári and Munos
(2005); Antos et al. (2008); Lazaric et al. (2016)). Chen and Jiang (2019) provide a more detailed
discussion on this quantity. Based on this problem dependent constant being bounded, Munos (2005);
Szepesvári and Munos (2005), Antos et al. (2008) and Lazaric et al. (2016) provide meaningful
sample size and error bounds for approximate dynamic programming methods, where there is a data
collection policy (under which value-function fitting occurs) that induces a concentrability coefficient.
In terms of the concentrability coefficient C∞ and the “distribution mismatch coefficient” D∞ in
Table 2 , we have that D∞ ≤ C∞ , as discussed in (Scherrer, 2014) (also see the table caption). Also,
as discussed in Chen and Jiang (2019), a finite concentrability coefficient is a restriction on the MDP
dynamics itself, while a bounded D∞ does not require any restrictions on the MDP dynamics. The
more refined quantities defined by Farahmand et al. (2010) (for the approximate policy iteration
result) partially alleviate some of these concerns, but their assumptions still implicitly constrain the
MDP dynamics, like the finiteness of the concentrability coefficient.
Assuming bounded concentrability coefficient, there are a notable set of provable average case
guarantees for the MD-MPI algorithm (Geist et al., 2019) (see also (Azar et al., 2012; Scherrer et al.,
2015)), which are stated in terms of various norms of function approximation error. MD-MPI is
a class of algorithms for approximate planning under regularized notions of optimality in MDPs.
Specifically, Geist et al. (2019) analyze a family of actor-critic style algorithms, where there are both
approximate value functions updates and approximate policy updates. As a consequence of utilizing
approximate value function updates for the critic, the guarantees of Geist et al. (2019) are stated with
dependencies on concentrability coefficients.
When dealing with function approximation, computational and statistical complexities are
relevant because they determine the effectiveness of approximate updates with finite samples. With
regards to sample complexity, the work in Szepesvári and Munos (2005); Antos et al. (2008)
provide finite sample rates (as discussed above), further generalized to actor-critic methods in Azar
et al. (2012); Scherrer et al. (2015). In our policy optimization approach, the analysis of both
computational and statistical complexities are straightforward, since we can leverage known statistical
and computational results from the stochastic approximation literature; in particular, we use the
stochastic projected gradient ascent to obtain a simple, linear time method for the critic estimation
step in the natural policy gradient algorithm.
In terms of the algorithmic updates for the function approximation setting, our development
of NPG bears similarity to the natural actor-critic algorithm Peters and Schaal (2008), for which
some asymptotic guarantees under finite concentrability coefficients are obtained in Bhatnagar et al.
(2009). While both updates seek to minimize the compatible function approximation error, we
perform streaming updates based on stochastic optimization using Monte Carlo estimates for values.
In contrast Peters and Schaal (2008) utilize Least Squares Temporal Difference methods (Boyan,
1999) to minimize the loss. As a consequence, their updates additionally make linear approximations
to the value functions in order to estimate the advantages; our approach is flexible in allowing for
wide family of smoothly differentiable policy classes (including neural policies).
Finally, we remark on some concurrent works. The work of Bhandari and Russo (2019) provides
gradient domination-like conditions under which there is (asymptotic) global convergence to the
optimal policy. Their results are applicable to the projected gradient ascent algorithm; they are not
applicable to gradient ascent with the softmax parameterization (see the discussion in Section 5
8
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
herein for the analysis challenges). Bhandari and Russo (2019) also provide global convergence
results beyond MDPs. Also, Liu et al. (2019) provide an analysis of the TRPO algorithm (Schulman
et al., 2015) with neural network parameterizations, which bears resemblance to our natural policy
gradient analysis. In particular, Liu et al. (2019) utilize ideas from both Even-Dar et al. (2009)
(with a mirror descent style of analysis) along with Cai et al. (2019) (to handle approximation with
neural networks) to provide conditions under which TRPO returns a near optimal policy. Liu et al.
(2019) do not explicitly consider the case where the policy class is not complete (i.e when there
is approximation). Another related work of Shani et al. (2019) considers the TRPO algorithm and
provides
√ theoretical guarantees in the tabular case; their convergence rates with exact updates are
O(1/ T ) for the (unregularized) objective function of interest; they also provide faster rates on
a modified (regularized) objective function. They do not consider the case of infinite state spaces
and function approximation. The closely related recent papers (Abbasi-Yadkori et al., 2019a,b)
also consider closely related algorithms to the Natural Policy Gradient approach studied here, in an
infinite horizon, average reward setting. Specifically, the EE-P OLITEX algorithm is closely related
to the Q-NPG algorithm which we study in Section 6.2, though our approach is in the discounted
setting. We adopt the name Q-NPG to capture its close relationship with the NPG algorithm, with the
main difference being the use of function approximation for the Q-function instead of advantages.
We refer the reader to Section 6.2 (and Remark 25) for more discussion of the technical differences
between the two works.
3. Setting
A (finite) Markov Decision Process (MDP) M = (S, A, P, r, γ, ρ) is specified by: a finite state space
S; a finite action space A; a transition model P where P (s0 |s, a) is the probability of transitioning
into state s0 upon taking action a in state s; a reward function r : S × A → [0, 1] where r(s, a) is the
immediate reward associated with taking action a in state s; a discount factor γ ∈ [0, 1); a starting
state distribution ρ over S.
A deterministic, stationary policy π : S → A specifies a decision-making strategy in which
the agent chooses actions adaptively based on the current state, i.e., at = π(st ). The agent may
also choose actions according to a stochastic policy π : S → ∆(A) (where ∆(A) is the probability
simplex over A), and, overloading notation, we write at ∼ π(·|st ).
A policy induces a distribution over trajectories τ = (st , at , rt )∞ t=0 , where s0 is drawn from the
starting state distribution ρ, and, for all subsequent timesteps t, at ∼ π(·|st ) and st+1 ∼ P (·|st , at ).
The value function V π : S → R is defined as the discounted sum of future rewards starting at state s
and executing π, i.e.
"∞ #
X
V π (s) := E γ t r(st , at )|π, s0 = s ,
t=0
where the expectation is with respect to the randomness of the trajectory τ induced by π in M . Since
1
we assume that r(s, a) ∈ [0, 1], we have 0 ≤ V π (s) ≤ 1−γ . We overload notation and define V π (ρ)
as the expected value under the initial state distribution ρ, i.e.
9
AGARWAL , K AKADE , L EE AND M AHAJAN
t=0
The goal of the agent is to find a policy π that maximizes the expected value from the initial state,
i.e. the optimization problem the agent seeks to solve is:
where the max is over all policies. The famous theorem of Bellman and Dreyfus (1959) shows there
exists a policy π ? which simultaneously maximizes V π (s0 ), for all states s0 ∈ S.
Policy Parameterizations. This work studies ascent methods for the optimization problem:
max V πθ (ρ),
θ∈Θ
where θ ∈ ∆(A)|S| , i.e. θ is subject to θs,a ≥ 0 and a∈A θs,a = 1 for all s ∈ S and a ∈ A.
P
exp(θs,a )
πθ (a|s) = P . (3)
a0 ∈A exp(θs,a0 )
• Restricted parameterizations: We also study parametric classes {πθ |θ ∈ Θ} that may not
contain all stochastic policies. In particular, we pay close attention to both log-linear policy
classes and neural policy classes (see Section 6). Here, the best we may hope for is an agnostic
result where we do as well as the best policy in this class.
While the softmax parameterization is the more natural parametrization among the two complete
policy classes, it is also informative to consider the direct parameterization.
It is worth explicitly noting that V πθ (s) is non-concave in θ for both the direct and the softmax
parameterizations, so the standard tools of convex optimization are not applicable. For completeness,
we formalize this as follows (with a proof in Appendix A, along with an example in Figure 1):
Lemma 1 There is an MDP M (described in Figure 1) such that the optimization problem V πθ (s)
is not concave for both the direct and softmax parameterizations.
10
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
0 0
a1
a4 a4
a3 a3
s3 s4
a2 a2
0 s0 s1 ··· sH sH+1
a1
0 r>0 a1 a1
s1 0 s2 0 s5
Policy gradients. In order to introduce these methods, it is useful to define the discounted state
visitation distribution dπs0 of a policy π as:
∞
X
dπs0 (s) := (1 − γ) γ t Prπ (st = s|s0 ), (4)
t=0
where Prπ (st = s|s0 ) is the state visitation probability that st = s, after we execute π starting at
state s0 . Again, we overload notation and write:
dπρ (s) = Es0 ∼ρ dπs0 (s) ,
where dπρ is the discounted state visitation distribution under initial distribution ρ.
The policy gradient functional form (see e.g. Williams (1992); Sutton et al. (1999)) is then:
1
∇θ V πθ (s0 ) = Es∼dπs θ Ea∼πθ (·|s) ∇θ log πθ (a|s)Qπθ (s, a) .
(5)
1−γ 0
Note the above gradient expression (Equation 6) does not hold for the direct parameterization, while
Equation 5 is valid. 2
The performance difference lemma. The following lemma is helpful throughout:
Lemma 2 (The performance difference lemma (Kakade and Langford, 2002)) For all policies π, π 0
and states s0 ,
0 1 h 0 i
V π (s0 ) − V π (s0 ) = Es∼dπs Ea∼π(·|s) Aπ (s, a) .
1−γ 0
11
AGARWAL , K AKADE , L EE AND M AHAJAN
The distribution mismatch coefficient. We often characterize the difficulty of the exploration
problem faced by our policy optimization algorithms when maximizing the objective V π (µ) through
the following notion of distribution mismatch coefficient.
We often instantiate this coefficient with µ as the initial state distribution used in a policy
optimization algorithm, ρ as the distribution to measure the sub-optimality of our policy (this is the
start state distribution of interest), and where π above is often chosen to be π ? ∈ argmaxπ∈Π V π (ρ),
given a policy class Π.
? ?
Notation. Following convention, we use V ? and Q? to denote V π and Qπ respectively. For
iterative algorithms which obtain policy parameters θ(t) at iteration t, we let π (t) , V (t) and A(t) denote
(t) (t)
the corresponding quantities parameterized by θ(t) , i.e. πθ(t) , V θ and Aθ , respectively. For vectors
u and v, we use uv to denote the componentwise ratio; u ≥ v denotes a componentwise inequality;
qP
2
P
we use the standard convention where kvk2 = i vi , kvk1 = i |vi |, and kvk∞ = maxi |vi |.
∂V π (µ) 1
= dπ (s)Qπ (s, a), (7)
∂π(a|s) 1−γ µ
using (5). In particular, for this parameterization, we may write ∇π V π (µ) instead of ∇θ V πθ (µ).
12
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
where θ? ∈ argmaxθ0 ∈Θ f (θ0 ) and where G(θ) is some suitable scalar notion of first-order station-
arity, which can be considered a measure of how large the gradient is (see (Karimi et al., 2016;
Bolte et al., 2007; Attouch et al., 2010)). Thus if one can find a θ that is (approximately) a first-
order stationary point, then the parameter θ will be near optimal (in terms of function value). Such
conditions are a standard device to establishing global convergence in non-convex optimization, as
they effectively rule out the presence of bad critical points. In other words, given such a condition,
quantifying the convergence rate for a specific algorithm, like say projected gradient ascent, will
require quantifying the rate of its convergence to a first-order stationary point, for which one can
invoke standard results from the optimization literature.
The following lemma shows that the direct policy parameterization satisfies a notion of gradient
domination. This is the basic approach used in the analysis of CPI (Kakade and Langford, 2002); a
variant of this lemma also appears in Scherrer and Geist (2014). We give a proof for completeness.
Even though we are interested in the value V π (ρ), it is helpful to consider the gradient with
respect to another state distribution µ ∈ ∆(S).
Lemma 4 (Gradient domination) For the direct policy parameterization (as in (2)), for all state
distributions µ, ρ ∈ ∆(S), we have
?
dπρ
? π
V (ρ) − V (ρ) ≤ max (π̄ − π)> ∇π V π (µ)
dπµ π̄
∞
?
1 dπρ
≤ max (π̄ − π)> ∇π V π (µ),
1−γ µ π̄
∞
where the max is over the set of all policies, i.e. π̄ ∈ ∆(A)|S| .
Before we provide the proof, a few comments are in order with regards to the performance
measure ρ and the optimization measure µ. Subtly, note that although the gradient is with respect
to V π (µ), the final guarantee applies to all distributions ρ. The significance is that even though we
may be interested in our performance under ρ, it may be helpful to optimize under the distribution
µ. To see this, note the lemma shows that a sufficiently small gradient magnitude in the feasible
directions implies the policy is nearly optimal in terms of its value, but only if the state distribution
of π, i.e. dπµ , adequately covers the state distribution of some optimal policy π ? . Here, it is also worth
recalling the theorem of Bellman and Dreyfus (1959) which shows there exists a single policy π ?
that is simultaneously optimal for all starting states s0 . Note that the hardness of the exploration
problem is captured through the distribution mismatch coefficient (Definition 3).
13
AGARWAL , K AKADE , L EE AND M AHAJAN
where the last inequality follows since maxā Aπ (s, ā) ≥ 0 for all states s and policies π. We wish to
upper bound (8). We then have:
X dπµ (s) X dπµ (s)
max Aπ (s, ā) = max π̄(a|s)Aπ (s, a)
s
1−γ ā π̄∈∆(A)|S| s,a 1−γ
X dπµ (s)
= max (π̄(a|s) − π(a|s))Aπ (s, a)
π̄∈∆(A)|S| s,a 1−γ
X dπµ (s)
= max (π̄(a|s) − π(a|s))Qπ (s, a)
π̄∈∆(A)|S| s,a 1−γ
= max (π̄ − π)> ∇π V π (µ)
π̄∈∆(A)|S|
where the last step follows due to maxπ̄∈∆(A)|S| (π̄ − π)> ∇π V π (µ) ≥ 0 for any policy π and
dπµ (s) ≥ (1 − γ)µ(s) (see (4)).
In a sense, the use of an appropriate µ circumvents the issues of strategic exploration. It is natural
to ask whether this additional term is necessary, a question which we return to. First, we provide a
convergence rate for the projected gradient ascent algorithm.
14
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
algorithm updates
π (t+1) = P∆(A)|S| (π (t) + η∇π V (t) (µ)), (9)
(1−γ)3
Theorem 5 The projected gradient ascent algorithm (9) on V π (µ) with stepsize η = 2γ|A| satisfies
for all distributions ρ ∈ ∆(S),
? 2
n
? (t)
o 64γ|S||A| dπρ
min V (ρ) − V (ρ) ≤ whenever T > .
t<T (1 − γ)6 2 µ
∞
A proof is provided in Appendix B.1. The proof first invokes a standard iteration complexity result
of projected gradient ascent to show that the gradient magnitude with respect to all feasible directions
is small. More concretely, we show the policy is -stationary3 , that is, for all πθ + δ ∈ ∆(A)|S| and
kδk2 ≤ 1, δ > ∇π V πθ (µ) ≤ . We then use Lemma 4 to complete the proof.
Note that the guarantee we provide is for the best policy found over the T rounds, which we
obtain from a bound on the average norm of the gradients. This type of a guarantee is standard in the
non-convex optimization literature, where an average regret bound cannot be used to extract a single
good solution, e.g. by averaging. In the context of policy optimization, this is not a serious limitation
as we collect on-policy trajectories for each policy in doing sample-based gradient estimation, and
these samples can be also used to estimate the policy’s value. Note that the evaluation step is not
required for every policy, and can also happen on a schedule, though we still need to evaluate O(T )
policies to obtain the convergence rates described here.
15
AGARWAL , K AKADE , L EE AND M AHAJAN
Proposition 6 (Vanishing gradients at suboptimal parameters) Consider the chain MDP of Fig-
ure 2, with H + 2 states, γ = H/(H + 1), and with the direct policy parameterization (with 3|S|
parameters, as described in the text above). Suppose θ is such that 0 < θ < 1 (componentwise) and
H
θs,a1 < 1/4 (for all states s). For all k ≤ 40 log(2H) − 1, we have k∇kθ V πθ (s0 )k ≤ (1/3)H/4 , where
∇kθ V πθ (s0 ) is a tensor of the kth order derivatives of V πθ (s0 ) and the norm is the operator norm of
the tensor.4 Furthermore, V ? (s0 ) − V πθ (s0 ) ≥ (H + 1)/8 − (H + 1)2 /3H .
This lemma also suggests that results in the non-convex optimization literature, on escaping from
saddle points, e.g. (Nesterov and Polyak, 2006; Ge et al., 2015; Jin et al., 2017), do not directly imply
global convergence due to that the higher order derivatives are small.
Remark 7 (Exact vs. Approximate Gradients) The chain MDP of Figure 2, is a common example
where sample based estimates of gradients will be 0 under random exploration strategies; there is an
exponentially small in H chance of hitting the goal state under a random exploration strategy. Note
that this lemma is with regards to exact gradients. This suggests that even with exact computations
(along with using exact higher order derivatives) we might expect numerical instabilities.
Remark 8 (Comparison with the upper bound) The lower bound does not contradict the upper
bound of Theorem 4 (where a small gradient is turned into a small policy suboptimality bound), as
the distribution mismatch coefficient, as defined in Definition 3, could be infinite in the chain MDP of
Figure 2, since the start-state distribution is concentrated on one state only. More generally, for any
?
dπ
policy with θs,a1 < 1/4 in all states s, ρ
π
dρ θ
= Ω(4H ).
∞
Remark 9 (Comparison with information-theoretic lower bounds) The lower bound here is not
information theoretic, in that it does not present a hard problem instance for all algorithms. Indeed,
exploration algorithms for tabular MDPs starting from E 3 (Kearns and Singh, 2002), RMAX (Braf-
man and Tennenholtz, 2003) and several subsequent works yield polynomial sample complexities
for the chain MDP. Proposition 6 should be interpreted as a hardness result for the specific class
of policy gradient like approaches that search for a policy with a small policy gradient, as these
methods will find the initial parameters to be valid in terms of the size of (several orders of) gradients.
In particular, it precludes any meaningful claims on global optimality, based just on the size of the
policy gradients, without additional assumptions as discussed in the previous remark.
The proof is provided in Appendix B.2. The lemma illustrates that lack of good exploration can
indeed be detrimental in policy gradient algorithms, since the gradient can be small either due to π
being near-optimal, or, simply because π does not visit advantageous states often enough. In this
sense, it also demonstrates the necessity of the distribution mismatch coefficient in Lemma 4.
16
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
parameterization of policies is preferable to the direct parameterization, since the parameters θ are
unconstrained and standard unconstrained optimization algorithms can be employed. However,
optimization over this policy class creates other challenges as we study in this section, as the optimal
policy (which is deterministic) is attained by sending the parameters to infinity.
We study three algorithms for this problem. The first performs direct policy gradient ascent
on the objective without modification, while the second adds a log barrier regularizer to keep the
parameters from becoming too large, as a means to ensure adequate exploration. Finally, we study
the natural policy gradient algorithm and establish a global optimality result with no dependence on
the distribution mismatch coefficient or dimension-dependent factors.
For the softmax parameterization, the gradient takes the form:
∂V πθ (µ) 1
= dπθ (s)πθ (a|s)Aπθ (s, a) (10)
∂θs,a 1−γ µ
Theorem 10 (Global convergence for softmax parameterization) Assume we follow the gradient
ascent update rule as specified in Equation (11) and that the distribution µ is strictly positive i.e.
3
µ(s) > 0 for all states s. Suppose η ≤ (1−γ)
8 , then we have that for all states s, V (t) (s) → V ? (s)
as t → ∞.
Remark 11 (Strict positivity of µ and exploration) Theorem 10 assumed that optimization distribu-
tion µ was strictly positive, i.e. µ(s) > 0 for all states s. We leave it is an open question of whether
or not gradient ascent will globally converge if this condition is not met. The concern is that if this
condition is not met, then gradient ascent may not globally converge due to that dπµθ (s) effectively
scales down the learning rate for the parameters associated with state s (see (10)).
The complete proof is provided in the Appendix C.1. We now discuss the subtleties in the proof
and show why the softmax parameterization precludes a direct application of the gradient domination
lemma. In order to utilize the gradient domination property (in Lemma 4), we would desire to show
that: ∇π V π (µ) → 0. However, using the functional form of the softmax parameterization (see
Lemma 40) and (7), we have that:
∂V πθ (µ) 1 ∂V πθ (µ)
= dπµθ (s)πθ (a|s)Aπθ (s, a) = πθ (a|s) .
∂θs,a 1−γ ∂πθ (a|s)
Hence, we see that even if ∇θ V πθ (µ) → 0, we are not guaranteed that ∇π V πθ (µ) → 0.
17
AGARWAL , K AKADE , L EE AND M AHAJAN
We now briefly discuss the main technical challenges in the proof. The proof first shows that
the sequence V (t) (s) is monotone increasing pointwise, i.e. for every state s, V (t+1) (s) ≥ V (t) (s)
(Lemma 41). This implies the existence of a limit V (∞) (s) by the monotone convergence theorem
(Lemma 42). Based on the limiting quantities V (∞) (s) and Q(∞) (s, a), which we show exist, define
the following limiting sets for each state s:
The challenge is to then show that, for all states s, the set I+ s is the empty set, which would
immediately imply V (∞) (s) = V ? (s). The proof proceeds by contradiction, assuming that I+ s is non-
s π
empty. Using that I+ is non-empty and that the gradient tends to zero in the limit, i.e. ∇θ V (µ) → 0,
θ
we have that for all a ∈ I+ s , π (t) (a|s) → 0 (see (10)). This, along with the functional form of the
softmax parameterization, implies that there must be divergence (in magnitude) among the set of
(t)
parameters associated with some action a at state s, i.e. that maxa∈A |θs,a | → ∞. The primary
technical challenge in the proof is to then use this divergence, along with the dynamics of gradient
ascent, to show that I+s is empty via a contradiction.
We leave it as a question for future work as to characterizing the convergence rate, which we
conjecture is exponentially slow in some of the relevant quantities, such as in terms of the size of
state space. Here, we turn to a regularization based approach to ensure convergence at a polynomial
rate in all relevant quantities.
λ X
= V πθ (µ) + log πθ (a|s) + λ log |A| , (12)
|S| |A| s,a
where λ is a regularization parameter. The constant (i.e. the last term) is not relevant with regards to
optimization. This regularizer is different from the more commonly utilized entropy regularizer as in
Mnih et al. (2016), a point which we return to in Remark 14.
The policy gradient ascent updates for Lλ (θ) are given by:
18
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Our next theorem shows that approximate first-order stationary points of the entropy-regularized
objective are approximately globally optimal, provided the regularization is sufficiently small.
and opt ≤ λ/(2|S| |A|). Then we have that for all starting state distributions ρ:
?
πθ ? 2λ dπρ
V (ρ) ≥ V (ρ) − .
1−γ µ
∞
Proof The proof consists of showing that maxa Aπθ (s, a) ≤ 2λ/(µ(s)|S|) for all states. To see that
this is sufficient, observe that by the performance difference lemma (Lemma 2),
1 X π?
V ? (ρ) − V πθ (ρ) = d (s)π ? (a|s)Aπθ (s, a)
1 − γ s,a ρ
1 X π?
≤ d (s) max Aπθ (s, a)
1−γ s ρ a∈A
1 X π?
≤ 2dρ (s)λ/(µ(s)|S|)
1−γ s
?
!
2λ dπρ (s)
≤ max .
1−γ s µ(s)
an (s, a) pair such that Aπθ (s, a) > 0. Using the policy gradient expression for the softmax
parameterization (see Lemma 40),
∂Lλ (θ) 1 λ 1
= dπµθ (s)πθ (a|s)Aπθ (s, a) + − πθ (a|s) . (14)
∂θs,a 1−γ |S| |A|
where we have used Aπθ (s, a) ≥ 0. Rearranging and using our assumption opt ≤ λ/(2|S| |A|),
1 opt |S| 1
πθ (a|s) ≥ − ≥ .
|A| λ 2|A|
19
AGARWAL , K AKADE , L EE AND M AHAJAN
By combining the above theorem with standard results on the convergence of gradient ascent (to
first order stationary points), we obtain the following corollary.
8γ 2λ
Corollary 13 (Iteration complexity with log barrier regularization) Let βλ := (1−γ)3
+ |S| . Starting
(1−γ)
from any initial θ(0) , consider the updates (13) with λ = dπ
? and η = 1/βλ . Then for all
ρ
2 µ
∞
starting state distributions ρ, we have
? 2
n
? (t)
o 320|S|2 |A|2 dπρ
min V (ρ) − V (ρ) ≤ whenever T ≥ .
t<T (1 − γ)6 2 µ
∞
See Appendix C.2 for the proof. The corollary shows the importance of balancing how the
regularization parameter λ is set relative to the desired accuracy , as well as the importance of the
initial distribution µ to obtain global optimality.
Remark 14 (Entropy vs. log barrier regularization) The more commonly considered regularizer
is the entropy (Mnih et al., 2016) (also see Ahmed et al. (2019) for a more detailed empirical
investigation), where the regularizer would be:
1 X 1 XX
H(πθ (·|s)) = −πθ (a|s) log πθ (a|s).
|S| s |S| s a
Note the entropy is far less aggressive in penalizing small probabilities, in comparison to the log
barrier, which is equivalent to the relative entropy. In particular, the entropy regularizer is always
bounded between 0 and log |A|, while the relative entropy (against the uniform distribution over
actions), is bounded between 0 and infinity, where it tends to infinity as probabilities tend to 0. We
leave it is an open question if a polynomial convergence rate 5 is achievable with the more common
entropy regularizer; our polynomial convergence rate using the KL regularizer crucially relies on
the aggressive nature in which the relative entropy prevents small probabilities (the proof shows that
any action, with a positive advantage, has a significant probability for any near-stationary policy of
the regularized objective).
5. Here, ideally we would like to be poly in |S|, |A|, 1/(1 − γ), 1/, and the distribution mismatch coefficient, which
we conjecture may not be possible.
20
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
where M † denotes the Moore-Penrose pseudoinverse of the matrix M . Throughout this section, we
restrict to using the initial state distribution ρ ∈ ∆(S) in our update rule in (15) (so our optimization
measure µ and the performance measure ρ are identical). Also, we restrict attention to states s ∈ S
reachable from ρ, since, without loss of generality, we can exclude states that are not reachable under
this start state distribution6 .
We leverage a particularly convenient form the update takes for the softmax parameterization
(see Kakade (2001)). For completeness, we provide a proof in Appendix C.3.
Lemma 15 (NPG as soft policy iteration) For the softmax parameterization (3), the NPG up-
dates (15) take the form:
The updates take a strikingly simple form in this special case; they are identical to the classical
multiplicative weights updates (Freund and Schapire, 1997; Cesa-Bianchi and Lugosi, 2006) for
online linear optimization over the probability simplex, where the linear functions are specified by
the advantage function of the current policy at each iteration. Notably, there is no dependence on the
(t)
state distribution dρ , since the pseudoinverse of the Fisher information cancels out the effect of the
state distribution in NPG. We now provide a dimension free convergence rate of this algorithm.
Theorem 16 (Global convergence for NPG) Suppose we run the NPG updates (15) using ρ ∈
∆(S) and with θ(0) = 0. Fix η > 0. For all T > 0, we have:
log |A| 1
V (T ) (ρ) ≥ V ∗ (ρ) − − .
ηT (1 − γ)2 T
In particular, setting η ≥ (1 − γ)2 log |A|, we see that NPG finds an -optimal policy in a number
of iterations that is at most:
2
T ≤ ,
(1 − γ)2
which has no dependence on the number of states or actions, despite the non-concavity of the
underlying optimization problem.
6. Specifically, we restrict the MDP to the set of states {s ∈ S : ∃π such that dπρ (s) > 0}.
21
AGARWAL , K AKADE , L EE AND M AHAJAN
The proof strategy we take borrows ideas from the online regret framework in changing MDPs
(in (Even-Dar et al., 2009)); here, we provide a faster rate of convergence than the analysis implied
by Even-Dar et al. (2009) or by Geist et al. (2019). We also note that while this proof is obtained for
the NPG updates, it is known in the literature that in the limit of small stepsizes, NPG and TRPO
updates are closely related (e.g. see Schulman et al. (2015); Neu et al. (2017); Rajeswaran et al.
(2017)).
First, the following improvement lemma is helpful:
Lemma 17 (Improvement lower bound for NPG) For the iterates π (t) generated by the NPG up-
dates (15), we have for all starting state distributions µ
(1 − γ)
V (t+1) (µ) − V (t) (µ) ≥ Es∼µ log Zt (s) ≥ 0.
η
Proof First, let us show that log Zt (s) ≥ 0. To see this, observe:
X
log Zt (s) = log π (t) (a|s) exp(ηA(t) (s, a)/(1 − γ))
a
X η X (t)
≥ π (t) (a|s) log exp(ηA(t) (s, a)/(1 − γ)) = π (a|s)A(t) (s, a) = 0.
a
1−γ a
where the inequality follows by Jensen’s inequality on the concave function log x and the final
(t+1)
equality uses a π (t) (a|s)A(t) (s, a) = 0. Using d(t+1) as shorthand for dµ , the performance
P
difference lemma implies:
1 X
V (t+1) (µ) − V (t) (µ) = Es∼d(t+1) π (t+1) (a|s)A(t) (s, a)
1−γ a
1 X π (t+1) (a|s)Zt (s)
= Es∼d(t+1) π (t+1) (a|s) log
η a
π (t) (a|s)
1 1
= Es∼d(t+1) KL(πs(t+1) ||πs(t) ) + Es∼d(t+1) log Zt (s)
η η
1 1−γ
≥ Es∼d(t+1) log Zt (s) ≥ Es∼µ log Zt (s),
η η
(t+1)
where the last step uses that d(t+1) = dµ ≥ (1 − γ)µ, componentwise (by (4)), and that
log Zt (s) ≥ 0.
22
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
?
Proof [of Theorem 16] Since ρ is fixed, we use d? as shorthand for dπρ ; we also use πs as shorthand
for the vector of π(·|s). By the performance difference lemma (Lemma 2),
? 1 X
V π (ρ) − V (t) (ρ) = Es∼d? π ? (a|s)A(t) (s, a)
1−γ a
1 X π (t+1) (a|s)Zt (s)
= Es∼d? π ? (a|s) log
η a
π (t) (a|s)
!
1 X
= Es∼d? KL(πs? ||πs(t) ) − KL(πs? ||πs(t+1) ) + π ∗ (a|s) log Zt (s)
η a
1
= Es∼d? KL(πs? ||πs(t) ) − KL(πs? ||πs(t+1) ) + log Zt (s) ,
η
where we have used the closed form of our updates from Lemma 15 in the second step.
By applying Lemma 17 with d? as the starting state distribution, we have:
1 1 (t+1) ?
Es∼d? log Zt (s) ≤ V (d ) − V (t) (d? )
η 1−γ
which gives us a bound on Es∼d? log Zt (s).
Using the above equation and that V (t+1) (ρ) ≥ V (t) (ρ) (as V (t+1) (s) ≥ V (t) (s) for all states s
by Lemma 17), we have:
T −1
π? (T −1) 1 X π?
V (ρ) − V (ρ) ≤ (V (ρ) − V (t) (ρ))
T
t=0
T −1 T −1
1 X 1 X
≤ Es∼d? (KL(πs? ||πs(t) ) − KL(πs? ||πs(t+1) )) + Es∼d? log Zt (s)
ηT ηT
t=0 t=0
T −1
Es∼d? KL(πs? ||π (0) ) 1 X
≤ + V (t+1) (d? ) − V (t) (d? )
ηT (1 − γ)T
t=0
Es∼d? KL(πs? ||π (0) ) V (T ) (d? )
− V (0) (d? )
= +
ηT (1 − γ)T
log |A| 1
≤ + .
ηT (1 − γ)2 T
Π = {πθ | θ ∈ Rd },
where Π may not contain all stochastic policies (and it may not even contain an optimal policy). In
contrast with the tabular results in the previous sections, the policy classes that we are often interested
23
AGARWAL , K AKADE , L EE AND M AHAJAN
in are not fully expressive, e.g. d |S||A| (indeed |S| or |A| need not even be finite for the results
in this section); in this sense, we are in the regime of function approximation.
We focus on obtaining agnostic results, where we seek to do as well as the best policy in this
class (or as well as some other comparator policy). While we are interested in a solution to the
(unconstrained) policy optimization problem
max V πθ (ρ),
θ∈Rd
(for a given initial distribution ρ), we will see that optimization with respect to a different distribution
will be helpful, just as in the tabular case,
We will consider variants of the NPG update rule (15):
Our analysis will leverage a close connection between the NPG update rule (15) with the notion of
compatible function approximation (Sutton et al., 1999), as formalized in Kakade (2001). Specifically,
it can be easily seen that:
1
Fρ (θ)† ∇θ V θ (ρ) = w? , (17)
1−γ
where w? is a minimizer of the following regression problem:
h i
w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) (w> ∇θ log πθ (·|s) − Aπθ (s, a))2 .
The above is a straightforward consequence of the first order optimality conditions (see (50)).
The above regression problem can be viewed as “compatible” function approximation: we are
approximating Aπθ (s, a) using the ∇θ log πθ (·|s) as features. We also consider a variant of the above
update rule, Q-NPG, where instead of using advantages in the above regression we use the Q-values.
This viewpoint provides a methodology for approximate updates, where we can solve the relevant
regression problems with samples. Our main results establish the effectiveness of NPG updates
where there is error both due to statistical estimation (where we may not use exact gradients) and
approximation (due to using a parameterized function class); in particular, we provide a novel
estimation/approximation decomposition relevant for the NPG algorithm. For these algorithms, we
will first consider log linear policies classes (as a special case) and then move on to more general
policy classes (such as neural policy classes). Finally, it is worth remarking that the results herein
provide one of the first provable approximation guarantees where the error conditions required do
not have explicit worst case dependencies over the state space.
where fθ is a differentiable function. For example, the tabular softmax policy class is one where
fθ (s, a) = θs,a . Typically, fθ is either a linear function or a neural network. Let us consider the NPG
algorithm, and a variant Q-NPG, in each of these two cases.
24
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
exp(θ · φs,a )
πθ (a|s) = P ,
a0 ∈A exp(θ · φs,a0 )
(We have rescaled the learning rate η in comparison to (16)). Note that we recompute w? for
every update of θ. Here, the compatible function approximation error measures the expressivity of
our parameterization in how well linear functions of the parameterization can capture the policy’s
advantage function.
We also consider a variant of the NPG update rule (16), termed Q-NPG, where:
h 2 i
Q-NPG: θ ← θ + ηw? , w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) Qπθ (s, a) − w · φs,a .
Note we do not center the features for Q-NPG; observe that Qπ (s, a) is also not 0 in expectation
under π(·|s), unlike the advantage function.
Remark 18 (NPG/Q-NPG and Soft-Policy Iteration) We now see how we can view both NPG and
Q-NPG as an incremental (soft) version of policy iteration, just as in Lemma 15 for the tabular case.
Rather than writing the update rule in terms of the parameter θ, we can write an equivalent update
rule directly in terms of the (log-linear) policy π:
h i
π 2
NPG: π(a|s) ← π(a|s) exp(w? ·φs,a )/Zs , w? ∈ argminw Es∼dπρ ,a∼π(·|s) Aπ (s, a)−w·φs,a ,
where Zs is normalization constant. While the policy update uses the original features φ instead of
π π
φ , whereas the quadratic error minimization is terms of the centered features φ , this distinction
π
is not relevant due to that we may also instead use φ (in the policy update) which would result in
an equivalent update; the normalization makes the update invariant to (constant) translations of the
features. Similarly, an equivalent update for Q-NPG, where we update π directly rather than θ, is:
h 2 i
Q-NPG: π(a|s) ← π(a|s) exp(w? ·φs,a )/Zs , w? ∈ argminw Es∼dπρ ,a∼π(·|s) Qπ (s, a)−w·φs,a .
Remark 19 (On the equivalence of NPG and Q-NPG) If it is the case that the compatible function
approximation error is 0, then it straightforward to verify that the NPG and Q-NPG are equivalent
algorithms, in that their corresponding policy updates will be equivalent to each other.
25
AGARWAL , K AKADE , L EE AND M AHAJAN
The iterates of the Q-NPG algorithm can be viewed as minimizing this loss under some (changing)
distribution υ.
We now specify an approximate version of Q-NPG. It is helpful to consider a slightly more
general version of the algorithm in the previous section, where instead of optimizing under a starting
state distribution ρ, we have a different starting state-action distribution ν. Analogous to the definition
of the state visitation measure, dπµ , we can define a visitation measure over states and actions induced
by following π after s0 , a0 ∼ ν. We overload notation using dπν to also refer to the state-action
visitation measure; precisely,
∞
X
dπν (s, a) := (1 − γ)Es0 ,a0 ∼ν γ t Prπ (st = s, at = a|s0 , a0 ) (19)
t=0
where Prπ (st = s, at = a|s0 , a0 ) is the probability that st = s and at = a, after starting at state s0 ,
taking action a0 , and following π thereafter. While we overload notation for visitation distributions
(dπµ (s) and dπν (s, a)) for notational convenience, note that the state-action measure dπν uses the
subscript ν, which is a state-action measure.
Q-NPG will be defined with respect to the on-policy state action measure starting with s0 , a0 ∼ ν.
As per our convention, we define
(t)
d(t) := dπν .
The approximate version of this algorithm is:
Approx. Q-NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argminkwk2 ≤W L(w; θ(t) , d(t) ), (20)
where the above update rule also permits us to constrain the norm of the update direction w(t)
(alternatively, we could use `2 regularization as is also common in practice). The exact minimizer is
denoted as:
(t)
w? ∈ argminkwk2 ≤W L(w; θ(t) , d(t) ).
26
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
(t)
Note that w? depends on the current parameter θ(t) and W can scale with |S| and |A| in general.
Our analysis will take into account both the excess risk (often also referred to as estimation error)
(t)
and the transfer error. Here, the excess risk will be due to that w(t) may not be equal w? , and the
(t)
approximation error will be due to that even the best linear fit using w? may not perfectly match the
(t) (t) (t)
Q-values, i.e. L(w? ; θ ; d ) is unlikely to be 0 in practical applications.
We now formalize these concepts in the following assumption:
Assumption 6.1 (Estimation/Transfer errors) Fix a state distribution ρ; a state-action distribution
ν; an arbitrary comparator policy π ? (not necessarily an optimal policy). With respect to π ? , define
the state-action measure d? as
?
d? (s, a) = dπρ (s) ◦ UnifA (a)
?
i.e. d? samples states from the comparators state visitation measure, dπρ and actions from the
uniform distribution. Let us permit the sequence of iterates w(0) , w(1) , . . . w(T −1) used by the Q-
NPG algorithm to be random, where the randomness could be due to sample-based, estimation error.
Suppose the following holds for all t < T :
1. (Excess risk) Assume that the estimation error is bounded as follows:
h i
(t)
E L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) ≤ stat
√
Note that using a sample based approach we would expect stat = O(1/ N ) or better, where
(t)
N is the number of samples used to estimate. w? We formalize this in Corollary 26.
(t)
2. (Transfer error) Suppose that the best predictor w? has an error bounded by bias , in expecta-
tion, with respect to the comparator’s measure of d∗ . Specifically, assume:
h i
(t)
E L(w? ; θ(t) , d? ) ≤ bias .
We refer to bias as the transfer error (or transfer bias); it is the error where relevant distribution
is shifted to d? . For the softmax policy parameterization for tabular MDPs, bias = 0 (see
remark 24 for another example).
In both conditions, the expectations are with respect to the randomness in the sequence of iterates
w(0) , w(1) , . . . w(T −1) , e.g. the approximate algorithm may be sample based.
Shortly, we discuss how the transfer error relates to the more standard approximation-estimation
decomposition. Importantly, with the transfer error, it is always defined with respect to a single, fixed
measure, d? .
Assumption 6.2 (Relative condition number) Consider the same ρ, ν, and π ? as in Assump-
tion 6.1. With respect to any state-action distribution υ, define:
h i
Συ = Es,a∼υ φs,a φ> s,a ,
and define
w> Σd? w
sup = κ.
w∈Rd w > Σν w
Assume that κ is finite.
27
AGARWAL , K AKADE , L EE AND M AHAJAN
Remark 22 discusses why it is reasonable to expect that κ is not a quantity related to the size of
the state space.7
Our main theorem below shows how the approximation error, the excess risk, and the conditioning,
determine the final performance. Note that both the transfer error bias and κ are defined with respect
to the comparator policy π ? .
Theorem 20 (Agnostic learning with Q-NPG) Fix a state distribution ρ; a state-action distribution
ν; an arbitrary comparator policy π ? (not necessarily an optimal policy). Suppose Assumption 6.2
holdspand kφs,a k2 ≤ B for all s, a. Suppose the Q-NPG update rule (in (20)) starts with θ(0) = 0,
η = 2 log |A|/(B 2 W 2 T ), and the (random) sequence of iterates satisfies Assumption 6.1. We
have that
r s p
n ? o BW 2 log |A| 4|A|κstat 4|A|bias
π (t)
E min V (ρ) − V (ρ) ≤ + 3
+ .
t<T 1−γ T (1 − γ) 1−γ
As we obtain more samples, we can drive the excess risk (the estimation error) to 0 (see Corollary 26).
The approximation error above is due to modeling error. Importantly, for our Q-NPG performance
bound, it is not this standard approximation error notion which is relevant, but it is this error under
(t)
a different measure d? , i.e. L(w? ; θ(t) , d? ). One appealing aspect about the transfer error is that
this error is with respect to a fixed measure, namely d? . Furthermore, in practice, modern machine
learning methods often performs favorably with regards to transfer learning, substantially better than
worst case theory might suggest.
The following corollary provides a performance bound in terms of the usual notion of approxima-
tion error, at the cost of also depending on the worst case distribution mismatch ratio. The corollary
disentangles the estimation error from the approximation error.
Corollary 21 (Estimation error/Approximation error bound for Q-NPG) Consider the same setting
as in Theorem 20. Rather than assuming the transfer error is bounded (part 2 in Assumption 6.1),
suppose that, for all t ≤ T , h i
(t)
E L(w? ; θ(t) , d(t) ) ≤ approx .
w> Σd? w
7. Technically, we only need the relative condition number supw∈Rd w> Σ (t) w
to be bounded for all t. We state this as
π
a sufficient condition based on the initial distribution ν due to: this is more interpretable, and, as per Remark 22, this
quantity can be bounded in a manner that is independent of the sequence of iterates produced by the algorithm.
28
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
We have that
r s
d?
o
n ?
π (t) BW 2 log |A| 4|A|
E min V (ρ) − V (ρ) ≤ + κ · stat + · approx .
t<T 1−γ T (1 − γ)3 ν ∞
Proof We have the following crude upper bound on the transfer error:
1 d?
where the last step uses the defintion of d(t) (see (19)). This implies bias ≤ 1−γ ν ∞ approx , and
the corollary follows.
The above also shows the striking difference between the effects of estimation error and approx-
imation error. The proof shows how the transfer error notion is weaker than previous conditions
based on distribution mistmatch coefficients or concentrability coefficients. Also, as discussed in
?
Scherrer (2014), the (distribution mismatch) coefficient dν is already weaker than the more
∞
standard concentrability coefficients.
A few additional remarks are now in order. We now make a few observations with regards to κ.
Proof The distribution can be found through constructing the minimal volume ellipsoid containing
Φ, i.e. the Loẅner-John ellipsoid (John, 1948). In particular, this ν is supported on the contact points
between this ellipsoid and Φ; the lemma immediately follows from properties of this ellipsoid (e.g.
see Ball (1997); Bubeck et al. (2012)).
It is also worth considering a more general example (beyond tabular MDPs) in which bias = 0
for the log-linear policy class.
29
AGARWAL , K AKADE , L EE AND M AHAJAN
Remark 24 (bias = 0 for “linear” MDPs) In the recent linear MDP model of Jin et al. (2019);
Yang and Wang (2019); Jiang et al. (2017), where the transition dynamics are low rank, we have
that bias = 0 provided we use the features of the linear MDP. Our guarantees also permit model
misspecification of linear MDPs, with non worst-case approximation error where bias 6= 0.
Remark 25 (Comparison with P OLITEX and EE-P OLITEX) Compared with P OLITEX (Abbasi-
Yadkori et al., 2019a), Assumption 6.2 is substantially milder, in that it just assumes a good relative
condition number for one policy rather than all possible policies (which cannot hold in general
even for tabular MDPs). Changing this assumption to an analog of Assumption 6.2 is the main
improvement in the analysis of the EE-P OLITEX (Abbasi-Yadkori et al., 2019b) algorithm. They
provide a regret bound for the average reward setting, which is qualitatively different from the
suboptimality bound in the discounted setting that we study. They provide a specialized result for
linear function approximation, similar to Theorem 20.
Algorithm 2 provides a sample based version of the Q-NPG algorithm; it simply uses stochastic
projected gradient ascent within each iteration. The following corollary shows this algorithm suffices
to obtain an accurate sample based version of Q-NPG.
Corollary 26 (Sample complexity of Q-NPG) Assume we are in the setting of Theorem 20 and that
we have access to an episodic sampling oracle (i.e. Assumption 6.3). Suppose that the Sample Based
Q-NPG Algorithm (Algorithm 2) is run for T iterations, with N gradient steps per iteration, with an
appropriate setting of the learning rates η and α. We have that:
n ? o
E min V π (ρ) − V (t) (ρ)
t<T
r s p
BW 2 log |A| 8κ|A|BW (BW + 1) 1 4|A|bias
≤ + 4 1/4
+ .
1−γ T (1 − γ) N 1−γ
Furthermore, since each episode has expected length 2/(1 − γ), the expected number of total samples
used by Q-NPG is 2N T /(1 − γ).
30
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
where W = {w : kwk2 ≤ W }.
7: end for
b(t) = N1 N
P
8: Set w n=1 wn .
9: Update θ (t+1) = θ(t) + η w
b(t) .
10: end for
1 W
Proof Note that our sampled gradients are bounded by G := 2B(BW + 1−γ ). Using α = √ ,
G N
a
standard analysis for stochastic projected gradient ascent (Theorem 59) shows that:
1
2BW (BW + 1−γ )
stat ≤ √ .
N
The proof is completed via substitution.
Remark 27 (Improving the scaling with N ) Our current rate of convergence is 1/N 1/4 due to our
use of stochastic projected gradient ascent. Instead, for the least squares estimator, stat would be
O(d/N ) provided certain further regularity assumptions hold (a bound on the minimal eigenvalue
of Σν would be sufficient but not necessary. See Hsu et al. (2014)
√ for such conditions). With such
further assumptions, our rate of convergence would be O(1/ N ).
where υ is state-action distribution, and the subscript of A denotes the loss function uses advantages
(rather than Q-values). The iterates of the NPG algorithm can be viewed as minimizing this loss
under some appropriately chosen measure.
We now consider an approximate version of the NPG update rule:
Approx. NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argminkwk2 ≤W LA (w; θ(t) , d(t) ), (21)
31
AGARWAL , K AKADE , L EE AND M AHAJAN
where again we use the on-policy, fitting distribution d(t) . As with Q-NPG, we also permit the use of
a starting state-action distribution ν as opposed to just a starting state distribution (see Remark 22).
(t) (t)
Again, we let w? denote the minimizer, i.e. w? ∈ argminkwk2 ≤W LA (w; θ(t) , d(t) ).
For this section, our analysis will focus on more general policy classes, beyond log-linear policy
classes. In particular, we make the following smoothness assumption on the policy class:
Assumption 6.4 (Policy Smoothness) Assume for all s ∈ S and a ∈ A that log πθ (a|s) is a β-
smooth function of θ (to recall the definition of smoothness, see (24)).
It is not to difficult to verify that the tabular softmax policy parameterization is a 1-smooth policy
class in the above sense. The more general class of log-linear policies is also smooth as we remark
below.
Remark 28 (Smoothness of the log-linear policy class) For the log-linear policy class (see Sec-
tion 6.1.1), smoothness is implied if the features φ have bounded Euclidean norm. Precisely, if
the feature mapping φ satisfies kφs,a k2 ≤ B, then it is not difficult to verify that log πθ (a|s) is a
B 2 -smooth function.
For any state-action distribution υ, define:
h i
Σθυ = Es,a∼υ ∇θ log πθ (a|s) (∇θ log πθ (a|s))>
(t) (t)
and, again, we use Συ as shorthand for Σθυ .
Note that, in comparison to Assumption 6.1, d? is the state-action visitation measure of the comparator
policy. Let us permit the sequence of iterates w(0) , w(1) , . . . w(T −1) used by the NPG algorithm to
be random, where the randomness could be due to sample-based, estimation error. Suppose the
following holds for all t < T :
1. (Excess risk) Assume the estimation error is bounded as:
h i
(t)
E LA (w(t) ; θ(t) , d(t) ) − LA (w? ; θ(t) , d(t) ) | θ(t) ≤ stat
32
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
3. (Relative condition number) For all iterations t, assume the average relative condition number
is bounded as follows:
" (t)
#
w > Σd ? w
E sup (t)
≤ κ. (22)
w∈Rd w > Σν w
Note that term inside the expectation is a random quantity as θ(t) is random.
In the above conditions, the expectation is with respect to the randomness in the sequence of iterates
w(0) , w(1) , . . . w(T −1) .
Analogous to our Q-NPG theorem, our main theorem for NPG shows how the transfer error is
relevant in addition the statistical error stat .
Theorem 29 (Agnostic learning with NPG) Fix a state distribution ρ; a state-action distribution
ν; an arbitrary comparator policy π ? (not necessarily an optimal policy). Suppose Assumption 6.4
(0)
p the NPG update rule (in (21)) starts with π being the uniform distribution (at each
holds. Suppose
state), η = 2 log |A|/(βW 2 T ), and the (random) sequence of iterates satisfies Assumption 6.5.
We have that
o
r √
2β log |A| bias
r
n ?
π (t) W κstat
E min V (ρ) − V (ρ) ≤ + 3
+ .
t<T 1−γ T (1 − γ) 1−γ
Remark 30 (The |A| dependence: NPG vs. Q-NPG) Observe there is no polynomial dependence
on |A| in the rate for NPG (in constrast to Theorem 20); also observe that here we define d? as
the state-action distribution of π ? in Assumption 6.5, as opposed to a uniform distribution over the
actions, as in Assumption 6.1. The main difference in the analysis is that, for Q-NPG, we need to
bound the error in fitting the advantage estimates; this leads to the dependence on |A| (which can
be removed with a path dependent bound, i.e. a bound which depends on the sequence of iterates
produced by the algorithm)9 . For NPG, the direct fitting of the advantage function sidesteps this
conversion step. Note that the relative condition number assumption in Q-NPG (Assumption 6.2) is
a weaker assumption, due to that it can be bounded independently of the path of the algorithm (see
Remark 6.2), while NPG’s centering of the features makes the assumption on the relative condition
number depend on the path of the algorithm.
Remark 31 (Comparison with Theorem 16) Compared with the result of Theorem 16 in the noiseless,
tabular case, we see two main differences. In the setting of Theorem 16, we have stat = √ bias = 0,
so that the last two terms vanish. This leaves the first term where we observe a slower 1/ T rate
compared with Theorem 16, and with an additional dependence on W (which grows as O(|S||A|/(1−
γ) to approximate the advantages in the tabular setting). Both differences arise from the additional
monotonicity property (Lemma 17) on the per-step improvements in the tabular case, which is not
easily generalized to the function approximation setting.
9. For Q-NPG, we have to bound two distribution shift terms to both π ? and π (t) at step t of the algorithm.
33
AGARWAL , K AKADE , L EE AND M AHAJAN
Remark 32 (Generalizing Q-NPG for smooth policies) A similar reasoning as the analysis here
can be also used to establish a convergence result for the Q-NPG algorithm in this more general
setting of smooth policy classes. Concretely, we can analyze the Q-NPG update described for neural
policy classes in Section 6.1.2, assuming that the function fθ is Lipschitz-continuous in θ. Like for
Theorem 29, the main modification is that Assumption 6.2 on relative condition numbers is now
defined using the covariance matrix for the features fθ (s, a), which depend on θ, as opposed to some
a feature map φ(s, a) in the log-linear case. The rest of the analysis follows with an appropriate
adaptation of the results above.
Corollary 33 (Sample complexity of NPG) Assume we are in the setting of Theorem 29 and that we
have access to an episodic sampling oracle (i.e. Assumption 6.3). Suppose that the Sample Based
NPG Algorithm (Algorithm 4) is run for T iterations, with N gradient steps per iteration. Also,
suppose that k∇θ log π (t) (a|s)k2 ≤ B holds with probability one. There exists a setting of η and α
such that:
n ? o
π (t)
E min V (ρ) − V (ρ)
t<T
r s √
W 2β log |A| 8κBW (BW + 1) 1 bias
≤ + 4 1/4
+ .
1−γ T (1 − γ) N 1−γ
34
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
where W = {w : kwk2 ≤ W }
7: end for
b(t) = N1 N
P
8: Set w n=1 wn .
9: Update θ(t+1) = θ(t) + η w
b(t) .
10: end for
Furthermore, since each episode has expected length 2/(1 − γ), the expected number of total samples
used by NPG is 2N T /(1 − γ).
Proof Let us see that the update direction in Step 6 of Algorithm 4 uses an unbiased estimate of the
true gradient of the loss function LA :
(t) (t)
2Es,a∼d(t) wn · ∇θ log π (a|s) − A(s, a) ∇θ log π (a|s)
b
(t) (t)
= 2Es,a∼d(t) wn · ∇θ log π (a|s) − E[A(s,
b a)|s, a] ∇θ log π (a|s)
(t) (t) (t)
= 2Es,a∼d(t) wn · ∇θ log π (a|s) − A (s, a) ∇θ log π (a|s)
where the last step follows due to that sampling procedure in Algorithm 3 produces a conditionally
unbiased estimate.
Since k∇θ log π (t) (a|s)k2 ≤ B and since A(s,b a) ≤ 2/(1 − γ), our sampled gradients are
1
bounded by G := 8B(BW + 1−γ ). The remainder of the proof follows that of Corollary 26
6.4 Analysis
We first proceed by providing a general analysis of NPG, for arbitrary sequences. We then specialize
it to complete the proof of our two main theorems in this section.
35
AGARWAL , K AKADE , L EE AND M AHAJAN
It is helpful for us to consider NPG more abstractly, as an update rule of the form
We will now provide a lemma where w(t) is an arbitrary (bounded) sequence, which will be helpful
when specialized.
Recall a function f : Rd → R is said to be β-smooth if for all x, x0 ∈ Rd :
β 0
f (x0 ) − f (x) − ∇f (x) · (x0 − x) ≤ kx − xk22 . (24)
2
The following analysis of NPG is based on the mirror-descent approach developed in (Even-Dar
et al., 2009), which motivates us to refer to it as a “regret lemma”.
Lemma 34 (NPG Regret Lemma) Fix a comparison policy π e and a state distribution ρ. Assume
for all s ∈ S and a ∈ A that log πθ (a|s) is a β-smooth function of θ. Consider the update rule
(23), where π (0) is the uniform distribution (for all states) and where the sequence of weights
w(0) , . . . , w(T ) , satisfies kw(t) k2 ≤ W (but is otherwise arbitrary). Define:
h i
errt = Es∼dπρe Ea∼eπ(·|s) A(t) (s, a) − w(t) · ∇θ log π (t) (a|s) .
We have that:
T −1
!
n o 1 log |A| ηβW 2 1 X
min V πe (ρ) − V (t) (ρ) ≤ + + errt .
t<T 1−γ ηT 2 T
t=0
π (t+1) (a|s) β
log ≥ ∇θ log π (t) (a|s) · θ(t+1) − θ(t) − kθ(t+1) − θ(t) k22
π (t) (a|s) 2
β
= η∇θ log π (t) (a|s) · w(t) − η 2 kw(t) k22 .
2
36
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
We use de as shorthand for dπρe (note ρ and π e are fixed); for any policy π, we also use πs as
shorthand for the vector π(·|s). Using the performance difference lemma (Lemma 2),
Es∼de KL(e πs ||πs(t) ) − KL(e πs ||πs(t+1) )
" #
π (t+1) (a|s)
= Es∼de Ea∼eπ(·|s) log (t)
π (a|s)
h i β
≥ ηEs∼de Ea∼eπ(·|s) ∇θ log π (t) (a|s) · w(t) − η 2 kw(t) k22 (using previous display)
2
h i β
= ηEs∼de Ea∼eπ(·|s) A(t) (s, a) − η 2 kw(t) k22
h 2 i
+ ηEs∼de Ea∼eπ(·|s) ∇θ log π (a|s) · w(t) − A(t) (s, a)
(t)
β
= (1 − γ)η V (ρ) − V (ρ) − η 2 kw(t) k22 − η errt
π
e (t)
2
Rearranging, we have:
π (t) 1 1 ηβ
V (ρ) − V
e
(ρ) ≤ πs ||πs(t) ) − KL(e
Es∼de KL(e πs ||πs(t+1) ) + W 2 + errt
1−γ η 2
Proceeding,
T −1 T −1
1 X πe 1 X
(V (ρ) − V (t) (ρ)) ≤ πs ||πs(t) ) − KL(e
Es∼de (KL(e πs ||πs(t+1) ))
T ηT (1 − γ)
t=0 t=0
T −1
ηβW 2
1 X
+ + errt
T (1 − γ) 2
t=0
T −1
πs ||π (0) )
Es∼de KL(e ηβW 2 1 X
≤ + + errt
ηT (1 − γ) 2(1 − γ) T (1 − γ)
t=0
T −1
log |A| ηβW 2 1 X
≤ + + errt ,
ηT (1 − γ) 2(1 − γ) T (1 − γ)
t=0
37
AGARWAL , K AKADE , L EE AND M AHAJAN
For the first term, using that ∇θ log πθ (a|s) = φs,a − Ea0 ∼πθ (·|s) [φs,a0 ] (see Section 6.1.1), we have:
h i
(t)
Es∼d?ρ ,a∼π? (·|s) A(t) (s, a) − w? · ∇θ log π (t) (a|s)
h i h i
(t) (t)
= Es∼d?ρ ,a∼π? (·|s) Q(t) (s, a) − w? · φs,a − Es∼d?ρ ,a0 ∼π(t) (·|s) Q(t) (s, a0 ) − w? · φs,a0
r r
(t) 2
(t)
2
≤ Es∼d?ρ ,a∼π? (·|s) Q(t) (s, a) − w? · φs,a + Es∼d?ρ ,a0 ∼π(t) (·|s) Q(t) (s, a0 ) − w? · φs,a0
r h 2 i q
(t) (t) (t)
≤ 2 |A|Es∼dρ ,a∼UnifA Q (s, a) − w? · φs,a
? = 2 |A|L(w? ; θ(t) , d? ). (25)
where in the first equality, we have used A(t) (s, a) = Q(t) (s, a) − Ea0 ∼π(t) (·|s) Q(t) (s, a0 ) and in the
(t)
last step, we have used the definition of d? and L(w? ; θ(t) , d? ).
For the second term, let us now show that:
h i
(t)
Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s)
s
|A|κ (t)
≤ 2 L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) (26)
1−γ
To see this, first observe that a similar argument to the above leads to:
h i
(t)
Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s)
h i h i
(t) (t)
= Es∼d?ρ ,a∼π? (·|s) w? − w(t) · φs,a − Es∼d?ρ ,a0 ∼π(t) (·|s) w? − w(t) · φs,a0
r h 2 i q
(t) (t)
= 2 |A| · kw? − w(t) k2Σd? ,
≤ 2 |A|Es,a∼d? w? − w(t) · φs,a
where we use the notation kxk2M := x> M x for a matrix M and a vector x. From the definition of κ,
(t) (t)
using that (1 − γ)ν ≤ dπν (see (19)). Due to that w? minimizes L(w; θ(t) , d(t) ) over the set
(t)
W := {w : kwk2 ≤ W }, for any w ∈ W the first-order optimality conditions for w? imply that:
(t) (t)
(w − w? ) · ∇L(w? ; θ(t) , d(t) ) ≥ 0.
38
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Noting that w(t) ∈ W by construction in Algorithm 20 yields the claimed bound on the second term
in (26).
Using the bounds on the first and second terms in (25) and (26), along with concavity of the
square root function, we have that:
r s
h
(t)
i |A|κ h (t)
i
E[errt ] ≤ 2 |A|E L(w? ; θ(t) , d? ) + 2 E L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) .
1−γ
The proof is completed by substitution and using our assumptions on stat and bias .
The following proof for the NPG algorithm follows along similar lines.
Proof (of Theorem 29) Using the NPG regret lemma and our setting of η,
r " T −1 #
n ? o W 2β log |A| 1 1 X
E min V π (ρ) − V (t) (ρ) ≤ + E errt .
t<T 1−γ T 1−γ T
t=0
where the expectation is with respect to the sequence of iterates w(0) , w(1) , . . . w(T −1) .
Again, we make the following decomposition of errt :
h i
(t)
errt = Es∼d?ρ ,a∼π? (·|s) A(t) (s, a) − w? · ∇θ log π (t) (a|s)
h i
(t)
+ Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s) .
(t)
where we have used the definition of LA (w? ; θ(t) , d? ) in the last step.
For the second term, a similar argument leads to:
h i q
(t) (t)
Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s) = kw? − w(t) k2Σd? .
39
AGARWAL , K AKADE , L EE AND M AHAJAN
(t) (t)
Define κ(t) := k(Σν )−1/2 Σd? (Σν )−1/2 k2 , which is the relative condition number at iteration t.
We have
(t) −1/2 −1/2 (t)
kw? − w(t) k2Σd? ≤ k(Σ(t)
ν ) Σd? (Σ(t)
ν ) k2 kw? − w(t) k2Σν
κ(t) (t)
≤ kw? − w(t) k2Σ (t)
1−γ d
κ(t)
(t)
≤ LA (w(t) ; θ(t) , d(t) ) − LA (w? ; θ(t) , d(t) )
1−γ
(t)
where the last step uses that w? is a minimizer of LA over W and that w(t) is feasible as before (see
the proof of Theorem 20). Now taking an expectation we have:
" #
h
(t) (t) 2
i κ(t) (t) (t) (t) (t) (t) (t)
E kw? − w kΣd? ≤ E LA (w ; θ , d ) − LA (w? ; θ , d )
1−γ
" #
κ(t) h (t) (t) (t) (t) (t) (t) (t)
i
= E E LA (w ; θ , d ) − LA (w? ; θ , d ) | θ
1−γ
" #
κ(t) κstat
≤ E · stat ≤
1−γ 1−γ
7. Discussion
This work provides a systematic study of the convergence properties of policy optimization techniques,
both in the tabular and the function approximation settings. At the core, our results imply that the non-
convexity of the policy optimization problem is not the fundamental challenge for typical variants of
the policy gradient approach. This is evidenced by the global convergence results which we establish
and that demonstrate the relative niceness of the underlying optimization problem. At the same
time, our results highlight that insufficient exploration can lead to the convergence to sub-optimal
policies, as is also observed in practice; technically, we show how this is an issue of conditioning.
Conversely, we can expect typical policy gradient algorithms to find the best policy from amongst
those whose state-visitation distribution is adequately aligned with the policies we discover, provided
a distribution-shifted notion of approximation error is small.
In the tabular case, our results show that the nature and severity of the exploration/distribution
mismatch term differs in different policy optimization approaches. For instance, we find that doing
policy gradient in its standard form for both the direct and softmax parameterizations can be slow to
converge, particularly in the face of distribution mismatch, even when policy gradients are computed
exactly. Natural policy gradient, on the other hand, enjoys a fast dimension-free convergence when
we are in tabular settings with exact gradients. On the other hand, for the function approximation
setting, or when using finite samples, all algorithms suffer to some degree from the exploration issue
captured through a conditioning effect.
40
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
With regards to function approximation, the guarantees herein are the first provable results
that permit average case approximation errors, where the guarantees do not have explicit worst
case dependencies over the state space. These worst case dependencies are avoided by precisely
characterizing an approximation/estimation error decomposition, where the relevant approximation
error is under distribution shift to an optimal policies measure. Here, we see that successful
function approximation relies on two key aspects: good conditioning (related to exploration) and low
distribution-shifted, approximation error. In particular, these results identify the relevant measure of
the expressivity of a policy class, for the natural policy gradient.
With regards to sample size issues, we showed that simply using stochastic (projected) gradient
ascent suffices for accurate policy optimization. However, in terms of improving sample efficiency
and polynomial dependencies, there are number of important questions for future research, including
variance reduction techniques along with data re-use.
There are number of compelling directions for further study. The first is in understanding
how to remove the density ratio guarantees among prior algorithms; our results are suggestive
that the incremental policy optimization approaches, including CPI (Kakade and Langford, 2002),
PSDP (Bagnell et al., 2004), and MD-MPI Geist et al. (2019), may permit such an improved analysis.
The question of understanding what representations are robust to distribution shift is well-motivated
by the nature of our distribution-shifted, approximation error (the transfer error). Finally, we hope
that policy optimization approaches can be combined with exploration approaches, so that, provably,
these approaches can retain their robustness properties (in terms of their agnostic learning guarantees)
while mitigating the need for a well conditioned initial starting distribution.
Acknowledgments
We thank the anonymous reviewers who provided detailed and constructive feedback that helped us
significantly improve the presentation and exposition. Sham Kakade and Alekh Agarwal gratefully
acknowledge numerous helpful discussions with Wen Sun with regards to the Q-NPG algorithm
and our notion of transfer error. We also acknowledge numerous helpful comments from Ching-An
Cheng and Andrea Zanette on an earlier draft of this work. We thank Nan Jiang, Bruno Scherrer, and
Matthieu Geist for their comments with regards to the relationship between concentrability coeffi-
cients, the condition number, and the transfer error; this discussion ultimately lead to Corollary 21.
Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in
Data-intensive Discovery, the ONR award N00014-18-1-2247, and the DARPA award FA8650-18-2-
7836. Jason D. Lee acknowledges support of the ARO under MURI Award W911NF-11-1-0303.
This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical
Research Council (EPSRC) under the Multidisciplinary University Research Initiative.
References
Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért
Weisz. POLITEX: Regret bounds for policy iteration using expert prediction. In International
Conference on Machine Learning, pages 3692–3702, 2019a.
Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced
politex. arXiv preprint arXiv:1908.10479, 2019b.
41
AGARWAL , K AKADE , L EE AND M AHAJAN
Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin
Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning
Representations, 2018.
Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans, editors. Understand-
ing the impact of entropy on policy optimization, 2019. URL https://arxiv.org/abs/
1811.11214.
András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-
residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71
(1):89–129, 2008.
Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating mini-
mization and projection methods for nonconvex problems: An approach based on the kurdyka-
łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J. Kappen. Dynamic policy programming.
J. Mach. Learn. Res., 13(1), November 2012. ISSN 1532-4435.
J. A. Bagnell, Sham M Kakade, Jeff G. Schneider, and Andrew Y. Ng. Policy search by dynamic
programming. In S. Thrun, L. K. Saul, and B. Schölkopf, editors, Advances in Neural Information
Processing Systems 16, pages 831–838. MIT Press, 2004.
J. Andrew Bagnell and Jeff Schneider. Covariant policy search. In Proceedings of the 18th Interna-
tional Joint Conference on Artificial Intelligence, IJCAI’03, pages 1019–1024, San Francisco, CA,
USA, 2003. Morgan Kaufmann Publishers Inc.
Keith Ball. An elementary introduction to modern convex geometry. Flavors of geometry, 31:1–58,
1997.
A. Beck. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997.
Richard Bellman and Stuart Dreyfus. Functional approximations and dynamic programming. Mathe-
matical Tables and Other Aids to Computation, 13(68):247–251, 1959.
Dimitri P Bertsekas and John N Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont,
MA, 1996.
Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. CoRR,
abs/1906.01786, 2019. URL http://arxiv.org/abs/1906.01786.
Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic
algorithms. Automatica, 45(11):2471–2482, 2009.
Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The łojasiewicz inequality for nonsmooth subana-
lytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization,
17(4):1205–1223, 2007.
42
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
43
AGARWAL , K AKADE , L EE AND M AHAJAN
Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape
saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning,
ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1724–1732, 2017.
Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement
learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
Fritz John. Extremum problems with inequalities as subsidiary conditions. Interscience Publishers,
1948.
Sham Kakade and John Langford. Approximately Optimal Approximate Reinforcement Learning. In
Proceedings of the 19th International Conference on Machine Learning, volume 2, pages 267–274,
2002.
Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-
gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.
Machine Learning, 49(2-3):209–232, 2002.
Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information
processing systems, pages 1008–1014, 2000.
Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization
attains globally optimal policy. CoRR, abs/1906.10306, 2019. URL http://arxiv.org/
abs/1906.10306.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement
learning. In International conference on machine learning, pages 1928–1937, 2016.
Rémi Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567,
2003.
Rémi Munos. Error bounds for approximate value iteration. In AAAI, 2005.
Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method
efficiency in optimization. 1983.
Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global perfor-
mance. Math. Program., pages 177–205, 2006.
Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. Online markov decision
processes under bandit feedback. In Advances in Neural Information Processing Systems 23.
Curran Associates, Inc., 2010.
44
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov
decision processes. CoRR, abs/1705.07798, 2017.
Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomput., 71(7-9):1180–1190, 2008. ISSN
0925-2312.
Jan Peters, Katharina Mülling, and Yasemin Altün. Relative entropy policy search. In Proceedings
of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2010), pages 1607–1612.
AAAI Press, 2010.
B. T. Polyak. Gradient methods for minimizing functionals. USSR Computational Mathematics and
Mathematical Physics, 3(4):864–878, 1963.
Aravind Rajeswaran, Kendall Lowrey, Emanuel V. Todorov, and Sham M Kakade. Towards general-
ization and simplicity in continuous control. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing
Systems 30, pages 6550–6561. Curran Associates, Inc., 2017.
Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative policy
iteration as boosted policy search. In Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, pages 35–50. Springer, 2014.
Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist.
Approximate modified policy iteration and its application to the game of tetris. Journal of Machine
Learning Research, 16:1629–1676, 2015.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region
policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends
in Machine Learning, 4(2):107–194, 2012.
Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global
convergence and fa ster rates for regularized mdps, 2019.
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient meth-
ods for reinforcement learning with function approximation. In Advances in Neural Information
Processing Systems, volume 99, pages 1057–1063, 1999.
45
AGARWAL , K AKADE , L EE AND M AHAJAN
Csaba Szepesvári and Rémi Munos. Finite time bounds for sampling based fitted value iteration. In
Proceedings of the 22nd international conference on Machine learning, pages 880–887. ACM,
2005.
Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning
algorithms. Connection Science, 3(3):241–268, 1991.
Lin F. Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features.
In International Conference on Machine Learning, pages 6995–7004, 2019.
46
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Proof [of Lemma 2] Let Prπ (τ |s0 = s) denote the probability of observing a trajectory τ when
starting in state s and following the policy π. Using a telescoping argument, we have:
"∞ #
π0 0
X
π
V (s) − V (s) = Eτ ∼Prπ (τ |s0 =s) γ r(st , at ) − V π (s)
t
t=0
"∞ #
π0 π0 0
X
= Eτ ∼Prπ (τ |s0 =s) γt r(st , at ) + V (st ) − V (st ) − V π (s)
t=0
∞
" #
(a)
π0 π0
X
= Eτ ∼Prπ (τ |s0 =s) γ t r(st , at ) + γV (st+1 ) − V (st )
t=0
∞
" #
(b)
π0 π0
X
t
= Eτ ∼Prπ (τ |s0 =s) γ r(st , at ) + γE[V (st+1 )|st , at ] − V (st )
t=0
∞
" #
X
π0 1 0
= Eτ ∼Prπ (τ |s0 =s) γ t A (st , at ) = Es0 ∼dπs Ea∼π(·|s) γ t Aπ (s0 , a),
1−γ
t=0
0 0
where (a) rearranges terms in the summation and cancels the V π (s0 ) term with the −V π (s) outside
the summation, and (b) uses the tower property of conditional expectations and the final equality
follows from the definition of dπs .
47
AGARWAL , K AKADE , L EE AND M AHAJAN
Due to that we are working with the direct parameterization (see (2)), we drop the θ subscript.
1
Gη (π) = P∆(A)|S| (π + η∇π V π (µ)) − π ,
η
and the update rule for the projected gradient is π + = π + ηGη (π). If kGη (π)k2 ≤ , then
+
max δ > ∇π V π (µ) ≤ (ηβ + 1).
π+δ∈∆(A)|S| , kδk2 ≤1
where B2 is the unit `2 ball, and N∆(A)|S| is the normal cone of the product simplex ∆(A)|S| .
+
Since ∇π V π (µ) is (ηβ + 1) distance from the normal cone and δ is in the tangent cone, then
+
δ > ∇π V π (µ) ≤ (ηβ + 1).
1
Gη (π) = P∆(A)|S| (π + η∇π V (t) (µ)) − π
η
From Lemma 54, we have V π (s) is β-smooth for all states s (and also hence V π (µ) is also β-smooth)
2γ|A| η
with β = (1−γ)3 . Then, from standard result (Theorem 57), we have that for G (π) with step-size
η = β1 ,
p
η (t) 2β(V ? (µ) − V (0) (µ))
min kG (π )k2 ≤ √
t=0,1,...,T −1 T
48
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
or, equivalently,
? 2
32|S|β(V ? (µ) − V (0) (µ)) dπρ
T ≥ .
(1 − γ)2 2 µ
∞
? (µ) (0) (µ) 1 2γ|A|
Using V −V ≤ 1−γ and β = (1−γ)3
from Lemma 54 leads to the desired result.
For the MDP illustrated in Figure 2, the entries of this matrix are given as:
49
AGARWAL , K AKADE , L EE AND M AHAJAN
With this definition, we recall that the value function in the initial state s0 is given by
X∞
V πθ (s0 ) = Eτ ∼πθ [ γ t rt ] = eT0 (I − γP θ )−1 r,
t=0
where e0 is an indicator vector for the starting state s0 . From the form of the transition probabili-
ties (27), it is clear that the value function only depends on the parameters θs,a1 in any state s. While
care is needed for derivatives as the parameters across actions are related by the simplex feasibility
constraints, we have assumed each parameter is strictly positive, so that an infinitesimal change to
any parameter other than θs,a1 does not affect the policy value and hence the policy gradients. With
this understanding, we succinctly refer to θs,a1 as θs in any state s. We also refer to the state si
simply as i to reduce subscripts.
For convenience, we also define p̄ (resp. p) to be the largest (resp. smallest) of the probabilities
θs across the states s ∈ [1, H] in the MDP.
In this section, we prove Proposition 6, that is: for 0 < θ < 1 (componentwise across states
H
and actions), p̄ ≤ 1/4, and for all k ≤ 40 log(2H) − 1, we have k∇kθ V πθ (s0 )k ≤ (1/3)H/4 , where
∇kθ V πθ (s0 ) is a tensor of the kth order. Furthermore, we seek to show V ? (s0 ) − V πθ (s0 ) ≥
(H + 1)/8 − (H + 1)2 /3H (where θ? are the optimal policy’s parameters).
It is easily checked that V πθ (s0 ) = M0,H+1
θ , where
M θ := (I − γP θ )−1 ,
since the only rewards are obtained in the state sH+1 . In order to bound the derivatives of the
expected reward, we first establish some properties of the matrix M θ .
√ √
1− 1−4γ 2 p̄(1−p) 1+ 1−4γ 2 p̄(1−p)
Lemma 38 Suppose p̄ ≤ 1/4. Fix any α ∈ 2γ(1−p) , max 2γ(1−p) , 1 . Then
θ ≤ αb−a−1
1. Ma,b 1−γ for 0 ≤ a ≤ b ≤ H.
θ γ p̄ θ γ p̄
2. Ma,H+1 ≤ 1−γ Ma,H ≤ (1−γ)2
αH−a for 0 ≤ a ≤ H.
Proof Let ρka,b be the normalized discounted probability of reaching b, when the initial state is a, in
k steps, that is
k
X
k
ρa,b := (1 − γ) [(γP θ )i ]a,b , (28)
i=0
where we recall the convention that U 0 is the identity matrix for any square matrix U . Observe that
0 ≤ ρka,b ≤ 1, and, based on the form (27) of P θ , we have the recursive relation for all k > 0:
50
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Note that ρ0a,b = 0 for a 6= b and ρ0a,b = 1 − γ for a = b. Now let us inductively prove that for all
k≥0
ρka,b ≤ αb−a for 1 ≤ a ≤ b ≤ H. (30)
Clearly this holds for k = 0 since ρ0a,b = 0 for a 6= b and ρ0a,b = 1 − γ for a = b. Now, assuming the
bound for all steps till k − 1, we now prove it for k case by case.
For a = b the result follows since
ρka,b ≤ 1 = αb−a .
For 1 < b < H and a < b, observe that the recursion (29) and the inductive hypothesis imply
that
where the last inequality follows since α2 γ(1 − p) − α + γ p̄ ≤ 0 due to that α is within the roots
of this quadratic equation. Note the discriminant term in the square root is non-negative provided
p̄ < 1/4, since the condition along with the knowledge that p ≤ p̄ ensures that 4γ 2 p̄(1 − p) ≤ 1.
For b = H and a < H, we observe that
This proves the inductive claim (note that the cases of b = a = 1 and b = a = H are already
handled in the first part above). Next, we prove that for all k ≥ 0
ρk0,b ≤ αb−1 .
Clearly this holds for k = 0 and b 6= 0 since ρ00,b = 0. Furthermore, for all k ≥ 0 and b = 0,
ρk0,b ≤ 1 ≤ αb−1 ,
since α ≤ 1 by construction and b = 0. Now, we consider the only remaining case when k > 0 and
b ∈ [1, H + 1]. By (27), observe that for k > 0 and b ∈ [1, H + 1],
51
AGARWAL , K AKADE , L EE AND M AHAJAN
for all i ≥ 1. Using the definition of ρka,b (28) for k > 0 and b ∈ [1, H + 1],
k
X k
X
ρk0,b = (1 − γ) θ i
[(γP ) ]0,b θ 0
= (1 − γ)[(γP ) ]0,b + (1 − γ) [(γP θ )i ]0,b
i=0 i=1
k
X
= 0 + (1 − γ) γ i [(P θ )i ]0,b (since b ≥ 1)
i=1
k
X h i
= (1 − γ) γ i (P θ )i−1 (using Equation (31))
1,b
i=1
k−1
X
= (1 − γ)γ γ j [(P θ )j ]1,b (By substituting j = i − 1)
j=0
= γρk−1
1,b (using Equation (28))
≤ αb−1 (using Equation (30) and γ, α ≤ 1)
θ αb−a αb−a−1
Ma,b ≤ ≤ for 1 ≤ a ≤ b ≤ H
1−γ 1−γ
θ αb−a−1
Ma,b ≤ for 0 = a ≤ b ≤ H
1−γ
which completes the proof of the first part of the lemma.
For the second claim, from recursion (29) and b = H + 1 and a < H + 1
Rearranging the terms in the above bound yields the second claim in the lemma.
52
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
∂ k M0,H+1
θ
p̄ 2k γ k+1 k! αH−2k
≤ .
∂θβ1 . . . ∂θβk (1 − γ)k+2
Proof Since the parameter θ is fixed throughout, we drop the superscript in M θ for brevity. Using
∇θ M = −M ∇θ (I − γP θ )M , using the form of P θ in (27), we get for any h ∈ [1, H]
H+1
∂Ma,b X ∂Pi,j
− = −γ Ma,i Mj,b = γMa,h (Mh−1,b − Mh+1,b ) (32)
∂θh ∂θh
i,j=0
where the second equality follows since Ph,h+1 = θh and Ph,h−1 = 1 − θh are the only two entries
in the transition matrix which depend on θh for h ∈ [1, H].
∂k M
0,H+1
Next, let us consider a kth order partial derivative of M0,H+1 , denoted as ∂θβ . Note that
β can have repeated entries to capture higher order derivative with respect to some parameter. We
∂k M
prove by induction for all k ≥ 1, − ∂θ0,H+1 can be written as N
P
β n=1 cn ζn where
2. Each monomial ζn is of the form Mi1 ,j1 . . . Mik+1 ,jk+1 , i1 = 0, jk+1 = H + 1, jl ≤ Hand
il+1 = jl ± 1 for all l ∈ [1, k].
The base case k = 1 follows from Equation (32), as we can write for any h ∈ [H]
∂M0,H+1
− = γM0,h Mh−1,H+1 − γM0,h Mh+1,H+1
∂θh
where β /i is the vector β with the ith entry removed. By inductive hypothesis,
N
∂ k−1 M0,H+1 X
− = cn ζn
∂θβ/1
n=1
where
53
AGARWAL , K AKADE , L EE AND M AHAJAN
In order to compute the (k)th derivative of M0,H+1 , we have to compute derivative of each mono-
mial ζn with respect to θβ1 . Consider one of the monomials in the (k − 1)th derivative, say,
ζ = Mi1 ,j1 . . . Mik ,jk . We invoke the chain rule as before and replace one of the terms in ζ, say
Mim ,jm , with γMim ,β1 Mβ1 −1,jm − γMim ,β1 Mβ1 +1,jm using Equation 32. That is, the derivative of
each entry gives rise to two monomials and therefore derivative of ζ leads to 2k monomials which
can be written in the form ζ 0 = Mi01 ,j10 . . . Mi0k+1 ,jk+1
0 where we have the following properties (by
appropriately reordering terms)
3. i0m , jm
0 = i , β and i0
m 1
0
m+1 , jm+1 = jm ± 1, jm
N0
∂ k M0,H+1 X
− = c0n ζn0
∂θβ1 . . . ∂θβk
n=0
where
1. |c0n | = γ|cn | = γ k , since as shown above each coefficient gets multiplied by ±γ.
2. N 0 ≤ 2k2k−1 (k − 1)! = 2k k!, since as shown above each monomial ζ leads to 2k monomials
ζ 0.
3. Each monomial ζn0 is of the form Mi1 ,j1 . . . Mik+1 ,jk+1 , i1 = 0, jk+1 = H + 1, jl ≤ H and
il+1 = jl ± 1 for all l ∈ [1, k].
γ p̄αH−2k
Mi1 ,j1 . . . Mik+1 ,jk+1 ≤ (33)
(1 − γ)k+2
1
We observe that it suffices to only consider pairs of indices il , jl where il < jl . Since |Mi,j | ≤ 1−γ
for all i, j,
54
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
k+1
Y Y Y 1
Mi0l ,jl0 ≤ Mi0l ,jl0 Mi0k+1 ,jk+1
0
1−γ
l=1 1≤l≤k : i0l <jl0 1≤l≤k : i0l ≥jl0
Y Y 1
= Mi0l ,jl0 Mi0k+1 ,H+1
1−γ
1≤l≤k : i0l <jl0 1≤l≤k : i0l ≥jl0
(by the inductive claim shown above)
0 0
P
{1≤l≤k : i0 <j 0 } jl −il −1
0
α l l γ p̄αH−ik+1
≤
(1 − γ)k (1 − γ)2
(using Lemma 38, parts 1 and 2 on the first and last terms resp.)
0 0
P
{1≤l≤k+1 : i0 <j 0 } jl −il −1
γ p̄α l l
= (34)
(1 − γ)k+2
0
The last step follows from H + 1 = jk+1 ≥ i0k+1 . Note that
X k+1
X k
X
jl0 − i0l ≥ 0
jl0 − i0l = jk+1 − i01 + 0
(jl+1 − i0l ) ≥ H + 1 − k ≥ 0
{1≤l≤k+1 : i0l <jl0 } l=1 l=1
where the first inequality follows from adding only non-positive terms to the sum, the second equality
follows from rearranging terms and the third inequality follows from i01 = 0, jk+1 0 = H + 1 and
0 0
il+1 = jl ± 1 for all l ∈ [1, k]. Therefore,
X
jl0 − i0l − 1 ≥ H − 2k
{1≤l≤k+1 : i0l <jl0 }
55
AGARWAL , K AKADE , L EE AND M AHAJAN
∂ k V πθ (s0 ) ∂ k M0,H+1
θ
= .
∂θβ1 . . . ∂θβh ∂θβ1 . . . ∂θβk
k k
Given vectors u1 , . . . , uk which are unit vectors in RH (we denote the unit sphere by SH ), the
norm of this gradient tensor is given by:
X ∂ k V πθ (s0 ) 1
k∇kθ V πθ (s0 )k = max u . . . ukβk
u1 ,...,uk ∈SH k ∂θβ1 . . . ∂θβk β1
β∈[H]k
v
∂ V (s0 ) 2
u X k πθ s X
2
u1β1 . . . ukβk
u
≤ max t
u1 ,...,uk ∈SH k ∂θβ1 . . . ∂θβk
β∈[H]k β∈[H]k
v v
k
∂ V (s0 ) 2 u
u X k πθ uY
= max
u t kui k2
2
t
u1 ,...,uk ∈SH k ∂θβ1 . . . ∂θβk
β∈[H]k i=1
v 2 v u X k θ
u X k πθ ∂ M0,H+1 2
u ∂ V (s0 ) u
=t =t
∂θβ1 . . . ∂θβk ∂θβ1 . . . ∂θβk
β∈[H]k β∈[H]k
s
H k p̄2 22k γ 2k+2 (k!)2 α2H−4k
≤ ,
(1 − γ)2k+4
where the last inequality follows from Lemma 39. In order to proceed further, we need an upper
bound on the smallest admissible value of α. To do so, let us consider all possible parameters θ such
that p̄ ≤ 1/4 in accordance with the theorem statement. In order to bound α, it suffices to place an
upper bound on the lower end of the range for α in Lemma 38 (note Lemma 38 holds for any choice
of α in the range). Doing so, we see that
q q
1− 1 − 4γ 2 p̄(1 − p) 1 − 1 + 2γ p̄(1 − p)
≤
2γ(1 − p) 2γ(1 − p)
s r
p̄ 4p̄
= ≤ ,
1−p 3
√ √ √
where the first inequality uses x−y ≥ x− y, by triangle inequality while the last inequality
uses p ≤ p̄ ≤ 1/4.
56
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
(b)
q
≤ (2H)2k+4 H k 22k (H)2k ( 4p̄
3 )
H−2k
q
= (2)4k+4 (H)5k+4 ( 4p̄
3 )
H−2k
H
k0 ≥ − 1,
40 log(2H)
Thus, the norm of the gradient is bounded by ( 4p̄
3 )
H/4 ≤ (1/3)H/4 for all k ≤ H
40 log(2H) − 1 as long
as p̄ ≤ 1/4, which gives the first part of the lemma.
For the second part, note that the optimal policy always chooses the action a1 , and gets a
discounted reward of
H+2
H+2 1 H +1
γ /(1 − γ) = (H + 1) 1 − ≥ ,
H +1 8
57
AGARWAL , K AKADE , L EE AND M AHAJAN
where the final inequality uses (1 − 1/x)x ≥ 1/8 for x ≥ 1. On the other hand, the value of πθ is
upper bounded by
H
γ p̄αH
γ p̄ 4p̄
M0,H+1 ≤ 2
≤
(1 − γ) (1 − γ)2 3
(H + 1)2
≤ .
3H
This gives the second part of the lemma.
∂V πθ (µ) 1
= dπθ (s)πθ (a|s)Aπθ (s, a)
∂θs,a 1−γ µ
where the second to last step uses that for any policy a π(a|s)Aπ (s, a) = 0.
P
58
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Lemma 41 (Monotonic Improvement in V (t) (s)) For all states s and actions a, for updates (36)
2
with learning rate η ≤ (1−γ)
5 , we have
holds for all states s. To see this, observe that since the above holds for all states s0 , the performance
difference lemma (Lemma 2) implies
1 h i
V (t+1) (s) − V (t) (s) = Es0 ∼dπ(t+1) Ea∼π(t+1) (·|s0 ) A(t) (s0 , a) ≥ 0,
1−γ s
where c(s, a) is constant, which we later set to be A(t) (s, a); note we do not treat c(s, a) as a function
of θ. Thus,
Taking c(s, a) to be A(t) (s, a) implies (t) (a0 |s)c(s, a0 ) (t) (a0 |s)A(t) (s, a0 )
P P
a0 ∈A π = a0 ∈A π = 0,
∂Fs (θs )
(t)
= π (t) (a|s)A(t) (s, a) (39)
∂θs,a θs
59
AGARWAL , K AKADE , L EE AND M AHAJAN
Recall that for a β smooth function, gradient ascent will decrease the function value provided that
5
η ≤ 1/β (Theorem 57). Because Fs (θs ) is β-smooth for β = 1−γ (Lemma 52 and A(t) (s, a) ≤
1
1−γ ), then our assumption that
(1 − γ)2
η≤ = (1 − γ)β −1
5
1 (t)
implies that η 1−γ dπµ (s) ≤ 1/β, and so we have
Fs (θs(t+1) ) ≥ Fs (θs(t) )
which implies (37).
Next, we show the limit for iterates V (t) (s) and Q(t) (s, a) exists for all states s and actions a.
Lemma 42 For all states s and actions a, there exists values V (∞) (s) and Q(∞) (s, a) such that as
t → ∞, V (t) (s) → V (∞) (s) and Q(t) (s, a) → Q(∞) (s, a). Define
∆= min |A(∞) (s, a)|
{s,a|A(∞) (s,a)6=0}
where A(∞) (s, a) = Q(∞) (s, a) − V (∞) (s). Furthermore, there exists a T0 such that for all t > T0 ,
s ∈ S, and a ∈ A, we have
Q(t) (s, a) ≥ Q(∞) (s, a) − ∆/4 (40)
1
Proof Observe that Q(t+1) (s, a) ≥ Q(t) (s, a) (by Lemma 41) and Q(t) (s, a) ≤ 1−γ , therefore by
(t) (∞) (∞)
monotone convergence theorem, Q (s, a) → Q (s, a) for some constant Q (s, a). Similarly it
follows that V (t) (s) → V (∞) (s) for some constant V (∞) (s). Due to the limits existing, this implies
we can choose T0 , such that the result (40) follows.
Based on the limits V (∞) (s) and Q(∞) (s, a), define following sets:
I0s := {a|Q(∞) (s, a) = V (∞) (s)}
s
I+ := {a|Q(∞) (s, a) > V (∞) (s)}
s
I− := {a|Q(∞) (s, a) < V (∞) (s)} .
60
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
In the following lemmas 44- 50, we first show that probabilities π (t) (a|s) → 0 for actions a ∈ I+
s ∪I s
−
s , lim (t) s
as t → ∞. We then show that for actions a ∈ I− t→∞ θs,a = −∞ and for all actions a ∈ I+ ,
θ(t) (a|s) is bounded from below as t → ∞.
Lemma 43 We have that there exists a T1 such that for all t > T1 , s ∈ S, and a ∈ A, we have
∆ ∆
A(t) (s, a) < − s
for a ∈ I− ; A(t) (s, a) > s
for a ∈ I+ (41)
4 4
Proof Since, V (t) (s) → V (∞) (s), we have that there exists T1 > T0 such that for all t > T1 ,
∆
V (t) (s) > V (∞) (s) − . (42)
4
s
Using Equation (40), it follows that for t > T1 > T0 , for a ∈ I−
Similarly A(t) (s, a) = Q(t) (s, a) − V (t) (s) > ∆/4 for a ∈ I+
s as
∂V (t) (µ) s ∪ Is ,
Lemma 44 → 0 as t → ∞ for all states s and actions a. This implies that for a ∈ I+
∂θs,a −
(t) (t)
P
π (a|s) → 0 and that a∈I s π (a|s) → 1.
0
Proof Because V πθ (µ) is smooth (Lemma 55) as a function of θ, it follows from standard optimiza-
(t)
tion results (Theorem 57) that ∂V∂θs,a(µ) → 0 for all states s and actions a. We have from Lemma
40
∂V (t) (µ) 1 (t)
= dπµ (s)π (t) (a|s)A(t) (s, a).
∂θs,a 1−γ
(t) µ(s)
Since, |A(t) (s, a)| > ∆ s s π
4 for all t > T1 (from Lemma 43) for all a ∈ I− ∪ I+ and dµ (s) ≥ 1−γ >0
(using the strict positivity of µ in our assumption in Theorem 10), we have π (t) (a|s) → 0.
61
AGARWAL , K AKADE , L EE AND M AHAJAN
(t) s, θ (t)
Lemma 45 (Monotonicity in θs,a ). For all a ∈ I+ s,a is strictly increasing for t ≥ T1 . For all
s (t)
a ∈ I− , θs,a is strictly decreasing for t ≥ T1 .
(t)
Since dπµ (s) > 0 and π (t) (a|s) > 0 for the softmax parameterization, we have for all t > T1
s 6= ∅, we have that:
Lemma 46 For all s where I+
(t) (t)
maxs θs,a → ∞, min θs,a → −∞
a∈I0 a∈A
or equivalently
P (t)
a∈I0s exp(θs,a )
P (t)
→ 1, as t → ∞
a exp(θs,a )
62
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
which implies
(t)
maxs θs,a → ∞, as t → ∞
a∈I0
(t)
Note this also implies maxa∈A θs,a → ∞. The last part of the proof is completed using that the
P ∂V (t) (µ) P (t)
gradients sum to 0, i.e. a ∂θs,a = 0. From gradient sum to 0, we get that a∈A θs,a =
P (0)
a∈A θs,a := c for all t > 0 where c is defined as the sum (over A) of initial parameters. That is
(t) 1 (t) (t)
mina∈A θs,a < − |A| maxa∈A θs,a + c. Since, maxa∈A θs,a → ∞, the result follows.
Lemma 47 Suppose a+ ∈ I+ s . For any a ∈ I s , if there exists a t ≥ T such that π (t) (a|s) ≤
0 0
π (t) (a+ |s), then for all τ ≥ t, π (τ ) (a|s) ≤ π (τ ) (a+ |s).
Proof The proof is inductive. Suppose π (t) (a|s) ≤ π (t) (a+ |s), this implies from Lemma 40
∂V (t) (µ)
1 (t) (t) (t) (t)
= d (s)π (a|s) Q (s, a) − V (s)
∂θs,a 1−γ µ
∂V (t) (µ)
1 (t) (t) (t) (t)
≤ dµ (s)π (a+ |s) Q (s, a+ ) − V (s) = .
1−γ ∂θs,a+
where the second to last step follows from Q(t) (s, a+ ) ≥ Q(∞) (s, a+ ) − ∆/4 ≥ Q(∞) (s, a) + ∆ −
∆/4 > Q(t) (s, a) for t > T0 . This implies that π (t+1) (a|s) ≤ π (t+1) (a+ |s) which completes the
proof.
Proof Let a+ ∈ I+ s . Consider any a ∈ B̄ s . Then, by definition of B̄ s , there exists t0 > T such
0 0 0
that π (a+ |s) ≥ π (t) (a|s). From Lemma 47, for all τ > t π (τ ) (a+ |s) ≥ π (τ ) (a|s). Also, since
(t)
63
AGARWAL , K AKADE , L EE AND M AHAJAN
Since, B0s ∪ B̄0s = I0s and π (t) (a|s) → 1 (from Lemma 44), this implies that B0s 6= ∅ and that
P
a∈I0s
means X
π (t) (a|s) → 1, as t → ∞,
a∈B0s
Proof The proof follows from definition of B̄0s (a+ ). That is if a ∈ B̄0s (a+ ), then there exists
a iteration ta > T0 such that π (ta ) (a+ |s) > π (ta ) (a|s). Then using Lemma 47, for all τ > ta ,
π (τ ) (a+ |s) > π (τ ) (a|s). Choosing
Ta+ = max s
ta
a∈B0 (a+ )
(t)
Proof For the first claim, from Lemma 45, we know that after T1 , θs,a is strictly increasing for
s , i.e. for all t > T
a ∈ I+ 1
(t) (T1 )
θs,a ≥ θs,a .
(t) s (Lemma 45).
For the second claim, we know that after T1 , θs,a is strictly decreasing for a ∈ I−
(t)
Therefore, by monotone convergence theorem, limt→∞ θs,a exists and is either −∞ or some constant
θ0 . We now prove the second claim by contradiction. Suppose a ∈ I− s and that there exists a θ , such
0
(t) 0
that θs,a > θ0 , for t ≥ T1 . By Lemma 46, there must exist an action where a ∈ A such that
(t)
lim inf θs,a0 = −∞. (43)
t→∞
(T )
Let us consider some δ > 0 such that θs,a10 ≥ θ0 − δ. Now for t ≥ T1 define τ (t) as follows:
(k)
τ (t) = k if k is the largest iteration in the interval [T1 , t] such that θs,a0 ≥ θ0 − δ (i.e. τ (t) is
the latest iteration before θs,a0 crosses below θ0 − δ). Define T (t) as the subsequence of iterations
(t0 )
τ (t) < t0 < t such that θs,a0 decreases, i.e.
0
∂V (t ) (µ)
≤ 0, for τ (t) < t0 < t.
∂θs,a0
64
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
X ∂V (t0 ) (µ)
Zt = .
0 (t)
∂θs,a0
t ∈T
(t0 ) (µ)
where we have used that | ∂V∂θ | ≤ 1/(1 − γ). By (43), this implies that:
s,a0
For any T (t) 6= ∅, this implies that for all t0 ∈ T (t) , from Lemma 40
0 0 0
∂V (t ) (µ)/∂θs,a π (t ) (a|s)A(t ) (s, a) (t0 ) (1 − γ)∆
(t 0) = (t0) 0 (t0) 0
≥ exp θ0 − θs,a0
∂V (µ)/∂θs,a0 π (a |s)A (s, a ) 4
(1 − γ)∆
≥ exp δ
4
0 0
where we have used that |A(t ) (s, a0 )| ≤ 1/(1 − γ) and |A(t ) (s, a)| ≥ ∆
4 for all t0 > T1 (from
∂V (t0 ) (µ) ∂V (t0 ) (µ)
Lemma 43). Note that since ∂θs,a < 0 and ∂θs,a0 < 0 over the subsequence T (t) , the sign of
the inequality reverses. In particular, for any T (t) 6= ∅
t−1 0
X ∂V (t0 ) (µ)
1 (T1 ) (t)
X ∂V (t ) (µ)
(θs,a − θs,a )= ≤
η 0
∂θs,a ∂θs,a
t =T1 0 (t) t ∈T
(1 − γ)∆ X ∂V (t0 ) (µ)
≤ exp δ
4 0 (t)
∂θs,a0
t ∈T
(1 − γ)∆
= exp δ Zt
4
(t) ∂V (t) (µ)
where the first step follows from that θs,a is monotonically decreasing, i.e. ∂θs,a < 0 for t ∈
/T
(Lemma 45). Since,
lim inf Zt = −∞,
t→∞
(t)
this contradicts that θs,a is lower bounded from below, which completes the proof.
65
AGARWAL , K AKADE , L EE AND M AHAJAN
Proof Consider any a ∈ B0s . We have by definition of B0s that π (t) (a+ |s) < π (t) (a|s) for all t > T0 .
(t) (t) (t)
This implies by the softmax parameterization that θs,a+ < θs,a . Since, θs,a+ is lower bounded as
(t)
t → ∞ (using Lemma 50), this implies θs,a is lower bounded as t → ∞ for all a ∈ B0s . This in
(t)
conjunction with maxa∈B0s (a+ ) θs,a → ∞ implies
X
(t)
θs,a → ∞, (44)
a∈B0s
We are now ready to complete the proof for Theorem 10. We prove it by showing that I+ s is
∆ X ∆
−π (t) (a+ |s) < π (t) (a|s)A(t) (s, a) < π (t) (a+ |s) (47)
16 s
16
a∈B̄0
66
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
X X X
0= π (t) (a|s)A(t) (s, a) + π (t) (a|s)A(t) (s, a) + π (t) (a|s)A(t) (s, a)
a∈I0s s
a∈I+ s
a∈I−
(a) X X
≥ π (t) (a|s)A(t) (s, a) + π (t) (a|s)A(t) (s, a) + π (t) (a+ |s)A(t) (s, a+ )
a∈B0s a∈B̄0s
X
+ π (t) (a|s)A(t) (s, a)
s
a∈I−
(b) X X ∆ X π (t) (a|s)
≥ π (t) (a|s)A(t) (s, a) + π (t) (a|s)A(t) (s, a) + π (t) (a+ |s) −
4 1−γ
a∈B0s a∈B̄0s s
a∈I−
(c) X ∆ ∆ ∆
> π (t) (a|s)A(t) (s, a) − π (t) (a+ |s) + π (t) (a+ |s) − π (t) (a+ |s)
16 4 16
a∈B0s
X
> π (t) (a|s)A(t) (s, a)
a∈B0s
where in the step (a), we used A(t) (s, a) > 0 for all actions a ∈ I+
s for t > T > T from Lemma 43.
3 1
∆ 1
In the step (b), we used A (s, a+ ) ≥ 4 for t > T3 > T1 from Lemma 43 and A(t) (s, a) ≥ − 1−γ
(t) .
In the step (c), we used Equation (46) and left inequality in (47). This implies that for all t > T3
X ∂V (t) (µ)
<0
s
∂θs,a
a∈B0
and if k∇θ Lλ (θ)k2 ≤ λ/(2|S| |A|). In order to complete the proof, we need to bound the iteration
complexity of making the gradient sufficiently small.
Since the optimization is deterministic and unconstrained, we can appeal to standard results
(Theorem 57) which give that after T iterations of gradient ascent with stepsize of 1/βλ , we have
67
AGARWAL , K AKADE , L EE AND M AHAJAN
Let wθ? be the minimizer of Lθ (w) with the smallest `2 norm. Then by definition of Moore-
Penrose pseudoinverse, it is easily seen that
wθ? = Fρ (θ)† Es∼dπρ θ ,a∼πθ (a|s) [∇θ log πθ (a|s)Aπθ (s, a)] = (1 − γ)Fρ (θ)† ∇θ V πθ (ρ). (50)
In other words, wθ? is precisely proportional to the NPG update direction. Note further that for the
Softmax policy parameterization, we have by (35),
X
w> ∇θ log πθ (a|s) = ws,a − ws,a0 πθ (a0 |s).
a0 ∈A
Since a∈A π(a|s)Aπ (s, a) = 0, this immediately yields that Lθ (Aπθ ) = 0. However, this might
P
not be the unique minimizer of Lθ , which is problematic since w? (θ) as defined in terms of the
Moore-Penrose pseudoinverse is formally the smallest norm solution to the least-squares problem,
which Aπθ may not be. However, given any vector v ∈ R|S||A| , let us consider solutions of the form
Aπθ + v. Due to the form of the derivatives of the policy for the softmax parameterization (recall
Equation 35), we have for any state s, a such that s is reachable under ρ,
X X
v > ∇θ log πθ (a|s) = (vs,a0 1[a = a0 ] − vs,a0 πθ (a0 |s)) = vs,a − vs,a0 π(a0 |s).
a0 ∈A a0 ∈A
68
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Note that here we have used that πθ is a stochastic policy with πθ (a|s) > 0 for all actions a in each
state s, so that if a state is reachable under ρ, it will also be reachable using πθ , and hence the zero
derivative conditions apply at each reachable state. For Aπθ + v to minimize Lθ , we would like
v > ∇θ log πθ (a|s) = 0 for all s, a so that vs,a is independent of the action and can be written as a
constant cs for each s by the above equality. Hence, the minimizer of Lθ (w) is determined up to a
state-dependent offset, and
Aπθ
Fρ (θ)† ∇θ V πθ (ρ) = + v,
1−γ
where vs,a = cs for some cs ∈ R for each state s and action a. Finally, we observe that this yields
the updates
Owing to the normalization factor Zt (s), the state dependent offset cs cancels in the updates for π,
so that resulting policy is invariant to the specific choice of cs . Hence, we pick cs ≡ 0, which yields
the statement of the lemma.
[x y]i = xi yi
Define diag(x) for a column vector x as the diagonal matrix with diagonal as x.
Lemma 52 (Smoothness of F (see Equation 38) ) Fix a state s. Let θs ∈ R|A| be the column
vector of parameters for state s. Let πθ (·|s) be the corresponding vector of action probabilities given
by the softmax parameterization. For some fixed vector c ∈ R|A| , define:
X
F (θ) := πθ (·|s) · c = πθ (a|s)ca .
a
Then
k∇θs F (θs ) − ∇θs F (θs0 )k2 ≤ βkθs − θs0 k2
where
β = 5kck∞ .
Proof For notational convenience, we do not explicitly state the s dependence. For the softmax
parameterization, we have that
∇θ πθ = diag(πθ ) − πθ πθ> .
69
AGARWAL , K AKADE , L EE AND M AHAJAN
and therefore
∇2θ (πθ · c) = ∇θ (πθ c − (πθ · c)πθ ).
For the first term, we get
∇2θ (πθ · c) = diag(πθ c) − πθ (πθ c)> − (πθ · c)∇θ πθ − (∇θ (πθ · c))πθ> . (52)
Note that
which gives
∇2θ (πθ · c) 2
≤ 5kck∞ .
Before we prove the smoothness results for ∇π V π (s0 ) and ∇θ V πθ (s0 ), we prove the following
helpful lemma. This lemma is general and not specific to the direct or softmax policy parameteriza-
tions.
Lemma 53 Let πα := πθ+αu and let Ve (α) be the corresponding value at a fixed state s0 , i.e.
Ve (α) := V πα (s0 ).
Assume that
X dπα (a|s0 ) X d2 πα (a|s0 )
≤ C1 , ≤ C2
dα α=0 (dα)2 α=0
a∈A a∈A
Then
d2 Ve (α) C2 2γC12
max ≤ + .
kuk2 =1 (dα)2 α=0 (1 − γ)2 (1 − γ)3
70
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Proof Consider a unit vector u and let Pe(α) be the state-action transition matrix under π, i.e.
and therefore
" #
dPe(α) X dπα (a0 |s0 )
max x = max P (s0 |s, a)xa0 ,s0
kuk2 =1 dα α=0 kuk2 =1 dα α=0
s,a a0 ,s0
X dπα (a0 |s0 )
≤ P (s0 |s, a)|xa0 ,s0 |
0 0
dα α=0
a ,s
X X dπα (a0 |s0 )
≤ P (s0 |s, a)kxk∞
dα α=0
s0 0a
X
0
≤ P (s |s, a)kxk∞ C1
s0
≤ C1 kxk∞ .
By definition of `∞ norm,
dPe(α)
max x ≤ C1 kxk∞
kuk2 =1 dα
∞
d2 Pe(α)
max x ≤ C2 kxk∞
kuk2 =1 (dα)2 α=0 ∞
Let Qα (s0 , a0 ) be the corresponding Q-function for policy πα at state s0 and action a0 . Observe
that Qα (s0 , a0 ) can be written as:
71
AGARWAL , K AKADE , L EE AND M AHAJAN
72
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Using this lemma, we now establish smoothness for: the value functions under the direct policy
parameterization and the log barrier regularized objective 12 for the softmax parameterization.
We now present a smoothness result for the entropy regularized policy optimization problem
which we study for the softmax parameterization.
Lemma 55 (Smoothness for log barrier regularized softmax) For the softmax parameterization
and
λ X
Lλ (θ) = V πθ (µ) + log πθ (a|s) ,
|S| |A| s,a
we have that
∇θ Lλ (θ) − ∇θ Lλ (θ0 ) 2
≤ βλ θ − θ 0 2
where
8 2λ
βλ = 3
+
(1 − γ) |S|
Proof Let us first bound the smoothness of V πθ (µ). Consider a unit vector u. Let θs ∈ R|A| denote
the parameters associated with a given state s. We have:
∇θs πθ (a|s) = πθ (a|s) ea − π(·|s)
73
AGARWAL , K AKADE , L EE AND M AHAJAN
and
> > > >
∇2θs πθ (a|s) = πθ (a|s) ea ea − ea π(·|s) − π(·|s)ea + 2π(·|s)π(·|s) − diag(π(·|s)) ,
where ea is a standard basis vector and π(·|s) is a vector of probabilities. We also have by differenti-
ating πα (a|s) once w.r.t α,
X dπα (a|s) X
≤ u> ∇θ+αu πα (a|s)
dα α=0 α=0
a∈A a∈A
X
≤ πθ (a|s) u> >
s ea − us π(·|s)
a∈A
≤ max u>
s ea + u>
s π(·|s) ≤2
a∈A
≤6
d2 Ve (α) C2 2γC12 6 8γ 8
max ≤ + ≤ + ≤
kuk2 =1 (dα)2 α=0 (1 − γ) 2 (1 − γ)3 (1 − γ)2 (1 − γ) 3 (1 − γ)3
or equivalently for all starting states s and hence for all starting state distributions µ,
1 X
R(θ) := log πθ (a|s)
|A| s,a
We have
∂R(θ) 1
= − πθ (a|s).
∂θs,a |A|
Equivalently,
1
∇θs R(θ) = 1 − πθ (·|s).
|A|
74
O N THE T HEORY OF P OLICY G RADIENT M ETHODS
Hence,
∇2θs R(θ) = −diag(πθ (·|s)) + πθ (·|s)πθ (·|s)> .
For any vector us ,
u> 2 > 2 2
s ∇θs R(θ)us = us diag(πθ (·|s))us − (us · πθ (·|s)) ≤ 2kus k∞ .
X X
u> ∇2θ R(θ)u = u> 2
s ∇θs R(θ)us ≤ 2 kus k2∞ ≤ 2kuk22 .
s s
λ 2λ
Thus R is 2-smooth and |S| R is |S| -smooth, which completes the proof.
with C being a nonempty closed and convex set. We assume the following
Assumption E.1 f : Rd → (−∞, ∞) is proper and closed, dom(f ) is convex and f is β smooth
over int(dom(f )).
1
Gη (x) := (x − PC (x − η∇f (x))) (55)
η
Theorem 57 (Theorem 10.15 Beck (2017)) Suppose that Assumption E.1 holds and let {xk }k≥0
be the sequence generated by the gradient descent algorithm for solving the problem (54) with the
stepsize η = 1/β. Then,
2. Gη (xt ) → 0 as t → ∞
√
2β(f (x0 )−f (x∗ ))
3. mint=0,1,...,T −1 kGη (xt )k ≤ √
T
75
AGARWAL , K AKADE , L EE AND M AHAJAN
Theorem 58 (Lemma 3 Ghadimi and Lan (2016)) Suppose that Assumption E.1 holds. Let x+ =
x − ηGη (x). Then,
where B2 is the unit `2 ball, and NC is the normal cone of the set C.
We now consider the stochastic projected gradient descent algorithm where at each time step t,
we update xt by sampling a random vt such that
Theorem 59 (Theorem 14.8 and Lemma 14.9 Shalev-Shwartz and Ben-David (2014)) Assume C =
{x : kxk ≤ B}, for some B > 0. Let f be a convex function and let x∗ ∈ argminx:kxk≤B f (w).
Assume also that forqall t, kvt k ≤ ρ, and that stochastic projected gradient descent is run for N
B2
iterations with η = ρ2 N
. Then,
N
" !#
1 X Bρ
E f xt − f (x∗ ) ≤ √ (57)
N N
t=1
76