12270-Article (PDF) - 25835-1-10-20210120
12270-Article (PDF) - 25835-1-10-20210120
Abstract
Solving multi-objective optimization problems is important in various applications
where users are interested in obtaining optimal policies subject to multiple (yet often con-
flicting) objectives. A typical approach to obtain the optimal policies is to first construct
a loss function based on the scalarization of individual objectives and then derive optimal
policies that minimize the scalarized loss function. Albeit simple and efficient, the typical
approach provides no insights/mechanisms on the optimization of multiple objectives due
to the lack of ability to quantify the inter-objective relationship. To address the issue, we
propose to develop a new efficient gradient-based multi-objective reinforcement learning
approach that seeks to iteratively uncover the quantitative inter-objective relationship via
finding a minimum-norm point in the convex hull of the set of multiple policy gradients
when the impact of one objective on others is unknown a priori. In particular, we first
propose a new PAOLS algorithm that integrates pruning and approximate optimistic linear
support algorithm to efficiently discover the weight-vector sets of multiple gradients that
quantify the inter-objective relationship. Then we construct an actor and a multi-objective
critic that can co-learn the policy and the multi-objective vector value function. Finally,
the weight discovery process and the policy and vector value function learning process can
be iteratively executed to yield stable weight-vector sets and policies. To validate the ef-
fectiveness of the proposed approach, we present a quantitative evaluation of the approach
based on three case studies.
1. Introduction
In recent years, the application of reinforcement learning (RL) in tasks with high-dimensional
sensory inputs has shown the potential of creating artificial agents that can learn to accom-
plish a number of challenging tasks, including the Atari games (Mnih, et al., 2015; Guo,
et al., 2014; Schaul, et al., 2015; Wang, et al., 2016; Van Hasselt, Guez, & Silver, 2016;
Oh, et al., 2015; Nair, et al., 2015), Go (Maddison, et al., 2014; Silver, et al., 2016, 2018),
and self-driving cars (Pan, et al., 2017). However, the approaches developed therein mainly
focus on finding a single usable strategy, without considering the trade-off among potential
alternatives that can increase one objective’s value at the cost of another.
In the multi-objective setting, the completion of a task requires the simultaneous satis-
faction of multiple objectives such as balancing power consumption and performance in Web
servers (Tesauro, et al., 2008). Such problems can be modeled as multi-objective Markov
decision processes (MOMDPs) and solved by some existing multi-objective reinforcement
2021
c AI Access Foundation. All rights reserved.
Zhan & Cao
learning (MORL) algorithms (Tesauro et al., 2008; Nguyen, 2018; Tajmajer, 2018; Abels,
et al., 2018; Van Moffaert, Drugan, & Nowé, 2013; Vamplew, Dazeley, & Foale, 2017) based
on the assumption that either the weighting factor for different objective functions or the
ordering information is available. The solution for parameterized algorithms assumes the
weighting factor for different objective functions can be obtained either directly (i.e., known
a priori ) or indirectly (through learning). Hence, the linear scalarization approach can be
used to transfer the multi-objective problem to a single-objective problem, which can be
solved via one or several parameterized single-objective optimization problems. For exam-
ple, an approach was proposed to use both linear weighted sum and nonlinear thresholded
lexicographic ordering methods to develop a multi-objective deep RL framework that in-
cludes both single- and multi-policy strategies (Nguyen, 2018). An architecture that uses
separated deep Q-networks (DQNs) was proposed to control the agent’s behavior with re-
spect to particular objectives (Tajmajer, 2018). Each DQN has an additional decision value
output that acts as a dynamic weight that is used while summing up Q-values. Another
method was proposed to generalize across weight changes and high-dimensional inputs by
proposing a multi-objective Q-network in a dynamic weight setting (Abels et al., 2018).
Other parameterized methods include the convex hull (Roijers, Whiteson, & Oliehoek,
2015), the varying parameters approaches (Abels et al., 2018; Liu, Xu, & Hu, 2015), the
constraint method (Konak, Coit, & Smith, 2006), the sequential method (Nakayama, Yun,
& Yoon, 2009), and the max-min method (Lin, 2005). When the weighting factor is assumed
to be unobtainable, parameter-free multi-objective optimization techniques usually apply
an ordering of the importance of the vector objective functions, i.e., the use of softmax-
epsilon selection based on a nonlinear action selection operator (Vamplew et al., 2017).
The agents incorporate an action selection function that is defined as an ordering over the
Q-values. However, this optimization process is usually augmented by an interactive proce-
dure of linear preferences to specify the ordering. Due to the above limitations, new MORL
methods are required when neither the weighting factor for different objective functions nor
the ordering can be obtained a priori.
Recently, some interesting research was conducted to propose an upper bound for the
multi-objective loss and prove that optimizing this upper bound via gradient-based multi-
objective optimization yields a Pareto optimal solution (Sener & Koltun, 2018). The Frank-
Wolfe solver is used to find a minimum-norm point in the convex hull of the set of input
points. This paper provides a new perspective for parameter-free objective balancing when
the gradients for all objectives are considered as the min-norm points in the convex hull.
However, the proposed method has two major limitations. First, this paper shows success in
large-scale multi-label learning tasks without addressing complex continuous space planning
tasks. Second, when the number of the vector value function grows, the Frank-Wolfe solver
is not applicable because the constrained convex optimization problem cannot be directly
formulated by minimizing the linear approximation of multiple differentiable and convex
real-valued functions.
In contrast to the previous work, we focus on discovering the weight-vector sets of
multiple gradients of the loss instead of learning the weights for the vector value functions
and scalarizing it using the weights. Hence, the proposed approach is a parameter-free
approach because the objective functions are treated individually without employing any
scalarization. Here is a brief description of the main idea behind the proposed approach.
320
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
Because some objectives are more sensitive than others, we seek to reach a designed point
in the Pareto set such that it is impossible to reduce the local loss of any objective without
increasing at least one other loss. In other words, we aim to discover the weight-vector sets
of multiple gradients towards convergence such that when moving in the direction of this
learned scalarization of multiple gradients of the loss, we can finally reach a designed point
in the Pareto set. Therefore, it is essential to learn the gradient directions that quantify
the relationship among these objectives and then train a decision policy that can balance
these objectives based on the quantitative inter-objective relationship, when the values for
multiple objectives are considered a vector and balancing them is required in the decision
making process.
To address this issue, this paper focuses on proposing a new efficient gradient-based
multi-objective reinforcement learning approach via multiple-gradient descent with discov-
ered weight-vector sets. The proposed approach is a gradient-based multi-policy method
that focuses on efficiently discovering an approximate convex coverage set and then de-
signing actor-critic policy learning algorithms. The proposed approach has three main
contributions. First, we propose a new learning algorithm that integrates pruning and
approximate optimistic linear support algorithm, named PAOLS, to efficiently learn the
weights that quantify the inter-objective relationship. This algorithm allows us to compute
a minimum-norm point in the convex hull of the policy set and find the corresponding
weights of multiple gradients. Different from the standard optimistic linear support, which
is developed for multi-objective planning (Mossalam, et al., 2016), PAOLS is motivated
by the need to address multi-objective reinforcement learning. Second, instead of using a
scalarized Q-value in the reinforcement learning based action selection, the proposed method
supports vector value functions. In particular, all value functions can be used sequentially
in the training of the actor network to update the control policy. Third, our multi-objective
optimization is applicable in high-dimensional continuous action spaces, such as robotics
motion planning. Lastly, we develop an iterative framework and provide rigorous analysis
on the stability of the proposed approach. To our best knowledge, this is the first time
that actor critic with learned quantifiable inter-objective relationship is developed to solve
MORL using vector value functions with rigorous stability analysis. To verify the effective-
ness and advantages of the proposed approach, we finally provide three case studies and
compare the performance of the proposed approach with some state-of-the-art approaches.
2. Previous Work
In this section, we will provide a brief review of related technical approaches and the moti-
vation of the work.
Typical multi-objective optimization (MOO) studies the problem of optimizing a set of
possibly conflicting objectives. One main strategy to solve this problem is the scalariza-
tion approach (Ward & Lee, 2001; Nguyen, 2018; Tajmajer, 2018; Vamplew et al., 2017;
Van Moffaert et al., 2013), where one or several single-objective optimization problems are
solved. Two major disadvantages of these approaches are (1) the choice of the weighting
factors is needed, leading to the burden of choosing them in the model, and (2) scalarization
only results in a properly efficient solution (Ward & Lee, 2001). Another typical approach is
the multi-criteria optimization. This approach usually uses an ordering of different criteria,
321
Zhan & Cao
i.e., an ordering of importance of different objectives (Miettinen & Mäkelä, 1995; Vamplew
et al., 2017). In this case, the ordering needs to be specified, leaving the decision maker
with the burden of deciding priorities among alternatives.
When the weighting factors for different objectives are unavailable, some important ap-
proaches have been developed. For example, a Pareto Q-learning approach was proposed
to learn the entire Pareto front when each state-action pair is sufficiently sampled, without
the knowledge of the weighting factors (Van Moffaert & Nowé, 2014). A multi-objective
version of the Bellman optimality operator was proposed to learn a single parametric rep-
resentation of all optimal policies over the space of preferences (Yang, Sun, & Narasimhan,
2019). A deep optimistic linear support learning algorithm was proposed to compute the
convex coverage set containing all potential optimal solutions of the convex combinations of
the objectives via using features from the high-dimensional inputs (Mossalam et al., 2016).
These studies provided very interesting and important approaches for multi-objective opti-
mization without knowing the weighting factor. However, an explicit quantitative analysis
of the inter-objective relationship has not been conducted.
Another class of relevant approaches that have been proposed is the gradient-based
MOO. Gradient-based methods play an important role in multi-objective optimization. For
example, the gradient-based multi-objective optimization methods (Fliege & Svaiter, 2000;
Désidéri, 2012) use multi-objective Karush-Kuhn-Tucker (KKT) conditions (Gordon & Tib-
shirani, 2012) to find a descent direction that decreases all objectives. This approach was
then extended to the case of a stochastic gradient descent (Peitz & Dellnitz, 2016; Poirion,
Mercier, & Désidéri, 2017). One key disadvantage of these approaches is that they scale
poorly with the dimensionality of the gradients. An importance sampling approach was
proposed for multi-objective reinforcement learning that can achieve good control perfor-
mance in partially observable Markov decision processes with few data (Shelton, 2001).
A new constrained policy gradient reinforcement learning approach that integrates policy
gradient reinforcement learning algorithms and techniques used in nonlinear programming
was proposed to maximize the long-term average intrinsic reward under the inequality con-
straints induced by the extrinsic rewards (Uchibe & Doya, 2007). Two new manifold-based
algorithms that combine episodic exploration and importance sampling were proposed to
efficiently learn a manifold in the policy parameter space such that its image in the ob-
jective space accurately approximates the Pareto frontier (Parisi, Pirotta, & Peters, 2017).
There are also numerous other gradient-based methods to solve multi-objective optimiza-
tion (Pirotta, Parisi, & Restelli, 2015; Parisi, Pirotta, & Restelli, 2016; Parisi, et al., 2014a,
2014b; Pinder, 2016). For example, policy gradient techniques were developed to approxi-
mate the Pareto frontier in multi-objective Markov decision processes (Pirotta et al., 2015;
Parisi et al., 2016, 2014a, 2014b). Note that an explicit quantitative study of the inter-
objective relationship is also lacking in these studies.
A recent study proposed the optimization of an upper bound for the multi-objective loss
via a gradient-based method and show success in large-scale multi-label learning tasks (Sener
& Koltun, 2018). However, whether the method can be extended to MORL (MOO on
reinforcement learning tasks) remains unknown. In complex continuous space planning
tasks, it is typical that both the large number of the gradients and the high dimensionality
of the gradients need to be considered. When a large number of gradients are employed,
the Frank-Wolfe solver is not directly applicable. Different from this approach, we focus on
322
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
solving the problem of optimizing the upper bound of the multiple value functions of the
learned policy set via an efficient PAOLS algorithm whose running time is bounded by some
polynomial function of the number of gradients. This algorithm is more computationally
efficient than the optimistic linear support method (Mossalam et al., 2016) to compute all
potential optimal solutions of the convex combinations of the objective. In particular, this
proposed algorithm can gradually improve the approximation of the convex coverage set
via maintaining a lexicographically maximum order of the gradients (Ehrgott, 1995). In
addition, PAOLS is scalable to both the number of gradients and the dimension of gradients.
As the proposed method is a multi-policy MORL approach, we will also briefly re-
view some relevant work on multi-policy optimization. The existing multi-policy algo-
rithms mainly focus on learning multiple policies that form an approximation of the Pareto
front (Vamplew, et al., 2011). For example, some earlier studies focused on combining the
basic RL algorithms with an online estimate of the minimally approachable target set (Man-
nor & Shimkin, 2002, 2004). This method does not hold when the vector value function is
monotonically increasing in all objectives without constraints because the approachability
in the presence of a finite set of steering gradient directions cannot be proved. Another
approach was proposed to compute the set of the multiple criteria on the convex hull for
multiple policies (Barrett & Narayanan, 2008). Note that the policies are still built via
computing the set’s maximum weighted sum value. In contrast, the scalarization of the
original vector value functions is not needed in the proposed method. Instead, it focuses on
learning the weights of gradients that can quantify inter-objective relationship via comput-
ing a discrete approximation of the set of gradients for high-dimensional continuous action
spaces.
3. Preliminaries
In this section, we will provide some background information, including the multi-objective
Markov decision processes (MOMDP), the multi-objective decision making setting, and the
MORL setting.
3.1 MOMDP
The readers are referred to the survey (Roijers, et al., 2013) for a thorough introduction
of algorithmic solutions for MOMDP. By following the typical MOMDP settings (Roijers
et al., 2013), we here adopt a finite MOMDP that is a tuple < X, A, T, R, µ, γ >, where
323
Zhan & Cao
A control policy π is a map that specifies the probability of taking a specific action at any
given state. If the policy π is stationary, i.e., it conditions only on the current state, it can
be formalized as π : X × A → [0, 1]. The vector value function Vπ = [V1π , · · · , VIπ ]0 → RI
specifies the expected cumulative discounted reward vector r
"∞ #
X
Vπ (x) = E γ k rt+k+1 |π, xt = x , (1)
k=0
where E[·] is the expectation operator and r is the immediate vector reward for all objectives
under the policy π.
324
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
K input/output pairs from each objective, for a total of IK data points for I objectives
(please refer to Yoo, et al., 2018 for more details). In other words, we adopt K samples,
namely, trajectories, for each of the I objectives in the training of reinforcement learning
algorithms. We consider a uniform distribution over trajectories p(T ) that we want our
model fθ : X → V to fit. The model is trained to fit a new trajectory Ti drawn from p(T )
from only K samples drawn from µi and the loss LTi that is generated by Ti . Formally, the
model’s parameters θ will be updated via the single objective gradient update as
I
X
θ = θ − αOθ wi LTi (fθi0 ), (3)
i=1
where wi is some static or dynamically computed weights for the objective i. The K-shot
learning procedure is shown in Algorithm 5.
Recall that the objectives are possibly conflicting. Consider a parametric hypothesis per
i ) : X → Vπ , such that some parameters (θ ) are shared between
objective as f i (x; θIO , θIS i IO
objectives and some (θIS i ) are objective-specific. The network architecture detail for this
1 I
0
[L̂1 θIO , θIS
1
, · · · , L̂I θIO , θIS
I
min L(θIO , θIS , · · · , θIS )= min ], (4)
1 ,··· ,θ I
θIO ,θIS 1 ,··· ,θ I
θIO ,θIS
IS IS
1 , · · · , θ I ) 6= L(θ , θ , · · · , θ ). 1 I
objective i and L(θIO , θIS IS IO IS IS
325
Zhan & Cao
In addition, if the objective functions L̂i , i = 1, · · · , I, are convex and continuously differ-
entiable, a Pareto stationary solution θ is also a Pareto optimal solution (Hosseinzade &
Hassanpour, 2011). Hence, if L̂i , i = 1, · · · , I are convex and continuously differentiable,
solving the optimization in (4) is the same as finding the Pareto stationary solution based
on the two KKT conditions.
Typically, the second KKT condition holds when freezing the objective-specific param-
i . Any solution that satisfies these conditions is called a Pareto stationary point.
eters θIS
Although every Pareto optimal pointP is Pareto stationary, the reverse may not be true.
For the first KKT condition, namely, Ii=1 wi Ωi = 0, one needs to learn both the weights
P
2
wi , i = 1, · · · , I and the policy θ such that
Ii=1 wi Ωi
is minimized (to zero). This
2
optimization problem admits a general solution v in the convex hull of the family of vectors
Ωi , i = 1, · · · , I, ( I I
)
X X
v= wi Ωi wi ≥ 0 , wi = 1 . (5)
i=1 i=1
P
2
Equivalently, the minimization of
Ii=1 wi Ωi
can be written as learning the weights
2
W corresponding to solving the minimum-norm element in v. Hence, the weights can be
learned via
2 I
X I
X
W = arg min
wi Ω i
wi = 1, wi ≥ 0 , (6)
W
i=1
i=1
2
where W = [w1 , · · · , wI ].
Because the dimension of θIO , typically described by neural networks, can be very large
(in thousands or more), it is very computationally expensive to compute the minimum-norm
point in the convex set. To reduce the dimension, by following the idea in (Sener & Koltun,
2018), define a new representation z = [z1 , · · · , zm ] ∈ RM , where zi = gi (x; θIO ). Ωi can be
rewritten using the chain rule as
i )
∂ L̂(θIO , θIS ∂z
Ωi = . (7)
∂z ∂θIO
326
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
∂z
Because each vector Ωi includes the term which is independent of W, one can obtain
∂θIO ,
I I
!
i )
∂ L̂(θIO , θIS
X X ∂z
wi Ω i = wi .
∂z ∂θIO
i=1 i=1
∂z
When ∂θIO has linearly independent rows, which is typically true when the dimension of
θIO is large, the optimization problem in (6) can be equivalently written as
2 I
X I
X
W = arg min
wi Oz L̂i (θIO , θIS
i
)
wi = 1, wi ≥ 0 . (8)
W
i=1
i=1
2
Because the dimension of z can be much smaller than that of θIO , the computation com-
plexity can be reduced significantly.
Since the only requirement for the selection of the loss function L is that each L̂i is
convex, one can select L as −Vπ because each function in −Vπ , namely, −Viπ , is a convex
combination of the negative immediate vector reward −rt+k+1 , defined in (1).
327
Zhan & Cao
The intuition to find the minimum-norm point and the associated marginal weight in
the convex hull defined in (5) is that we can find model parameters that are sensitive to
changes in training each objective, such that small changes in the parameters will produce
large improvements on the loss function of any objective drawn from p(T ). When altered
in the direction of the gradients, these vectors are often associated with the criteria that
have already achieved a fair degree of convergence.
Remark 1. The direction of the minimum-norm vector V , the weighted sum of vectors Ωi
with a minimum norm, is mostly influenced by the gradients of small norms in the fam-
ily of vectors, as illustrated in Figure 1 when I = 2 and z ∈ R2 . In the course of the
K-shot optimization, these vectors are often varied towards convergence. In the visualiza-
Ω1 Ω2 Ω2 Ω1
Ω2
Ω1
Figure 1: Possible directions of V with respect to the two gradients Ω1 and Ω2 when I = 2
and z ∈ R2 .
tion of the mininum-norm point in the convex hull of two dimensional vectors Ω1 and Ω2
(minw1 ∈[0,1] kw1 Ω1 + (1 − w1 )Ω2 k2 ), we can see that the solution is either a vector itself
or a perpendicular vector. As the computational geometry suggests, one of the following
cases occurs: (1) the solution to this optimization problem is 0 and the resulting point is
Pareto stationary; and (2) the solution defines a descent direction common to all the crite-
ria (Désidéri, 2012).
We next discuss how to compute the minimum-norm point. We define the weight W at
the minimum-norm point as the marginal weight. A typical approach is the approximate op-
timistic linear support (AOLS) approach (Roijers et al., 2014b). AOLS is an ε-approximate
MDP solver that can compute a set of policies for which the scalarized value for each possible
weight is at most a factor ε away from the optimal value for the MDP. AOLS is developed to
deal with the computational infeasibility issue of optimistic linear support (OLS) although
OLS can theoretically obtain solution to an MOMDP with linear scalarizations. Before
discussing the AOLS approach, it is necessary to define the undominated set (US).
Definition 2. The undominated set US (V) is the subset of all possible vectors V that are
optimal for some weight vector W with a scalarized comparison:
where WV 0 and WV are the scalarized values of V 0 and V with respect to a given weight
vector W.
We now describe AOLS in model detail. The AOLS is a method that can gradually
improve the approximation of US. Given a maximum improvement threshold ε > 0, the
328
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
AOLS algorithm can compute an approximated ε-optimal US, denoted as US , which may
diverge from the optimal US by at most ε. Before a complete US is obtained, a partial
US can be obtained by evaluating the largest improvement for weights W via the priority
queue of the marginal weight set in this step. An element in the vector value function over
a partial US is defined by VS∗ (W) = maxV∈S WV, where S is the partial undominated set.
Ω1
Ω2
Ω3
As described in Section 3.3, the dimension of the vectors Ωi has been shrinked to the
dimension of z by rewriting the optimization problem (6) into (8). An example of VS∗ (W)
for an S containing three value vectors and z ∈ R2 is given in Figure 2. The vectors Ω1 , Ω2 ,
∂ L̂(θ ,θi ) ∂ L̂(θ ,θi )
and Ω3 are represented in the 2D space. Each line Ωi denoted β ∂zio1 is + (1−β) ∂zio2 is .
VS∗ (W) is a piecewise linear and convex function that consists of line segments, each of which
is the upper surface among all scalarized gradient vectors. The marginal weight is the weight
corresponding to the minimum-norm point marked with red ‘o’. When z ∈ R3 , each of the
element in VS∗ (W) associated with a policy is a plane instead of a line. When there are
more than three objectives, each element in VS∗ (W) can be represented as a hyperplane.
AOLS always selects the minimum-norm point and W that reflects the largest difference
between VUS (W) and VS∗ (W) on an optimistic upper bound, i.e., VUS (W) − VS∗ (W), which
can be updated iteratively to obtain a more accurate W = arg max VUS (W) in the convex
hull. The pseudocode for AOLS is shown in the Algorithm 1, where the minimum-norm
point WVS∗ (W) with respect to W is appended in the minimum point queue P .
4.1 PAOLS
In this subsection, we will propose a new approach that is more efficient than the existing
AOLS method. Before describing the new approach, let us discuss the time complexity
associated with the AOLS algorithm. Let the number of the marginal weights in the AOLS
algorithm be given by |W|, where |·| is the cardinality of a set. The while loop in Algorithm 1
will run for |W| times until Q is empty. In each run, |V| linear equations (also referred to
as primitive operations) need to be solved for up to |W| times. Hence, the total number
329
Zhan & Cao
of primitive operations in AOLS can be up to |W| |V| (1 + |W|) |W|, namely, in an order of
O(|W|3 |V|). We next present a new algorithm that can reduce the time complexity to the
order of O(|W|2 |V|).
The proposed new method is the introduction of pruning in the AOLS algorithm, named
PAOLS. While still employing the standard AOLS procedure, PAOLS focuses on identifying
the elements in the undominated set that are not optimal and hence can be removed from
the list to be visited by AOLS procedure. In other words, the PAOLS algorithm seeks to
reduce the number of the weighted sum WVold to be visited in the US set, hence improving
the efficiency. In particular, PAOLS replaces the original OLS subroutine (Algorithm 2) by
a new paols subroutine (Algorithm 3). In the new paols subroutine, we introduce a new
function lexgt(·) that uses lexicographic order, which is a dictionary order of vectors, to
obtain the lexicographically maximized element. More specifically, lexgt(·) in Algorithm 3
is defined as
Because the lexicographically maximum vector g(V[j]) is chosen so that W [i] g(V[j]) is lex-
icographically maximized for all V[j], no other element can be used to construct a larger
W [i]g(V[j]) − VS∗ (W [i]) while still satisfying the marginal weight Wmax = W [i]. In other
words, the lexicographic order can further eliminate the non-optimal elements in the un-
dominated set so that these non-optimal elements will not be visited for obtaining the
corresponding policies. Meanwhile, such a Wmax is guaranteed to be a marginal weight
because it dominates all previous vectors due to the lexicographic order operation in (11).
One can also observe that the eliminated non-optimal elements are not marginal weights
since they are dominated by the lexicographically maximum vector.
We next show that the new algorithm can reduce the complexity to the order of
O(|W|2 |V|).
Theorem 2. The new PAOLS algorithm described in Alg. 1 runs in polynomial time in an
order of O(|W|2 |V|).
Proof. In computing W, the total number of W added to the unchecked set depends on
the size of W, namely |W|. In the new subroutine paols, a number of |W| |V| primitive
operations is needed. Each path in the while loop of Algorithm 1 consumes an element from
a total number of |W| unchecked weights. Hence, the total number of primitive operations
can be up to |W| |W| |V|, namely, in an order of O(|W|2 |V|).
Because the proposed PAOLS algorithm can reduce the time complexity from |W|3 |V|
to |W|2 |V|, it is more efficient than the AOLS algorithm.
330
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
331
Zhan & Cao
Data: W, S
Result: W [i], W [i]g(V[j]) − VS∗ (W [i]) , VS∗ (W [i])
for W [i] ∈ W do
maxval = −∞
for V[j] ∈ S do
if W [i]V[j] ≥ maxval and lexgt(W [i]V[j],W [i]g(V[j])) then
maxval← W [i]V[j]
g(V[j]) ← V[j]
end
end
return W [i], W [i]g(V[j]) − VS∗ (W [i]) , VS∗ (W [i])
end
Algorithm 3: function paolsSubroutine(W, S)
also known as the temporal difference (TD) residual of V̂iπ with discount γ (Sutton & Barto,
2018), given by
−
i
= ri (xk,t , ak,t ) + γ V̂iπ xk,t+1 ; φ− − V̂iπ xk,t ; φ− ,
δk,t (12)
where ri (xk,t , ak,t ) is the immediate reward at the tth time step on the kth experience,
−
V̂iπ (xk,t ; φ− ) is the approximation of the value function Vi based on the old policy π −
for the actor network and the old weights φ− for the critic network, and V̂iπ (xk,t+1 ; φ− ) is
the approximation of the value function Vi based on the updated policy π for the actor
network and the old weights φ− for the critic network.
In the standard TD-residual method, the value of one action evaluated via (12) is an
incremental form of value iteration. The key drawback of the standard TD-residual method
includes the need for a large number of samples and large variance of policy gradient esti-
mate. To address these issues, an existing approach, called generalized advantage estimator
(GAE) (Schulman, et al., 2015b), can be used to substantially reduce the variance of policy
332
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
gradient estimates at the cost of some bias (Schulman, et al., 2017, 2015a; Rockafellar &
Wets, 1991).
The GAE is defined by
H
X j
X H
X
Âit (xt , at ) = lim (1 − λ) λ j−1
γ j−1 δk,t+j−1
i
= (γλ)l δk,t+l
i
, (13)
H→∞
j=1 k=1 l=0
where λ ∈ [0, 1] and γ ∈ [0, 1] adjusts the bias-variance tradeoff of GAE. To further improve
the performance as well as decrease the complexity of implementation and computation, we
also use proximal policy optimization (PPO) with clip to constrain the objective in (13) as
h π (a |x ) i
θ t t
Ãit (θ) =EDk min Âit (xt , at ) , g , Âit (xt , at ) ,
πθ− (at |xt )
Note that θ includes inter-objective weights θIO and objective specific weights θISi .
For the I-objective critic network, its ith output value with hyperparameters φ is used to
approximate each
P element in the vector value function Vπ (s). The weights can be updated
via ∆φ ∼ −Oφ k kδk,t k22 , where δk,t = [δk,t1 , · · · , δ n ]0 .
k,t
After new weights of the A3C network models are obtained, Vπ can be obtained via
new samples using the updated policy. Afterwards, the procedure in Subsection 4.1 can be
implemented to obtain the updated Wmax . The entire process will iterate until VUS (W) −
VS∗ (W) < . The pseudocode is given in the Algorithm 4.
5. Convergence Analysis
In this section, we will prove the convergence property of the policy in the proposed algo-
rithm. To do this, we first need to define a few additional notations.
Let µa (x, ϕ), ϕ ∈ Rn , represent the probability of selecting action a at state x subject to
a policy π(ϕ), where ϕ is the parameter for the policy. We define one stage reward as gxk (ϕ),
where xk is the state at the time step k (one-stage state). Based on the value function defined
in (1), a typical performance metric
Pt to kcompare different policies is the average discounted
reward given by λ(ϕ) = lim E k=0 γ gxk (ϕ) , where E is the expectation operator. For
t→∞
any ϕ and x, the differential reward vx (ϕ) of observation x is defined as
"T −1 #
X
vx (ϕ) = E (gxk (ϕ) − λ(ϕ)) |x0 = x ,
k=1
333
Zhan & Cao
Π: empty queue of policy π; K: # of time steps in one episode; θi,k : param. of the
ith objective at the kth time step; θIO k , θ i k : θ , θ i at the kth time step
IS IO IS
for k = 1, · · · , K do
Collect a set of trajectories Di,k by running policy π = π (θi,k )
Compute rewards-to-go
R̂t = [r1 (x, a) + γ V̂1π (x; φk ) , · · · , rI (x, a) + γ V̂Iπ (x; φk )]0
Compute advantage estimates Aˆit using the GAE method based on V̂i (x; φk )
Obtain V̂(x; φk ) = [V̂1 (x; φk ), · · · , V̂I (x; φk )]
Evaluate L̂(θi,k ) = L̂(θIOk , θ ik )
IS
Update policy by maximizing PPO-Clip objective:
T π (a |x )
1 XX θ t t
θi,k+1 =arg max min Âi (xt , at ) , g , Âi (xt , at )
θ |Di,k | T πθi,k (at |xt )
Di,k t=0
via stochastic gradient ascent (e.g., Adam (Kingma & Ba, 2014))
Fit value function by regression on the mean-squared error:
T
1 X X
2
φk+1 =arg min
V (xt ; φk ) − R̂t
φk |Di,k | T 2
τni ∈Di,k t=0
Assumption 1. For each ϕ, the Markov chains {Xn } and {Xn , An }, denoting the se-
quence of states and state-action pairs, are irreducible and aperiodic under the stationary
probabilities πx (ϕ).
Assumption 2. For every x, x0 ∈ X, the transition probability pxx0 (φ) and gx (φ) are
bounded, twice differentiable, and have bounded first and second derivatives. In addition,
there exists a valued function ψa (x, ϕ) such that, for every observation x and action a, there
exists a bounded function satisfying
5µa (x, ϕ)
ψa (x, ϕ) = , (14)
µa (x, ϕ)
where the mapping ψa (x, ϕ) has first bounded derivatives for any fixed x and a.
When Assumption 1 holds true, the MOMDP admits a unique invariant probability
distribution for each objective (Tsitsiklis & Van Roy, 1999). Hence, the policy π(ϕ) is
composed of a steady-state probability of state x. When Assumption 2 holds true, µa (x, ϕ)
is a smooth function on ϕ, hence λ(ϕ) has a bounded first derivative. Let Assumptions 1
334
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
and 2 hold, λ(ϕ) and πx (ϕ) are twice differentiable, and have bounded first and second
derivatives. This property is needed in the convergence analysis. Furthermore, (14) holds
whenever µa (x, ϕ) is nonzero. In order to project 5λ(ϕ) to a subspace and derive a general
from that can show the proposed algorithm is applicable to the case of infinite space, we
first need to rewrite 5λ(ϕ).
Lemma 1. Let Assumptions 1 and 2 hold. The gradient of λ(ϕ) can be represented by 1
XX
5λ(ϕ) = ηa (x, ϕ)qx,a (ϕ)ψai (x, ϕ), (15)
x∈X a∈A
by moving the gradient inside the summation. Meanwhile, the transition probability is
given by: X
pxx0 (ϕ) = µa (x, ϕ)pxx0 (a). (18)
a∈A
By following a similar analysis as that for 5gx (ϕ), we can obtain:
X X X
5pxx0 (ϕ)vx0 (ϕ) = 5µa (x, ϕ)pxx0 (a)vx0 (ϕ). (19)
x0 ∈X x0 ∈X a∈A
By inserting (17) and (19) into (16) and making a few rearrangements, we can obtain (15).
Based on Lemma 1, we now show that the gradient 5λ(ϕ) can be written in the form of
inner products given in the following lemma. Before moving on, let us define qϕ and ψ(ϕ)
as the vectors of, respectively, qx,a (ϕ) and ψa (x, ϕ) on X × A. Define the inner product of
two real value functions qϕ and ψ(ϕ) as
XX
hqϕ , ψ(ϕ)iϕ = ηa (x, ϕ)qx,a (ϕ)ψa (x, ϕ). (20)
x∈X a∈A
1. qx,a (ϕ) in (15) corresponds to the TD residual in (12).
335
Zhan & Cao
Lemma 2. Let Assumptions 1 and 2 hold. The gradient of λ(ϕ) can be computed by the
inner product of two real value functions given by
* +
Y
5λ(ϕ) = hqϕ , ψ(ϕ)iϕ = qϕ , ψ(ϕ) , (21)
ϕ ϕ
where Y
q = arg min kq − qbkϕ (22)
qb∈ζϕ
ϕ
with ζϕ denoting the span of the vectors ψai (x, ϕ); i = 1, · · · , n in R|X|×|A| .
Proof. Based on the definition in (20) and Lemma 1, we can obtain that 5λ(ϕ) = hqϕ , ψ(ϕ)iϕ .
We next show that the second equality in (21) holds.
We can rewrite (15) as
∂
λ(ϕ) = q(ϕ), ψ i (ϕ) ϕ ,
i = 1, · · · , n, (23)
∂ϕi
where n is the dimension of ϕ. For a high (or even infinite) dimensional space, computing
the gradient of λ(ϕ) depends on qx,a (ϕ), (equivalently qϕ in (23)), and is typically difficult.
An alternative approach is to use the project of qϕ based on (22) in the computation of
inner product. Based on (20), the inner product hqϕ , ψ(ϕ)iϕ is equivalent to the inner
DQ E
product of ψ(ϕ) and the projection of qϕ on ζϕ . Hence, hqϕ , ψ(ϕ)iϕ = q
ϕ ϕ , ψ(ϕ)
ϕ
always holds. In
DQother words, E the projection of qϕ onto ζϕ is sufficient to learn 5λ(ϕ) since
hqϕ , ψ(ϕ)iϕ = ϕ qϕ , ψ(ϕ) .
ϕ
With Lemma 2, we are ready to present the proof of convergence in Theorem 3. Before
moving on, the following two assumptions are needed.
i
Assumption 3. The value update stepsize sequence
P∞ for the P∞γk 2and the actor {βk }
critic
are positive and nonincreasing, and satisfies k=0 ϑk = ∞ and k=0 ϑk < ∞, ϑ ∈ {γ, β}.
In addition, the actor updates much slower that the critic, i.e., βγki → 0.
k
Proof. Under Assumption 3, the size of actor updates is negligible compared with the size
of the critic updates. If the critic network is stable, the actor network is stationary. We
next show the convergence of critic network.
Under Assumptions 1 and 2 hold, the gradient 5λ(ϕ) can be written in the form of inner
products, as shown in Lemma 2. When Assumptions 3 and 4 hold, the convergence analysis
in (Tsitsiklis & Van Roy, 1999) can be used to prove that any critic in the proposed policy
336
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
will converge with probability 1. Once the critic networks converge (i.e., are stationary), all
weights in the critic networks are stationary. Because λ(ϕ) has a second bounded derivative
when Assumptions 1 and 2 hold, the update of the actor network can be proved stationary
as well by the stochastic approximation algorithm (Spall, 1992).
This completes the proof of Theorem 3.
6. Experiments
In this section, we evaluate the performance of the proposed approach. We first describe
the experiment setup and demonstrate how the gradient weights among various objectives
can be quantified via computing a min-norm point in the convex optimization. After that,
VU (W)−VS∗ (W)
we will show how the maximal relative improvement ∆r (W) = S
VU (W) changes with
S
respect to the number of training episodes. Finally, we show the testing results and the
solution stability.
6.1 Setup
In our experiment setting, the parametric hypothesis f i (x; θIO , θISi ) : X → V π , where θ i
i IS
is the objective specific weight and θIO is the inter-objective weight, is a CNN shown in
Figure 3. The architecture specification is given in Table 1. The F C − n and F C − I
correspond to the n-action policy π(·|xt ) and the value function V(xt ) ∈ RI×1 , respectively.
The screen is resized to 84 × 84 × 3 RGB image as the network input. In the proposed K-
shot RL setting, a network starts with a single row: a CNN having the layer l with feature
1 , and weights W 1 trained when the objective number i = 1 in Algorithm 4.
maps Xl−1 l
When switching to the training of the second objective, the parameters Wl1 are frozen and
a new row with weights Wl2 is instantiated, where feature maps Xl2 receive input from both
1
Xl−1 2
and Xl−1 via lateral connections with weights F Wl1 and Wl2 . The generalization to I
objectives is given by
j
F Wli:j ∗ Xl−1
X
i
Xli = f (Wli ∗ Xl−1 + ), (24)
j<i
where Wli is the weight matrix of layer l of row i ,F Wli:j are the lateral connections from
layer l − 1 of row i, ∗ is the convolution operation. In summary, for the ith objective, θIO
i is θ \ θ . The proposed approach
represents the weights for the (i − 1)th objective, and θIS IO
is shown in Algorithm 5.
Layer # 1 2 3 4
Parameters C4 × 4 − 32 × i, S2 C4 × 4 − 32 × i, S2 FC − n FC − I
337
Zhan & Cao
Conv Conv FC FC
FW31
FW11 FW21 FW41
Conv Conv FC FC
FW32
FW12 FW22 FW42
Conv Conv FC FC
W13 W23 W33 W43
and Reward Forward (Rfor). For Humanoid-v2, we select five objectives: Mean Episode
Length (Mel), Mean Episode Reward (Mer), Linear Velocity (Lvel), Quadratic Control
(Qctrl), and Quadratic Impact Cost (Qim). For HumanoidStandup-v2, three objectives,
namely, Standup Cost (Stc), Quadratic Control (Qctrl), and Quadratic Impact Cost (Qim),
are selected.
We use the proximal policy optimization clipping algorithm with = 0.2 as the op-
timizer. The discounting factor is selected as γ = 0.99. One episode, characterizing the
number of time steps of the vectorized environment per update, is chosen as 10240. For
the purpose of stabilization, we execute parallel episodes in each batch. The batch size is
chosen as the product of the episode size and the number of environment copies simulated
in parallel. The number of environment copies is selected as 8. The results are an average
of 6 runs. The parameters are optimized using the Adam algorithm (Kingma & Ba, 2014)
and a learning rate of 3 × 10−4 . The dimension of z, namely, M , is selected as the num-
ber of actions in each scenario. All experiments were conducted using TensorFlow, which
allows for automatic differentiation through the gradient updates (Abadi, et al., 2016), on
a NVIDIA GeForce RTX 2070 GPU.
338
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
1.0
Mer
Qctrl
Lvel
Qim
0.0
From the matrix, we can observe that the first two objectives Mel and Mer, correspond-
ing to the first two rows, are independent. In other words, Mel and Mer will be optimized
without affecting other objectives, corresponding to Figure 1b and and Fig 1c in the com-
puting of the min-norm point. It can also be observed that the last three objectives, namely
Qctrl, Lvel and Qim, are conflicting with Mel and Mer. In particular, the third row of W
indicates that objective 3 has direct impact on objectives 4 and 5. Similarly, the fourth
and fifth rows of W indicate that objectives 4 and 5 have direct impact on other objec-
339
Zhan & Cao
tives except the first two objectives, corresponding to Figure 1a in the computing of the
min-norm point. In other words, Qctrl, Lvel, and Qim are dependent. Moreover, Qctrl
has larger impact on Lvel than that on Qim because w34 = 0.208 > 0.087 = w35 . Hence,
the quantitative relationship among these objectives is explicitly described by the matrix.
The gradients taking a vector-valued form reflects the weighted sum of the gradients of all
objectives based on the impact of each objective on other (possibly conflicting) objectives.
Each element of the vector value function will gradually converge to a value with a very
small variance.
Figure 5a and Figure 5b show the time complexity of paolsSubroutine and aolsSub-
routine. In order to show that the proposed PAOLS subroutine provides much better scal-
ability with respect to the dimension and the number of gradients, we report the amount
of CPU time spent in a user-mode code (outside the kernel). In Figure 5a, when the num-
ber of the gradients, namely, |V|, is 5, the CPU time of the proposed PAOLS subroutine
grows much slower than the AOLS subroutine as the dimension of z increases. Similarly,
in Figure 5b, when the dimension of z is 5, the logarithm of the CPU time of the proposed
PAOLS subroutine also grows much slower than the AOLS subroutine. All results are based
on an average of 100 runs. The amount of CPU time spent in user-mode code (log(ms))
The amount of CPU time spent in user-mode code (ms)
(a) (b)
340
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
Δ (w)(%)
MOO_LWP
r
MOO_LW
0
0 1.6 3.2
timestep(1e6)
Figure 6: Evolution of ∆r (W) with respect to time steps in percentage. MOO LWP:
proposed multi-objective optimization approach with discovered weights and PAOLS.
MOO LW: proposed multi-objective optimization approach with discovered weights.
341
Zhan & Cao
et al., 2016; Dhariwal, et al., 2017), and compare it with the proposed method. It is worth-
while to emphasize that HumanoidStandup-v2 does not have a specified reward threshold
beyond which “stand up” is considered successful. Figure 7 shows the performance of var-
ious algorithms on the three scenarios. It can be observed that the proposed MOO LWP
and MOO LW methods outperform other methods in yielding higher rewards. In addition,
the use of the proposed PAOLS algorithm in PPO can improve the performance of the
standard PPO method using the AOLS algorithm.
reward
reward
Figure 7: Comparison among MOO LW, MOO LWP, PPO CCS, PPO CCS (PAOLS),
PPO equalWeights, and PPO with single objective on Ant-v2, Humanoid-v2, and
HumanoidStandup-v2.
342
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
343
Zhan & Cao
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
−0.5 −0.5
−1.0 −1.0
2 1
1
−2 −1 0 city −1.0 0 city
−1 o −0.5 o
Positio 0 −2 el Positio0.0 −1 Vel
n 1 −3 V n 0.5
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
−0.5 −0.5
−1.0 −1.0
2 2
−2 −1 0 ocity −2 −1 0 ocity
Posit0io 1 −2 Vel Positio0 1 le
n 2 n −2 V
2
Figure 8: The actor π (x; θ) trained on the Humanoid-v2. The surfaces represent the func-
tions. The blue dots show the trajectories in the state-action using the current policy. The
red and blue contours show the forces that shape the surfaces.
shape the surfaces. The lines show the probabilities for one particular action. After the
157th episode, the contours show that the learned policy becomes more and more stable
because the policy becomes more consistent across episodes, indicating that the policy is
“settled” after an appropriate number of episodes.
It is worth noting that the stability of the proposed method is determined by both the
stability of the actor-critic network and the stability in learning W. Moreover, the stability
of the actor-critic network and the stability in learning W are interdependent. On the one
hand, if the actor-critic network is stable, W can often be stabilized since W becomes the
main one to be learned after the stabilization of the actor-critic network. On the other
hand, if W is stable, the actor-critic network becomes the main one to be trained after
344
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
the value of W is stabilized. Notice that W described in Subsection 6.2 becomes stable
as the training proceeds. Hence, the actor-critic network will also become stable, which is
consistent with the obtained stable action policy shown in Figure 8.
7. Conclusions
In multi-objective optimization problems, the possibly conflicting objectives necessitates a
trade-off when multiple objectives need to optimize simultaneously. A typical approach is
to minimize a loss of a weighted linear summation of all objective functions. However, this
approach may be effective for limited cases (e.g., when the objectives do not compete). To
address the potential competing nature among these objectives, we proposed a new effi-
cient gradient-based multi-objective deep reinforcement learning to solve high-dimensional
multi-objective decision making problems in continuous control environments. The pro-
posed method optimizes vectorized proxy objectives sequentially based on proximal policy
optimization, actor-critical network, and the derivation of optimal weights via marginal
weight using a new efficient PAOLS method.
By explicitly quantifying inter-objective relationship via solving the min-norm point in
the convex optimization, the relative importance of the objectives unknown a prior can be
obtained via reinforcement learning. Each entry in the discovered weights W specifies and
explains the relative impact of one objective on another objective in the optimization step.
Acknowledgment
The work was supported by the Office of Naval Research under Grants N000141712613 and
N000141912278.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat,
S., Irving, G., Isard, M., et al. (2016). Tensorflow: a system for large-scale machine
learning.. In USENIX Symposium on Operating Systems Design and Implementation,
Vol. 16, pp. 265–283.
Abels, A., Roijers, D. M., Lenaerts, T., Nowé, A., & Steckelmacher, D. (2018). Dy-
namic weights in multi-objective deep reinforcement learning. arXiv preprint
arXiv:1809.07803.
Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., & Kautz, J. (2016). Reinforcement
learning through asynchronous advantage actor-critic on a gpu. arXiv preprint
arXiv:1611.06256.
Barrett, L., & Narayanan, S. (2008). Learning all optimal policies with multiple criteria.
In International Conference on Machine Learning, pp. 41–47.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba,
W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
Désidéri, J.-A. (2012). Multiple-gradient descent algorithm (MGDA) for multiobjective
optimization. Comptes Rendus Mathematique, 350 (5-6), 313–318.
345
Zhan & Cao
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J.,
Sidor, S., & Wu, Y. (2017). Openai baselines. GitHub, GitHub Repository.
Ehrgott, M. (1995). Lexicographic max-ordering-a solution concept for multicriteria com-
binatorial optimization..
Fliege, J., & Svaiter, B. F. (2000). Steepest descent methods for multicriteria optimization.
Mathematical Methods of Operations Research, 51 (3), 479–494.
Gordon, G., & Tibshirani, R. (2012). Karush-kuhn-tucker conditions. Optimization,
10 (725/36), 725.
Guo, X., Singh, S., Lee, H., Lewis, R. L., & Wang, X. (2014). Deep learning for real-time
Atari game play using offline Monte-Carlo tree search planning. In Advances in Neural
Information Processing Systems, pp. 3338–3346.
Hosseinzade, E., & Hassanpour, H. (2011). The Karush-Kuhn-Tucker optimality condi-
tions in interval-valued multiobjective programming problems. Journal of Applied
Mathematics & Informatics, 29 (5 6), 1157–1165.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.
Konak, A., Coit, D. W., & Smith, A. E. (2006). Multi-objective optimization using genetic
algorithms: A tutorial. Reliability Engineering & System Safety, 91 (9), 992–1007.
Lin, J. G. (2005). On min-norm and min-max methods of multi-objective optimization.
Mathematical Programming, 103 (1), 1–33.
Liu, C., Xu, X., & Hu, D. (2015). Multiobjective reinforcement learning: A comprehensive
overview. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45 (3),
385–398.
Maddison, C. J., Huang, A., Sutskever, I., & Silver, D. (2014). Move evaluation in Go using
deep convolutional neural networks. arXiv preprint arXiv:1412.6564.
Mannor, S., & Shimkin, N. (2002). The steering approach for multi-criteria reinforcement
learning. In Advances in Neural Information Processing Systems, pp. 1563–1570.
Mannor, S., & Shimkin, N. (2004). A geometric approach to multi-criterion reinforcement
learning. Journal of Machine Learning Research, 5, 325–360.
Marbach, P., & Tsitsiklis, J. N. (2001). Simulation-based optimization of markov reward
processes. IEEE Transactions on Automatic Control, 46 (2), 191–209.
Miettinen, K., & Mäkelä, M. (1995). Interactive bundle-based method for nondifferentiable
multiobjeective optimization: Nimbus. Optimization, 34 (3), 231–246.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,
A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik,
A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D.
(2015). Human-level control through deep reinforcement learning. Nature, 518 (7540),
529.
Mossalam, H., Assael, Y. M., Roijers, D. M., & Whiteson, S. (2016). Multi-objective deep
reinforcement learning. arXiv preprint arXiv:1610.02707.
346
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershel-
vam, V., Suleyman, M., Beattie, C., Petersen, S., et al. (2015). Massively parallel
methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296.
Nakayama, H., Yun, Y., & Yoon, M. (2009). Sequential approximate multiobjective opti-
mization using computational intelligence. Springer Science & Business Media.
Nguyen, T. T. (2018). A multi-objective deep reinforcement learning framework. arXiv
preprint arXiv:1803.02965.
Oh, J., Guo, X., Lee, H., Lewis, R. L., & Singh, S. (2015). Action-conditional video pre-
diction using deep networks in Atari games. In Advances in Neural Information
Processing Systems, pp. 2863–2871.
Pan, X., You, Y., Wang, Z., & Lu, C. (2017). Virtual to real reinforcement learning for
autonomous driving. arXiv preprint arXiv:1704.03952.
Parisi, S., Pirotta, M., & Peters, J. (2017). Manifold-based multi-objective policy search
with sample reuse. Neurocomputing, 263, 3–14.
Parisi, S., Pirotta, M., & Restelli, M. (2016). Multi-objective reinforcement learning through
continuous pareto manifold approximation. Journal of Artificial Intelligence Research,
57, 187–227.
Parisi, S., Pirotta, M., Smacchia, N., Bascetta, L., & Restelli, M. (2014a). Policy gradient
approaches for multi-objective sequential decision making. In International Joint
Conference on Neural Networks (IJCNN), pp. 2323–2330.
Parisi, S., Pirotta, M., Smacchia, N., Bascetta, L., & Restelli, M. (2014b). Policy gra-
dient approaches for multi-objective sequential decision making: A comparison. In
IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning
(ADPRL), pp. 1–8.
Peitz, S., & Dellnitz, M. (2016). Gradient-based multiobjective optimization with uncer-
tainties. arXiv preprint arXiv:1612.03815.
Pinder, J. (2016). Multi-objective reinforcement learning framework for unknown stochastic
& uncertain environments. Ph.D. thesis, University of Salford.
Pirotta, M., Parisi, S., & Restelli, M. (2015). Multi-objective reinforcement learning with
continuous pareto frontier approximation. In Twenty-Ninth AAAI Conference on
Artificial Intelligence.
Poirion, F., Mercier, Q., & Désidéri, J.-A. (2017). Descent algorithm for nonsmooth stochas-
tic multiobjective optimization. Computational Optimization and Applications, 68 (2),
317–331.
Rockafellar, R. T., & Wets, R. J.-B. (1991). Scenarios and policy aggregation in optimization
under uncertainty. Mathematics of Operations Research, 16 (1), 119–147.
Roijers, D. M., Vamplew, P., Whiteson, S., & Dazeley, R. (2013). A survey of multi-objective
sequential decision-making. Journal of Artificial Intelligence Research, 48, 67–113.
Roijers, D. M., Whiteson, S., Oliehoek, F. A., et al. (2014a). Linear support for multi-
objective coordination graphs. In The International Conference on Autonomous
Agents & Multiagent Systems, pp. 1297–1304.
347
Zhan & Cao
Roijers, D. M., Scharpff, J., Spaan, M. T., Oliehoek, F. A., De Weerdt, M., & Whiteson,
S. (2014b). Bounded approximations for linear multi-objective planning under un-
certainty. In International Conference on Automated Planning and Scheduling, pp.
262–270.
Roijers, D. M., Whiteson, S., & Oliehoek, F. A. (2015). Computing convex coverage sets
for faster multi-objective coordination. Journal of Artificial Intelligence Research, 52,
399–443.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay.
arXiv preprint arXiv:1511.05952.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015a). Trust region policy
optimization. In International Conference on Machine Learning, pp. 1889–1897.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015b). High-
dimensional continuous control using generalized advantage estimation. arXiv preprint
arXiv:1506.02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347.
Sener, O., & Koltun, V. (2018). Multi-task learning as multi-objective optimization. In
Advances in Neural Information Processing Systems, pp. 527–538.
Shelton, C. R. (2001). Importance sampling for reinforcement learning with multiple ob-
jectives..
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrit-
twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,
D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu,
K., Graepel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural
networks and tree search. Nature, 529 (7587), 484.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre,
L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). A
general reinforcement learning algorithm that masters chess, shogi, and Go through
self-play. Science, 362 (6419), 1140–1144.
Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation
gradient approximation. IEEE Transactions on Automatic Control, 37 (3), 332–341.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Tajmajer, T. (2018). Modular multi-objective deep reinforcement learning with decision
values. In 2018 Federated Conference on Computer Science and Information Systems,
pp. 85–93.
Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., & Lefurgy, C. (2008).
Managing power consumption and performance of computing systems using reinforce-
ment learning. In Advances in Neural Information Processing Systems, pp. 1497–1504.
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control.
In International Conference on Intelligent Robots and Systems, pp. 5026–5033.
348
Multi-objective Reinforcement Learning with Learned Inter-objective Relationship
Tsitsiklis, J. N., & Van Roy, B. (1999). Average cost temporal-difference learning. Auto-
matica, 35 (11), 1799–1808.
Uchibe, E., & Doya, K. (2007). Constrained reinforcement learning from intrinsic and ex-
trinsic rewards. In 6th IEEE International Conference on Development and Learning,
pp. 163–168.
Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., & Dekker, E. (2011). Empirical evalua-
tion methods for multiobjective reinforcement learning algorithms. Machine Learning,
84 (1-2), 51–80.
Vamplew, P., Dazeley, R., & Foale, C. (2017). Softmax exploration strategies for multiob-
jective reinforcement learning. Neurocomputing, 263, 74–86.
Vamvoudakis, K. G., & Lewis, F. L. (2010). Online actor–critic algorithm to solve the
continuous-time infinite horizon optimal control problem. Automatica, 46 (5), 878–
888.
Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double
q-learning.. In Thirtieth AAAI conference on Artificial Intelligence, Vol. 2, p. 5.
Van Moffaert, K., Drugan, M. M., & Nowé, A. (2013). Scalarized multi-objective rein-
forcement learning: Novel design techniques.. In 2013 IEEE Symposium on Adaptive
Dynamic Programming and Reinforcement Learning, pp. 191–199.
Van Moffaert, K., & Nowé, A. (2014). Multi-objective reinforcement learning using sets
of pareto dominating policies. The Journal of Machine Learning Research, 15 (1),
3483–3512.
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2016).
Dueling network architectures for deep reinforcement learning. In International Con-
ference on Machine Learning, pp. 1995–2003.
Ward, D., & Lee, G. (2001). Generalized properly efficient solutions of vector optimization
problems. Mathematical Methods of Operations Research, 53 (2), 215–232.
Yang, R., Sun, X., & Narasimhan, K. (2019). A generalized algorithm for multi-objective
reinforcement learning and policy adaptation. In Advances in Neural Information
Processing Systems, pp. 14636–14647.
349