Special Issue On Multi-Objective Reinforcement Learning
Special Issue On Multi-Objective Reinforcement Learning
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Editorial
a r t i c l e i n f o a b s t r a c t
Article history: Many real-life problems involve dealing with multiple objectives. For example, in network routing the
Received 2 June 2017 criteria may consist of energy consumption, latency, and channel capacity, which are in essence conflict-
Accepted 3 June 2017
ing objectives. As in many problems there may be multiple (conflicting) objectives, there usually does
Available online 26 June 2017
not exist a single optimal solution. In those cases, it is desirable to obtain a set of trade-off solutions be-
tween the objectives. This problem has in the last decade also gained the attention of many researchers in
the field of reinforcement learning (RL). RL addresses sequential decision problems in initially (possibly)
unknown stochastic environments. The goal is the maximization of the agent’s reward in an environ-
ment that is not always completely observable. The purpose of this special issue is to obtain a broader
picture on the algorithmic techniques at the confluence between multi-objective optimization and rein-
forcement learning. The growing interest in multi-objective reinforcement learning (MORL) was reflected
in the quantity and quality of submissions received for this special issue. After a rigorous review process,
seven papers were accepted for publication, and they reflect the diversity of research being carried out
within this emerging field of research. The accepted papers consider many different aspects of algorith-
mic design and the evaluation and this editorial puts them in a unified framework.
© 2017 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2017.06.020
0925-2312/© 2017 Elsevier B.V. All rights reserved.
2 M. Drugan et al. / Neurocomputing 263 (2017) 1–2
optimal policies when additional rewards are added to each objec- Q-learning can be used to learn, at least partially, the values as-
tive through reward shaping functions. The hypervolume unary in- sociated with these additional objectives in parallel with learn-
dicator is also used in [1] to measure the performance of a policy- ing to solve the primary goal, thereby minimizing the need for
search MORL algorithm. The empirical performance is improved additional exploration in case the goal would change.
using multiple importance sampling estimators. In [3], the authors • “Multi-objectivization and Ensembles of Shapings in Reinforce-
use a variant of geometric steering for multi-objective stochas- ment Learning” by Tim Brys, Anna Harutyunyan, Peter Vrancx,
tic games with scalarized reward vectors. The MORL algorithm in Matthew Taylor and Ann Nowé: This paper examines the use
[4] is an interesting mixture of on-line learning for the first objec- of multi-objectivization to improve the performance of a rein-
tive and off-line learning for two independently found secondary forcement learning agent on a single-objective task. Additional
objectives. The secondary objectives are found using unsupervised objectives are introduced either by decomposition of the orig-
learning and their corresponding learned policies are useful when inal objective or based on external heuristic knowledge. This
the primary task changes in the environment. In [7], one objective introduces an additional source of diversity, which supports
is considered more important than the second objective with so- the use of ensemble methods which significantly improve the
called lexicographic ordering, and to solve this problem the RL al- learning performance.
gorithm is integrated with new variants of the softmax exploration
The final two papers in the issue examine how methods which
strategy. In another line of reasoning, in [2] the authors use Pareto
are widely used in single-objective reinforcement learning can be
dominance to partially order policies. Not one, but several policies
applied in the context of multiobjective reinforcement learning.
with associated Q-value vectors are simultaneously optimized.
• “Policy Invariance under Reward Transformations for Multi-
3. Theoretical analysis Objective Reinforcement Learning” by Patrick Mannion, Sam
Devlin, Karl Mason, Jim Duggan and Enda Howley: Potential-
There are only some papers in this special issue that give the- Based Reward Shaping (PBRS) has been shown to be an ef-
oretical guarantees on the expected behavior of the algorithms. fective means of accelerating learning in single-objective prob-
In this issue, in [5] and [6] proofs are provided for the convergence lems, with proven guarantees that it does not interfere with the
of MORL variants with reward shaping functions. final optimal policy. This paper extends these theoretical guar-
antees to the case of multiple objectives, for both single-agent
and multi-agent systems. It also provides the first empirical re-
4. Short summary of papers in the current issue
sults for the use of PBRS within multiobjective reinforcement
learning.
The first three papers propose and evaluate the performance
• “Softmax Exploration Strategies for Multiobjective Reinforce-
of reinforcement learning algorithms designed specifically for tasks
ment Learning” by Peter Vamplew, Richard Dazeley, and
involving multiple conflicting objectives.
Cameron Foale: The effectiveness of exploration strategies has
• “Manifold-Based Multi-Objective Policy Search with Sample been widely studied in single-objective reinforcement learning,
Reuse” by Simone Parisi, Matteo Pirotta, and Jan Peters: but this paper provides one of the first intensive studies of
This paper extends prior approaches to policy-search learn- these techniques in the context of multiple objectives, show-
ing of multiobjective policies by learning a manifold in policy- ing that unexpected complications may arise due to the in-
parameter space. Sampling points on this manifold can produce troduction of additional objectives. It also proposes and evalu-
policies which accurately approximate the Pareto front of poli- ates two multiobjective adaptations of the widely used softmax
cies, which is more efficient than directly learning a set of these approach to exploration.
policies.
• “A Temporal Difference Method for Multi-Objective Reinforce- Acknowledgments
ment Learning” by Manuela Ruiz-Montiel, Lawrence Mandow
and José-Luis Pérez-de-la-Cruz: Like Parisi et al., this work ad- We would like to thank all of the authors who submitted their
dresses the task of learning multiple policies which represent work for this issue, as well as the reviewers who generously gave
different Pareto-optimal tradeoffs between objectives. However, their time and expertise during the review process. We also wish
rather than policy-search, this paper extends the temporal- to thank the editors of Neurocomputing who supervised an in-
difference Q-learning algorithm to the task of learning multiple dependent review process for those papers for which we had a
Pareto-optimal policies. conflict of interest.
• “Steering Approaches to Pareto-Optimal Multiobjective Rein- References
forcement Learning” by Peter Vamplew, Rustam Issabekov,
Richard Dazeley, Cameron Foale, Adam Berry, Tim Moore, and [1] S. Parisi, M. Pirotta, J. Peters, Manifold-based multi-objective policy search
Douglas Creighton: This paper adapts the geometric steering al- with sample reuse, Neurocomputing (2017) Special issue on multi-objective
reinforcement learning.
gorithm, originally designed for stochastic multi-criteria games, [2] M. Ruiz-Montiel, L. Mandow, J.-L. Pérez-de-la-Cruz, A temporal difference
to learning Pareto-optimal non-stationary policies for multiob- method for multi-objective reinforcement learning, Neurocomputing (2017)
jective Markov Decision Processes. It also provides an example Special issue on multi-objective reinforcement learning.
[3] P. Vamplew, R. Issabekov, R. Dazeley, C. Foale, A. Berry, T. Moore, D. Creighton,
of the application of the steering approach to the problem of Steering approaches to Pareto-optimal multiobjective reinforcement learning,
controlling local battery storage for a household’s solar power Neurocomputing (2017) Special issue on multi-objective reinforcement learning.
system. [4] T.G. Karimpanal, E. Wilhelm, Identification and off-policy learning of multiple
objectives using adaptive clustering, Neurocomputing (2017) Special issue on
The next two papers address the incorporation of additional multi-objective reinforcement learning.
[5] T. Brys, A. Harutyunyan, P. Vrancx, M. Taylor, A. Nowé, Multi-objectivization and
objectives into an existing reinforcement learning task. ensembles of shapings in reinforcement learning, Neurocomputing (2017) Spe-
cial issue on multi-objective reinforcement learning.
• “Identification and Off-Policy Learning of Multiple Objectives [6] P. Mannion, S. Devlin, K. Mason, J. Duggan, E. Howley, Policy invariance under
Using Adaptive Clustering” by Thommen Karimpanal George reward transformations for multi-objective reinforcement learning, Neurocom-
and Erik Wilhelm: In this paper additional objectives are dis- puting (2017) Special issue on multi-objective reinforcement learning.
[7] P. Vamplew, R. Dazeley, C. Foale, Softmax exploration strategies for multiobjec-
covered by the agent itself during its exploration of the envi- tive reinforcement learning, Neurocomputing (2017) Special issue on multi-ob-
ronment, using online unsupervised clustering. It is shown that jective reinforcement learning.