0% found this document useful (0 votes)
23 views

Special Issue On Multi-Objective Reinforcement Learning

This editorial introduces a special issue of the journal Neurocomputing focused on multi-objective reinforcement learning (MORL). MORL aims to solve sequential decision problems with multiple conflicting objectives. The special issue contains 7 papers that propose new MORL algorithms and evaluate them on benchmark environments with 2-3 objectives. The papers explore using variants of Q-learning, reward shaping, and other techniques to learn Pareto fronts of optimal policies for multi-objective problems.

Uploaded by

master Q
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Special Issue On Multi-Objective Reinforcement Learning

This editorial introduces a special issue of the journal Neurocomputing focused on multi-objective reinforcement learning (MORL). MORL aims to solve sequential decision problems with multiple conflicting objectives. The special issue contains 7 papers that propose new MORL algorithms and evaluate them on benchmark environments with 2-3 objectives. The papers explore using variants of Q-learning, reward shaping, and other techniques to learn Pareto fronts of optimal policies for multi-objective problems.

Uploaded by

master Q
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Neurocomputing 263 (2017) 1–2

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Editorial

Special issue on multi-objective reinforcement learning


Madalina Drugan a, Marco Wiering b, Peter Vamplew c,∗, Madhu Chetty c
a
Technical University of Eindhoven, Eindhoven, Netherlands
b
Institute of Artificial Intelligence and Cognitive Engineering, University of Groningen, Groningen, Netherlands
c
School of Engineering and Information Technology, Federation University, Ballarat, Australia

a r t i c l e i n f o a b s t r a c t

Article history: Many real-life problems involve dealing with multiple objectives. For example, in network routing the
Received 2 June 2017 criteria may consist of energy consumption, latency, and channel capacity, which are in essence conflict-
Accepted 3 June 2017
ing objectives. As in many problems there may be multiple (conflicting) objectives, there usually does
Available online 26 June 2017
not exist a single optimal solution. In those cases, it is desirable to obtain a set of trade-off solutions be-
tween the objectives. This problem has in the last decade also gained the attention of many researchers in
the field of reinforcement learning (RL). RL addresses sequential decision problems in initially (possibly)
unknown stochastic environments. The goal is the maximization of the agent’s reward in an environ-
ment that is not always completely observable. The purpose of this special issue is to obtain a broader
picture on the algorithmic techniques at the confluence between multi-objective optimization and rein-
forcement learning. The growing interest in multi-objective reinforcement learning (MORL) was reflected
in the quantity and quality of submissions received for this special issue. After a rigorous review process,
seven papers were accepted for publication, and they reflect the diversity of research being carried out
within this emerging field of research. The accepted papers consider many different aspects of algorith-
mic design and the evaluation and this editorial puts them in a unified framework.
© 2017 Elsevier B.V. All rights reserved.

1. Test environment scheduling electricity generators to meet the customers’ demand


while minimizing the fuel cost and emissions. Resource Gathering
The practical motivation, such as novel approaches for challeng- [2] and Predator Prey [5] are three objective environments with
ing real-world applications or developing new algorithms with an stochastic transitions which are related to strategic games. Two of
improved computational efficiency for particular problems, is es- the above multi-objective environments have stochastic transition
sential for any proposed technique. In this issue, all the papers use functions [1,2]; the other environments are deterministic. In [4], an
benchmark environments with two or three objectives. The Deep agent navigates through a maze with continuous states that con-
Sea Treasure task [2,3,6] is a bi-objective environment consisting tains obstacles and different kinds of areas. The problem has one
of ten Pareto-optimal states, which has often been used for testing primary objective, while other secondary objectives are found with
MORL algorithms. The Bonus World used in [7] is an original three an unsupervised learning method which are subsequently solved
objective environment. Another bi-objective environment that has with off-policy RL techniques.
been used to evaluate a novel multi-objective RL algorithm is the
Linked Rings problem [3]. Some of the used environments con- 2. The methodological approach
sist of continuous state variables. The Cart Pole problem [5] has
two objectives with continuous state values that reflect the posi- Many of the proposed MORL algorithms use variants of the
tion and velocity of the cart and the angle and the angular veloc- Q-learning algorithm [2–7]. In [5], multi-objectivization is used to
ity of the pole. The Water Reservoir problem [1] models an agent create additional objectives next to solving the primary goal in or-
controlling the water level from a reservoir with three conflict- der to improve the empirical efficiency. The objectives are assumed
ing objectives, which are flooding, water and electricity demand. to be independent, and Q-values for each objective are learned
The Dynamic Economic Emissions Dispatcher problem [6] involves in parallel. On top of the multi-objectivization mechanism, reward
shaping is used to incorporate heuristical knowledge. The goal is to

learn the Pareto front of optimal policies. The algorithm proposed
Corresponding author.
E-mail addresses: [email protected] (M. Drugan),
in [6] uses scalarization functions and the hypervolume unary in-
[email protected] (M. Wiering), [email protected] (P. Vamplew), dicator to transform the reward vectors into scalar reward val-
[email protected] (M. Chetty). ues. Similarly to [5], the goal is to identify the Pareto front of

http://dx.doi.org/10.1016/j.neucom.2017.06.020
0925-2312/© 2017 Elsevier B.V. All rights reserved.
2 M. Drugan et al. / Neurocomputing 263 (2017) 1–2

optimal policies when additional rewards are added to each objec- Q-learning can be used to learn, at least partially, the values as-
tive through reward shaping functions. The hypervolume unary in- sociated with these additional objectives in parallel with learn-
dicator is also used in [1] to measure the performance of a policy- ing to solve the primary goal, thereby minimizing the need for
search MORL algorithm. The empirical performance is improved additional exploration in case the goal would change.
using multiple importance sampling estimators. In [3], the authors • “Multi-objectivization and Ensembles of Shapings in Reinforce-
use a variant of geometric steering for multi-objective stochas- ment Learning” by Tim Brys, Anna Harutyunyan, Peter Vrancx,
tic games with scalarized reward vectors. The MORL algorithm in Matthew Taylor and Ann Nowé: This paper examines the use
[4] is an interesting mixture of on-line learning for the first objec- of multi-objectivization to improve the performance of a rein-
tive and off-line learning for two independently found secondary forcement learning agent on a single-objective task. Additional
objectives. The secondary objectives are found using unsupervised objectives are introduced either by decomposition of the orig-
learning and their corresponding learned policies are useful when inal objective or based on external heuristic knowledge. This
the primary task changes in the environment. In [7], one objective introduces an additional source of diversity, which supports
is considered more important than the second objective with so- the use of ensemble methods which significantly improve the
called lexicographic ordering, and to solve this problem the RL al- learning performance.
gorithm is integrated with new variants of the softmax exploration
The final two papers in the issue examine how methods which
strategy. In another line of reasoning, in [2] the authors use Pareto
are widely used in single-objective reinforcement learning can be
dominance to partially order policies. Not one, but several policies
applied in the context of multiobjective reinforcement learning.
with associated Q-value vectors are simultaneously optimized.
• “Policy Invariance under Reward Transformations for Multi-
3. Theoretical analysis Objective Reinforcement Learning” by Patrick Mannion, Sam
Devlin, Karl Mason, Jim Duggan and Enda Howley: Potential-
There are only some papers in this special issue that give the- Based Reward Shaping (PBRS) has been shown to be an ef-
oretical guarantees on the expected behavior of the algorithms. fective means of accelerating learning in single-objective prob-
In this issue, in [5] and [6] proofs are provided for the convergence lems, with proven guarantees that it does not interfere with the
of MORL variants with reward shaping functions. final optimal policy. This paper extends these theoretical guar-
antees to the case of multiple objectives, for both single-agent
and multi-agent systems. It also provides the first empirical re-
4. Short summary of papers in the current issue
sults for the use of PBRS within multiobjective reinforcement
learning.
The first three papers propose and evaluate the performance
• “Softmax Exploration Strategies for Multiobjective Reinforce-
of reinforcement learning algorithms designed specifically for tasks
ment Learning” by Peter Vamplew, Richard Dazeley, and
involving multiple conflicting objectives.
Cameron Foale: The effectiveness of exploration strategies has
• “Manifold-Based Multi-Objective Policy Search with Sample been widely studied in single-objective reinforcement learning,
Reuse” by Simone Parisi, Matteo Pirotta, and Jan Peters: but this paper provides one of the first intensive studies of
This paper extends prior approaches to policy-search learn- these techniques in the context of multiple objectives, show-
ing of multiobjective policies by learning a manifold in policy- ing that unexpected complications may arise due to the in-
parameter space. Sampling points on this manifold can produce troduction of additional objectives. It also proposes and evalu-
policies which accurately approximate the Pareto front of poli- ates two multiobjective adaptations of the widely used softmax
cies, which is more efficient than directly learning a set of these approach to exploration.
policies.
• “A Temporal Difference Method for Multi-Objective Reinforce- Acknowledgments
ment Learning” by Manuela Ruiz-Montiel, Lawrence Mandow
and José-Luis Pérez-de-la-Cruz: Like Parisi et al., this work ad- We would like to thank all of the authors who submitted their
dresses the task of learning multiple policies which represent work for this issue, as well as the reviewers who generously gave
different Pareto-optimal tradeoffs between objectives. However, their time and expertise during the review process. We also wish
rather than policy-search, this paper extends the temporal- to thank the editors of Neurocomputing who supervised an in-
difference Q-learning algorithm to the task of learning multiple dependent review process for those papers for which we had a
Pareto-optimal policies. conflict of interest.
• “Steering Approaches to Pareto-Optimal Multiobjective Rein- References
forcement Learning” by Peter Vamplew, Rustam Issabekov,
Richard Dazeley, Cameron Foale, Adam Berry, Tim Moore, and [1] S. Parisi, M. Pirotta, J. Peters, Manifold-based multi-objective policy search
Douglas Creighton: This paper adapts the geometric steering al- with sample reuse, Neurocomputing (2017) Special issue on multi-objective
reinforcement learning.
gorithm, originally designed for stochastic multi-criteria games, [2] M. Ruiz-Montiel, L. Mandow, J.-L. Pérez-de-la-Cruz, A temporal difference
to learning Pareto-optimal non-stationary policies for multiob- method for multi-objective reinforcement learning, Neurocomputing (2017)
jective Markov Decision Processes. It also provides an example Special issue on multi-objective reinforcement learning.
[3] P. Vamplew, R. Issabekov, R. Dazeley, C. Foale, A. Berry, T. Moore, D. Creighton,
of the application of the steering approach to the problem of Steering approaches to Pareto-optimal multiobjective reinforcement learning,
controlling local battery storage for a household’s solar power Neurocomputing (2017) Special issue on multi-objective reinforcement learning.
system. [4] T.G. Karimpanal, E. Wilhelm, Identification and off-policy learning of multiple
objectives using adaptive clustering, Neurocomputing (2017) Special issue on
The next two papers address the incorporation of additional multi-objective reinforcement learning.
[5] T. Brys, A. Harutyunyan, P. Vrancx, M. Taylor, A. Nowé, Multi-objectivization and
objectives into an existing reinforcement learning task. ensembles of shapings in reinforcement learning, Neurocomputing (2017) Spe-
cial issue on multi-objective reinforcement learning.
• “Identification and Off-Policy Learning of Multiple Objectives [6] P. Mannion, S. Devlin, K. Mason, J. Duggan, E. Howley, Policy invariance under
Using Adaptive Clustering” by Thommen Karimpanal George reward transformations for multi-objective reinforcement learning, Neurocom-
and Erik Wilhelm: In this paper additional objectives are dis- puting (2017) Special issue on multi-objective reinforcement learning.
[7] P. Vamplew, R. Dazeley, C. Foale, Softmax exploration strategies for multiobjec-
covered by the agent itself during its exploration of the envi- tive reinforcement learning, Neurocomputing (2017) Special issue on multi-ob-
ronment, using online unsupervised clustering. It is shown that jective reinforcement learning.

You might also like