0% found this document useful (0 votes)
17 views16 pages

2

Uploaded by

asikuzzamananik0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

2

Uploaded by

asikuzzamananik0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received: 28 April 2023

DOI: 10.1049/cit2.12269

ORIGINAL RESEARCH
- -Revised: 5 July 2023 Accepted: 3 August 2023

- CAAI Transactions on Intelligence Technology

A safe reinforcement learning approach for autonomous


navigation of mobile robots in dynamic environments

Zhiqian Zhou1 | Junkai Ren1 | Zhiwen Zeng1 | Junhao Xiao1 |


Xinglong Zhang1 | Xian Guo2 | Zongtan Zhou1 | Huimin Lu1

1
College of Intelligence Science and Technology, Abstract
National University of Defense Technology,
Changsha, China
When deploying mobile robots in real‐world scenarios, such as airports, train stations,
2
hospitals, and schools, collisions with pedestrians are intolerable and catastrophic. Motion
Institute of Robotics and Automatic Information
System, College of Artificial Intelligence, NanKai safety becomes one of the most fundamental requirements for mobile robots. However,
University, Tianjin, China until now, efficient and safe robot navigation in such dynamic environments is still an
open problem. The critical reason is that the inconsistency between navigation efficiency
Correspondence and motion safety is greatly intensified by the high dynamics and uncertainties of pe-
Zhiwen Zeng and Junhao Xiao. destrians. To face the challenge, this paper proposes a safe deep reinforcement learning
Email: [email protected] and junhao.
[email protected] algorithm named Conflict‐Averse Safe Reinforcement Learning (CASRL) for autono-
mous robot navigation in dynamic environments. Specifically, it first separates the
Funding information collision avoidance sub‐task from the overall navigation task and maintains a safety critic
National Natural Science Foundation of China, to evaluate the safety/risk of actions. Later, it constructs two task‐specific but model‐
Grant/Award Numbers: 62203460, U1913202, agnostic policy gradients for goal‐reaching and collision avoidance sub‐tasks to elimi-
U22A2059
nate their mutual interference. Then, it further performs a conflict‐averse gradient
[Correction added on 14 Oct 2023, after first online manipulation to address the inconsistency between two sub‐tasks. Finally, extensive ex-
publication. The below correction needs to be noted. periments are performed to evaluate the superiority of CASRL. Simulation results show
The affiliation of authors Zhiqian Zhou, Junkai Ren, an average 8.2% performance improvement over the vanilla baseline in eight groups of
Zhiwen Zeng, Junhao Xiao, Xinglong Zhang, dynamic environments, which is further extended to 13.4% in the most challenging
Zongtan Zhou and Huimin Lu is corrected as below: group. Besides, forty real‐world experiments fully illustrated that the CASRL could be
“College of Intelligence Science and Technology, successfully deployed on a real robot.
National University of Defense Technology,
Changsha, China”] KEYWORDS
autonomous navigation, dynamic environments, gradient‐based policy optimisation, mobile robots, safe
reinforcement learning

1 | INTRODUCTION time‐efficient and collision‐free paths. One natural idea is to


expand traditional autonomous navigation algorithms with the
With the deployment of an increasing number of robots for temporal dimension. It first predicts trajectories of dynamic
various service tasks in the past decades, there has been a obstacles over the coming period and then constructs time‐
growing interest in robotics regarding the autonomous navi- varying obstacle constraints for the subsequent planning
gation of mobile robots in real‐world dynamic environments. sessions [1–3]. However, both trajectory predicting and opti-
In the problem, a crucial bottleneck is how to ensure the mising with time‐varying constraints are time‐consuming. Be-
motion safety of mobile robots. With various dynamic obsta- sides, the two‐stage framework may lead to the freezing robot
cles (e.g., pedestrians, other robots, and vehicles), the envi- problem [4] in which a great number of possible trajectories fill
ronment is time‐varying, and it becomes difficult to find up the space, eliminating any plausible path.

This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial License, which permits use, distribution and reproduction in any medium, provided

-
the original work is properly cited and is not used for commercial purposes.
© 2023 The Authors. CAAI Transactions on Intelligence Technology published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology and Chongqing
University of Technology.

CAAI Trans. Intell. Technol. 2023;1–16. wileyonlinelibrary.com/journal/cit2 1


24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2
- ZHOU ET AL.

Meanwhile, benefiting from the impressive non‐linear


fitting capacity of deep neural networks, deep reinforcement
learning (DRL) has shown impressive potential in autonomous
navigation, especially in dynamic and complex environments
[5–13]. They formulate the robot autonomous navigation
problem as a Markov decision process (MDP), and solve it via
reinforcement learning (RL). Their core idea is to continuously
improve the navigation policy based on reward signals from
interactions with the environment [14]. Therefore, their per-
formance highly depends on a proper task‐specific reward
function.
Since autonomous robot navigation is essentially a multi‐
objective optimisation problem, the corresponding reward
function always consists of multiple components for different
F I G U R E 1 Three frameworks of SRL. Paths and blocks in cyan
objectives. Some widely‐used reward components include re- highlight their difference from the standard DRL. Paths in red indicate the
wards on reaching or approaching the goal [10, 15] and pen- back‐forward propagation of gradients. The structure‐based framework
alties on collisions with obstacles, violations to social norms maintains task‐specific modules for the collision avoidance sub‐task [19].
[16], or predictable collisions [17, 18]. The former is built to Constraint‐based solutions sum up two value functions and then generate a
policy gradient for the policy network without any consideration of the
navigate the robot to its goal as fast as possible, while the latter
inconsistency between two sub‐tasks [20, 21]. The proposed gradient‐based
is built to prevent the robot from unsafe or high‐risky states. framework constructs an optimisation problem to determine the final
Though these reward components can be integrated into the policy gradient f(gr, gc) = wrgr þ wcgc, which balances two sub‐tasks by
overall navigation task by introducing a minus to penalties, they weighting policy gradients for sub‐tasks. (a) Structure‐based SRL.
are not always consistent and sometimes even conflicted. In (b) Constraint‐based SRL. (c) Gradient‐based SRL.
simple scenarios, reward shaping via carefully weighting mul-
tiple components is enough to balance the time efficiency and
motion safety [9, 15, 16]. However, in more realistic and sub‐tasks, which is the common basis of extensive SRL
complex dynamic environments, it is hard to design a proper methods. Later, from the grouped rewards, a critic and another
reward function manually [10, 17, 18]. There exists a mutual safety critic can be learnt for the goal‐reaching and collision
interface among different objectives. As a result, though DRL‐ avoidance sub‐task, respectively. Based on them, this work
based approaches are capable of performing better than further constructs two policy gradients to explicitly separate
traditional methods in navigation efficiency, they fail to deal the two sub‐tasks. And then, to deal with the inconsistency
with the motion safety problem, and most of them are limited between the two sub‐tasks, it proposes an optimisation
in laboratory environments [10, 11, 15]. objective on maximising the worse local policy improvement in
Over the past few years, extensive work from safe rein- sub‐tasks. Under the guidance of the above ideas, this work
forcement learning (SRL) has been proposed to improve the presents an SRL approach for autonomous robot navigation in
safety of DRL‐based approaches, which coincidentally detach dynamic environments and shows its superior via extensive
safety components from the original reward function. As simulation experiments. Finally, a real‐world experiment proves
shown in Figure 1, constraint‐based approaches extend MDP that the proposed algorithm can be migrated effectively to real‐
into constrained MDP and explicitly construct a constraint on world applications. To sum up, the contribution of this work is
safety. By rescaling policy gradients with Lagrange multipliers threefold.
[20–22], or introducing a safety‐constraint‐based penalty to the
policy objective [23, 24], they can limit constraint violations � A gradient‐based SRL framework for autonomous robot
under a given threshold. However, such a threshold is some- navigation in dynamic environments is designed, which
what contrary to the collision‐free requirement in autonomous constructs two task‐specific but model‐agnostic policy gra-
robot navigation. Apart from them, some structure‐based ap- dients to eliminate the mutual interference between goal‐
proaches maintain the safety critic to estimate the risk of ac- reaching and collision avoidance sub‐tasks. The proposed
tions or state‐action pairs. Once the risk is over a given framework can be combined with all existing DRL‐based
threshold, unsafe actions will be projected into safe zones via approaches for autonomous robot navigation.
predefined rules [25, 26]. However, since such an action editing � A novel conflict‐averse policy optimisation method is pre-
function is also hard to determine, the latest researches sented, which can effectively reconcile the inconsistency
furthermore design additional task‐specific policy networks between two sub‐tasks and ensure an efficient and stable
[27, 28] or modules [19] as the action editor. training process. To our best knowledge, it is the first work
This work follows the idea of task decomposition from to improve motion safety in autonomous robot navigation
previous SRL methods [19–21] but further separates sub‐tasks using gradient manipulations.
in the aspect of policy gradient. In particular, it first de- � A practical autonomous robot navigation algorithm named
composes the navigation task into two sub‐tasks: goal‐reaching Conflict‐Averse Safe Reinforcement Learning (CASRL) is
and collision avoidance by grouping reward components by proposed, which outperforms the vanilla algorithm by an
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 3

8.2% success rate and fully illustrates the effectiveness of the critical threats to the robot. Based on them, some researchers
proposed two innovations. furthermore introduce historical trajectory information [11,
34], semantic information (e.g. types of obstacles) [18, 35], and
This paper is structured as follows. Section 2 presents velocity‐obstacle‐based interactive information [17, 18] to help
related work, and Section 3 depicts the proposed CASRL al- mobile robots understand scenarios better. Though the agent‐
gorithm. Section 4 verifies the effectiveness of CASRl with level representation is manually designed, experimental results
extensive simulation and real‐world experiments. Finally, Sec- show that this high‐level information, especially interactive
tion 5 concludes the whole work. information from traditional collision avoidance models, plays
an important role in improving the motion safety of mobile
robots in dynamic environments.
2 | RELATED WORK Although these methods achieve continuous performance
improvements, they suffer from a common problem from the
In this section, we first briefly review recent DRL‐based ap- scalar reward function. Rewards for arriving at and
proaches for autonomous robot navigation in dynamic envi- approaching the goal, and penalties for collisions with obsta-
ronments. And then, we go further to highlight related work on cles and high‐risk states, are all integrated into a scalar reward
SRL in robotics and gradient‐based optimisation in multi‐task function. However, the navigation task is inherently multi‐
learning, which directly inspire this work. objective, and different objectives may be inconsistent or
conflicted. A scalar reward function may result in mutual
interference among different objectives. Therefore, reward
2.1 | DRL for robot navigation in dynamic shaping, via carefully fine‐tuning reward weights [11, 33, 34] or
environments introducing other so‐called reasonable reward components
[16–18], is a critical point for their final performance in motion
Encouraged by the superior performance in video games [29], safety.
DRL methods have been intensively studied for autonomous To deal with the mutual interference among different ob-
robot navigation in dynamic environments over the last few jectives, this work decomposes the navigation task into two
years. Sensor‐level methods take the raw perceptual informa- sub‐tasks: goal‐reaching and collision avoidance, and corre-
tion from laserscans and cameras as inputs of networks and spondingly groups rewards/penalties. Then two policy gradi-
then learn a sensorimotor control policy by RL [30, 31]. With ents are constructed from learnt critic and safety critic to
an accurate sensor simulation model, their policies learnt from separate the two sub‐tasks. Finally, a gradient‐based policy
the simulation environment can be migrated to real robots. optimisation method is proposed to resolve the inherent
However, due to the lack of low‐level motion properties and inconsistency between goal‐reaching and collision avoidance
high‐level interaction models for obstacles, their decisions are sub‐tasks. In practical implementation, considering that
inherently reactive and may lead to oscillatory motions in modelling interactions among agents is the key to dealing with
complex dynamic environments, such as crowd scenarios with the collision avoidance problem in dynamic environments, a
many pedestrians. To address the problem, the authors in Refs. heterogeneous graph representation from Ref. [18] is used to
[8, 9, 32] unanimously utilise another network to estimate describe navigation scenarios.
motion properties of moving objects, named world transition
models, by supervised learning, which greatly enhance mobile
robots' ability to avoid collisions proactively. Yet, world tran- 2.2 | SRL in robotic
sition models are far from enough to model sophisticated in-
teractions in real‐world dynamic environments. Besides, since Over the past few years, although much promising progress
world transition models are always constructed in the forms of has been made by DRL in robot navigation, the motion safety
images or occupied maps, either the training or inference of problem of robots is still open, especially in real‐world appli-
the world transition model is time‐consuming. cations. To address the safety problem, many researchers are
To achieve more proactive and foresighted avoidance of turning their attention to SRL. Their key innovations are
dynamic obstacles, some researchers turn their attention to the twofold: separating cost functions (related to different safety
agent‐level representation of surrounding environments and constraints) from the scalar reward function in traditional
focus on modelling interactions among agents. These methods MDP and constructing explicit constraints on costs in the
follow the classic two‐stage framework, which detects obsta- optimisation objective [20–24]. The former greatly alleviates
cles at first and then learns collision avoidance policy from the the problem of balancing safety and performance carefully
processed agent‐level representation [15, 16]. The separation with a problematic hand‐designed scalar reward function.
of detection and navigation makes it possible to perform a Meanwhile, the latter can theoretically prevent the optimised
higher‐level inference than sensor‐level methods. Representa- policy from constraint violations.
tive achievements include the relational graph model in Ref. However, unlike constraints on velocity, acceleration, or
[33], the human‐like gaze model in Ref. [5], and the social energy consumption, a positive threshold for the collision
attention model in Ref. [10]. They enable mobile robots to avoidance constraint means a permissible possibility of colli-
assess disturbances from surrounding obstacles and identify sions. On the one hand, it is hard to define the threshold of
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4
- ZHOU ET AL.

motion safety [20–23]. On the other hand, the positive learning targets of tasks and sub‐tasks modules with knowl-
threshold is contrary to the collision‐free requirement of edge distillation [45].
autonomous robot navigation. Meanwhile, methods using a Recently, considering that over‐parameterised networks
Lagrange multiplier to balance time efficiency and motion have adequate capacity to allow harmonious joint training for
safety, it is necessary to evaluate the safety performance online multiple tasks, a number of gradient‐based methods focus on
[20–22]. Finally, current benchmark environments in SRL balancing the learning process of each task via manipulation on
consider only static or simple dynamic obstacles. There is rarely task‐specific but model‐agnostic gradients. Representative
any interaction between obstacles in these environments, work includes MGDA‐UB [37], CAGrad [41], PCGrad [40],
which differs from real‐world navigation scenarios. These and Meta‐GF [42]. Among them, both MGDA‐UB and
problems make it difficult to deploy these SRL algorithms in CAGrad formulate MTL as a multi‐objective optimisation
real‐world autonomous robot navigation scenarios [36]. problem and focus on maximising the worse local policy
Meanwhile, other researchers concentrate on editing unsafe improvement of sub‐tasks. Their difference lies in the search
actions. They also separate the cost/penalty function for area of parameters. PCGrad identifies conflicting gradients
collision and construct a safety critic to evaluate the expected with cosine similarity and then suppresses the interfering
cumulative cost. In deployment, the robot can identify high‐ components. Meta‐GF takes a meta‐learnt weighted fusion
risk actions with the safety critic, and then project them into policy to combine multi‐task gradients. Experimental results
safe zones with predefined rules [25, 26]. However, predefined verify that all of them can effectively improve the learning
collision avoidance policies do not necessarily work by them- process on a series of challenging multi‐task supervised, semi‐
selves in dynamic or complex environments. Therefore, a two‐ supervised, and RL problems.
policy framework is proposed in Ref. [27], which learns a task
policy to maximise the task reward under non‐emergency cases
and a recovery policy to minimise collision probability under 3 | METHODOLOGY
emergency cases. Another similar work is the safety editor
framework proposed in Ref. [19]. It also trains two policies: a This section proposes a novel safe reinforcement learning
utility maximiser to maximise the utility reward and a safety approach named CASRL. As shown in Figure 2, the network
editor to adjust unsafe actions from the utility maximiser into consists of three components: an actor, a critic, and a safety
safe actions. Benefiting from the explicit structural separation critic. Firstly, the critic is trained with rewards from the goal‐
between the primary task and the collision avoidance task, both reaching sub‐task, while the safety critic is trained with costs
considerably improve safety. from collision cases. Then, by substituting the learnt critic and
Inspired by the above research work [19–21, 26], this work safety critic into the policy gradient equation, two task‐specific
also divides the robot navigation task into two sub‐tasks: the but model‐agnostic policy gradients are obtained for both sub‐
primary goal‐reaching sub‐task and the secondary collision tasks. Finally, via a conflict‐averse gradient manipulation of
avoidance sub‐task. On this basis, we further propose a new maximising the worst local policy improvement of sub‐tasks,
optimisation objective to reconcile the inconsistencies between the inconsistency between two sub‐tasks can then be alleviated
two sub‐tasks. Compared with existing methods, there is no in policy optimisation. Additionally, in practical implementa-
need to set an upper bound and perform an online evaluation tion, the framework is built on the heterogeneous graph rep-
for safety performance or introduce additional task‐specific resentation of states. Details of the approach are as follows.
modules to edit unsafe actions.

3.1 | Safe Markov decision process


2.3 | Gradient‐based optimisation in multi‐
task learning The critical point to improve motion safety is the notion of
safety for target scenarios. Therefore, this work formulates the
Multi‐task learning (MTL) aims to simultaneously deal with autonomous navigation problem in the form of safe MDP
multiple tasks with a single model [37–39]. Compared to proposed in Refs. [21, 26]. The safe MDP can be represented
learning with separate models, MTL methods achieve better with a tuple of 〈S; A; P; γr ; R; γc ; C〉. Compared with the
data and computational efficiency by a smaller model size with standard MDP, the key improvement is that the single unified
sharing parameters across tasks. However, since objectives of reward function in standard MDP is divided into two portions,
multiple tasks are always inconsistent, a challenging optimisa- the reward function R : S � S � A → R for the goal‐
tion problem is to harmonise multi gradients with varying di- reaching sub‐task and the cost function C : S � S � A → R
rections and magnitudes [40–42]. for the collision avoidance sub‐task. Their discounted factors
To tackle the MTL problem, several architectural solutions are denoted by γr and γc. For a clear comparison to the reward
have been proposed based on task‐specific modules [43], function, the penalty for unsafe states and actions is used to
attention mechanisms [38], or activating different paths along build the cost function C. It means that the cost value is set
deep networks [44]. Apart from reconstructing network ar- positive when a collision occurs.
chitectures, another branch of methods decomposes a large Due to the distinct separation of sub‐tasks, the policy
task into smaller but general sub‐task modules and then aligns optimisation process can also be adapted accordingly. Firstly,
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 5

FIGURE 2 The overall framework of CASRL.

the optimisation objective can be explicitly divided into two Here, Qπc θ ðs; aÞ is the expected sum of costs. Note that the
components: maximising the expected cumulative returns J πr θ deterministic policy gradient for the collision avoidance sub‐
while minimising the expected cumulative costs J πc θ . task introduces an additional minus. It is caused by the fact
that the optimisation objective for the collision avoidance sub‐
T �
X � task is to minimise expected cumulative costs.
J πr θ ¼ Eτ γtr Rðst ; at ; stþ1 Þ
t¼0
ð1Þ
T �
X � 3.2 | Training of critic and safety critic
J πc θ ¼ Eτ γtc Cðst ; at ; stþ1 Þ
t¼0 Since the optimal critic Qπr θ ðs; aÞ and the optimal safety critic
of Qπc θ ðs; aÞ are unavailable, two neural networks are para-
where τ = s0, a0, …, at−1, st is a trajectory following policy πθ, meterised by φ and ψ to estimate them. By temporal‐difference
and at is the action sampled from policy πθ(⋅|st). (TD) learning, Qφr ðs; aÞ and Qψc ðs; aÞ are regarded as approx-
The two optimisation objectives are built for the goal‐ imate estimates of the optimal ones and can be used to replace
reaching sub‐task and collision avoidance sub‐task. Then, it them in policy gradient.
is available to differentiate different sub‐tasks in the aspect of
policy gradient. For the goal‐reaching sub‐task, its determin-
istic policy gradient is Qφr ðs; aÞ → Qπr θ ðs; aÞ
ð6Þ
Qψc ðs; aÞ → Qπc θ ðs; aÞ:
� �
∇θ J πr θ ¼ Eτ ∇a Qπr θ ðs; aÞ∇θ πθ ðsÞja ¼ πθ ð⋅jsÞ ð2Þ
Specifically, the critic is updated by the clipped double Q‐
with Learning proposed in Ref. [46], which maintains two separate
value networks to resolve the overestimation problem caused
T �
X � by the critic. Meanwhile, a separate target network is built for
Qπr θ ðs; aÞ ¼ Eτ γtr Rðst ; at ; stþ1 Þjs0 ¼ s; a0 ¼ a : ð3Þ stable off‐policy training. It is the same as the online network
t¼0
except that its parameters, b θ, φb , and ψb , are learnt via soft
updates. Finally, the target value for the two value networks in a
where ∇ indicates the partial derivative, and Qπr θ ðs; aÞ is the
critic is rewritten as
expected sum of rewards for the goal‐reaching sub‐task.
For the collision avoidance sub‐task, its deterministic
policy gradient ∇θ J πc θ replaces the reward function and dis- bφ � �
yr ¼ r þ γr min Qr j s0 ; πbðs0 Þ ð7Þ
counted factor in Equation (2) with the cost function C and j¼1;2 θ
cost discounted factor γc, which is depicted as follows:
� � where s0 = P(s, a) is the subsequent state from after taking
∇θ J πc θ ¼ −Eτ ∇a Qπc θ ðs; aÞ∇θ πθ ðsÞja ¼ πθ ð⋅jsÞ ð4Þ
action a on state s, and r = R(s, a, s0 ) is the immediate reward
for the state transition.
with As for the safety critic, considering the collision avoidance
sub‐task only dominates the navigation task in high‐risk cases,
T �
X � the overestimation problem no longer happens. Therefore,
Qπc θ ðs; aÞ ¼ Eτ γtc Cðst ; at ; stþ1 Þjs0 ¼ s; a0 ¼ a : ð5Þ
t¼0
only one value network is used, and its target value is
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6
- ZHOU ET AL.

n � � o
yc ¼ c þ γc max Qc b
ψ 0
s ; πbðs0 Þ ; 0 : ð8Þ where ΔJ πr θ and −ΔJ πc θ are the local policy improvement of
θ expected returns and costs at each update step. When there is
where c ¼ Cðs; a; s0 Þ is the immediate cost for the state tran- only a slight inconsistency between the goal‐reaching and
sition. Considering that the cost value is always non‐negative, a collision avoidance sub‐tasks, such an optimisation objective
ReLU operator is used to avoid underestimating collision risk. aims to simultaneously improve both performances of goal‐
Then, by replacing optimal critic Qπr θ ðs; aÞ and safety critic reaching and collision avoidance sub‐tasks. When there exists
πθ
Qc ðs; aÞ in policy gradient equation (Equations 2 and 4) with a clear conflict between the two sub‐tasks, the optimisation
the learnt critics Qφr 1 ðs; aÞ and safety critic Qψc ðs; aÞ, policy objective focuses on minimising performance degradation on
gradients for sub‐tasks are estimated by sub‐tasks. The following is a detailed solution to this optimi-
sation problem.
� � Given a small step size of α and an update vector of
∇θ J πr θ ¼ Eτ ∇a Qφr 1 ðs; aÞ∇θ πθ ðsÞja ¼ πθ ð⋅jsÞ
� � ð9Þ g ∈ Rm , the updated policy network is determined by
∇θ J πc θ ¼ −Eτ ∇a Qψc ðs; aÞ∇θ πθ ðsÞja ¼ πθ ð⋅jsÞ θ0 = θ þ α*g. By using the first‐order Taylor approximation,
Equation (12) can be represented by

arg maxm min gTi g


g∈R i∈fr;cg ð13Þ
3.3 | Conflict‐averse policy optimisation s:t: kg − g0 k ≤ σkg0 k
Finally, the key problem is updating the policy network with
where T is the transpose operation for vectors and ‖⋅‖ denotes
policy gradients for sub‐tasks. Considering that the mini-
the L2 norm. Policy gradients are flatten to be gradient vectors:
misation objective can be transformed into the maximisation
gr for ∇θ J πr θ and gc for ∇θ J πc θ . The inequality constraints the
objective by introducing the minus, an intuitive and simple
best policy update vector in a local ball centred at the averaged
combination for optimisation objectives can be described by
gradient vector g0 = (gr þ gc)/2, with a radius controlled by a
hyper‐parameter of σ > 0. �
arg max J πr;cθ P �
θ Considering min gTi g ¼ min ωi gTi g , the Lagran-
i∈fr;cg ωr ;ωc i∈r;c
¼ arg max J πr θ − J πc θ gian objective of Equation (13) can be written as follows:
θ ð10Þ
T
� �
X � λ λ
¼ arg max Eτ γtr Rðst ; at ; stþ1 Þ − γtc Cðst ; at ; stþ1 Þ max min gw g − kg − g0 k þ σkg0 k
T 2 2
ð14Þ
θ g∈Rm λ>0;ω∈W 2 2
t¼0

Then, according to the additional rules of derivatives, the where ω ¼ ½ωr ; ωc � ∈ W is the normalised weight and
parameterised policy network can be updated by the following gω = ωrgr þ ωcgc is the weighted sum of two policy gradient
gradient: vectors. For the convex optimisation problem, both the Slater's
condition and strong duality hold when σ > 0. Therefore, it is
∇θ J πr;cθ ¼ ∇θ J πr θ þ ∇θ J πc θ ð11Þ allowed to exchange the min and max:
� �
λ λ
Even though the optimisation objective in Equation (10) min maxm gTw g − kg − g0 k2 þ σkg0 k2 : ð15Þ
λ>0;ω∈Wg∈R 2 2
can combine different components, it ignores the inherent
inconsistency, even conflict, between two sub‐tasks. For
example, in the autonomous robot navigation problem, the With fixing λ and ω, the optimal gradient vector
goal‐reaching sub‐task aims to navigate the robot along time‐ g∗ω;λ ¼ g0 þ gw =λ. Substituting it into Equation (15), another
efficient paths, while the collision avoidance sub‐task drives dual optimisation problem is
the robot from high collision‐risk regions. When an obstacle
lies between the robot and its goal, an obvious conflict exists � �
1 λ
between goal‐reaching and collision avoidance sub‐tasks. To min gTw g0 2 2
þ kgw k þ σkg0 k : ð16Þ
λ>0;ω∈W 2λ 2
deal with the inconsistency between them, this work further
proposed a conflict‐averse policy optimisation algorithm via
gradient manipulation. Further, let λ∗ ¼ σ1=2
1
kgw k=kg0 k, the optimisation problem
In particular, to mitigate the inconsistency between the two becomes
sub‐tasks, the optimisation objective for each policy update is �
constructed based on the minimum performance improvement min gTω g0 þ σkgω k2 kg0 k2 : ð17Þ
ω
on sub‐tasks, as subsequently defined.
� �� This optimisation problem is equal to Equation (12), but
arg max min ΔJ πr θ ; −ΔJ πc θ ð12Þ
θ more time‐efficient to be solved with only two optimisation
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 7

parameters. After several steps of iterative optimisation on 3.4 | Feature vector construction with
Equation (17), the optimal weight vector is approximated by HGAT
ω*. Substituting ω* into g∗ω;λ , the optimal policy gradient
vector for Equation (13) is g∗ ¼ ω∗r gr þ ω∗c gc . To help the robot navigate in dynamic environments, for
The proposed CASRL is summarised in Algorithm 1, example, crowd scenario as shown in Figure 3, this work fol-
which is built based on the Twin Delayed Deep Deterministic lows the representation from Ref. [18], which constructs a
policy gradient algorithm (TD3) [46]. It maintains a single feature vector with heterogeneous graph attention network
actor, a single safety critic, and a pair of critics. For each (HGAT). After simplification, it describes various navigation
episode, the critics are updated towards the minimum target scenarios with four types of objects: the robot, dynamic cir-
reward value (line 14), while the safety critic is updated to cular obstacles, static circular obstacles, and static line obsta-
approximate the target cost value (line 15). Every f iterations, cles. Their observable configurations are defined by
the policy is updated along the optimal conflict‐averse gradient 8
vector g* with the step size of α (line 17–20). Meanwhile, target >
> ½d; v0 ; vm ; θ� ∈ R5 ; if labelðiÞ ¼ 0
>
> � �
networks are updated with τ (line 21). < p ; ρ~ ; vi ; vli ; vri ; μ ; ξ ∈ R11 ; if labelðiÞ ¼ 1
i i i i
oi ¼ � �
>
>
> pi ; ρ~i ; vli ; vri ; μi ; ξi ∈ R9 ; if labelðiÞ ¼ 2
Algorithm 1: Conflict-Averse Safe Reinforcement >
:� �
Learning psi ; pei ; ρ0 ; vli ; vri ; μi ; ξi ∈ R11 ; if labelðiÞ ¼ 3
ð18Þ

where all objects are numbered with subscript i(i > 0): 0 for
the robot and others for obstacles. The function label(⋅) marks
types of objects. In the robot's configuration, d and θ describe
the Euclidean distance and orientation from its current posi-
tion to the given goal. Its velocity is donated by v0, limited by
the maximum speed of vm. For circular obstacles, both dy-
namic and static, their positions are presented by pi = [pxi, pyi].
Their radii are ~ρi ¼ ρi þ ρ0 , enlarged by the robot's radius ρ0.
For line obstacles, for example, boundaries, they are located by
their start points psi and end points pei. Meanwhile, they are
also inflated by ρ0. Besides, for all obstacles, a VO‐inspired
interactive configuration is introduced to enhance the state
representation. Assuming an obstacle maintains its velocity of
vi, the robot will collide with it only if the robot's velocity lies
in the VO cone as shown in Figure 4. The VO cone can be
briefly described with a vertex vi and two normalised direction

F I G U R E 3 Indoor crowd scenario. During navigation, the robot can


only observe obstacles, such as pedestrians, bins, and walls, within its sensor
range. Its goal is marked by a red pentagram. Black arrows depict
interactions among objects.
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8
- ZHOU ET AL.

F I G U R E 4 An illustration of the velocity obstacle for circular (left)


and static line obstacles (right).

vectors, vli and vri for the left and right rays [47]. Furthermore,
to better distinguish the emergency of obstacle avoidance, the
minimum separation distance μi and expected collision time ςi
are introduced to represent the configuration of the obstacle.
However, since ςi is infinite when the robot's velocity lies F I G U R E 5 Simulation environment. Both the robot and dynamic
circular obstacles are represented by hollow circles, with a red arrow
outside the VO cone, another equivalent variable ξi = 1/ indicating the heading. Dynamic obstacles are painted with numbers to
(ςi þ 0.5) is used to replace ςi. Finally, the vertex components distinguish them from the robot. Except for hollow circles, solid grey
of static obstacles are ignored since they have zero velocity. circles and black lines are painted for static circular obstacles and
Then, state st can be represented as a heterogeneous graph boundaries, respectively. Additionally, two pentagons are used to mark the
GðV; EÞ with nodes V for objects and edge E for interactions start (black) and the goal (red) position of the robot. Meanwhile, goal
positions for dynamic obstacles are marked with black stars.
among objects. An edge from node i to node j means that
there exists an interaction from object i to object j. In other
words, object j is required to avoid collision with object i,
actively. the square‐crossing phenomenon. Once a dynamic obstacle
Based on the heterogeneous graph representation of states, reaches its goal, a new random goal is generated within the
the proposed method utilises two layers of HGATs to extract room. Static circular obstacles, with random radii ranging from
the feature vector. It has two main advantages: scalability in the 0.1 to 0.4 m, are randomly placed in the room.
number of nodes and attention‐based message passing among During the training process, the number of circular ob-
nodes. A detailed setting and description for HGAT can refer stacles is randomly generated, with 3–10 for dynamic obsta-
to Ref. [18]. Then, the feature vector is fed into subsequent cles and 3–5 for static obstacles. As for the robot, its
Multi‐Layer Perceptions (MLPs) to construct the actor πθ, maximum velocity vmax and acceleration amax are set to
critic Qφr , and safety‐critic Qψc . 1.0 m/s and 1.0 m/s2. At the beginning of each training
episode, the robot is set with the initial and goal position of
(0.0, −4.0) and (0.0, 4.0). Additionally, the robot is assumed
4 | CASE STUDY to have a limited observation range of 6.0 m. Therefore, only
observed circular obstacles or segments of static linear ob-
4.1 | Simulation platform and settings stacles, rather than global obstacle information, are consid-
ered in navigation. Other detailed settings for the training
The simulation environment is adapted from CrowdNav [48]. process are followed.
As shown in Figure 5, there exists a robot, dynamic and static
circular obstacles, and static linear obstacles from the bound-
ary. The width and length of the room are 10.0 m. All dynamic 4.1.1 | Reward function
obstacles are navigated by ORCA [49], a widely used reciprocal
collision avoidance algorithm for multi‐agent simulation. The The reward function R is a sum of three components: rg, rd,
dynamic obstacles have a radius of 0.3 m and a maximum and rθ. Among them, rd and rθ are the dense rewards designed
velocity of 1 m/s. To interact with the robot frequently, some to navigate the robot approaching and steering towards its
dynamic obstacles are initialised to simulate the circle‐crossing goal, and rg is the sparse immediate reward for reaching its
phenomenon. Their initial and goal positions are roughly goal. Namely,
symmetrically placed on the central circle with a radius of
g
4.0 m. Moreover, random noises uniformly distributed be- rt ¼ rt þ rtt þ rθt : ð19Þ
tween −0.5 and 0.5 m are added to their positions. Apart from
circle‐crossing dynamic obstacles, the remaining are initialised
g
with random initial and goal positions in the room to mimic where rt , rdt , and rθt are given by
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 9


g κ1 ; if d t < η1 sub‐tasks. PCSRL modifies PCGrad [40] into the proposed
rt ¼
0; otherwise SRL framework. Specifically, the optimal policy gradient vector
ð20Þ is defined by
rdt ¼ κ2 ðd t−1 − d t Þ;
rθt ¼ κ3 ðcosðθt Þ − 1Þ=ðdt þ η2 ÞÞ; 8
< 0:5ðgr þ gc Þ;
> if gTr gc ≤ 0
where dt and θt are the distance and the orientation from the g ¼
∗ � �
gT g ð22Þ
robot's position to the robot's goal at step t, respectively. The : 0:5 gc þ gr − r c gr ;
> otherwise
kgr k
goal tolerance of the robot η1 is set to 0.2 m. The hyper‐
parameter η2 is used to smooth and limit the value of rθt ,
with its value of 5.0 m. Parameters κ1, κ2, and κ3 are used to CASRL is a further development of PCSRL. It aims to maxi-
balance different reward components, and their values are 25.0, mise the worse local policy improvement of sub‐tasks to deal
1.0, and 1.0. with the inherent inconsistency between the two sub‐tasks.
Five DRL‐based approaches are training five times with
different random seeds. In the training process, all network
4.1.2 | Cost function parameters are trained by the Adam optimiser with a learning
rate of 0.001. Both discount factors for expected cumulative
The cost function C only includes the sparse immediate cost of rewards and costs, γr and γc, are set to 0.987. Since this work is
κ4 = 25.0 on collisions, which is given by adapted from the TD3 algorithm, hyper‐parameters for
training include the exploration noise, policy noise, noise clip,
� update rate, and update frequency, with values of 0.2, 0.2, 0.2,
κ4 ; min μi ≤ 0
ct ¼ i>0 ð21Þ 0.005, and 8, respectively. With these hyper‐parameters and a
0; otherwise 13th‐generation i7 CPU from Intel, and the training of HGAT‐
DRL, NGSRL, and CASRL spent 11,659s, 13,569s, and
where μi is the minimal distance from the robot to the object i. 13,936s, respectively. The separation of two sub‐tasks only
requires an additional 1,910s, and the proposed conflict‐averse
policy optimisation only increased the training time by 367s.
4.1.3 | Action space Therefore, there is no need to worry about an excessive
computational burden caused by the proposed conflict‐averse
The robot is assumed with a common two‐wheeled
� differential policy optimisation.
drive model, and its action at ¼ alt ; art donates the accelera- For a comprehensive and fair comparison, all methods are
tion of wheels (alt for the left wheel and art for the right one). evaluated with 12,000 random test cases. These cases are
Considering the acceleration constraint, both of alt and art are grouped based on the number of dynamic circular obstacles
limited in [−amax, amax], where amax = 1.0 is the maximal (ranging from 3 to 10) and static circular obstacles (ranging
acceleration. from 3 to 5), with 500 cases for each combination. Eight
metrics are used to evaluate different algorithms quantitatively,
including "Suc. Rate", "Col. Rate", "Nav. Time", "Col.
4.2 | Simulation results and discussions Time", "Dis. Number", "Avg. Return", "Avg. Cost", and
Over. Perf. They describe the success rate, collision rate,
Four methods, MPC (Model Predictive Control) from Ref. navigation time of success cases, time response of collision
[50], TEB (Time‐Elastic‐Band) from Ref. [51], HGAT‐DRL cases, number of violations to the safety zone of dynamic
from Ref. [18], and HGAT‐DRL‐Lag adapted from Ref. [20] obstacles, discounted cumulative return over steps, discounted
are implemented as baseline algorithms. Among them, both cumulative cost over steps, and overall performance,
MPC and TEB are traditional optimisation‐based approaches, respectively.
while HGAT‐DRL is a recent DRL‐based approach. HGAT‐
DRL‐Lag is a variant of HGAT‐DRL, which introduces a
Lagrange multiplier λl to balance the two sub‐tasks adaptively. 4.2.1 | Performance comparison
The Lagrange multiplier is updated by −λl(ϕ − ϕ0 ), where ϕ
and ϕ0 are the current collision rate and the threshold of the Statistic results are shown in Table 1 and training curves from
collision rate. In the implementation, ϕ is estimated on the last DRL‐based methods are depicted in Figure 6. Since the
500 episodes, and ϕ0 is set to 10%. random seed does not affect TEB and MPC, their standard
This work proposes three approaches: NGSRL, PCSRL, deviations are always 0. As expected, TEB and MPC achieve a
and CASRL, which differ in how they update the policy terrible performance in the simulation environment. Especially,
network. Specifically, NGSRL updates the policy network with since MPC does not take into account the motion property of
the gradient vector of g0 ¼ ðgR þ gC Þ=2, without considering obstacles, its performance is catastrophic, with a success rate of
the consistency between goal‐reaching and collision avoidance only 17.5%. All five DRL‐based methods perform better than
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10
- ZHOU ET AL.

TABLE 1 Statistical results of different algorithms.

Nav. Time (s) Col. Time (s) Dis. Number Avg. Return
Methods Suc. Rate ↑ Col. Rate ↓ ↓ ↑ ↓ ↑ Avg. Cost ↓ Over. Perf.
MPC [50] 0.132 � 0.000 0.867 � 0.000 18.88 � 0.00 4.53 � 0.00 1819.3 � 0.0 3.45 � 0.00 19.93 � 0.00 −16.49 � 0.00

TEB [52] 0.756 � 0.000 0.236 � 0.000 13.79 � 0.00 5.21 � 0.00 1358.8 � 0.0 16.59 � 0.00 4.96 � 0.00 11.63 � 0.00

HGAT‐DRL [18] 0.767 � 0.037 0.233 � 0.38 10.97 � 0.49 5.80 � 0.43 1061.5 � 165.9 17.78 � 0.61 5.08 � 0.87 12.69 � 1.47

HGAT‐DRL‐Lag [20] 0.833 � 0.021 0.138 ± 0.023 15.82 � 1.03 10.61 � 0.71 929.4 � 75.0 17.37 � 0.75 2.68 � 0.48 14.69 � 1.07

NGSRL(ours) 0.809 � 0.018 0.190 � 0.018 11.72 � 0.34 6.72 � 0.43 1010.1 � 117.7 18.38 � 0.33 4.05 � 0.40 14.33 � 0.71

PCSRL(ours) 0.829 � 0.017 0.170 � 0.017 12.19 � 0.16 7.34 � 0.30 1066.6 � 82.5 18.61 � 0.52 3.56 � 0.37 15.05 � 0.88

CASRL(ours) 0.849 � 0.010 0.148 � 0.009 12.76 � 0.40 7.67 � 0.28 968.6 � 95.0 18.82 � 0.25 3.11 � 0.19 15.71 � 0.42

Note: � is used to indicate the standard deviation. All DRL‐based algorithms are tested with five models from a different random seed. The upward/downward arrow indicates that the
higher/lower the metric is, the better the algorithm. In each metric, the best two are bolded, and the best one is marked with another wave line.

F I G U R E 6 Training curves from HGAT‐DRL, HGAT‐DRL‐Lag, NGSRL, PCSRL, and CASRL, which are coloured blue, orange, green, red, and purple,
respectively. (a) Training curve of episodic reward. (b) Training curve of episodic cost.

the two optimised‐based methods, which shows the ability of 4.2.2 | Ablation study of gradient manipulation
DRL to solve complex optimisation problems.
In the further comparison among five DRL‐based algo- NGSRL is implemented for ablation experiments on the
rithms, it is easy to find that CASRL achieves the top two in proposed two innovations: decomposition of the overall nav-
almost all metrics, except for navigation time. Specifically, igation task and gradient‐based conflict‐averse policy optimi-
CASRL achieves the highest success rate of 84.9%, the most sation. Benefiting from the first innovation, NGSRL
average discounted return of 18.84, and the best overall outperforms HGAT‐DRL in terms of safety, showing a 0.92s
performance of 15.71. In other metrics on motion safety, (15.7%) increase in Col. Time, as well as a 4.3%, 4.9%, and
including Col. Rate, Col. Time, Avg. Cost, and Dis. Num- 18.4% drop in Col. Rate, Dis. Number, and Avg. Cost,
ber, it also achieves the second‐best performance, slightly respectively. Benefiting from the second innovation, CASRL
worse than HGAT‐DRL‐Lag. Compared to the vanilla algo- also shows a further improvement on the basis of NGSRL in
rithm of HGAT‐DRL, CASRL greatly delays the occurrence terms of safety, increasing/decreasing by 1.02s (15.1%), 4.0%,
of collisions in high‐risk cases by about 1.94s and reduces 6.6%, and 11.9% in Col. Rate, Col. Time, Dis. Number, and
violations to safety zones of obstacles (with a distance less Avg. Cost, respectively.
than 0.2 m) by 11.1%. HGAT‐DRL‐Lag is the safest algo- PCSRL is also implemented as a comparison algorithm to
rithm, with a Col. Rate of only 13.8%. However, since it NGSRL and CASRL, which performs gradient surgery ac-
does not take into account the conflicting policy gradient cording to Equation (22). Due to the projection operation of
from two sub‐tasks, its time efficiency is greatly reduced, conflicted gradients, PCSRL achieves better motion safety than
speeding 23.7% more navigation time than CASRL. There- NGSRL. However, PCSRL focuses on the conflicted cases,
fore, it performs terribly in Avg. Return and ranks only third ignoring the more common cases of slight inconsistency.
in Over. Perf. Therefore, its motion safety is worse than CASRL.
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 11

4.2.3 | Performance in different environments is worth noting that as the difficulty level increases, the gaps
among HGAT‐DRL, NGSRL, and CASRL also widen signif-
To further validate the effectiveness of CASRL and demon- icantly. Specifically, in the simplest cases with only three dy-
strate the proposed two innovations, a comprehensive namic circular obstacles, their success rates are 90.9%, 92.5%,
comparative analysis is conducted on a more detailed set of and 94.9%, with gaps of 1.6% and 2.4%. When the number of
experimental results. Here, 12,000 cases are divided into eight dynamic obstacles becomes 6, their success rates become
groups according to the number of dynamic circular obstacles 79.9%, 83.0%, and 87.1%, with widened gaps of 3.1% and
ranging from 3 to 10. Each group comprises 1500 test cases 4.1%. Yet in the most difficult cases with 10 dynamic circular
with static circular obstacles ranging from 3 to 5. Six methods, obstacles, their success rates are 59.3%, 67.1%, and 72.7%,
TEB, HGAT‐DRL, HGAT‐DRL‐Lag, NGSRL, PCSRL, and with further widened gaps of 7.8% and 5.6%. This phenom-
CASRL, are tested on four core metrics, including Ave. Re- enon is in line with the expectations of the proposed innova-
turn, Suc. Rate, Nav. Time, and Ave. Cost. tion. In simple scenarios, the conflict between goal‐reaching
Experimental results are shown in Figure 7. As the number and collision avoidance sub‐tasks is not so slight that the
of dynamic obstacles increases, the difficulty of collision neglect of conflict causes only a slight performance degrada-
avoidance increases, so the performance of all six algorithms is tion. However, as the number of dynamic obstacles increases,
detelriorating. Nevertheless, CASRL always keeps the best obstacle avoidance becomes more challenging, and the conflict
performance in Suc. Rate, and Ave. Return. TEB experiences between two sub‐tasks intensifies. Then, either separating the
the most dramatic decline in performance, from the third best collision avoidance sub‐task from the overall task or elimi-
in the simplest scenarios to the worst in others. Furthermore, it nating conflicts between the two sub‐tasks can effectively

F I G U R E 7 Experimental comparisons of TEB, HGAT‐DRL, NGSRL, and CASRL in different environments. They depict the average return, average cost,
success rate, and navigation time, respectively. (a) Average Return. (b) Average Cost. (c) Success Rate. (d) Navigation Time.
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12
- ZHOU ET AL.

13.70 � 1.95 ↑

15.89 � 0.84 ↑

14.74 � 1.13 ↑

16.26 � 2.08 ↑

16.66 � 0.58 ↑
improve the motion safety of mobile robots. Another notable

Statistical results in interactive environments. � is used to indicate the standard deviation. The upward/downward arrow indicates that the item is larger/smaller than the corresponding one in Table 1.

Ove. Perf. ↑
point is that though HGAT‐DRL‐Lag achieves the best per-
formance in motion safety, it spends the most navigation time
in each group of environments. The result is consistent with
the result mentioned above.

4.33 � 1.09 ↓

1.80 � 0.44 ↓

3.61 � 0.62 ↓

2.71 � 0.96 ↓

2.41 � 0.32 ↓
Avg. Cost ↓
4.2.4 | Experiment in interactive environments

The crowd is one of the typical dynamic environments that


mobile robots encounter. In real‐world crowd scenarios, a key
difference is that pedestrians can see and react to the robot.

18.07 � 0.88 ↑

17.69 � 0.62 ↑

18.35 � 0.52 ↑

18.97 � 1.12 ↑

19.08 � 0.28 ↑
Avg. Return ↑
Therefore, we further evaluate DRL‐based policies from
Sec. 4.2.1 in interactive scenarios, where the root is visible to
dynamic circular obstacles. The statistical results are shown in
Table 2.
As expected, in interactive environments, all five DRL‐
based algorithms achieve a visible improvement in the four

1469.7 � 154.1 ↑

1028.4 ± 73.5 ↑

1301.4 � 183.6 ↑

1370.7 � 83.5 ↑

1213.3 � 108.6 ↑
metrics for motion safety. It shows that the proactive collision

Dis. Number ↓
avoidance policy learnt by an invisible robot can be directly
deployed in real‐world visible robots. The performance
improvement can be attributed to pedestrians' proactive
avoidance behaviours. Even if there is no communication
network between the robot and pedestrians, they can still
achieve collaborative collision avoidance. A possible problem

Col. Time (s) ↑


6.85 � 0.43 ↑

11.96 � 1.00 ↑

7.57 � 0.39 ↑

8.67 � 0.24 ↑

8.72 � 0.21 ↑
is the increased violations and navigation time. It is caused by
the lack of a communication mechanism between the robot
and pedestrians. Actually, even for pedestrians, it is also
difficult to realise optimal reciprocal avoidance in
communication‐less environments. Therefore, the robot and
pedestrians sometimes hider each other, resulting in increased
Nav. Time (s) ↓

navigation time and violations of the safety zone. Finally, their


11.68 � 0.54 ↑

16.46 � 0.92 ↑

12.47 � 0.31 ↑

12.89 � 0.24 ↑

13.40 � 0.26 ↑
Over. Perf increase by an average of 6.6%.

Note: In each metric, the best two are bolded, and the best one is marked with another wave line.
4.3 | Real‐world experiments

Apart from simulation experiments, we also deploy CASRL on


0.203 � 0.049 ↓

0.096 � 0.022 ↓

0.173 � 0.029 ↓

0.133 � 0.047 ↓

0.116 � 0.016 ↓

a Fetch robot. Its diameter, maximum velocity, and maximum


Col. Rate ↓

wheel acceleration are 0.56 m, 1.0 m/s, and 1.0 m/s2,


respectively. Though the policy is trained with a simulation
timestep of 0.25s, it runs at a frequency of 20 Hz for quicker
reactions to possible collisions. Besides, since only the robot
provides the control interface with linear and angular velocity,
planned actions on accelerations are transformed to control
0.865 � 0.018 ↑

0.880 � 0.017 ↑
0.796 � 0.049 ↑

0.826 � 0.028 ↑

0.864 � 0.045 ↑

commands at a higher frequency of 100 Hz. The whole


Suc. Rate ↑

experimental field is 8.0 and 18.0 m, and a motion capture


system is utilised to record the positions of all objects. To
better verify the algorithm's effectiveness, the Fetch robot
must patrol between point (0, −4.0) and (0,4.0) rather than a
goal‐reaching navigation task in the simulation experiment.
HGAT‐DRL‐Lag [20]

Once the distance to the current goal is less than 0.3 m, the
HGAT‐DRL [18]

robot is considered to have completed the current individual


NGSRL (ours)

CASRL (ours)
PCSRL (ours)

navigation task, and its goal will be reset for the next navi-
TABLE 2

Methods

gation. Its free space is also limited in a 10.0 m � 10.0 m


square. As shown in Figure 8, three suitcases are randomly
placed as static circular obstacles. Except for them, four
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 13

world experiments differs from the goal‐reaching task in the


simulation environment, where the Fetch robot has to make a
U‐turn at the beginning of each run. Figure 9 shows two
sample trajectory diagrams and corresponding velocity curves.
Their navigation times are 15.35 and 15.60 s, respectively. In
Figure 9(a), noticeable collision avoidance behaviours of the
robot include (1) turning right to avoid the oncoming cyan
volunteer coloured at the time of about 2.0 s; (2) turning left
and giving way for magenta and orange volunteers at
t = 5.0 s; and (3) turning towards its goal after passing by the
magenta volunteer human at t = 10.0 s. In Figure 9b,
noticeable collision avoidance behaviours include (1) rotating
towards the goal first, (2) turning right to avoid the oncoming
cyan volunteer, (3) keeping its velocity to pass quickly in
front of the magenta volunteer, (4) slowing down at
t = 6.25 s, turning left at t = 7.0 s, and staying parallel with
the orange volunteer until t = 10.0 s. In the comparison of
the robot's behaviours in two tests, an interesting thing is its
reaction to the magenta volunteer. In the test shown in
Figure 9(a), since the robot is close to the magenta volunteer
after avoiding the cyan one, it decides to not only give way to
the volunteer but also turn left to enlarge its distance. While
in the test shown in Figure 9(b), there exists enough passable
space in front of the magenta volunteer, and the robot per-
forms a quick pass ahead of the volunteer. As for the velocity
curves, slight derivations exist between velocity commands
and actual velocities, which fully illustrate that the planned
F I G U R E 8 An illustration for the real‐world experiment filed. (a) The actions of accelerations are dynamically feasible for the Fetch
field coordinate and some parameters. (b) A snapshot from real‐world robot.
experiments.

5 | CONCLUSION
TABLE 3 Statistical results from hardware experiments.
This paper proposes an SRL framework for autonomous robot
Methods Suc. Rate Nav. Time (s) Ave. Speed. (m/s) navigation, along with a safe navigation algorithm for mobile
CASRL 1.00 14.36 � 3.19 0.64 � 0.06 robots in dynamic environments. It first decomposes the
overall navigation task into two sub‐tasks, one for goal‐
reaching and another for collision avoidance, and introduces
a safety critic to estimate the motion safety/risk of actions. On
volunteers join in the hardware experiment to interact with the the basis of critic and safety critic, it further constructs two
robot. task‐specific but model‐agnostic policy gradients, eliminating
Statistical results of the hardware experiment are shown the mutual interference between two sub‐tasks. And then, it
in Table 3, which comes from 40 individual navigation tasks. proposes a conflict‐averse manipulation of policy gradients. By
One surprising result is that CASRL achieves a 100% success maximising the worst local policy improvement of both sub‐
rate and an average speed of 0.643 m/s, without any collision tasks, the inconsistency between the two sub‐tasks can be
with volunteers, suitcases, or walls. However, in the real‐ greatly eliminated. Additionally, the conflict‐averse policy up-
world experiment, Nav. Time increases to 14.36 s, 3.14 s date ensures an efficient and stable learning process. Further-
higher than that from the simulation experiment with four more, this work proposes a practical safe navigation algorithm
dynamic and three static circular obstacles. There are three named CASRL by combining the SRL framework with the
possible reasons for the difference. At first, in real‐world heterogeneous graph representation of dynamic environments
environments, volunteers exhibit more diverse and change- from Ref. [18].
able behaviour, rather than the ideal ORCA algorithm in the As expected, the proposed algorithm achieves a high suc-
simulation environment. Secondly, there exists an unavoidable cess rate of over 84.9% in a challenging simulated environ-
noise in state estimation for volunteers caused by relative ment, outperforming the vanilla algorithm of HGAT‐DRL by
perception modules, which can also weaken the time effi- 8.2%. Especially in the most difficult environment with 10
ciency of the navigation policy. Finally, the patrol task in real‐ dynamic circular obstacles, where the inconsistency between
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14
- ZHOU ET AL.

F I G U R E 9 Trajectory diagrams and velocity curves in two sample scenarios. In trajectory diagrams (the top line), grey discs represent static circular
obstacles. Coloured curves depict the trajectories of the robot (black) and volunteers (other colours). Black lines represent the boundaries of the robot's free
space. Coloured numbers on trajectories indicate the time, and the interval between two consecutive nodes is 0.5s. Small arrows on humans' trajectories indicate
the forward directions, while red arrows on the robot's trajectory show its heading angles. The robot's goal is marked with a red pentagon. The figures' bottom
line shows the robot's velocity curves, consisting of the velocity commands and the actual velocities of the Fetch robot. In addition, two dashed black lines are
used to mark the start time and end time of the navigation task. (a) Scenario 1. (b) Scenario 2.

two sub‐tasks intensifies, the proposed algorithm can further R E FE R E N C E S


extend its lead over HGAT‐DRL to 13.4%. The superiority of 1. Ferrer, G., Sanfeliu, A.: Proactive kinodynamic planning using the
the proposed algorithm to the baseline fully illustrates the extended social force model and human motion prediction in urban
effectiveness of the proposed two innovations. Besides, environments. In: 2014 IEEE/RSJ International Conference on Intelli-
gent Robots and Systems (IROS), pp. 1730–1735 (2014)
extensive hardware experiments also demonstrate that the 2. Chen, Y., Zhao, F., Lou, Y.: Interactive model predictive control for robot
proposed algorithm can be directly deployed in real‐world navigation in dense crowds. IEEE Trans. Syst. Man Cybernetics, 1–13
dynamic environments. (2021)
3. Lin, S., et al.: Receding horizon control with trajectronþþ: navigating
CON FL ICT OF I N T ER E ST STAT E M EN T mobile robots in the crowd. In: 2022 41st Chinese Control Conference
(CCC), pp. 3871–3877 (2022)
The authors declare no conflicts of interest. 4. Trautman, P., Krause, A.: Unfreezing the robot: navigation in dense,
interacting crowds. In: 2010 IEEE/RSJ International Conference on
DATA AVA IL AB I LI T Y S TAT E M EN T Intelligent Robots and Systems (IROS), pp. 797–803 (2010)
Research data are not shared. 5. Chen, Y., et al.: Robot navigation in crowds by graph convolutional
networks with attention learned from human gaze. IEEE Rob. Autom.
Lett. 5(2), 2754–2761 (2020). https://doi.org/10.1109/lra.2020.2972868
O RC ID
6. Yan, C., Xiang, X., Wang, C.: Towards real‐time path planning through
Zhiqian Zhou https://orcid.org/0000-0002-9407-075X deep reinforcement learning for a uav in dynamic environments. J. Intell.
Junkai Ren https://orcid.org/0000-0002-6199-7452 Rob. Syst. 98(2), 297–309 (2020). https://doi.org/10.1007/s10846‐019‐
Xinglong Zhang https://orcid.org/0000-0002-0587-2487 01073‐3
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 15

7. Li, G., Hildre, H.P., Zhang, H.: Toward time‐optimal trajectory planning 30. Tai, L., Paolo, G., Liu, M.: Virtual‐to‐real deep reinforcement learning:
for autonomous ship maneuvering in close‐range encounters. IEEE J. continuous control of mobile robots for mapless navigation. In: 2017
Ocean. Eng. 45(4), 1219–1234 (2020). https://doi.org/10.1109/joe.2019. IEEE/RSJ International Conference on Intelligent Robots and Systems,
2926822 pp. 31–36. IROS) (2017)
8. Cui, Y., et al.: Learning world transition model for socially aware robot 31. Long, P., et al.: Towards optimally decentralized multi‐robot collision
navigation. In: 2021 IEEE International Conference on Robotics and avoidance via deep reinforcement learning. In: 2018 IEEE International
Automation (ICRA), pp. 9262–9268 (2021) Conference on Robotics and Automation (ICRA), pp. 6252–6259
9. Xie, Z., Xin, P., Dames, P.: Towards safe navigation through crowded (2018)
dynamic environments. In: 2021 IEEE/RSJ International Conference on 32. Sathyamoorthy, A.J., et al.: Densecavoid: real‐time navigation in dense
Intelligent Robots and Systems (IROS), pp. 4934–4940 (2021) crowds using anticipatory behaviors. In: 2020 IEEE International
10. Zhou, Z., et al.: Robot navigation in a crowd by integrating deep rein- Conference on Robotics and Automation, pp. 11345–11352. ICRA)
forcement learning and online planning. Appl. Intell. 52(13), 15600–- (2020)
15616 (2022). https://doi.org/10.1007/s10489‐022‐03191‐2 33. Chen, C., et al.: Relational graph learning for crowd navigation. In: 2020
11. Liu, S., et al.: Decentralized structural‐rnn for robot crowd navigation IEEE/RSJ International Conference on Intelligent Robots and Systems
with deep reinforcement learning. In: IEEE International Conference on (IROS), pp. 10007–10013 (2020)
Robotics and Automation (ICRA), pp. 3517–3524 (2021) 34. Liu, S., et al.: Intention aware robot crowd navigation with attention‐
12. Yan, C., et al.: Deep reinforcement learning of collision‐free flocking based interaction graph. In: IEEE International Conference on Ro-
policies for multiple fixed‐wing uavs using local situation maps. IEEE botics and Automation. ICRA) (2023)
Trans. Ind. Inf. 18(2), 1260–1270 (2022). https://doi.org/10.1109/tii. 35. Kästner, L., et al.: Enhancing navigational safety in crowded environ-
2021.3094207 ments using semantic‐deep‐reinforcement‐learning‐based navigation. In:
13. Zhao, L., et al.: Toward an online decision support system to improve 2022 IEEE International Symposium on Safety, Security, and Rescue
collision risk assessment at sea. IEEE Intell. Transp. Syst. Mag. 15(2), Robotics, pp. 87–93. SSRR) (2022)
137–148 (2023). https://doi.org/10.1109/mits.2022.3190965 36. Lv, S., et al.: A deep safe reinforcement learning approach for mapless
14. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd navigation. In: 2021 IEEE International Conference on Robotics and
ed. The MIT Press (2018) Biomimetics, pp. 1520–1525. ROBIO) (2021)
15. Chen, Y.F., et al.: Decentralized non‐communicating multiagent collision 37. Sener, O., Koltun, V.: Multi‐task learning as multi‐objective optimization.
avoidance with deep reinforcement learning. In: 2017 IEEE International In: Proceedings of the 32nd International Conference on Neural Infor-
Conference on Robotics and Automation (ICRA), pp. 285–292 (2017) mation Processing Systems, pp. 525–536. Curran Associates Inc., NY
16. Chen, Y.F., et al.: Socially aware motion planning with deep reinforce- (2018)
ment learning. In: : 2017 IEEE/RSJ International Conference on 38. Liu, S., Johns, E., Davison, A.J.: End‐to‐end multi‐task learning with
Intelligent Robots and Systems (IROS), pp. 1343–1350 (2017) attention. In: 2019 IEEE/CVF Conference on Computer Vision and
17. Han, R., et al.: Reinforcement learned distributed multi‐robot navigation Pattern Recognition, pp. 1871–1880. CVPR) (2019)
with reciprocal velocity obstacle shaped rewards. IEEE Rob. Autom. 39. Zhang, Y., Yang, Q.: A survey on multi‐task learning. IEEE Trans.
Lett. 7(3), 5896–5903 (2022). https://doi.org/10.1109/lra.2022.3161699 Knowl. Data Eng. 34(12), 5586–5609 (2022). https://doi.org/10.1109/
18. Zhou, Z., et al.: Navigating robots in dynamic environment with deep tkde.2021.3070203
reinforcement learning. IEEE Trans. Intell. Transport. Syst. 23(12), 40. Yu, T., et al.: Gradient surgery for multi‐task learning. In: Proceedings of
25201–25211 (2022). https://doi.org/10.1109/tits.2022.3213604 the 34th International Conference on Neural Information Processing
19. Yu, H., Xu, W., Zhang, H.: Towards safe reinforcement learning with a Systems. Curran Associates Inc., NY (2020)
safety editor policy. In: Advances in Neural Information Processing 41. Liu, B., et al.: Conflict‐averse gradient descent for multi‐task learning. In:
Systems (2022) Ranzato, M., et al. (eds.) Advances in Neural Information Processing
20. Ha, S., et al.: Learning to walk in the real world with minimal human Systems, vol. 34, pp. 18878–18890. Curran Associates, Inc. (2021)
effort. In: Kober, J., Ramos, F., Tomlin, C. (eds.) Proceedings of the 2020 42. Sun, Y., Li, J., Xu, X.: Meta‐gf: training dynamic‐depth neural networks
Conference on Robot Learning. Vol. 155 of Proceedings of Machine harmoniously. In: Computer Vision – ECCV 2022, pp. 691–708.
Learning Research, pp. 1110–1120 (2021) Springer‐Verlag, Berlin (2022)
21. Achiam, J., Amodei, D.: Benchmarking Safe Exploration in Deep Rein- 43. Misra, I., et al.: Cross‐stitch networks for multi‐task learning. In: 2016
forcement Learning. arXiv (2019) IEEE Conference on Computer Vision and Pattern Recognition
22. Stooke, A., Achiam, J., Abbeel, P.: Responsive safety in reinforcement (CVPR), pp. 3994–4003 (2016)
learning by pid Lagrangian methods. In: Proceedings of the 37th Inter- 44. Rosenbaum, C., Klinger, T., Riemer, M.: Routing networks: adaptive
national Conference on Machine Learning (2020) selection of non‐linear functions for multi‐task learning. In: International
23. Achiam, J., et al.: Constrained policy optimization. In: Proceedings of the Conference on Learning Representations (2018)
34th International Conference on Machine Learning, vol. 70, pp. 22–31 45. Wang, X., Li, Y.: Harmonized dense knowledge distillation training for
(2017) multi‐Exit architectures. Proc. AAAI Conf. Artif. Intell. 35(11),
24. Yang, L., et al.: Constrained update projection approach to safe policy 10218–10226 (2021). https://doi.org/10.1609/aaai.v35i11.17225
optimization. In: Advances in Neural Information Processing Systems, 46. Fujimoto, S., van Hoof, H., Meger, D.: Addressing function approxi-
vol. 35, pp. 9111–9124. Curran Associates, Inc. (2022) mation error in actor‐critic methods. In: Dy, J., Krause, A. (eds.) Pro-
25. Dalal, G., et al.: Safe Exploration in Continuous Action Spaces. arXiv ceedings of the 35th International Conference on Machine Learning, pp.
(2018) 1587–1596 (2018)
26. Srinivasan, K., et al.: Learning to Be Safe: Deep RL with a Safety Critic. 47. Fiorini, P., Shiller, Z.: Motion planning in dynamic environments using
arXiv (2020) velocity obstacles. Int. J. Robot Res. 17(7), 760–772 (1998). https://doi.
27. Thananjeyan, B., et al.: Recovery rl: safe reinforcement learning with org/10.1177/027836499801700706
learned recovery zones. IEEE Rob. Autom. Lett. 6(3), 4915–4922 (2021). 48. Chen, C., et al.: Crowd‐robot interaction: crowd‐aware robot naviga-
https://doi.org/10.1109/lra.2021.3070252 tion with attention‐based deep reinforcement learning. In: 2019 In-
28. Berliac, F., Basu, Y., Saac, D.: Safe reinforcement learning as an adver- ternational Conference on Robotics and Automation, pp. 6015–6022
sarial game of actor‐critics. In: The Multi‐Disciplinary Conference on (2019)
Reinforcement Learning and Decision Making (2022) 49. Van den Berg, J., et al.: Reciprocal n‐body collision avoidance. In: The
29. Mnih, V., et al.: Human‐level control through deep reinforcement International Journal of Robotics Research, pp. 3–19 (2011)
learning. Nature 518(7540), 529–533 (2015). https://doi.org/10.1038/ 50. Rösmann, C., et al.: Exploiting sparse structures in nonlinear model
nature14236 predictive control with hypergraphs. In: 2018 IEEE/ASME
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16
- ZHOU ET AL.

International Conference on Advanced Intelligent Mechatronics, pp.


1332–1337. AIM) (2018) How to cite this article: Zhou, Z., et al.: A safe
51. Rösmann, C., Hoffmann, F., Bertram, T.: Timed‐elastic‐bands for time‐ reinforcement learning approach for autonomous
optimal point‐to‐point nonlinear model predictive control. In: 2015 navigation of mobile robots in dynamic environments.
European Control Conference, pp. 3352–3357. ECC) (2015) CAAI Trans. Intell. Technol. 1–16 (2023). https://doi.
52. Rösmann, C., Hoffmann, F., Bertram, T.: Integrated online trajectory
planning and optimization in distinctive topologies. Robot. Autonom.
org/10.1049/cit2.12269
Syst. 88, 142–153 (2017). https://doi.org/10.1016/j.robot.2016.11.007

You might also like