2
2
DOI: 10.1049/cit2.12269
ORIGINAL RESEARCH
- -Revised: 5 July 2023 Accepted: 3 August 2023
1
College of Intelligence Science and Technology, Abstract
National University of Defense Technology,
Changsha, China
When deploying mobile robots in real‐world scenarios, such as airports, train stations,
2
hospitals, and schools, collisions with pedestrians are intolerable and catastrophic. Motion
Institute of Robotics and Automatic Information
System, College of Artificial Intelligence, NanKai safety becomes one of the most fundamental requirements for mobile robots. However,
University, Tianjin, China until now, efficient and safe robot navigation in such dynamic environments is still an
open problem. The critical reason is that the inconsistency between navigation efficiency
Correspondence and motion safety is greatly intensified by the high dynamics and uncertainties of pe-
Zhiwen Zeng and Junhao Xiao. destrians. To face the challenge, this paper proposes a safe deep reinforcement learning
Email: [email protected] and junhao.
[email protected] algorithm named Conflict‐Averse Safe Reinforcement Learning (CASRL) for autono-
mous robot navigation in dynamic environments. Specifically, it first separates the
Funding information collision avoidance sub‐task from the overall navigation task and maintains a safety critic
National Natural Science Foundation of China, to evaluate the safety/risk of actions. Later, it constructs two task‐specific but model‐
Grant/Award Numbers: 62203460, U1913202, agnostic policy gradients for goal‐reaching and collision avoidance sub‐tasks to elimi-
U22A2059
nate their mutual interference. Then, it further performs a conflict‐averse gradient
[Correction added on 14 Oct 2023, after first online manipulation to address the inconsistency between two sub‐tasks. Finally, extensive ex-
publication. The below correction needs to be noted. periments are performed to evaluate the superiority of CASRL. Simulation results show
The affiliation of authors Zhiqian Zhou, Junkai Ren, an average 8.2% performance improvement over the vanilla baseline in eight groups of
Zhiwen Zeng, Junhao Xiao, Xinglong Zhang, dynamic environments, which is further extended to 13.4% in the most challenging
Zongtan Zhou and Huimin Lu is corrected as below: group. Besides, forty real‐world experiments fully illustrated that the CASRL could be
“College of Intelligence Science and Technology, successfully deployed on a real robot.
National University of Defense Technology,
Changsha, China”] KEYWORDS
autonomous navigation, dynamic environments, gradient‐based policy optimisation, mobile robots, safe
reinforcement learning
This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial License, which permits use, distribution and reproduction in any medium, provided
-
the original work is properly cited and is not used for commercial purposes.
© 2023 The Authors. CAAI Transactions on Intelligence Technology published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology and Chongqing
University of Technology.
8.2% success rate and fully illustrates the effectiveness of the critical threats to the robot. Based on them, some researchers
proposed two innovations. furthermore introduce historical trajectory information [11,
34], semantic information (e.g. types of obstacles) [18, 35], and
This paper is structured as follows. Section 2 presents velocity‐obstacle‐based interactive information [17, 18] to help
related work, and Section 3 depicts the proposed CASRL al- mobile robots understand scenarios better. Though the agent‐
gorithm. Section 4 verifies the effectiveness of CASRl with level representation is manually designed, experimental results
extensive simulation and real‐world experiments. Finally, Sec- show that this high‐level information, especially interactive
tion 5 concludes the whole work. information from traditional collision avoidance models, plays
an important role in improving the motion safety of mobile
robots in dynamic environments.
2 | RELATED WORK Although these methods achieve continuous performance
improvements, they suffer from a common problem from the
In this section, we first briefly review recent DRL‐based ap- scalar reward function. Rewards for arriving at and
proaches for autonomous robot navigation in dynamic envi- approaching the goal, and penalties for collisions with obsta-
ronments. And then, we go further to highlight related work on cles and high‐risk states, are all integrated into a scalar reward
SRL in robotics and gradient‐based optimisation in multi‐task function. However, the navigation task is inherently multi‐
learning, which directly inspire this work. objective, and different objectives may be inconsistent or
conflicted. A scalar reward function may result in mutual
interference among different objectives. Therefore, reward
2.1 | DRL for robot navigation in dynamic shaping, via carefully fine‐tuning reward weights [11, 33, 34] or
environments introducing other so‐called reasonable reward components
[16–18], is a critical point for their final performance in motion
Encouraged by the superior performance in video games [29], safety.
DRL methods have been intensively studied for autonomous To deal with the mutual interference among different ob-
robot navigation in dynamic environments over the last few jectives, this work decomposes the navigation task into two
years. Sensor‐level methods take the raw perceptual informa- sub‐tasks: goal‐reaching and collision avoidance, and corre-
tion from laserscans and cameras as inputs of networks and spondingly groups rewards/penalties. Then two policy gradi-
then learn a sensorimotor control policy by RL [30, 31]. With ents are constructed from learnt critic and safety critic to
an accurate sensor simulation model, their policies learnt from separate the two sub‐tasks. Finally, a gradient‐based policy
the simulation environment can be migrated to real robots. optimisation method is proposed to resolve the inherent
However, due to the lack of low‐level motion properties and inconsistency between goal‐reaching and collision avoidance
high‐level interaction models for obstacles, their decisions are sub‐tasks. In practical implementation, considering that
inherently reactive and may lead to oscillatory motions in modelling interactions among agents is the key to dealing with
complex dynamic environments, such as crowd scenarios with the collision avoidance problem in dynamic environments, a
many pedestrians. To address the problem, the authors in Refs. heterogeneous graph representation from Ref. [18] is used to
[8, 9, 32] unanimously utilise another network to estimate describe navigation scenarios.
motion properties of moving objects, named world transition
models, by supervised learning, which greatly enhance mobile
robots' ability to avoid collisions proactively. Yet, world tran- 2.2 | SRL in robotic
sition models are far from enough to model sophisticated in-
teractions in real‐world dynamic environments. Besides, since Over the past few years, although much promising progress
world transition models are always constructed in the forms of has been made by DRL in robot navigation, the motion safety
images or occupied maps, either the training or inference of problem of robots is still open, especially in real‐world appli-
the world transition model is time‐consuming. cations. To address the safety problem, many researchers are
To achieve more proactive and foresighted avoidance of turning their attention to SRL. Their key innovations are
dynamic obstacles, some researchers turn their attention to the twofold: separating cost functions (related to different safety
agent‐level representation of surrounding environments and constraints) from the scalar reward function in traditional
focus on modelling interactions among agents. These methods MDP and constructing explicit constraints on costs in the
follow the classic two‐stage framework, which detects obsta- optimisation objective [20–24]. The former greatly alleviates
cles at first and then learns collision avoidance policy from the the problem of balancing safety and performance carefully
processed agent‐level representation [15, 16]. The separation with a problematic hand‐designed scalar reward function.
of detection and navigation makes it possible to perform a Meanwhile, the latter can theoretically prevent the optimised
higher‐level inference than sensor‐level methods. Representa- policy from constraint violations.
tive achievements include the relational graph model in Ref. However, unlike constraints on velocity, acceleration, or
[33], the human‐like gaze model in Ref. [5], and the social energy consumption, a positive threshold for the collision
attention model in Ref. [10]. They enable mobile robots to avoidance constraint means a permissible possibility of colli-
assess disturbances from surrounding obstacles and identify sions. On the one hand, it is hard to define the threshold of
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4
- ZHOU ET AL.
motion safety [20–23]. On the other hand, the positive learning targets of tasks and sub‐tasks modules with knowl-
threshold is contrary to the collision‐free requirement of edge distillation [45].
autonomous robot navigation. Meanwhile, methods using a Recently, considering that over‐parameterised networks
Lagrange multiplier to balance time efficiency and motion have adequate capacity to allow harmonious joint training for
safety, it is necessary to evaluate the safety performance online multiple tasks, a number of gradient‐based methods focus on
[20–22]. Finally, current benchmark environments in SRL balancing the learning process of each task via manipulation on
consider only static or simple dynamic obstacles. There is rarely task‐specific but model‐agnostic gradients. Representative
any interaction between obstacles in these environments, work includes MGDA‐UB [37], CAGrad [41], PCGrad [40],
which differs from real‐world navigation scenarios. These and Meta‐GF [42]. Among them, both MGDA‐UB and
problems make it difficult to deploy these SRL algorithms in CAGrad formulate MTL as a multi‐objective optimisation
real‐world autonomous robot navigation scenarios [36]. problem and focus on maximising the worse local policy
Meanwhile, other researchers concentrate on editing unsafe improvement of sub‐tasks. Their difference lies in the search
actions. They also separate the cost/penalty function for area of parameters. PCGrad identifies conflicting gradients
collision and construct a safety critic to evaluate the expected with cosine similarity and then suppresses the interfering
cumulative cost. In deployment, the robot can identify high‐ components. Meta‐GF takes a meta‐learnt weighted fusion
risk actions with the safety critic, and then project them into policy to combine multi‐task gradients. Experimental results
safe zones with predefined rules [25, 26]. However, predefined verify that all of them can effectively improve the learning
collision avoidance policies do not necessarily work by them- process on a series of challenging multi‐task supervised, semi‐
selves in dynamic or complex environments. Therefore, a two‐ supervised, and RL problems.
policy framework is proposed in Ref. [27], which learns a task
policy to maximise the task reward under non‐emergency cases
and a recovery policy to minimise collision probability under 3 | METHODOLOGY
emergency cases. Another similar work is the safety editor
framework proposed in Ref. [19]. It also trains two policies: a This section proposes a novel safe reinforcement learning
utility maximiser to maximise the utility reward and a safety approach named CASRL. As shown in Figure 2, the network
editor to adjust unsafe actions from the utility maximiser into consists of three components: an actor, a critic, and a safety
safe actions. Benefiting from the explicit structural separation critic. Firstly, the critic is trained with rewards from the goal‐
between the primary task and the collision avoidance task, both reaching sub‐task, while the safety critic is trained with costs
considerably improve safety. from collision cases. Then, by substituting the learnt critic and
Inspired by the above research work [19–21, 26], this work safety critic into the policy gradient equation, two task‐specific
also divides the robot navigation task into two sub‐tasks: the but model‐agnostic policy gradients are obtained for both sub‐
primary goal‐reaching sub‐task and the secondary collision tasks. Finally, via a conflict‐averse gradient manipulation of
avoidance sub‐task. On this basis, we further propose a new maximising the worst local policy improvement of sub‐tasks,
optimisation objective to reconcile the inconsistencies between the inconsistency between two sub‐tasks can then be alleviated
two sub‐tasks. Compared with existing methods, there is no in policy optimisation. Additionally, in practical implementa-
need to set an upper bound and perform an online evaluation tion, the framework is built on the heterogeneous graph rep-
for safety performance or introduce additional task‐specific resentation of states. Details of the approach are as follows.
modules to edit unsafe actions.
the optimisation objective can be explicitly divided into two Here, Qπc θ ðs; aÞ is the expected sum of costs. Note that the
components: maximising the expected cumulative returns J πr θ deterministic policy gradient for the collision avoidance sub‐
while minimising the expected cumulative costs J πc θ . task introduces an additional minus. It is caused by the fact
that the optimisation objective for the collision avoidance sub‐
T �
X � task is to minimise expected cumulative costs.
J πr θ ¼ Eτ γtr Rðst ; at ; stþ1 Þ
t¼0
ð1Þ
T �
X � 3.2 | Training of critic and safety critic
J πc θ ¼ Eτ γtc Cðst ; at ; stþ1 Þ
t¼0 Since the optimal critic Qπr θ ðs; aÞ and the optimal safety critic
of Qπc θ ðs; aÞ are unavailable, two neural networks are para-
where τ = s0, a0, …, at−1, st is a trajectory following policy πθ, meterised by φ and ψ to estimate them. By temporal‐difference
and at is the action sampled from policy πθ(⋅|st). (TD) learning, Qφr ðs; aÞ and Qψc ðs; aÞ are regarded as approx-
The two optimisation objectives are built for the goal‐ imate estimates of the optimal ones and can be used to replace
reaching sub‐task and collision avoidance sub‐task. Then, it them in policy gradient.
is available to differentiate different sub‐tasks in the aspect of
policy gradient. For the goal‐reaching sub‐task, its determin-
istic policy gradient is Qφr ðs; aÞ → Qπr θ ðs; aÞ
ð6Þ
Qψc ðs; aÞ → Qπc θ ðs; aÞ:
� �
∇θ J πr θ ¼ Eτ ∇a Qπr θ ðs; aÞ∇θ πθ ðsÞja ¼ πθ ð⋅jsÞ ð2Þ
Specifically, the critic is updated by the clipped double Q‐
with Learning proposed in Ref. [46], which maintains two separate
value networks to resolve the overestimation problem caused
T �
X � by the critic. Meanwhile, a separate target network is built for
Qπr θ ðs; aÞ ¼ Eτ γtr Rðst ; at ; stþ1 Þjs0 ¼ s; a0 ¼ a : ð3Þ stable off‐policy training. It is the same as the online network
t¼0
except that its parameters, b θ, φb , and ψb , are learnt via soft
updates. Finally, the target value for the two value networks in a
where ∇ indicates the partial derivative, and Qπr θ ðs; aÞ is the
critic is rewritten as
expected sum of rewards for the goal‐reaching sub‐task.
For the collision avoidance sub‐task, its deterministic
policy gradient ∇θ J πc θ replaces the reward function and dis- bφ � �
yr ¼ r þ γr min Qr j s0 ; πbðs0 Þ ð7Þ
counted factor in Equation (2) with the cost function C and j¼1;2 θ
cost discounted factor γc, which is depicted as follows:
� � where s0 = P(s, a) is the subsequent state from after taking
∇θ J πc θ ¼ −Eτ ∇a Qπc θ ðs; aÞ∇θ πθ ðsÞja ¼ πθ ð⋅jsÞ ð4Þ
action a on state s, and r = R(s, a, s0 ) is the immediate reward
for the state transition.
with As for the safety critic, considering the collision avoidance
sub‐task only dominates the navigation task in high‐risk cases,
T �
X � the overestimation problem no longer happens. Therefore,
Qπc θ ðs; aÞ ¼ Eτ γtc Cðst ; at ; stþ1 Þjs0 ¼ s; a0 ¼ a : ð5Þ
t¼0
only one value network is used, and its target value is
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6
- ZHOU ET AL.
n � � o
yc ¼ c þ γc max Qc b
ψ 0
s ; πbðs0 Þ ; 0 : ð8Þ where ΔJ πr θ and −ΔJ πc θ are the local policy improvement of
θ expected returns and costs at each update step. When there is
where c ¼ Cðs; a; s0 Þ is the immediate cost for the state tran- only a slight inconsistency between the goal‐reaching and
sition. Considering that the cost value is always non‐negative, a collision avoidance sub‐tasks, such an optimisation objective
ReLU operator is used to avoid underestimating collision risk. aims to simultaneously improve both performances of goal‐
Then, by replacing optimal critic Qπr θ ðs; aÞ and safety critic reaching and collision avoidance sub‐tasks. When there exists
πθ
Qc ðs; aÞ in policy gradient equation (Equations 2 and 4) with a clear conflict between the two sub‐tasks, the optimisation
the learnt critics Qφr 1 ðs; aÞ and safety critic Qψc ðs; aÞ, policy objective focuses on minimising performance degradation on
gradients for sub‐tasks are estimated by sub‐tasks. The following is a detailed solution to this optimi-
sation problem.
� � Given a small step size of α and an update vector of
∇θ J πr θ ¼ Eτ ∇a Qφr 1 ðs; aÞ∇θ πθ ðsÞja ¼ πθ ð⋅jsÞ
� � ð9Þ g ∈ Rm , the updated policy network is determined by
∇θ J πc θ ¼ −Eτ ∇a Qψc ðs; aÞ∇θ πθ ðsÞja ¼ πθ ð⋅jsÞ θ0 = θ þ α*g. By using the first‐order Taylor approximation,
Equation (12) can be represented by
Then, according to the additional rules of derivatives, the where ω ¼ ½ωr ; ωc � ∈ W is the normalised weight and
parameterised policy network can be updated by the following gω = ωrgr þ ωcgc is the weighted sum of two policy gradient
gradient: vectors. For the convex optimisation problem, both the Slater's
condition and strong duality hold when σ > 0. Therefore, it is
∇θ J πr;cθ ¼ ∇θ J πr θ þ ∇θ J πc θ ð11Þ allowed to exchange the min and max:
� �
λ λ
Even though the optimisation objective in Equation (10) min maxm gTw g − kg − g0 k2 þ σkg0 k2 : ð15Þ
λ>0;ω∈Wg∈R 2 2
can combine different components, it ignores the inherent
inconsistency, even conflict, between two sub‐tasks. For
example, in the autonomous robot navigation problem, the With fixing λ and ω, the optimal gradient vector
goal‐reaching sub‐task aims to navigate the robot along time‐ g∗ω;λ ¼ g0 þ gw =λ. Substituting it into Equation (15), another
efficient paths, while the collision avoidance sub‐task drives dual optimisation problem is
the robot from high collision‐risk regions. When an obstacle
lies between the robot and its goal, an obvious conflict exists � �
1 λ
between goal‐reaching and collision avoidance sub‐tasks. To min gTw g0 2 2
þ kgw k þ σkg0 k : ð16Þ
λ>0;ω∈W 2λ 2
deal with the inconsistency between them, this work further
proposed a conflict‐averse policy optimisation algorithm via
gradient manipulation. Further, let λ∗ ¼ σ1=2
1
kgw k=kg0 k, the optimisation problem
In particular, to mitigate the inconsistency between the two becomes
sub‐tasks, the optimisation objective for each policy update is �
constructed based on the minimum performance improvement min gTω g0 þ σkgω k2 kg0 k2 : ð17Þ
ω
on sub‐tasks, as subsequently defined.
� �� This optimisation problem is equal to Equation (12), but
arg max min ΔJ πr θ ; −ΔJ πc θ ð12Þ
θ more time‐efficient to be solved with only two optimisation
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 7
parameters. After several steps of iterative optimisation on 3.4 | Feature vector construction with
Equation (17), the optimal weight vector is approximated by HGAT
ω*. Substituting ω* into g∗ω;λ , the optimal policy gradient
vector for Equation (13) is g∗ ¼ ω∗r gr þ ω∗c gc . To help the robot navigate in dynamic environments, for
The proposed CASRL is summarised in Algorithm 1, example, crowd scenario as shown in Figure 3, this work fol-
which is built based on the Twin Delayed Deep Deterministic lows the representation from Ref. [18], which constructs a
policy gradient algorithm (TD3) [46]. It maintains a single feature vector with heterogeneous graph attention network
actor, a single safety critic, and a pair of critics. For each (HGAT). After simplification, it describes various navigation
episode, the critics are updated towards the minimum target scenarios with four types of objects: the robot, dynamic cir-
reward value (line 14), while the safety critic is updated to cular obstacles, static circular obstacles, and static line obsta-
approximate the target cost value (line 15). Every f iterations, cles. Their observable configurations are defined by
the policy is updated along the optimal conflict‐averse gradient 8
vector g* with the step size of α (line 17–20). Meanwhile, target >
> ½d; v0 ; vm ; θ� ∈ R5 ; if labelðiÞ ¼ 0
>
> � �
networks are updated with τ (line 21). < p ; ρ~ ; vi ; vli ; vri ; μ ; ξ ∈ R11 ; if labelðiÞ ¼ 1
i i i i
oi ¼ � �
>
>
> pi ; ρ~i ; vli ; vri ; μi ; ξi ∈ R9 ; if labelðiÞ ¼ 2
Algorithm 1: Conflict-Averse Safe Reinforcement >
:� �
Learning psi ; pei ; ρ0 ; vli ; vri ; μi ; ξi ∈ R11 ; if labelðiÞ ¼ 3
ð18Þ
where all objects are numbered with subscript i(i > 0): 0 for
the robot and others for obstacles. The function label(⋅) marks
types of objects. In the robot's configuration, d and θ describe
the Euclidean distance and orientation from its current posi-
tion to the given goal. Its velocity is donated by v0, limited by
the maximum speed of vm. For circular obstacles, both dy-
namic and static, their positions are presented by pi = [pxi, pyi].
Their radii are ~ρi ¼ ρi þ ρ0 , enlarged by the robot's radius ρ0.
For line obstacles, for example, boundaries, they are located by
their start points psi and end points pei. Meanwhile, they are
also inflated by ρ0. Besides, for all obstacles, a VO‐inspired
interactive configuration is introduced to enhance the state
representation. Assuming an obstacle maintains its velocity of
vi, the robot will collide with it only if the robot's velocity lies
in the VO cone as shown in Figure 4. The VO cone can be
briefly described with a vertex vi and two normalised direction
vectors, vli and vri for the left and right rays [47]. Furthermore,
to better distinguish the emergency of obstacle avoidance, the
minimum separation distance μi and expected collision time ςi
are introduced to represent the configuration of the obstacle.
However, since ςi is infinite when the robot's velocity lies F I G U R E 5 Simulation environment. Both the robot and dynamic
circular obstacles are represented by hollow circles, with a red arrow
outside the VO cone, another equivalent variable ξi = 1/ indicating the heading. Dynamic obstacles are painted with numbers to
(ςi þ 0.5) is used to replace ςi. Finally, the vertex components distinguish them from the robot. Except for hollow circles, solid grey
of static obstacles are ignored since they have zero velocity. circles and black lines are painted for static circular obstacles and
Then, state st can be represented as a heterogeneous graph boundaries, respectively. Additionally, two pentagons are used to mark the
GðV; EÞ with nodes V for objects and edge E for interactions start (black) and the goal (red) position of the robot. Meanwhile, goal
positions for dynamic obstacles are marked with black stars.
among objects. An edge from node i to node j means that
there exists an interaction from object i to object j. In other
words, object j is required to avoid collision with object i,
actively. the square‐crossing phenomenon. Once a dynamic obstacle
Based on the heterogeneous graph representation of states, reaches its goal, a new random goal is generated within the
the proposed method utilises two layers of HGATs to extract room. Static circular obstacles, with random radii ranging from
the feature vector. It has two main advantages: scalability in the 0.1 to 0.4 m, are randomly placed in the room.
number of nodes and attention‐based message passing among During the training process, the number of circular ob-
nodes. A detailed setting and description for HGAT can refer stacles is randomly generated, with 3–10 for dynamic obsta-
to Ref. [18]. Then, the feature vector is fed into subsequent cles and 3–5 for static obstacles. As for the robot, its
Multi‐Layer Perceptions (MLPs) to construct the actor πθ, maximum velocity vmax and acceleration amax are set to
critic Qφr , and safety‐critic Qψc . 1.0 m/s and 1.0 m/s2. At the beginning of each training
episode, the robot is set with the initial and goal position of
(0.0, −4.0) and (0.0, 4.0). Additionally, the robot is assumed
4 | CASE STUDY to have a limited observation range of 6.0 m. Therefore, only
observed circular obstacles or segments of static linear ob-
4.1 | Simulation platform and settings stacles, rather than global obstacle information, are consid-
ered in navigation. Other detailed settings for the training
The simulation environment is adapted from CrowdNav [48]. process are followed.
As shown in Figure 5, there exists a robot, dynamic and static
circular obstacles, and static linear obstacles from the bound-
ary. The width and length of the room are 10.0 m. All dynamic 4.1.1 | Reward function
obstacles are navigated by ORCA [49], a widely used reciprocal
collision avoidance algorithm for multi‐agent simulation. The The reward function R is a sum of three components: rg, rd,
dynamic obstacles have a radius of 0.3 m and a maximum and rθ. Among them, rd and rθ are the dense rewards designed
velocity of 1 m/s. To interact with the robot frequently, some to navigate the robot approaching and steering towards its
dynamic obstacles are initialised to simulate the circle‐crossing goal, and rg is the sparse immediate reward for reaching its
phenomenon. Their initial and goal positions are roughly goal. Namely,
symmetrically placed on the central circle with a radius of
g
4.0 m. Moreover, random noises uniformly distributed be- rt ¼ rt þ rtt þ rθt : ð19Þ
tween −0.5 and 0.5 m are added to their positions. Apart from
circle‐crossing dynamic obstacles, the remaining are initialised
g
with random initial and goal positions in the room to mimic where rt , rdt , and rθt are given by
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 9
�
g κ1 ; if d t < η1 sub‐tasks. PCSRL modifies PCGrad [40] into the proposed
rt ¼
0; otherwise SRL framework. Specifically, the optimal policy gradient vector
ð20Þ is defined by
rdt ¼ κ2 ðd t−1 − d t Þ;
rθt ¼ κ3 ðcosðθt Þ − 1Þ=ðdt þ η2 ÞÞ; 8
< 0:5ðgr þ gc Þ;
> if gTr gc ≤ 0
where dt and θt are the distance and the orientation from the g ¼
∗ � �
gT g ð22Þ
robot's position to the robot's goal at step t, respectively. The : 0:5 gc þ gr − r c gr ;
> otherwise
kgr k
goal tolerance of the robot η1 is set to 0.2 m. The hyper‐
parameter η2 is used to smooth and limit the value of rθt ,
with its value of 5.0 m. Parameters κ1, κ2, and κ3 are used to CASRL is a further development of PCSRL. It aims to maxi-
balance different reward components, and their values are 25.0, mise the worse local policy improvement of sub‐tasks to deal
1.0, and 1.0. with the inherent inconsistency between the two sub‐tasks.
Five DRL‐based approaches are training five times with
different random seeds. In the training process, all network
4.1.2 | Cost function parameters are trained by the Adam optimiser with a learning
rate of 0.001. Both discount factors for expected cumulative
The cost function C only includes the sparse immediate cost of rewards and costs, γr and γc, are set to 0.987. Since this work is
κ4 = 25.0 on collisions, which is given by adapted from the TD3 algorithm, hyper‐parameters for
training include the exploration noise, policy noise, noise clip,
� update rate, and update frequency, with values of 0.2, 0.2, 0.2,
κ4 ; min μi ≤ 0
ct ¼ i>0 ð21Þ 0.005, and 8, respectively. With these hyper‐parameters and a
0; otherwise 13th‐generation i7 CPU from Intel, and the training of HGAT‐
DRL, NGSRL, and CASRL spent 11,659s, 13,569s, and
where μi is the minimal distance from the robot to the object i. 13,936s, respectively. The separation of two sub‐tasks only
requires an additional 1,910s, and the proposed conflict‐averse
policy optimisation only increased the training time by 367s.
4.1.3 | Action space Therefore, there is no need to worry about an excessive
computational burden caused by the proposed conflict‐averse
The robot is assumed with a common two‐wheeled
� differential policy optimisation.
drive model, and its action at ¼ alt ; art donates the accelera- For a comprehensive and fair comparison, all methods are
tion of wheels (alt for the left wheel and art for the right one). evaluated with 12,000 random test cases. These cases are
Considering the acceleration constraint, both of alt and art are grouped based on the number of dynamic circular obstacles
limited in [−amax, amax], where amax = 1.0 is the maximal (ranging from 3 to 10) and static circular obstacles (ranging
acceleration. from 3 to 5), with 500 cases for each combination. Eight
metrics are used to evaluate different algorithms quantitatively,
including "Suc. Rate", "Col. Rate", "Nav. Time", "Col.
4.2 | Simulation results and discussions Time", "Dis. Number", "Avg. Return", "Avg. Cost", and
Over. Perf. They describe the success rate, collision rate,
Four methods, MPC (Model Predictive Control) from Ref. navigation time of success cases, time response of collision
[50], TEB (Time‐Elastic‐Band) from Ref. [51], HGAT‐DRL cases, number of violations to the safety zone of dynamic
from Ref. [18], and HGAT‐DRL‐Lag adapted from Ref. [20] obstacles, discounted cumulative return over steps, discounted
are implemented as baseline algorithms. Among them, both cumulative cost over steps, and overall performance,
MPC and TEB are traditional optimisation‐based approaches, respectively.
while HGAT‐DRL is a recent DRL‐based approach. HGAT‐
DRL‐Lag is a variant of HGAT‐DRL, which introduces a
Lagrange multiplier λl to balance the two sub‐tasks adaptively. 4.2.1 | Performance comparison
The Lagrange multiplier is updated by −λl(ϕ − ϕ0 ), where ϕ
and ϕ0 are the current collision rate and the threshold of the Statistic results are shown in Table 1 and training curves from
collision rate. In the implementation, ϕ is estimated on the last DRL‐based methods are depicted in Figure 6. Since the
500 episodes, and ϕ0 is set to 10%. random seed does not affect TEB and MPC, their standard
This work proposes three approaches: NGSRL, PCSRL, deviations are always 0. As expected, TEB and MPC achieve a
and CASRL, which differ in how they update the policy terrible performance in the simulation environment. Especially,
network. Specifically, NGSRL updates the policy network with since MPC does not take into account the motion property of
the gradient vector of g0 ¼ ðgR þ gC Þ=2, without considering obstacles, its performance is catastrophic, with a success rate of
the consistency between goal‐reaching and collision avoidance only 17.5%. All five DRL‐based methods perform better than
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10
- ZHOU ET AL.
Nav. Time (s) Col. Time (s) Dis. Number Avg. Return
Methods Suc. Rate ↑ Col. Rate ↓ ↓ ↑ ↓ ↑ Avg. Cost ↓ Over. Perf.
MPC [50] 0.132 � 0.000 0.867 � 0.000 18.88 � 0.00 4.53 � 0.00 1819.3 � 0.0 3.45 � 0.00 19.93 � 0.00 −16.49 � 0.00
TEB [52] 0.756 � 0.000 0.236 � 0.000 13.79 � 0.00 5.21 � 0.00 1358.8 � 0.0 16.59 � 0.00 4.96 � 0.00 11.63 � 0.00
HGAT‐DRL [18] 0.767 � 0.037 0.233 � 0.38 10.97 � 0.49 5.80 � 0.43 1061.5 � 165.9 17.78 � 0.61 5.08 � 0.87 12.69 � 1.47
HGAT‐DRL‐Lag [20] 0.833 � 0.021 0.138 ± 0.023 15.82 � 1.03 10.61 � 0.71 929.4 � 75.0 17.37 � 0.75 2.68 � 0.48 14.69 � 1.07
NGSRL(ours) 0.809 � 0.018 0.190 � 0.018 11.72 � 0.34 6.72 � 0.43 1010.1 � 117.7 18.38 � 0.33 4.05 � 0.40 14.33 � 0.71
PCSRL(ours) 0.829 � 0.017 0.170 � 0.017 12.19 � 0.16 7.34 � 0.30 1066.6 � 82.5 18.61 � 0.52 3.56 � 0.37 15.05 � 0.88
CASRL(ours) 0.849 � 0.010 0.148 � 0.009 12.76 � 0.40 7.67 � 0.28 968.6 � 95.0 18.82 � 0.25 3.11 � 0.19 15.71 � 0.42
Note: � is used to indicate the standard deviation. All DRL‐based algorithms are tested with five models from a different random seed. The upward/downward arrow indicates that the
higher/lower the metric is, the better the algorithm. In each metric, the best two are bolded, and the best one is marked with another wave line.
F I G U R E 6 Training curves from HGAT‐DRL, HGAT‐DRL‐Lag, NGSRL, PCSRL, and CASRL, which are coloured blue, orange, green, red, and purple,
respectively. (a) Training curve of episodic reward. (b) Training curve of episodic cost.
the two optimised‐based methods, which shows the ability of 4.2.2 | Ablation study of gradient manipulation
DRL to solve complex optimisation problems.
In the further comparison among five DRL‐based algo- NGSRL is implemented for ablation experiments on the
rithms, it is easy to find that CASRL achieves the top two in proposed two innovations: decomposition of the overall nav-
almost all metrics, except for navigation time. Specifically, igation task and gradient‐based conflict‐averse policy optimi-
CASRL achieves the highest success rate of 84.9%, the most sation. Benefiting from the first innovation, NGSRL
average discounted return of 18.84, and the best overall outperforms HGAT‐DRL in terms of safety, showing a 0.92s
performance of 15.71. In other metrics on motion safety, (15.7%) increase in Col. Time, as well as a 4.3%, 4.9%, and
including Col. Rate, Col. Time, Avg. Cost, and Dis. Num- 18.4% drop in Col. Rate, Dis. Number, and Avg. Cost,
ber, it also achieves the second‐best performance, slightly respectively. Benefiting from the second innovation, CASRL
worse than HGAT‐DRL‐Lag. Compared to the vanilla algo- also shows a further improvement on the basis of NGSRL in
rithm of HGAT‐DRL, CASRL greatly delays the occurrence terms of safety, increasing/decreasing by 1.02s (15.1%), 4.0%,
of collisions in high‐risk cases by about 1.94s and reduces 6.6%, and 11.9% in Col. Rate, Col. Time, Dis. Number, and
violations to safety zones of obstacles (with a distance less Avg. Cost, respectively.
than 0.2 m) by 11.1%. HGAT‐DRL‐Lag is the safest algo- PCSRL is also implemented as a comparison algorithm to
rithm, with a Col. Rate of only 13.8%. However, since it NGSRL and CASRL, which performs gradient surgery ac-
does not take into account the conflicting policy gradient cording to Equation (22). Due to the projection operation of
from two sub‐tasks, its time efficiency is greatly reduced, conflicted gradients, PCSRL achieves better motion safety than
speeding 23.7% more navigation time than CASRL. There- NGSRL. However, PCSRL focuses on the conflicted cases,
fore, it performs terribly in Avg. Return and ranks only third ignoring the more common cases of slight inconsistency.
in Over. Perf. Therefore, its motion safety is worse than CASRL.
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHOU ET AL.
- 11
4.2.3 | Performance in different environments is worth noting that as the difficulty level increases, the gaps
among HGAT‐DRL, NGSRL, and CASRL also widen signif-
To further validate the effectiveness of CASRL and demon- icantly. Specifically, in the simplest cases with only three dy-
strate the proposed two innovations, a comprehensive namic circular obstacles, their success rates are 90.9%, 92.5%,
comparative analysis is conducted on a more detailed set of and 94.9%, with gaps of 1.6% and 2.4%. When the number of
experimental results. Here, 12,000 cases are divided into eight dynamic obstacles becomes 6, their success rates become
groups according to the number of dynamic circular obstacles 79.9%, 83.0%, and 87.1%, with widened gaps of 3.1% and
ranging from 3 to 10. Each group comprises 1500 test cases 4.1%. Yet in the most difficult cases with 10 dynamic circular
with static circular obstacles ranging from 3 to 5. Six methods, obstacles, their success rates are 59.3%, 67.1%, and 72.7%,
TEB, HGAT‐DRL, HGAT‐DRL‐Lag, NGSRL, PCSRL, and with further widened gaps of 7.8% and 5.6%. This phenom-
CASRL, are tested on four core metrics, including Ave. Re- enon is in line with the expectations of the proposed innova-
turn, Suc. Rate, Nav. Time, and Ave. Cost. tion. In simple scenarios, the conflict between goal‐reaching
Experimental results are shown in Figure 7. As the number and collision avoidance sub‐tasks is not so slight that the
of dynamic obstacles increases, the difficulty of collision neglect of conflict causes only a slight performance degrada-
avoidance increases, so the performance of all six algorithms is tion. However, as the number of dynamic obstacles increases,
detelriorating. Nevertheless, CASRL always keeps the best obstacle avoidance becomes more challenging, and the conflict
performance in Suc. Rate, and Ave. Return. TEB experiences between two sub‐tasks intensifies. Then, either separating the
the most dramatic decline in performance, from the third best collision avoidance sub‐task from the overall task or elimi-
in the simplest scenarios to the worst in others. Furthermore, it nating conflicts between the two sub‐tasks can effectively
F I G U R E 7 Experimental comparisons of TEB, HGAT‐DRL, NGSRL, and CASRL in different environments. They depict the average return, average cost,
success rate, and navigation time, respectively. (a) Average Return. (b) Average Cost. (c) Success Rate. (d) Navigation Time.
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12
- ZHOU ET AL.
13.70 � 1.95 ↑
15.89 � 0.84 ↑
14.74 � 1.13 ↑
16.26 � 2.08 ↑
16.66 � 0.58 ↑
improve the motion safety of mobile robots. Another notable
Statistical results in interactive environments. � is used to indicate the standard deviation. The upward/downward arrow indicates that the item is larger/smaller than the corresponding one in Table 1.
Ove. Perf. ↑
point is that though HGAT‐DRL‐Lag achieves the best per-
formance in motion safety, it spends the most navigation time
in each group of environments. The result is consistent with
the result mentioned above.
4.33 � 1.09 ↓
1.80 � 0.44 ↓
3.61 � 0.62 ↓
2.71 � 0.96 ↓
2.41 � 0.32 ↓
Avg. Cost ↓
4.2.4 | Experiment in interactive environments
18.07 � 0.88 ↑
17.69 � 0.62 ↑
18.35 � 0.52 ↑
18.97 � 1.12 ↑
19.08 � 0.28 ↑
Avg. Return ↑
Therefore, we further evaluate DRL‐based policies from
Sec. 4.2.1 in interactive scenarios, where the root is visible to
dynamic circular obstacles. The statistical results are shown in
Table 2.
As expected, in interactive environments, all five DRL‐
based algorithms achieve a visible improvement in the four
1469.7 � 154.1 ↑
1028.4 ± 73.5 ↑
1301.4 � 183.6 ↑
1370.7 � 83.5 ↑
1213.3 � 108.6 ↑
metrics for motion safety. It shows that the proactive collision
Dis. Number ↓
avoidance policy learnt by an invisible robot can be directly
deployed in real‐world visible robots. The performance
improvement can be attributed to pedestrians' proactive
avoidance behaviours. Even if there is no communication
network between the robot and pedestrians, they can still
achieve collaborative collision avoidance. A possible problem
11.96 � 1.00 ↑
7.57 � 0.39 ↑
8.67 � 0.24 ↑
8.72 � 0.21 ↑
is the increased violations and navigation time. It is caused by
the lack of a communication mechanism between the robot
and pedestrians. Actually, even for pedestrians, it is also
difficult to realise optimal reciprocal avoidance in
communication‐less environments. Therefore, the robot and
pedestrians sometimes hider each other, resulting in increased
Nav. Time (s) ↓
16.46 � 0.92 ↑
12.47 � 0.31 ↑
12.89 � 0.24 ↑
13.40 � 0.26 ↑
Over. Perf increase by an average of 6.6%.
Note: In each metric, the best two are bolded, and the best one is marked with another wave line.
4.3 | Real‐world experiments
0.096 � 0.022 ↓
0.173 � 0.029 ↓
0.133 � 0.047 ↓
0.116 � 0.016 ↓
0.880 � 0.017 ↑
0.796 � 0.049 ↑
0.826 � 0.028 ↑
0.864 � 0.045 ↑
Once the distance to the current goal is less than 0.3 m, the
HGAT‐DRL [18]
CASRL (ours)
PCSRL (ours)
navigation task, and its goal will be reset for the next navi-
TABLE 2
Methods
5 | CONCLUSION
TABLE 3 Statistical results from hardware experiments.
This paper proposes an SRL framework for autonomous robot
Methods Suc. Rate Nav. Time (s) Ave. Speed. (m/s) navigation, along with a safe navigation algorithm for mobile
CASRL 1.00 14.36 � 3.19 0.64 � 0.06 robots in dynamic environments. It first decomposes the
overall navigation task into two sub‐tasks, one for goal‐
reaching and another for collision avoidance, and introduces
a safety critic to estimate the motion safety/risk of actions. On
volunteers join in the hardware experiment to interact with the the basis of critic and safety critic, it further constructs two
robot. task‐specific but model‐agnostic policy gradients, eliminating
Statistical results of the hardware experiment are shown the mutual interference between two sub‐tasks. And then, it
in Table 3, which comes from 40 individual navigation tasks. proposes a conflict‐averse manipulation of policy gradients. By
One surprising result is that CASRL achieves a 100% success maximising the worst local policy improvement of both sub‐
rate and an average speed of 0.643 m/s, without any collision tasks, the inconsistency between the two sub‐tasks can be
with volunteers, suitcases, or walls. However, in the real‐ greatly eliminated. Additionally, the conflict‐averse policy up-
world experiment, Nav. Time increases to 14.36 s, 3.14 s date ensures an efficient and stable learning process. Further-
higher than that from the simulation experiment with four more, this work proposes a practical safe navigation algorithm
dynamic and three static circular obstacles. There are three named CASRL by combining the SRL framework with the
possible reasons for the difference. At first, in real‐world heterogeneous graph representation of dynamic environments
environments, volunteers exhibit more diverse and change- from Ref. [18].
able behaviour, rather than the ideal ORCA algorithm in the As expected, the proposed algorithm achieves a high suc-
simulation environment. Secondly, there exists an unavoidable cess rate of over 84.9% in a challenging simulated environ-
noise in state estimation for volunteers caused by relative ment, outperforming the vanilla algorithm of HGAT‐DRL by
perception modules, which can also weaken the time effi- 8.2%. Especially in the most difficult environment with 10
ciency of the navigation policy. Finally, the patrol task in real‐ dynamic circular obstacles, where the inconsistency between
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14
- ZHOU ET AL.
F I G U R E 9 Trajectory diagrams and velocity curves in two sample scenarios. In trajectory diagrams (the top line), grey discs represent static circular
obstacles. Coloured curves depict the trajectories of the robot (black) and volunteers (other colours). Black lines represent the boundaries of the robot's free
space. Coloured numbers on trajectories indicate the time, and the interval between two consecutive nodes is 0.5s. Small arrows on humans' trajectories indicate
the forward directions, while red arrows on the robot's trajectory show its heading angles. The robot's goal is marked with a red pentagon. The figures' bottom
line shows the robot's velocity curves, consisting of the velocity commands and the actual velocities of the Fetch robot. In addition, two dashed black lines are
used to mark the start time and end time of the navigation task. (a) Scenario 1. (b) Scenario 2.
7. Li, G., Hildre, H.P., Zhang, H.: Toward time‐optimal trajectory planning 30. Tai, L., Paolo, G., Liu, M.: Virtual‐to‐real deep reinforcement learning:
for autonomous ship maneuvering in close‐range encounters. IEEE J. continuous control of mobile robots for mapless navigation. In: 2017
Ocean. Eng. 45(4), 1219–1234 (2020). https://doi.org/10.1109/joe.2019. IEEE/RSJ International Conference on Intelligent Robots and Systems,
2926822 pp. 31–36. IROS) (2017)
8. Cui, Y., et al.: Learning world transition model for socially aware robot 31. Long, P., et al.: Towards optimally decentralized multi‐robot collision
navigation. In: 2021 IEEE International Conference on Robotics and avoidance via deep reinforcement learning. In: 2018 IEEE International
Automation (ICRA), pp. 9262–9268 (2021) Conference on Robotics and Automation (ICRA), pp. 6252–6259
9. Xie, Z., Xin, P., Dames, P.: Towards safe navigation through crowded (2018)
dynamic environments. In: 2021 IEEE/RSJ International Conference on 32. Sathyamoorthy, A.J., et al.: Densecavoid: real‐time navigation in dense
Intelligent Robots and Systems (IROS), pp. 4934–4940 (2021) crowds using anticipatory behaviors. In: 2020 IEEE International
10. Zhou, Z., et al.: Robot navigation in a crowd by integrating deep rein- Conference on Robotics and Automation, pp. 11345–11352. ICRA)
forcement learning and online planning. Appl. Intell. 52(13), 15600–- (2020)
15616 (2022). https://doi.org/10.1007/s10489‐022‐03191‐2 33. Chen, C., et al.: Relational graph learning for crowd navigation. In: 2020
11. Liu, S., et al.: Decentralized structural‐rnn for robot crowd navigation IEEE/RSJ International Conference on Intelligent Robots and Systems
with deep reinforcement learning. In: IEEE International Conference on (IROS), pp. 10007–10013 (2020)
Robotics and Automation (ICRA), pp. 3517–3524 (2021) 34. Liu, S., et al.: Intention aware robot crowd navigation with attention‐
12. Yan, C., et al.: Deep reinforcement learning of collision‐free flocking based interaction graph. In: IEEE International Conference on Ro-
policies for multiple fixed‐wing uavs using local situation maps. IEEE botics and Automation. ICRA) (2023)
Trans. Ind. Inf. 18(2), 1260–1270 (2022). https://doi.org/10.1109/tii. 35. Kästner, L., et al.: Enhancing navigational safety in crowded environ-
2021.3094207 ments using semantic‐deep‐reinforcement‐learning‐based navigation. In:
13. Zhao, L., et al.: Toward an online decision support system to improve 2022 IEEE International Symposium on Safety, Security, and Rescue
collision risk assessment at sea. IEEE Intell. Transp. Syst. Mag. 15(2), Robotics, pp. 87–93. SSRR) (2022)
137–148 (2023). https://doi.org/10.1109/mits.2022.3190965 36. Lv, S., et al.: A deep safe reinforcement learning approach for mapless
14. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd navigation. In: 2021 IEEE International Conference on Robotics and
ed. The MIT Press (2018) Biomimetics, pp. 1520–1525. ROBIO) (2021)
15. Chen, Y.F., et al.: Decentralized non‐communicating multiagent collision 37. Sener, O., Koltun, V.: Multi‐task learning as multi‐objective optimization.
avoidance with deep reinforcement learning. In: 2017 IEEE International In: Proceedings of the 32nd International Conference on Neural Infor-
Conference on Robotics and Automation (ICRA), pp. 285–292 (2017) mation Processing Systems, pp. 525–536. Curran Associates Inc., NY
16. Chen, Y.F., et al.: Socially aware motion planning with deep reinforce- (2018)
ment learning. In: : 2017 IEEE/RSJ International Conference on 38. Liu, S., Johns, E., Davison, A.J.: End‐to‐end multi‐task learning with
Intelligent Robots and Systems (IROS), pp. 1343–1350 (2017) attention. In: 2019 IEEE/CVF Conference on Computer Vision and
17. Han, R., et al.: Reinforcement learned distributed multi‐robot navigation Pattern Recognition, pp. 1871–1880. CVPR) (2019)
with reciprocal velocity obstacle shaped rewards. IEEE Rob. Autom. 39. Zhang, Y., Yang, Q.: A survey on multi‐task learning. IEEE Trans.
Lett. 7(3), 5896–5903 (2022). https://doi.org/10.1109/lra.2022.3161699 Knowl. Data Eng. 34(12), 5586–5609 (2022). https://doi.org/10.1109/
18. Zhou, Z., et al.: Navigating robots in dynamic environment with deep tkde.2021.3070203
reinforcement learning. IEEE Trans. Intell. Transport. Syst. 23(12), 40. Yu, T., et al.: Gradient surgery for multi‐task learning. In: Proceedings of
25201–25211 (2022). https://doi.org/10.1109/tits.2022.3213604 the 34th International Conference on Neural Information Processing
19. Yu, H., Xu, W., Zhang, H.: Towards safe reinforcement learning with a Systems. Curran Associates Inc., NY (2020)
safety editor policy. In: Advances in Neural Information Processing 41. Liu, B., et al.: Conflict‐averse gradient descent for multi‐task learning. In:
Systems (2022) Ranzato, M., et al. (eds.) Advances in Neural Information Processing
20. Ha, S., et al.: Learning to walk in the real world with minimal human Systems, vol. 34, pp. 18878–18890. Curran Associates, Inc. (2021)
effort. In: Kober, J., Ramos, F., Tomlin, C. (eds.) Proceedings of the 2020 42. Sun, Y., Li, J., Xu, X.: Meta‐gf: training dynamic‐depth neural networks
Conference on Robot Learning. Vol. 155 of Proceedings of Machine harmoniously. In: Computer Vision – ECCV 2022, pp. 691–708.
Learning Research, pp. 1110–1120 (2021) Springer‐Verlag, Berlin (2022)
21. Achiam, J., Amodei, D.: Benchmarking Safe Exploration in Deep Rein- 43. Misra, I., et al.: Cross‐stitch networks for multi‐task learning. In: 2016
forcement Learning. arXiv (2019) IEEE Conference on Computer Vision and Pattern Recognition
22. Stooke, A., Achiam, J., Abbeel, P.: Responsive safety in reinforcement (CVPR), pp. 3994–4003 (2016)
learning by pid Lagrangian methods. In: Proceedings of the 37th Inter- 44. Rosenbaum, C., Klinger, T., Riemer, M.: Routing networks: adaptive
national Conference on Machine Learning (2020) selection of non‐linear functions for multi‐task learning. In: International
23. Achiam, J., et al.: Constrained policy optimization. In: Proceedings of the Conference on Learning Representations (2018)
34th International Conference on Machine Learning, vol. 70, pp. 22–31 45. Wang, X., Li, Y.: Harmonized dense knowledge distillation training for
(2017) multi‐Exit architectures. Proc. AAAI Conf. Artif. Intell. 35(11),
24. Yang, L., et al.: Constrained update projection approach to safe policy 10218–10226 (2021). https://doi.org/10.1609/aaai.v35i11.17225
optimization. In: Advances in Neural Information Processing Systems, 46. Fujimoto, S., van Hoof, H., Meger, D.: Addressing function approxi-
vol. 35, pp. 9111–9124. Curran Associates, Inc. (2022) mation error in actor‐critic methods. In: Dy, J., Krause, A. (eds.) Pro-
25. Dalal, G., et al.: Safe Exploration in Continuous Action Spaces. arXiv ceedings of the 35th International Conference on Machine Learning, pp.
(2018) 1587–1596 (2018)
26. Srinivasan, K., et al.: Learning to Be Safe: Deep RL with a Safety Critic. 47. Fiorini, P., Shiller, Z.: Motion planning in dynamic environments using
arXiv (2020) velocity obstacles. Int. J. Robot Res. 17(7), 760–772 (1998). https://doi.
27. Thananjeyan, B., et al.: Recovery rl: safe reinforcement learning with org/10.1177/027836499801700706
learned recovery zones. IEEE Rob. Autom. Lett. 6(3), 4915–4922 (2021). 48. Chen, C., et al.: Crowd‐robot interaction: crowd‐aware robot naviga-
https://doi.org/10.1109/lra.2021.3070252 tion with attention‐based deep reinforcement learning. In: 2019 In-
28. Berliac, F., Basu, Y., Saac, D.: Safe reinforcement learning as an adver- ternational Conference on Robotics and Automation, pp. 6015–6022
sarial game of actor‐critics. In: The Multi‐Disciplinary Conference on (2019)
Reinforcement Learning and Decision Making (2022) 49. Van den Berg, J., et al.: Reciprocal n‐body collision avoidance. In: The
29. Mnih, V., et al.: Human‐level control through deep reinforcement International Journal of Robotics Research, pp. 3–19 (2011)
learning. Nature 518(7540), 529–533 (2015). https://doi.org/10.1038/ 50. Rösmann, C., et al.: Exploiting sparse structures in nonlinear model
nature14236 predictive control with hypergraphs. In: 2018 IEEE/ASME
24682322, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12269 by Bangladesh Hinari NPL, Wiley Online Library on [13/10/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16
- ZHOU ET AL.