Deep Reinforcement Learning Based Mobile Robot Navigation A Review
Deep Reinforcement Learning Based Mobile Robot Navigation A Review
Abstract: Navigation is a fundamental problem of mobile robots, for which Deep Reinforcement Learning (DRL)
has received significant attention because of its strong representation and experience learning abilities. There is a
growing trend of applying DRL to mobile robot navigation. In this paper, we review DRL methods and DRL-based
navigation frameworks. Then we systematically compare and analyze the relationship and differences between
four typical application scenarios: local obstacle avoidance, indoor navigation, multi-robot navigation, and social
navigation. Next, we describe the development of DRL-based navigation. Last, we discuss the challenges and some
possible solutions regarding DRL-based navigation.
Key words: mobile robot navigation; obstacle avoidance; deep reinforcement learning
1 Introduction
The navigation capability is a fundamental problem of
mobile robots, which include unmanned vehicles, aerial
vehicles, and ships. The general aim of navigation is to
identify an optimal or suboptimal path from a starting
point to a target point in a Two-Dimensional (2D) or
Three-Dimensional (3D) environment while avoiding
obstacles. Delivery robots, warehouse automated guided
vehicles, and indoor service robots require robust robot
navigation systems in their dynamic environments.
In the past two decades, researchers from all over Fig. 1 Traditional robot navigation framework.
the world have been focused on solving the navigation it to its destination using a path planning module[1] .
problem. One popular approach is combining a series of SLAM algorithms can be divided into visual and
different algorithms. As shown in Fig. 1, the traditional laser SLAMs. The visual SLAM algorithm extracts
navigation framework uses Simultaneous Localization artificial image features, estimates the pose of the
and Mapping (SLAM) to construct a map of the robot and camera based on multi-view geometry theory,
unknown environment, then uses a localization algorithm and builds an obstacle map. Classical visual SLAM
to determine the current position of the robot and moves methods, such as LSD-SLAM[2] and ORB-SLAM[3] ,
face two main challenges: (1) designing effective image
Kai Zhu and Tao Zhang are with the Department of Automation, features to express image information and (2) possible
Tsinghua University, Beijing 100084, China. E-mail: failure of the algorithm in cases of object movement,
[email protected]; [email protected].
camera parameter change, illumination change, and
Tao Zhang is also with the Beijing National Research Center
for Information Science and Technology, Tsinghua University,
single environments that lack texture. The laser SLAM
Beijing 100084, China. algorithm directly constructs an obstacle map of the
To whom correspondence should be addressed. environment based on the dense laser ranging results of
Manuscript received: 2021-02-05; accepted: 2021-02-22 algorithms such as GMapping[4] and Hector SLAM[5] .
C The author(s) 2021. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
Kai Zhu et al.: Deep Reinforcement Learning Based Mobile Robot Navigation: A Review 675
The challenges of laser SLAM include (1) the time- the state and a goal of maximizing the expected return
consuming establishment and update of the obstacle map of the action. By interacting with the environment, the
and (2) the need for a dense laser sensor because the DRL method finds the optimal policy of guiding the
algorithm performance strongly depends on the sensor robot to the target position. DRL-based navigation has
accuracy. the advantages of being mapless and having a strong
Path planning is another key module in the traditional learning ability and low dependence on sensor accuracy.
navigation framework. Based on different amounts of Since 2016, the trend of applying DRL to mobile robot
environmental information obtained, this module can be navigation has increased, achieving great success[9] .
divided into global and local path planning. Global path Researchers have published several surveys on mobile
planning involves selecting a complete path based on a robot navigation[10, 11] , which mainly introduce path
known environmental map. Commonly used methods planning and obstacle avoidance methods under the
include the A-star, ant colony optimization, and rapid- traditional navigation framework. The contents of the
exploration random tree[6] , which rely on known static DRL technology are not comprehensive. In 2020,
maps and are therefore difficult to use in dynamic Nguyen et al.[12] investigated multi-agent DRL, which
environments. Local path planning methods, such as the uses DRL to solve multi-agent cooperation problems.
Artificial Potential Field (APF) and dynamic window Zeng et al.[13] published a survey on the use of DRL for
approach[7] , are used to deal with dynamic changes the visual navigation of artificial agents, which mainly
in the environment and replanning local paths. The focused on visual navigation tasks and the division of
bottlenecks encountered by traditional path planning DRL methods into five categories for review. In contrast
algorithms include (1) the contradiction of grid-based to their research, because we believe that different
map representation between its accuracy and memory navigation scenarios have similar characteristics, here,
requirements and (2) the intensive calculations required we focus more on the intrinsic enhancement that DRL
for real-time replanning of the navigation path in a brings to a variety of navigation tasks, and the use of
dynamic environment, which limits its reactivity to some state-of-the-art techniques in dealing with mobile robot
extent. navigation problems.
As mentioned above, each aspect of the traditional In this paper, we present a comprehensive and
navigation framework represents a challenging research systematic review of DRL-based mobile robot navigation
topic, and their integration often leads to large from 2016 to 2020. The application scenarios, current
computational errors. These calculation errors gradually challenges, and possible solutions to the challenges of
accumulate along the pipeline from mapping, to DRL-based navigation are discussed in detail to guide
positioning, to the path planning algorithm, which leads researchers for further improvement of current research
to poor performance of all these algorithms in practical results and their deployment to real systems.
applications. The traditional navigation framework relies In Section 2, we present the background knowledge
on a high-precision global map that is very sensitive to of DRL. In Section 3, we present the framework and
sensor noise, resulting in limitations in the ability to key elements of the DRL-based navigation problem. In
manage an unknown or dynamic environment. Section 4, we divide the application scenarios of DRL-
With the powerful representation capabilities of deep- based navigation into four categories, and describe in
learning technology, new ideas have been introduced detail the developments and approaches used in each
for using reinforcement learning frameworks that can scenario. We present the current challenges and available
directly learn navigation strategies from raw sensor solutions in Section 5. Finally, we draw our conclusions
inputs. In 2013, Mnih et al.[8] were the first to propose in Section 6. The architecture of this paper is shown in
the concept of Deep Reinforcement Learning (DRL). Fig. 2.
They proposed the Deep Q Network (DQN), which can
learn to play Atari 2600 games at a level beyond that of 2 Background: Deep Reinforcement
human experts based only on image input. Since then, Learning
researchers have proposed numerous methods that use 2.1 Preliminary
DRL algorithms for handling autonomous navigation
tasks. These methods describe navigation as a Markov Reinforcement Learning (RL), inspired by animal
Decision Process (MDP), with sensor observations as learning in psychology, learns optimal decision-making
676 Tsinghua Science and Technology, October 2021, 26(5): 674–691
s0 a0
the policy network to obtain the reward value, and
Kai Zhu et al.: Deep Reinforcement Learning Based Mobile Robot Navigation: A Review 677
optimizes the policy network parameters along the evaluates the selected optimal action. Two sets of
gradient direction for obtaining an optimized policy that parameters separate the action selection and policy
maximizes the reward value. evaluation tasks, which reduces the overestimation risk.
2.2 Value-based DRL methods The loss function of the DDQN is denoted as
2.2.1 Deep Q network L.i / D E r C
Q.s 0 ; arg max Q.s 0 ; a0 I i /I i /
a0
Mnih et al. published DQN-related work in Nature in # 2
2015, reporting that the trained network could reach Q.s; aI i / (7)
a level equivalent to that of humans after playing 49
games[15] . DQN, which is based on Q learning, uses a The experimental results in 57 Atari games show
convolutional neural network (a deep neural network) to that the normalized performance of DDQN without
represent the action value function, and the network is adjustment is twice that of DQN, and three times that of
trained based on reward feedback from the game. The DQN when adjusted.
main features of DQN are as follows: 2.3 Policy-based DRL methods
(1) The target network is set to deal with the TD
error in the time-difference algorithm separately. The 2.3.1 Deep Deterministic Policy Gradient (DDPG)
parameter i of the current Q-network Q.s; aI i / is Value-based DRL methods (DQN and its variants) solve
copied to i of the target Q-network Q.s 0 ; a0 I i / every problems with a high-dimensional observation space but
n time steps, which prevents instability of the target Q can only handle discrete and low-dimensional action
network from the changes made in the current Q network spaces. However, several practical tasks, especially
during training. physical control tasks, have continuous and high-
(2) The experience pool U.D/ is used to store and dimensional action spaces. To address this issue, the
manage samples .s; a; r; s 0 /, and an experience replay action space can be discretized but will inevitably face
mechanism is used to select the samples. These samples the curse of dimensionality, i.e., the number of actions
are stored in the experience pool, from which batch will increase exponentially with the increase in degree
samples are randomly selected to train the Q network. of freedom.
The experience replay mechanism helps to eliminate the Lillicrap et al.[17] proposed the DDPG, which uses a
correlation between samples so that the samples used method based on the policy gradient to directly optimize
in the training approximately realize independent and the policy, which can be used for problems with a
identical distributions. continuous action space. Unlike the random strategy
The parameters of the neural network are updated represented by the probability distribution function a t
by gradient descent. The loss function of the DQN is .s t j /, DDPG uses a deterministic policy function
denoted as: " a t D .s t j /. It also uses a convolutional neural
2 # network to simulate the policy and Q functions and
0 0
L.i / D E r C
max 0
Q.s ; a I i / Q.s; aI i / learns from the experience replay and target network in
a
(6) the DQN to stabilize the training and ensure high sample
utilization efficiency. K samples in the experience pool
2.2.2 Double DQN (DDQN)
are randomly selected, and the Q network is gently
The emergence of DQN promoted widespread use of
updated by gradient ascent. The loss function of the
DRL, but DQN has a number of shortcomings, one of
Q network is defined as follows:
which is the overestimation of the action value function. 1 X 2
Van Hasselt et al.[16] pointed out that when the DQN LD .yi Q.si ; ai j Q // (8)
K
calculates the TD error, using the same Q network to i
0 0
select actions and calculate value functions leads to where yi D ri C
Q .si C1 ; 0 .si C1 j /j Q / indicates
0
overestimation of the value function. Thus, the authors the expected value.
proposed the DDQN algorithm. The unbiased estimate of the policy network gradient
DDQN uses a dual network structure in the target Q is obtained as follows:
1X
function, whereby the optimal action is selected based r J raQ.s; ajQ /jsDsi ;aD.si /r .sj /jsi
on the current Q network, and the target Q network K
i
(9)
678 Tsinghua Science and Technology, October 2021, 26(5): 674–691
The experimental results show that the DDPG is be resampled to implement the next policy update.
suitable for solving more than 20 continuous control Schulman et al.[19] proposed the PPO algorithm, which
tasks such as robotic arm control. can perform multiple epochs of minibatch updates,
2.3.2 Asynchronous Advantage Actor-Critic (A3C) thereby improving the sample utilization efficiency.
The A3C algorithm proposed by Mnih et al.[18] is a The PPO algorithm uses an alternative goal to
representative Actor-Critic (AC) method. The classic optimize the new policy using the old policy:
.a t js t / O
policy gradient algorithm directly optimizes the agent’s L. / D EO At (10)
policy; it must collect a series of complete sequence data old .a t js t /
to update the policy. In DRL, collecting sequence data is where AO t is an estimation of the advantage function at
often challenging and large variances can be introduced. .a t js t /
time t. , r t . / is the probability ratio of
The AC structure that combines the value function with old .a t js t /
the policy gradient method is receiving much attention. the new policy to the old policy old .
In the AC structure, the actor selects actions using the Equation (10) is used to improve the actions generated
policy gradient method, and the critic evaluates those by the new policy relative to those generated by the old
actions using the value function method. During training, policy. However, a large-scale improvement by the new
the parameters of the actor and critic are alternately policy will lead to instability of the training algorithm.
updated. The advantage of the AC structure is that it The PPO algorithm improves the objective function to
changes the sequence update in the policy gradient to obtain the following new clipped surrogate objective:
h i
a single-step update, without the need to wait for the LCLIP . / D EO min.r t . /; clip.r t . /; 1 "; 1C"//AO t
sequence to end before evaluating and improving the (11)
policy. This ability reduces both the difficulty of data where clip.r t . /; 1 "; 1 C "/ denotes the restriction
collection and the variance experienced by the policy of the probability ratio to the interval of Œ1 "; 1 C ".
gradient algorithm. The clipped surrogate objective keeps the change of
Based on the AC structure, A3C makes the following the new policy within a certain range and improves the
improvements: algorithm’s stability.
(1) Parallel agents: The A3C algorithm creates By achieving a good balance between sample
multiple parallel environments, thereby enabling complexity, simplicity, and time effectiveness, PPO
multiple agents with secondary structures to outperforms A3C and other on-policy gradient methods.
simultaneously update the parameters of the main
structure in these parallel environments. Multiple actors 3 DRL-Based Navigation
are used to explore the environment. 3.1 Framework
(2) N-step return: Although other algorithms
Mobile robots include unmanned vehicles, aerial
typically use a one-step return of the instant reward
vehicles, and ships that move in two or three dimensions.
calculation function obtained in the sample, the value
Their navigation involves searching for an optimal or
function of A3C’s critic is updated based on the
suboptimal path from the starting point to a target point
multi-step cumulative return. Calculation of the N -step
while avoiding obstacles. To simplify this challenge,
return improves the iterative update propagation and
most researches have focused only on the navigation
convergence speed.
problem in 2D space.
A3C can run on a multi-core CPU, and its
In essence, the mobile robot navigation task
computational cost is lower than methods like DQN. The
constitutes Point-To-Point (P2P) movement and obstacle
experimental results show that, despite the problems of
avoidance: (1) The P2P task requires the position of the
hyperparameter adjustment and low sampling efficiency,
goal point relative to the start point, which can be directly
A3C has achieved success in tasks such as the continuous
obtained via GPS or ultra-wideband localization[20] ,
control of a robotic arm and maze navigation.
or indirectly through a target perspective image. (2)
2.3.3 Proximal Policy Optimization (PPO) Obstacles include those that are statically, dynamically,
Traditional policy gradient methods adopt an on-policy and structurally continuous, which can be sensed by
strategy in which the sampled minibatch can only be laser range finding, ultrasonic finding, cameras, or other
used for one update epoch, and the minibatch must sensors. In this paper, a structurally continuous obstacle
Kai Zhu et al.: Deep Reinforcement Learning Based Mobile Robot Navigation: A Review 679
refers to an inherent structure in the environment, such as global information provided by the traditional navigation
a corridor or wall. These obstacles constitute an indoor technique[21] . As shown in Fig. 3, the global path
or maze-like environment. planning module generates a series of waypoints as
The purpose of using a DRL algorithm in an intermediate goal points for DRL-based navigation,
autonomous navigation task is to find the optimal which enables the integrated navigation system to
policy for guiding the robot to its target position realize long-distance navigation in a complex structural
through interaction with the environment. Many well- environment.
known DRL algorithms, such as DQN, DDPG, PPO, 3.2 Key elements
and their variants, have been extended to realize a
DRL-based navigation system. These methods describe The DRL-based navigation system contains three
the navigation process as an MDP that uses sensor key elements that directly determine the application
observation as the state with the goal of maximizing scenarios and performance of the DRL algorithm: the
the expected revenue of the action. As mentioned state space S , the action space A, and the reward
above, DRL-based navigation has the advantages of function R.
being mapless and having a strong learning ability and 3.2.1 State space
low dependence on sensor accuracy. As RL is a trial- The most often used state-space settings include the
and-error learning technology, the physical training start point, goal point, and obstacles. (1) The start
process inevitably leads to collisions of the robot with point and goal point are represented by the current and
environmental obstacles, which is prohibited. Generally, destination coordinates of the mobile robot, respectively.
the deep neural network is trained in a simulation Several researchers convert global Cartesian coordinates
environment before being deployed in a real robot for into local polar coordinates and use the direction and
real-time navigation decision making. distance relative to the robot for expressing the target
DRL-based navigation has been used to replace or point position. (2) The obstacle state is represented by
be integrated into the traditional navigation framework. the speed, position, and size of the moving obstacle
Figure 3 shows the interaction process between the (agent level) or treats sensor data directly as a sensor-
agent and environment of the DRL-based navigation level state, i.e., lidar/ultrasonic ranging data, monocular
system. The DRL agent replaces the localization and camera image, or depth camera data.
map building module as well as the local path planning 3.2.2 Action space
module of the traditional navigation framework, moving
In DRL-based navigation research, there are three kinds
toward the target point while avoiding static, dynamic,
of actions, i.e., (1) discrete moving actions: moving
and simple structurally continuous obstacles. However,
forward, moving backward, turning left, turning right,
in an environment where structurally continuous
and so on; (2) continuous velocity commands: the linear
obstacles are too complex, the agent may fall into
velocity and angular velocity of the mobile robot; and (3)
a local trap. In this case, DRL requires additional
motor speed commands: the desired speeds of the left
and right motor. In general, discrete moving actions and
continuous velocity commands require a Proportional–
Integral–Derivative (PID) or other low-level motion
controllers to output motion control instructions and
control the mobile robot to achieve the desired motion.
Motor speed commands can realize end-to-end control
using the sensor-level state, but the associated training is
much more difficult.
3.2.3 Reward function
The reward function is used to train the RL agent to
complete a task. In the navigation task, positive or
negative rewards are only given when reaching the target
or colliding with obstacles, which means this reward
Fig. 3 DRL-based navigation system. is very sparse. Sparse rewards are not conducive to
680 Tsinghua Science and Technology, October 2021, 26(5): 674–691
rapid convergence by the agent. To improve training research and added expert knowledge for different
efficiency, dense reward-shaping methods are used in application scenarios. These different approaches have
most studies: (1) goal rewards include positive rewards occurred because in the current state of the art, if the
given for arriving at goals and movement close to these DRL navigation policy space is set too large, it is difficult
goals; (2) a collision penalty is a negative reward given to converge. Currently, to reduce the difficulty of DRL
following a collision with an obstacle, or movement too training, agents usually learn navigation capabilities in
close to an obstacle; and (3) a time step penalty is a a specific scene, which are then generalized to similar
negative reward given at each time step to encourage the scenes.
robot to move faster on its way to the target. In this review, we divide the application scenarios
of DRL navigation into four categories: local obstacle
4 Application Scenario avoidance, indoor navigation, multi-robot navigation,
In the past five years, several studies have been and social navigation. A simple comparison of these
scenarios is shown in Table 1. Each scenario has
conducted on DRL-based navigation, but the
the same basic navigation tasks but features different
classification of DRL-based navigation remains
emphases and details. The local obstacle-avoidance
confusing. For example, when using lightweight
scenario emphasizes dynamic changes in the simple
localization solutions, such as GPS and Wifi, a DRL-
structural environment, whereas indoor navigation
based navigation system can obtain the relative position
focuses on the complexity of the indoor structural
of a goal point without global map information, which
environment. The multi-robot navigation scenario
several researchers refer to as “mapless” navigation.
involves an environment with multiple high-speed
In other research, the DRL method preprocesses the
mobile robots. Social navigation focuses on moving
sensor’s local observation data into the form of a local
through pedestrian-rich environments.
map, which is called a “map-based” method, and
global map information is not used. Moreover, some 4.1 Local obstacle avoidance
studies refer to “visual navigation” as the use of a 4.1.1 Feature
first-person-view Red-Green-Blue (RGB) image as the The local obstacle-avoidance scenario, which is the
target, whereas other studies refer to navigation based most common application scenario of the DRL-based
on visual sensors. navigation system, is the basis of the other scenarios
We found that although researchers use similar DRL and can be extended to more complex navigation tasks.
algorithms for essentially solving the same problem In traditional navigation frameworks, reactive methods
(Fig. 4), different researchers have conducted specific are typically used to solve this type of problem, such
as the APF or velocity-based methods. One of the
biggest problems of reactive methods is the need for
a good sensor system that can generate accurate position
coordinates for any local obstacle. DRL methods
implicitly process sensor data through neural networks,
which overcome the shortcomings of traditional obstacle-
avoidance methods.
4.1.2 Development
In 2016, Duguleana and Mogan[22] studied autonomous
navigation in environments containing static and
Fig. 4 Four application scenarios of DRL-based navigation. dynamic obstacles; they combined the neural network
Pose-Net and a 30-20-3 multi-layer perceptron with feature extraction ability of the CNN and learning ability
the famous RL method Q learning. By dividing the of DRL.
surrounding obstacle environment into eight angular 4.1.3 Sim-to-real
regions, they reduced the number of states. Pose-Net Like most scientific research, the purpose of DRL-based
can output three discrete actions, i.e., moving forward, navigation research is its application to real systems.
turning left, and turning right. This early research
In the RL training process, high-speed collisions, even
realized effective obstacle avoidance in simple physical
during training, can damage the robot. Kahn et
environments.
al.[32] focused on this problem and proposed a learning
Subsequently, DRL solutions, such as DQN, were
algorithm based on an uncertainty model that uses a
rapidly developed, receiving widespread attention. In
neural network to estimate the probability of collision.
numerous works, mature DRL methods have been The algorithm naturally chooses to proceed cautiously
applied to local obstacle-avoidance scenarios. Feng et in unfamiliar environments and increases the robot’s
al.[23] used DDQN to train the agent in a simulation velocity in environments where it has high confidence.
environment to avoid collisions with a wall without using However, training in a physical environment is very
a target point. Most actual local obstacle-avoidance tasks time-consuming and dangerous. Researchers usually
must be simultaneously performed with P2P tasks. For train agents in a simulation environment and then
example, Kato et al.[24] developed a navigation system transfer this learning to the physical environment. Sparse
that combines DDQN with a topological-map-based laser ranging data can reduce the large difference
global planning method. The topology map node stores between simulation and reality[33] . Tai et al.[34] trained
the topology map, plans the global path, and selects an asynchronous DDPG algorithm in a simulation
the next waypoint. The local navigation node uses the environment, took only the 10-dimensional sparse range
control commands learned through the DDQN to move findings and the target position as input, and output
between waypoints. The authors further improved the continuous linear and angular velocities. They were
system using a real-time kinematic global navigation able to directly transfer the network to a real incomplete
satellite system that provides waypoints[25] ; this system differential robot platform without any fine-tuning.
does not perform global path planning and was verified Yokoyama and Morioka[35] proposed a system that uses
in an outdoor obstacle environment. Wang et al.[26] the pre-trained Struct2depth model to estimate depth data
proposed the Fast Rent Deterministic Policy Gradient from a monocular camera image. They then converted
(Fast-RDPG) algorithm, which improved the DDPG the depth data into 2D distance data and trained the
performance on Unmanned Aerial Vehicles (UAVs). DDQN agent in a 2D-lidar simulation environment.
By assuming that a virtual UAV only flies at a fixed The proposed system realized navigation in the local
altitude and speed, they simplified the flying task to environment using only a monocular camera.
two dimensions. The input of the Fast-RDPG includes For expanding the state-space distribution of the pool
five-dimensional ranging data, the distance and angle of samples and enabling the agent to adapt to more
of the target point provided by the GPS signal, and the new situations, Zhang et al.[36] adopted the approach of
direction angle of the UAV. The control profile only randomly changing the starting and target points in each
includes turning left or right. The authors then extended epoch. They set up four different maze environments for
their work to the 3D environment[27] . In the study of training. Lei et al.[37] introduced a radius constraint on the
Ma et al.[28] , a single camera was used to avoid flight initialization by randomly setting the start point in a circle
obstacles in 3D space. The authors used an improved with the target point as the center and a radius of Lr . The
saliency detection method based on a Convolutional initial value of Lr is small, and as the neural network
Neural Network (CNN) to extract monocular visual cues; is updated, the value of Lr gradually increases, thereby
an AC RL module receives states from the obstacle increasing the success probability and ensuring a positive
detection module to adjust the position and altitude of incentive in the sample space. However, expanding
the UAV. the sample space causes the DRL algorithm training to
In addition, Woo and Kim[29] and Wu et al.[30] become very time-consuming. Several researchers have
combined international regulations for the prevention combined methods, such as expert demonstrations[38] ,
of collisions at sea[31] to study the collision avoidance artificial potential fields[39] , and non-expert helpers[40] ,
problem of the unmanned surface vehicle based on the to improve training efficiency, which have been tested in
682 Tsinghua Science and Technology, October 2021, 26(5): 674–691
goals and used sparse “fruit” rewards to encourage Scenarios with multiple high-speed mobile robots, such
exploration. Two auxiliary tasks, including depth as a warehouse, feature stronger dynamics that bring
prediction and loop-closure prediction, were proposed new challenges to DRL-based navigation. Traditional
to solve the sparse reward problem. However, NAV solutions for multi-robot navigation can be divided
A3C was found to have an unstable policy, poor into centralized and distributed solutions. Centralized
data efficiency, and poor robustness in a complex methods typically provide a central server, which
environment. Zeng and Wang[50] used the monotonic requires communication between robots, collects all
policy improvement advantage of PPO and proposed relevant information, and then determines the actions
the appoNav (asynchronous PPO) algorithm to solve the of each robot using an optimization algorithm. In
visual navigation problem. Kulhánek et al.[51] developed the classical distributed method[57] , each robot makes
auxiliary tasks including pixel control, reward prediction, independent decisions, but must obtain the accurate
a depth map, segmentation, and target segmentation. motion data of other robots, such as speed, acceleration,
They conducted experiments on the DeepMind Lab, and path. The DRL method provides a new solution for
AI2-THOR, and House3D frameworks and verified distributed obstacle avoidance and the cooperation of
the performance improvement of the visual navigation multiple robots under non-communication conditions.
algorithm. 4.3.2 Development
In previous research[45] , the DRL algorithm was found In 2016, Chen et al.[58] first proposed multi-robot
to be generalizable to new scenarios, but at the expense Collision Avoidance with Deep RL (CADRL) in the
of a decrease in performance and the need to fine-tune decentralized non-communicating condition, in which
the network. To improve the generalization ability of the the expensive online calculations of traditional methods
visual navigation algorithm, Devo et al.[52] proposed the were converted into the offline training processes
importance weighted actor–learner architecture, a new of value networks. The optimal reciprocal collision
framework comprising object localization and navigation avoidance method was used to generate the training
networks. The object localization network takes the set. Then, given an agent’s joint configuration (position
target image and current frame as input, and outputs vector, velocity vector, and robot radius) with its
a six-dimensional vector that represents the position of neighbors, the value network encoded the estimated time
the target in the current frame. The vector and the current to reach the target. The authors refined the policy using
frame are then input into the navigation network, and an RL and simulated multiple situations.
action decision is generated. The agent only uses sparse However, the agent’s joint configuration cannot be
rewards, which can be transferred to real environments obtained directly, and the calculation is time-consuming.
and real objects without fine-tuning. However, the To simplify the decision-making process, raw sensor
authors noted that the navigation performance fluctuated, data can be directly used as input. Ding et al.[59]
and with some combinations of light and textures, the studied the hierarchical RL method using lidar data as
agent could not achieve satisfactory results. Thus, further input. A high-level evaluation module is responsible
research is required. for perceiving the overall environmental risks, and a
In addition to using a first-view image to express the low-level control module is responsible for making
target location, Devo et al.[53] studied the navigation action decisions. Long et al.[60] have also done a
task that follows natural language instruction input. Hsu lot of work on the multi-robot collision avoidance
et al.[54] divided the complex indoor environment into problem. They chose 512-dimensional lidar data with
different local areas, and generated navigation actions the 180ı FOV, target position, and current robot velocity
based on the scene image and target location. The latest as the state space, and continuous translational and
research interests also include hierarchical RL[55] and rotational velocities as the action space. To improve the
the graph structure neural network[56] . Indoor navigation training efficiency, they made the artificially designed
is becoming increasingly practical. reward function dense. This artificially designed dense
4.3 Multi-robot navigation reward function and multi-scenario multi-stage training
framework were used to improve the training efficiency
4.3.1 Feature
of the parallel PPO algorithm, but the authors found that
Multi-robot navigation is an extension of the local the RL policy still could not produce consistently perfect
obstacle-avoidance task performed by a single robot. behavior. Therefore, they combined the classic PID
684 Tsinghua Science and Technology, October 2021, 26(5): 674–691
controller with the DRL policy and proposed a hybrid- drive vehicles, are used to detect pedestrians[68] . The
RL framework[61, 62] , classified the scenarios faced by a speed, velocity, and size (radius) of a pedestrian are
robot into three categories, and designed separate control estimated by clustering point-cloud data. The authors
sub-policies for each category. considered complex normative motion patterns to
In addition to avoiding collisions during navigation, possibly be the result of simple local interactions, and
multi-robots must often perform cooperative tasks such they proposed a way to induce behavior that respects the
as maintaining formation. Chen et al.[63] studied the norms of human society. To induce a particular norm,
problem of multi-robot formation. In their parallel they introduced a small bias to the RL training process in
DDPG method, multiple agents share experience favor of one set of behaviors over others. They conducted
memory data and navigation strategies. Reward-shaping an experiment in a real environment with plenty of
techniques are used to adjust the reward function of pedestrians. After further research, they found that their
single-robot navigation tasks into a multi-robot version, assumptions regarding social norms deviated from reality
and curriculum learning techniques are used to train as the number of agents in the environment increased.
robots to complete a series of increasingly difficult tasks. Previous research had required fixed-size observations.
Lin et al.[64] proposed a distributed navigation method In 2018, the GA3C-CADRL[69] algorithm combined
based on PPO, which takes the geometric center of the with LSTM was proposed to avoid collisions between
mass of the robot group as the target, and achieves the various types of dynamic agents without assuming that
goal point while avoiding collisions and maintaining they follow any specific behavioral rules. In terms of the
connectivity. The authors verified this algorithm in a real reward function, Ciou et al.[70] used reward updates from
system. Recently, Sartoretti et al.[65] combined RL and human feedback to learn appropriate social navigation.
imitation learning to achieve distributed path planning Sun et al.[71] calculated the collision probability between
for thousands of robots in a large-scale environment. the subject and a dynamic obstacle, whereby the higher
Ma et al.[66] successfully applied DRL to a multi-robot the collision probability is, the greater the penalty is
hunting problem. received by the agent.
4.4 Social navigation Unlike agent-level states such as coordinates and
velocity, Sasaki et al.[72] used a local map to represent
4.4.1 Feature
the status of dynamic crowds. The advantage of this
Social navigation refers to the movement of a
approach is that it can simultaneously identify multiple
mobile robot in pedestrian-rich environments such as
moving targets and track all targets based on lidar
airports and shopping malls. This scenario focuses
data. Each map is generated from the current step to
on obstacle-avoidance capabilities in the complex
several previous steps. They designed a unique geometric
environments of human society. Social and multi-
transformation method, wherein the map contains target
robot navigation are similar, because we can treat
location information. They evaluated the A3C algorithm
pedestrians as non-cooperative mobile robots. Most
using pedestrian data collected in the real world, but at
local obstacle avoidance and distributed multi-robot
the time of publication of this review, they had not yet
navigation methods can be extended to social navigation
deployed it in a real system.
scenarios, but the density of dynamic obstacles is much
Recently, Sathyamoorthy et al.[73] studied the freezing
higher in crowded environments. In addition, because
robot problem that arises when a robot navigates
of the randomness of human behavior, navigation that
through dense scenarios and crowds, using a DRL-
meets social standards remains difficult to quantify,
based hybrid approach to handle crowds of varying
which widens the difference between simulated and
densities. Chen et al.[74] focused on the problem of DRL
actual environments. Ensuring safe crowd navigation
performance degradation when the number of people
requires further research.
increases and combined it with a state-of-the-art graph
4.4.2 Development convolutional network technique for training the agent
In 2017, Chen et al.[67] extended their previous work to pay attention to the most critical person in the crowd.
CADRL to social navigation scenarios, and proposed the Their research efforts will facilitate the development
socially aware CADRL algorithm. Various sensors, such of social navigation. Simple comparison of relevant
as lidar, a depth camera, and a camera in differential references is shown in Table 2.
Kai Zhu et al.: Deep Reinforcement Learning Based Mobile Robot Navigation: A Review 685
5 Current Challenge and Solution individual current observations and (2) the agent can
only learn suboptimal strategies.
5.1 Partial observation
The MDP assumes that the state is completely
5.1.1 Challenge observable, and it cannot capture the complex structure
In robotics, RL is typically applied to motion control of an environment. Many researchers model using
tasks such as manipulator control, because the state space the Partially Observable Markov Decision Process
is considered to be fully observable. However, because (POMDP), an extension of MDP that can obtain
of the limited FOV and the uncertainty regarding the more information from historical trajectories. Formally,
state of other subjects, the mobile robot navigation task a POMDP consists of six tuples .S; A; R; P; ˝; O/,
is partially observable. This causes two problems: (1) where ˝ is the observation space .o t 2 ˝/ and O
the agent cannot uniquely distinguish its state based on is the observation probability distribution given the
686 Tsinghua Science and Technology, October 2021, 26(5): 674–691
system state .o t O.s t //. At each time step t , the agent The reward is generated only at the end of each epoch,
observes o t , executes an action a t according to the policy which means it is very sparse. As we know, effective RL
.a t jo t /, receives a reward r t , and transfers into a new depends on collecting useful reward signals, so agents
state s t C1 according to the distribution P .s tC1 js t ; a t /. must identify a series of corrective actions to achieve
5.1.2 Solution the goal of generating a sparse reward. The probability
Two main methods are used to solve the partial of finding a sparse reward signal by a random search is
observation problem. very small. When the surrounding environment becomes
(1) Expansion of network input more complex and dynamic, the sample space rapidly
Considering that the agent cannot uniquely distinguish expands. Sparse reward can aggravate data inefficacy
its state based on current observations, the simplest and result in poor training convergence and long training
solution is to add several previous observation frames times.
as network inputs to improve its ability to distinguish 5.2.2 Solution
among states[36, 51, 72, 75, 76] . In addition, previous The well-known experience replay mechanism,
rewards and actions also contain state information, so exploration, and utilization techniques in RL research
some studies have input previous rewards and actions can be directly applied to DRL-based navigation. In
to the network[33, 44, 63, 77] . Another input expansion addition, three techniques can be used to solve the
technique is the two-stream Q network proposed by sparse reward problem.
Wang et al.[78] , which adds the difference between two (1) Reward shaping
frames of laser scanning data. An intuitive approach is to design dense rewards
(2) Addition of memory ability manually to make the problem easier to learn. By
Compared with the manual selection of the introducing expert knowledge, the agent can earn a
previous observation frame as input, having the agent reward at each step in which it executes an action and
automatically memorize and process the previous interacts with the environment. Most methods, especially
observation is more attractive[79] . The Recurrent Neural those that can be verified in real systems, adopt reward-
Network (RNN), a classic network structure in deep- shaping techniques.
learning research, uses sequence data as input and has Generally, based on the target distance, moving
memory capabilities. Previous observations have an closer to a target will generate a small positive
impact on the current output of the RNN. Some scholars reward[24, 34, 58, 60] , and moving closer to an obstacle will
have integrated an RNN into the DRL framework generate a small negative reward[69] . A small penalty is
and proposed RDPG to solve the POMDP. Wang et also imposed at each time step to encourage the robot
al.[26, 27] successfully applied this approach to DRL- to reach the target faster[26, 36, 37, 45, 54] . Direction-related
based navigation tasks. rewards can also encourage agents to move toward a
However, RNNs have a long-term dependencies target. Sampedro et al.[39] adopted an APF formulation
problem, and the LSTM network was proposed by to design their reward function. Other researchers
deep-learning researchers to improve this problem. have also designed special reward functions to promote
Numerous DRL-based navigation studies have dealt with learning[31, 42, 67, 72, 81] .
POMDP by adding several LSTM layers to the network However, manually designed reward functions have
structure[44, 50, 54, 65, 71] . As a variant of LSTM, the Gated two disadvantages: (1) they are closely related to the
Recurrent Unit (GRU) is easier to train. Zeng et al.[80] task scene, and over-fitting a certain scene leads to
built a GRU-based memory neural network and proved its non-universality, and (2) an inappropriate reward
that it could improve performance in complex navigation sometimes leads to the wrong guidance for learning.
tasks. To solve these problems, Chiang et al.[75] proposed the
AutoRL algorithm, which automates the search for both
5.2 Sparse reward
the shaped reward and the neural network architecture.
5.2.1 Challenge Zhang et al.[82] introduced a general reward function
As noted above, the mobile robot navigation task based on a matching network. In addition, expert
comprises P2P and obstacle avoidance. The agent demonstration can provide an auxiliary experience for
receives a positive reward when it reaches the target point DRL, which can be used to address the sparse reward
and a negative reward when it collides with an obstacle. problem[38, 83] .
Kai Zhu et al.: Deep Reinforcement Learning Based Mobile Robot Navigation: A Review 687
(2) Auxiliary task (1) The agent is transferred from one simulation
In the case of a sparse reward, when it is difficult to environment to another. The DRL algorithm essentially
complete the original task, we can set auxiliary tasks to trains a reactive strategy to select the action with the
accelerate learning. In essence, auxiliary tasks provide maximum cumulative return in its training environment.
extra dense pseudo-rewards for RL[84] . Auxiliary tasks The data distributions in different environments are very
are supervised, so the associated auxiliary loss can be different, so it is not easy to transfer the navigation model
used to adjust the network parameters. The original and trained in one environment to another environment.
auxiliary tasks share part of the network structure, which (2) The agent is transferred from a simulation
helps to build up the model representation. environment to a real environment. Generally, the real
The auxiliary task technique is often used in indoor environment is more complex and dynamic. The reality
visual navigation research. Mirowski et al.[44] added two gap between the virtual and real worlds is the core
auxiliary tasks to the Nav A3C framework, including challenge of deploying a trained model directly into a
depth and loop-closure predictions. They also applied real robot.
the auxiliary heading prediction task[85] to an urban 5.3.2 Solution
navigation problem, that is, to predict the angle between We can address the generalization problem by either
the current direction and due north. Hsu et al.[54] expanding the sample space or reducing the state space.
trained their local region model with an auxiliary task (1) Expansion of the sample space
that can speed up the convergence. Kulhánek et al.[51] Adding randomness to the simulation environment
further developed the auxiliary task technique for visual to expand the sample space is a technique commonly
navigation by extending the batched A2C algorithm with used to solve the generalization problem. Random
pixel-control, reward prediction, depth-map prediction, sensor noise, a random target position, random obstacles,
and image segmentation tasks. and other changes help expand the data distribution of
(3) Curriculum learning various scenarios and reduce the difficulty of transfer
Curriculum learning refers to the design of appropriate between simulation environments.
curricula for progressive learners from simple to To cause the agent to adapt to the new situation,
complex, for example, moving a target from nearby to Lei et al.[37] randomly set the target position in an
far away in navigation. The probability of completing unoccupied area to expand the state-space distribution of
a simple task and receiving a reward is higher than the the experience pool. Zhang et al.[36] had the agent start
probability of completing the original task, which helps from a random position and used four different maze
to speed up the neural network training. The complexity environments for training. Long et al.[60] presented a
of the course is gradually increased. multi-scenario training framework to learn an optimal
This technique has been applied in DRL-based policy, which is simultaneously trained using a large
navigation task. Chen et al.[76] introduced a two-stage number of robots in rich, complex environments. The
training process for curriculum learning. In the first stage, research of Zhu et al.[45] showed that a model trained
they trained the policy in a random scenario with eight in 16 scenes could be applied to new scenes without
robots; and in the second stage, they trained the policy extra training. The model can then be transferred from a
in both random and circular scenarios with 16 robots. simulation environment to a real one with only a small
Similar to curriculum learning[63] , multi-stage learning amount of fine-tuning.
with policy evolution can also improve the DRL training (2) Reduction of the state space
efficiency under the sparse reward condition[40, 86] . Expanding the sample space is time-consuming and
5.3 Poor generalization requires subsequent fine-tuning in practical applications.
5.3.1 Challenge Reducing the state-space dimensions can achieve similar
generalization ability at less computational cost. Using
Poor generalization is another challenge in DRL-based
the image as a part of the state space has the advantage
navigation. Because training in a physical environment
of abundant information, but there are often details such
is very time-consuming and dangerous, researchers
as illumination, texture, and color changes, which may
typically train the agent in a simulation environment
not be helpful or may even be harmful to the learning
and then transfer it to the entity environment. There are
navigation ability. Using sparse laser ranging as input
two types of generalization:
688 Tsinghua Science and Technology, October 2021, 26(5): 674–691
focuses the DRL model on the mapping relationship [4] G. Grisetti, C. Stachniss, and W. Burgard, Improved
between the distances among the learning obstacles and techniques for grid mapping with rao-blackwellized particle
the motion instructions. filters, IEEE Transactions on Robotics, vol. 23, no. 1, pp.
34–46, 2007.
Most DRL-based navigation research that can be [5] S. Kohlbrecher, O. von Stryk, J. Meyer, and U. Klingauf,
transferred to a real environment has used range finders A flexible and scalable SLAM system with full 3D motion
as the sensor. Tai et al.[34] used only the 10-dimensional estimation, presented at 2011 IEEE Int. Symp. Safety,
sparse laser range findings and related target information Security, and Rescue Robotics, Kyoto, Japan, 2011, pp.
as the state space. Their model can be extended to a 155–160.
[6] M. Elbanhawi and M. Simic, Sampling-based robot motion
real mobile robot platform without any fine-tuning. Shi
planning: A review, IEEE Access, vol. 2, pp. 56–77, 2014.
et al.[33] also reported that sparse laser range findings [7] D. Fox, W. Burgard, and S. Thrun, The dynamic
could reduce the reality gap. The system proposed by window approach to collision avoidance, IEEE Robotics &
Yokoyama and Morioka[35] converts the depth images Automation Magazine, vol. 4, no. 1, pp. 23–33, 1997.
that are estimated from a monocular camera to 2D range [8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I.
Antonoglou, D. Wierstra, and M. Riedmiller, Playing
data, rather than directly inputting the image.
atari with deep reinforcement learning, arXiv preprint
In addition to reducing the state input size, Devo et arXiv:1312.5602, 2013.
al.[52] designed a two-network architecture comprising [9] A. Banino, C. Barry, B. Uria, C. Blundell, T. Lillicrap,
an object localization network and a navigation network P. Mirowski, A. Pritzel, M. J. Chadwick, T. Degris, J.
to solve the generalization problem. This architecture Modayil, et al., Vector-based navigation using grid-like
representations in artificial agents, Nature, vol. 557, no.
reduces the state-space dimension of the navigation
7705, pp. 429–433, 2018.
network by preprocessing images through the object [10] A. Pandey, S. Pandey, and D. R. Parhi, Mobile robot
localization network. navigation and obstacle avoidance techniques: A review,
International Robotics & Automation Journal, vol. 2, no. 3,
6 Conclusion pp. 96–105, 2017.
[11] F. Kamil, S. H. Tang, W. Khaksar, Zulkifli N, and S.
Navigation is a core ability of the mobile robot. A. Ahmad, A review on motion planning and obstacle
Although the DRL policy cannot generate perfect avoidance approaches in dynamic environments, Advances
behavior all the time and the DRL-based navigation in Robotics & Automation, vol. 4, no. 2, p. 1000134, 2015.
performance fluctuates, DRL continues to show the [12] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, Deep
reinforcement learning for multiagent systems: A review of
most promise for achieving breakthroughs in navigation
challenges, solutions, and applications, IEEE Transactions
capability. on Cybernetics, vol. 50, no. 9, pp. 3826–3839, 2020.
This paper provided a comprehensive and systematic [13] F. Y. Zeng, C. Wang, and S. S. Ge, A survey on visual
review of DRL-based mobile robot navigation research. navigation for artificial agents with deep reinforcement
The application scenarios, current challenges, and learning, IEEE Access, vol. 8, pp. 135 426–135 442, 2020.
[14] C. J. C. H. Watkins, Learning from delayed rewards, PhD
possible solutions were discussed. We anticipate that this
dissertation, University of Cambridge, Cambridge, England,
paper will help researchers further improve the current 1989.
research results and apply them to real systems to realize [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
human-level navigation intelligence. M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,
G. Ostrovski, et al., Human-level control through deep
References reinforcement learning, Nature, vol. 518, no. 7540, pp. 529–
533, 2015.
[1] W. Rone and P. Ben-Tzvi, Mapping, localization and motion [16] H. Van Hasselt, A. Guez, and D. Silver, Deep reinforcement
planning in mobile multi-robotic systems, Robotica, vol. learning with double Q-learning, in Proc. 30th AAAI Conf.
31, no. 1, pp. 1–23, 2013. Artificial Intelligence, Phoenix, AZ, USA, 2016, pp. 2094–
[2] J. Engel, T. Schöps, and D. Cremers, LSD-SLAM: Large- 2100.
scale direct monocular SLAM, in Computer Vision – ECCV [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T.
2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous
eds. Zurich, Switzerland: Springer International Publishing, control with deep reinforcement learning, arXiv preprint
2014, pp. 834–849. arXiv:1509.02971, 2015.
[3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, ORB- [18] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap,
SLAM: A versatile and accurate monocular SLAM system, T. Harley, D. Silver, and K. Kavukcuoglu, Asynchronous
IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147– methods for deep reinforcement learning, arXiv preprint
1163, 2015. arXiv:1602.01783, 2016.
Kai Zhu et al.: Deep Reinforcement Learning Based Mobile Robot Navigation: A Review 689
[19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. [32] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine,
Klimov, Proximal policy optimization algorithms, arXiv Uncertainty-aware reinforcement learning for collision
preprint arXiv:1707.06347, 2017. avoidance, arXiv preprint arXiv:1702.01182, 2017.
[20] Q. Shi, S. Zhao, X. W. Cui, M. Q. Lu, and M. D. Jia, Anchor [33] H. B. Shi, L. Shi, M. Xu, and K. S. Hwang, End-to-end
self-localization algorithm based on UWB ranging and navigation strategy with deep reinforcement learning for
inertial measurements, Tsinghua Science and Technology, mobile robots, IEEE Transactions on Industrial Informatics,
vol. 24, no. 6, pp. 728–737, 2019. vol. 16, no. 4, pp. 2393–2402, 2020.
[21] A. Faust, K. Oslund, O. Ramirez, A. Francis, L. Tapia, [34] L. Tai, G. Paolo, and M. Liu, Virtual-to-real deep
M. Fiser, and J. Davidson, PRM-RL: Long-range robotic reinforcement learning: Continuous control of mobile
navigation tasks by combining reinforcement learning and robots for mapless navigation, in Proc. 2017 IEEE/RSJ Int.
sampling-based planning, in Proc. 2018 IEEE Int. Conf. Conf. Intelligent Robots and Systems, Vancouver, Canada,
Robotics and Automation, Brisbane, Australia, 2018, pp. 2017, pp. 31–36.
5113–5120. [35] K. Yokoyama and K. Morioka, Autonomous mobile robot
[22] M. Duguleana and G. Mogan, Neural networks based with simple navigation system based on deep reinforcement
reinforcement learning for mobile robots obstacle learning and a monocular camera, in Proc. 2020 IEEE/SICE
avoidance, Expert Systems with Applications, vol. 62, pp. Int. Symp. System Integration, Honolulu, HI, USA, 2020,
104–115, 2016. pp. 525–530.
[23] S. M. Feng, H. L. Ren, X. R. Wang, and P. Ben-Tzvi, [36] J. W. Zhang, J. T. Springenberg, J. Boedecker, and W.
and Asme, Mobile robot obstacle avoidance based on deep Burgard, Deep reinforcement learning with successor
reinforcement learning, in Proc. ASME 2019 Int. Design features for navigation across similar environments, in Proc.
Engineering Technical Conferences and Computers and 2017 IEEE/RSJ Int. Conf. Intelligent Robots and Systems,
Information in Engineering Conf., Anaheim, CA, USA, Vancouver, Canada, 2017, pp. 2371–2378.
[37] X. Y. Lei, Z. Zhang, and P. F. Dong, Dynamic path planning
2019.
[24] Y. Kato, K. Kamiyama, and K. Morioka, Autonomous robot of unknown environment based on deep reinforcement
navigation system with learning based on deep Q-network learning, Journal of Robotics, vol. 2018, p. 5781591, 2018.
[38] M. Pfeiffer, S. Shukla, M. Turchetta, C. Cadena, A. Krause,
and topological maps, in Proc. 2017 IEEE/SICE Int. Symp.
R. Siegwart, and J. Nieto, Reinforced imitation: Sample
System Integration, Taipei, China, 2017, pp. 1040–1046.
efficient deep reinforcement learning for mapless navigation
[25] Y. Kato and K. Morioka, Autonomous robot navigation
by leveraging prior demonstrations, IEEE Robotics and
system without grid maps based on double deep Q-network
Automation Letters, vol. 3, no. 4, pp. 4423–4430, 2018.
and RTK-GNSS localization in outdoor environments, in
[39] C. Sampedro, H. Bavle, A. Rodriguez-Ramos, P. de la
Proc. 2019 IEEE/SICE Int. Symp. System Integration, Paris,
Puente, and P. Campoy, Laser-based reactive navigation for
France, 2019, pp. 346–351.
multirotor aerial robots using deep reinforcement learning,
[26] C. Wang, J. Wang, X. D. Zhang, and X. Zhang,
in Proc. 2018 IEEE/RSJ Int. Conf. Intelligent Robots and
Autonomous navigation of UAV in large-scale unknown
Systems, Madrid, Spain, 2018, pp. 1024–1031.
complex environment with deep reinforcement learning, [40] C. Wang, J. Wang, J. J. Wang, and X. D. Zhang, Deep-
in Proc. 2017 IEEE Global Conf. Signal and Information reinforcement-learning-based autonomous UAV navigation
Processing, Montreal, Canada, 2017, pp. 858–862. with sparse rewards, IEEE Internet of Things Journal, vol.
[27] C. Wang, J. Wang, Y. Shen, and X. D. Zhang, Autonomous
7, no. 7, pp. 6180–6190, 2020.
navigation of UAVs in large-scale complex environments: [41] F. Aznar, M. Pujol, and R. Rizo, Obtaining fault tolerance
A deep reinforcement learning approach, IEEE Trans. Veh. avoidance behavior using deep reinforcement learning,
Technol., vol. 68, no. 3, pp. 2124–2136, 2019. Neurocomputing, vol. 345, pp. 77–91, 2019.
[28] Z. W. Ma, C. Wang, Y. F. Niu, X. K. Wang, and L. C. Shen, [42] J. Choi, K. Park, M. Kim, and S. Seok, Deep reinforcement
A saliency-based reinforcement learning approach for a learning of navigation in a complex and crowded
UAV to avoid flying obstacles, Robotics and Autonomous environment with a limited field of view, in Proc. 2019 Int.
Systems, vol. 100, pp. 108–118, 2018. Conf. Robotics and Automation, Montreal, Canada, 2019,
[29] J. Woo and N. Kim, Collision avoidance for an unmanned pp. 5993–6000.
surface vehicle using deep reinforcement learning, Ocean [43] F. Leiva and J. Ruiz-del-Solar, Robust RL-based map-less
Eng., vol. 199, p. 107001, 2020. local planning: Using 2D point clouds as observations,
[30] X. Wu, H. L. Chen, C. G. Chen, M. Y. Zhong, S. IEEE Robotics and Automation Letters, vol. 5, no. 4, pp.
R. Xie, Y. K. Guo, and H. Fujita, The autonomous 5787–5794, 2020.
navigation and obstacle avoidance for USVs with ANOA [44] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A.
deep reinforcement learning method, Knowl.-Based Syst., Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu,
vol. 196, p. 105201, 2020. et al., Learning to navigate in complex environments, arXiv
[31] X. Y. Zhang, C. B. Wang, Y. C. Liu, and X. Chen, Decision- preprint arXiv:1611.03673, 2017.
making for the autonomous navigation of maritime [45] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L.
autonomous surface ships based on scene division and deep Fei-Fei, and A. Farhadi, Target-driven visual navigation in
reinforcement learning, Sensors, vol. 19, no. 18, p. 4055, indoor scenes using deep reinforcement learning, in Proc.
2019. 2017 IEEE Int. Conf. Robotics and Automation, Singapore,
690 Tsinghua Science and Technology, October 2021, 26(5): 674–691
[74] Y. Y. Chen, C. C. Liu, B. E. Shi, and M. Liu, Robot deep reinforcement learning, Sensors, vol. 19, no. 18, p.
navigation in crowds by graph convolutional networks with 3837, 2019.
attention learned from human gaze, IEEE Robotics and [81] K. Lobos-Tsunekawa, F. Leiva, and J. Ruiz-del-Solar,
Automation Letters, vol. 5, no. 2, pp. 2754–2761, 2020. Visual navigation for biped humanoid robots using deep
[75] H. T. L. Chiang, A. Faust, M. Fiser, and A. Francis, reinforcement learning, IEEE Robotics and Automation
Learning navigation behaviors end-to-end with AutoRL, Letters, vol. 3, no. 4, pp. 3247–3254, 2018.
IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. [82] Q. C. Zhang, M. Q. Zhu, L. Zou, M. Li, and Y. Zhang,
2007–2014, 2019. Learning reward function with matching network for
[76] G. D. Chen, S. Y. Yao, J. Ma, L. F. Pan, Y. A. Chen, mapless navigation, Sensors, vol. 20, no. 13, p. 3664, 2020.
P. Xu, J. M. Ji, and X. P. Chen, Distributed non- [83] A. Hussein, E. Elyan, M. M. Gaber, and C. Jayne,
communicating multi-robot collision avoidance via map- Deep imitation learning for 3D navigation tasks, Neural
based deep reinforcement learning, Sensors, vol. 20, no. 17, Computing and Applications, vol. 29, no. 7, pp. 389–404,
p. 4836, 2020. 2018.
[77] V. J. Hodge, R. Hawkins, and R. Alexander, Deep [84] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J.
reinforcement learning for drone navigation using sensor Z. Leibo, D. Silver, and K. Kavukcuoglu, Reinforcement
data, Neural Computing and Applications, vol. 33, no. 6, learning with unsupervised auxiliary tasks, arXiv preprint
pp. 2015–2033, 2021. arXiv:1611.05397, 2016.
[78] Y. D. Wang, H. B. He, and C. Y. Sun, Learning to navigate [85] P. Mirowski, M. K. Grimes, M. Malinowski, K. M.
through complex dynamic environment with modular deep Hermann, K. Anderson, D. Teplyashin, K. Simonyan, K.
reinforcement learning, IEEE Transactions on Games, vol. Kavukcuoglu, A. Zisserman, and R. Hadsell, Learning
10, no. 4, pp. 400–412, 2018. to navigate in cities without a map, arXiv preprint
[79] E. Parisotto and R. Salakhutdinov, Neural map: Structured arXiv:1804.00168, 2019.
memory for deep reinforcement learning, arXiv preprint [86] D. W. Wang, T. X. Fan, T. Han, and J. Pan, A two-stage
arXiv:1702.08360, 2017. reinforcement learning approach for multi-UAV collision
[80] J. J. Zeng, R. S. Ju, L. Qin, Y. Hu, Q. J. Yin, and C. Hu, avoidance under imperfect sensing, IEEE Robotics and
Navigation in unknown dynamic environments based on Automation Letters, vol. 5, no. 2, pp. 3098–3105, 2020.
Tao Zhang received the BS, MS, and PhD Kai Zhu received the BS degree in
degrees in control science and engineering mechanical engineering from Tsinghua
from Tsinghua University, Beijing, China in University, Beijing, China in 2019. He
1993, 1995, and 1999, respectively, and the is currently pursuring the PhD degree at
second PhD degree from Saga University, the Department of Automation, Tsinghua
Saga, Japan in 2002. He was a visiting University. His research interests include
associate professor with Saga University, path planning, autonomous navigation,
from 1999 to 2003. From 2003 to 2006, he deep reinforcement learning, and their
worked as a researcher with the National Institute of Informatics, applications in multi-robot collaboration.
Tokyo, Japan.
He is currently a professor and serves as the dean of the
Department of Automation, School of Information Science and
Technology, Tsinghua University, Beijing. He has authored or
coauthored more than 200 papers and eight books. He is a fellow
of IET, and a member of the American Institute of Aeronautics
and Astronautics and Institute of Electronics, Information and
Communication Engineers. He currently serves as the editorial
board member and technical editor for IEEE/ASME Transactions
on Mechatronics. His current research includes robotics, image
processing, control theory, artificial intelligent, navigation, and
control of spacecraft.