4

Uploaded by

asikuzzamananik0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

4

Uploaded by

asikuzzamananik0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

This is the author’s version of an article that has been published in SSRR 2018

Robot Navigation of Environments with Unknown Rough Terrain

Using Deep Reinforcement Learning
Kaicheng Zhang, Student Member, IEEE, Farzad Niroui, Student Member, IEEE, Maurizio Ficocelli,
and Goldie Nejat, Member, IEEE

1

Abstract— In Urban Search and Rescue (USAR) missions, In order to address this issue, learning techniques have
mobile rescue robots need to search cluttered disaster been proposed for robot navigation in rough terrain [12]–[18].
environments in order to find victims. However, these These techniques focus on learning to classify the
environments can be very challenging due to the unknown traversability of terrain from environment features. In
rough terrain that the robots must be able to navigate. In this particular, learning is used to 1) classify the surrounding
paper, we uniquely explore the first use of deep reinforcement
learning (DRL) to address the robot navigation problem in such
terrain which is then represented as a costmap [13], [16]–[18]
cluttered environments with unknown rough terrain. We have or 2) to learn the overall cost function [12], [14], [15], in order
developed and trained a DRL network that uses raw sensory to plan optimal paths to goal locations.
data from the robot’s onboard sensors to determine a series of Our own previous research has focused on utilizing
local navigation actions for a mobile robot to execute. The traditional learning methods (e.g. MAXQ hierarchical
performance of our approach was successfully tested in several reinforcement learning, support vector machines) and utility
unique 3D simulated environments with varying sizes and levels function based approaches to address such tasks as
of traversability. exploration, rough terrain navigation, and victim identification
I. INTRODUCTION [19]–[22].
However, to effectively train learning techniques, usually
Mobile rescue robots deployed in Urban Search and a large number of labeled data is required, which can be time
Rescue (USAR) missions must navigate unknown rough consuming to obtain [23]. A handful of techniques [13], [18]
terrain in order to explore cluttered environments to search for have automated the process of data collection and labeling by
potential victims [1]. However, the traversability of the rough having the robot directly interact with the environment in
terrain can vary greatly, consisting of different rubble piles order to assign a class to a set of online captured data.
with various shapes and sizes. In order to be able to perform In this research, we investigate the use of deep
semi-autonomously or fully autonomously, these robots need reinforcement learning (DRL) to address the robot navigation
to find navigation paths to safely navigate in these cluttered problem in environments with unknown rough terrain, in
environments with unknown terrain with no a priori map of particular in USAR scenarios. DRL can directly use raw
the environment. sensory data to determine robot navigation actions without the
Previous work in robot navigation of rough terrain has need of pre-labeled data [24]. A handful of papers have
mainly focused on known terrain [2]. A feasible path to a goal applied DRL approaches for robot navigation using onboard
location in the environment can be determined using such sensory information in environments with known [25]–[27]
techniques as graph search [3], rapidly exploring random trees and unknown [28] flat terrain. However, DRL has yet to be
[4] and potential field methods [5], [6]. In cases when the implemented for cluttered environments with unknown rough
terrain is unknown, the robot can navigate to multiple local terrain.
target locations using a defined utility function [7]–[11]. By In USAR missions, we have areas of interest with high
navigating to these local locations, the robot can therefore likelihoods of victims being present. A rescue robot needs to
progress towards the final goal location. For a number of these navigate to these regions, in order to search for victims. For
approaches, the robot also obtains a model of the environment, the robot navigation problem addressed in this paper, these
e.g. [7]–[10]. The challenge with such approaches is that they areas are defined as goal target locations for the robot.
can require substantial expert input for parameter tuning [12]. Namely, we are addressing the local navigation problem,
where the goal target locations can be given by a global
exploration planner such as in [29], and the robot needs to
This research was supported by the Natural Sciences and Engineering
Research Council of Canada (NSERC), and the Canada Research Chairs locally navigate to these locations without previous
(CRC) Program. knowledge of the unknown rough terrain. Such a scenario
K. Zhang, F. Niroui and G. Nejat are with the Autonomous Systems and would be after a natural disaster such as an earthquake, when
Biomechatronics Laboratory (ASBLab) in the Department of Mechanical a building has collapsed and the terrain at this site is unknown
and Industrial Engineering, University of Toronto, (email:
[email protected]), (email: [email protected]), a priori, however, a rescue robot needs to navigate the
(e-mail: [email protected]). (The first two authors contributed equally environment to help search for victims.
to this work.) In this paper we present the first use of DRL to address the
M. Ficocelli is with the Guidance, Navigation & Control group at mobile robot navigation problem in unknown rough terrain
MacDonald, Dettwiler and Associates (MDA) Inc., Brampton, Canada
(e-mail: [email protected]) such as in USAR environments. The main contribution of this
work is in the design of a DRL network which uses raw
sensory data from the robot’s onboard sensors to determine a In [15], an approach was introduced where far-range data
series of primitive navigation actions for the robot to execute from overhead images and height maps were used along with
in order to traverse to a goal location in an environment with near-range data from a robot’s onboard LiDAR to compute
unknown rough terrain. traversal costs of the terrain. This was done by using an online
Bayesian learning approach to combine far-range and
II. ROBOT LEARNING IN ROUGH TERRAIN near-range features. In experiments, a robot was able to use
Existing work on robot navigation in rough terrain can be the traversal costs along with the D* planner to navigate large
categorized into learning approaches that focus on classifying outdoor cluttered environments.
the terrain [17], [18] as well as its traversability [12]–[16], In [13], a robot used on-line learning to classify different
and learning approaches to determine robot navigation terrain patches of outdoor environments as traversable or
actions [25], [26], [28]. non-traversable. Geometric and appearance features from
stereo images were used as inputs. An autonomous data
A. Learning to Classify Terrain for Navigation collection system was used to gather labeled training data.
With respect to classifying terrain, in [17], a Support The collected features, which included height estimates and
Vector Machine (SVM) classifier was used to improve robot surface texture, were clustered into traversable and
navigation capabilities in forest environments. The classifier non-traversable terrain. A planner used the online terrain
was trained to identify 3D points from LiDAR data features classifier to compute cost maps in two different outdoor
which represented the surface of the terrain. The 3D points environments. The results showed that the classifier approach
were hand labeled as either ground or non-ground and used was able to outperform a traditional planner when the robot
for training the classifier. Online learning was also utilized in had to navigate through tall traversable vegetation.
order to adapt the classifier to changes in the terrain due to In [16], terrain traversability was determined by using a
season and vegetation growth. This approach was validated neural network with geometric features from stereo cameras
using different experimental data sets. as input. Four cost classes of low, intermediate, high, and
A terrain modeling approach was introduced in [18] lethal were used to represent the traversability. The training
where noisy sensor data from stereo cameras and LiDAR was done offline first and then in an online learning process a
were used to estimate ground and vegetation height and robot learned to infer geometric features from RGB images
classify obstacles mainly in agricultural environments. Three when stereo information was not available. By combining the
prior assumptions were made: ground height varied classification from geometry features and color images, a cost
smoothly, terrain in neighboring cells were likely to be in the map was created for path planning. A comparison with
same class, and class members have similar height. The different network configurations and other classifiers showed
assumptions were represented as a probabilistic generative that the neural network configurations were the most suitable
model which included two Markov random fields and a latent classifiers due to their low error score and fast computation
variable. Training data was autonomously collected by for a wide range of environments with obstacles and different
driving a tractor in a known environment to train the model. lighting conditions.
With respect to learning the traversability of rough terrain B. Learning to Navigate Terrain
environments, in general, sensory information and prior
knowledge of the environment have been used to learn cost Learning techniques used in order for a robot to learn
functions [12], [14], [15] or compute cost maps [13], [16] navigation actions have mainly used DRL to determine
which are then used in planning navigation paths. For continuous or discrete navigation actions to travel using
example, in [13], learning from demonstration (LfD) was onboard sensory information in environment with known
used to learn a cost function from terrain data which were [25]–[27] or unknown [28] flat terrain.
labeled by a human expert for navigation. In particular, LfD In [25] a robot motion controller used a Deep Q-network
was used to learn the mapping from perceptual information to (DQN) with successor features to provide discrete robot
planning costs. The data included prior overhead feature actions to a goal object location in a known environment.
maps of the outdoor environment and onboard sensory These actions included standing still, turning 90o left or right,
information such as LiDAR data. The cost function was used or going straight for 1 m. The input to the controller was depth
in conjunction with the Field D* planning algorithm to images from a Kinect sensor. The controller was trained in
navigate a robot to waypoints. This approach was compared simulation within an environment with walls and obstacles.
to manually engineered cost maps in outdoor field tests where The trained controller was transferred to a new environment
a robot navigated a series of courses. The learned cost on a physical mobile robot, and additional training was
function achieved more reliable results and improved conducted. Experiment in the same environment showed that
performance than the engineered cost maps. the number of additional training is less than having to train a
Similarly, in [14], LfD was used to learn non-linear cost new network from scratch, and the proposed model with
functions from geometric features of satellite maps of rough successor feature preserved the ability to navigate both the
terrain. The maps were labeled by tracking the paths walked simulated and real environments.
by several human experts, and then conformal geometric In [26], a motion controller using DRL with generalized
algebra was used to describe the maps. The geometric computation graphs was developed to provide continuous
features were used as inputs into a neural network to learn the steering angle predictions for a mobile robot. The inputs to the
non-linear cost function to be used by a navigation planner. system included grayscale images from an onboard camera, as
well as previous action estimations. Training and testing were investigation of a DRL network to address the robot
conducted in the same single path corridor for the robot to navigation problem in such difficult environments.
learn to move at a fixed speed without collisions. The
proposed planner was compared with planners using Double III. DEEP REINFORCEMENT LEARNING FOR ROBOT
Q-learning and random policies. The results demonstrated NAVIGATION IN UNKNOWN ROUGH TERRAIN
that on average, the robot was able to travel 7 times longer Our proposed DRL architecture for a robot to learn how to
distances and 15 times longer distances respectively, after 4 navigate an environment with unknown cluttered terrain is
hours of real world training. presented in Fig. 1. The architecture uses an end to end
In [27], a robot localization and navigation module was learning strategy, for which a robot learns to navigate rough
developed with a modified Asynchronous Advantage terrain by using elevation maps and high dimensional sensory
Actor-Critic (A3C) approach. The objective was to localize a data from depth images as inputs into a DRL Navigation
robot in an environment for which a map is given and module. This module uses deep reinforcement learning to
determine the navigation actions such as move (forward, determine the navigation actions for the robot. Once a robot
backward, left, right) and rotate (left, right) to a given target navigation action has been chosen, the robot’s Low-Level
location. The inputs to the system included a 2D map of the Controller executes each action.
environment, RGB-D images, robot heading and the previous In the below subsections we discuss in more detail the
estimated robot location. Both training and testing were main modules of our architecture.
conducted in 3D simulated maze environments. Environments
used for testing were different from the ones used in the A. Inputs
training. Results showed navigation success rates of 91% - Inputs to the DRL Navigation module consist of depth
100% depending on map sizes. images, the Elevation Map, as well as the robot’s orientation
With respect to environments with unknown flat terrain, in (𝛼, 𝛽, 𝛾), where the robot heading 𝛾 is relative to a goal target
[28], a mapless motion planner was developed using the location on the unknown terrain.
asynchronous deep deterministic policy gradients method to
provide continuous linear and angular velocities for a mobile B. Elevation Map
robot to navigate to a target location. The input to the system An elevation map of the robot’s surrounding environment
was sparse laser range data, the previous robot action, and the is generated and updated as the robot navigates the
relative position of the target location. The system was trained environment. In order to obtain this elevation map we use the
in two simulated indoor environments with walls and ROS OctoMap package [30]. Depth images as well as the
obstacles. Experiments were conducted on a mobile robot in robot 3D pose, containing both position (𝑥, 𝑦, 𝑧) and
an office environment in both simulations and in a real setting. orientation (𝛼, 𝛽, 𝛾), are used to create a 3D occupancy grid.
The proposed planner was compared with the Move Base This 3D occupancy grid is then converted into a 2D grid map
motion planner, and the results showed that it was able to using the ROS grid_map library [31], where each grid cell
navigate the robot to all target locations in unseen represents the elevation of a local square area.
environments, especially in narrow areas, whereas the Move
C. DRL Navigation Module
Base planner was not able to.
The aforementioned literature shows the feasibility of The DRL Navigation module uses the Asynchronous
using DRL to learn navigation policies using high dimensional Advantage Actor-Critic (A3C) approach [32] to learn a policy
sensory inputs from the environment. However, to-date, DRL that determines the robot’s navigation actions using depth
has not yet been applied to robot navigation in cluttered information, the robot’s 3D orientation and the elevation
environments with unknown rough terrain. The traversability map. We use A3C as it can handle our high dimensional input
of the rough terrain can be challenging due to the varying space and has a faster learning speed compared to other DRL
shapes and sizes of both climbable terrain and non-climbable techniques such as DQN [32]. This is due to the approach
terrain. Our research focuses on the first development and deploying several agents to learn in parallel using
multi-threads.
1) A3C
The A3C model uses the actor-critic method for
reinforcement learning by combining the policy and
state-value functions [32]. At each discrete time step 𝑡, the
robot in state 𝑠𝑡 executes a navigation action 𝑎𝑡 which
transitions the robot to state 𝑠𝑡′ in order to maximize the
expected future reward. These navigation actions consist of
the robot moving forward, backward, or turning right or left.
Herein, the actor maintains the policy 𝜋(𝑎𝑡 |𝑠𝑡 , 𝜃) which is
determined by a neural network with parameter 𝜃. The critic
estimates the value function 𝑉(𝑠𝑡 ; 𝜃𝑣 ) in order for the actor to
adjust its policy accordingly. 𝑉(𝑠𝑡 ; 𝜃𝑣 ) is estimated by the
Figure 1. Proposed architecture for rough terrain robot navigation neural network with parameter 𝜃𝑣 . Both the actor and the
critic update their networks as new information is obtained by
the robot. The policy and the value function are updated after 𝑑min − 𝑑𝑡
every 𝑡𝑚𝑎𝑥 steps or when a terminal state is reached. 𝜇 , 𝑑𝑡 < 𝑑𝑚𝑖𝑛
Rewards are used to compute and accumulate the Δ𝑡
gradients at every step. Stochastic gradient descent (SGD) is 𝑟𝑡 = 0.5 + 𝑟𝑣 , 𝑑𝑡 ≤ 𝑑𝑔 , (4)
used to update the policy as follows [32], [33]: −0.5, 𝑠𝑡𝑢
{ 0, 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒
∆𝜃 = ∇θ log 𝜋 (𝑎𝑡 |𝑠𝑡 ; 𝜃 )𝐴(𝑠𝑡 , 𝑎𝑡 ; 𝜃𝑣 )
(1) where 𝑑𝑚𝑖𝑛 represents the closest distance the robot has
+𝜁∇𝜃 𝐻(𝜋(𝑠𝑡 ; 𝜃 )), navigated to with respect to the goal target location, up to step
𝑡 − 1. 𝑑𝑡 is the distance from the robot to the goal target
location at step 𝑡. Δ𝑡 is the time duration from the previous
where,
time step 𝑡 − 1 to the current time step 𝑡. 𝜇 is a discount
𝑛−1
factor that can be used in order to choose the impact of this
𝐴(𝑠𝑡 , 𝑎𝑡 ; 𝜃𝑣 ) = ∑ Γ 𝑖 𝑟𝑡+𝑖 + Γ 𝑛 𝑉(𝑠𝑡+𝑛 ; 𝜃𝑣 )
𝑖=0 (2) reward relative to the rewards provided for the terminal
states. 𝑑𝑔 represents the distance to the goal location and 𝑟𝑣 is
−𝑉(𝑠𝑡 ; 𝜃𝑣 ),
a dynamic variable that represents the distance from the
and 𝑛 represents the number of remaining steps until 𝑡𝑚𝑎𝑥 , or robot’s starting location to the goal location over the total
until a terminal state. 𝐻 is the entropy of the policy to time the robot took to reach the goal. 𝑠𝑡𝑢 represents an
encourage the robot to explore different navigation actions, undesirable terminal state which includes repeatedly colliding
and the hyperparameter 𝜁 controls the strength of the entropy with the same obstacle, the robot flipping over, the robot
regularization term. 𝐴(𝑠𝑡 , 𝑎𝑡 ; 𝜃, 𝜃𝑣 ) is an estimation of the being stuck (i.e., attempting to climb a steep hill), or
advantage function, and Γ is the discount factor. SGD is also exceeding 𝑛𝑚 number of steps. 𝑛𝑚 is the maximum number
used to update the value function [32], [33]: of steps allowed for an episode, similar to the concept of
time-out.
∆𝜃𝑣 = 𝐴(𝑠𝑡 , 𝑎𝑡 ; 𝜃𝑣 )∇𝜃𝑣 𝑉(𝑠𝑡 ; 𝜃𝑣 ) . (3)
3) A3C Network Design
Herein, 𝜃 and 𝜃𝑣 are optimized using the RMSprop approach As previously mentioned, the neural network uses a depth
[34]. image, an elevation map and the robot 3D orientation as
As A3C deploys several agents acting independently in inputs to determine a navigation action. The proposed
their own environment, each agent provides updates to the network can be seen in Fig. 2. There are three input branches,
global policy and value function asynchronously. one for each input type. The elevation map and the 3D
orientation branches are merged together by connecting to a
2) Rewards shared convolutional layer (CL). The output of this CL
The reward 𝑟t is given based on the robot: 1) getting merges with the output of the depth image branch.
closer to a target goal location, 2) reaching the target location
within a certain tolerance (since the terrain is unknown), or 3) Depth image branch: depth images are downsampled from
reaching an undesirable terminal state before reaching the 480×640 to an 84×84 single channel array. Each element in
goal. the array represents a specific distance measurement. There
The reward 𝑟t is given after every executed navigation are 4 CLs in this branch, the first two have 32 filters and the
action and is represented as follows: remaining have 64 filters. Filter sizes decrease gradually from
5×5 to 3×3, Fig. 3a. The output of the last layer is reshaped to
a vector and concatenated head-to-tail with the reshaped
output vector of the CL which merges the other two branches.

Figure 2. The proposed A3C network architecture. The three rectangular bounding boxes from the top to bottom
represent the Depth image branch, Elevation map branch, and Robot orientation branch, respectively.
(a) (b)

Figure 3. Hidden layer configurations of the CNNs for: (a) the depth image branch, and (b) the elevation map branch and the layer after merging with
the robot orientation branch. CONV represents a convolutional layer with filter size F, stride S, and padding P.

Elevation map branch: The elevation map is inputted as a an additional CL layer. The combination of 𝛼 and 𝛽
single channel 84×84 array. The first 2 CLs of this branch orientations with the elevation maps helps to infer 3D terrain
have 32 filters and the remaining use 64 filters. Each filter conditions that the robot is traversing. This facilitates the
produces an activation map as the result of convolutions. learning of a policy that helps the robot to avoid steep slopes
Filter sizes decrease gradually from 7×7 to 4×4 to account for that can potentially roll or flip the robot over. Additionally,
the decrease of activation map size, Fig. 3b. the robot heading 𝛾 allows the network to learn that when the
Robot orientation branch: the 3D orientation (𝛼, 𝛽, 𝛾) of the robot’s motion aligns with the direction of the goal target
robot are input as a vector with 3 elements and processed location, it usually approaches the goal quicker and gets
through a fully connected layer (FCL). The output of this higher rewards.
layer is reshaped to a volume of 9×9×64 to match the output In this network, Rectified Linear Units (ReLU) [35] are
shape of the elevation map branch. Using element-wise applied on the output of all CLs. Once the branches are
addition, the two volumes are merged and used as an input to combined, the merged tensor is fed into a Long Short-Term
Memory (LSTM) layer. This recurrent layer can improve
DRL training by better capturing underlying states of a
partially observed environment [36].
After the LSTM layer, the network separates into two
output branches, actor and critic. The actor output branch has
one FCL that outputs a 4-dimensional vector, each dimension
representing one navigation action. The value of each
dimension shows how likely a certain action is optimal (i.e.,
𝑝1 to 𝑝4 ). The Softmax [35] function is applied to normalize
the values. The action with the highest likelihood of being
optimal is chosen as 𝑎𝑡 . The critic branch also uses an FCL. It
Figure 4. An example of the robot training environment. outputs a single value that represents the estimation of the
value function 𝑉(𝑠𝑡 ; 𝜃𝑣 ), which is used to improve future
policies.
4) Training
In order to train the A3C network for the DRL Navigation
module, we created a 3D Simulator in Gazebo in the Robot
Operating System (ROS). A 3D physics model for the Jaguar
4x4 wheeled mobile platform was created including weight,
center of mass, speed and ground resistance to match the
physical robot. The robot was equipped with both 3D
odometry and a front-facing depth sensor.
We used 2,000 randomly generated unique
Figure 5. Cumulative Reward Per Episode Averaged Per 500 Episodes. 100 𝑚2 environments with percentage of traversability
varying from 70-90%. These environments included
traversable terrain that was non-flat, which included hills and
valleys and irregular-shaped rubble piles, as shown in Fig. 4.
The robot’s starting location and goal target location within
each environment were randomly selected for every episode
on the traversable portion of the environment.
For our training, nine agent threads were simultaneously
used on an AMD Ryzen Threadripper 1950x CPU. Each
agent having its own unique environment. The values of the
parameters used in training are as follows: 𝜁 = 0.01; Γ =
0.99; 𝜇 = 0.05; 𝑑𝑔 = 0.5; 𝑛𝑚 = 600; and 𝑡𝑚𝑎𝑥 = 60. The
Figure 6. Number of Steps Per Episode Averaged Per 500 Episodes. learning rate was set to 0.0001[32].
Fig. 5 and Fig. 6 present the training results for the robot navigation in environments with unknown rough
cumulative rewards per episode and the number of steps per terrain. These environments varied in size from 100 𝑚2 to
episode, both averaged across 500 episodes. As can be seen in 400 𝑚2 and with percent traversability ranging from 80-90%.
Fig. 5, the network started to converge at 22,000 episodes for Namely, within our simulator, 600 environments, each with a
an average cumulative reward of 0.5. The number of steps per unique terrain distribution, were generated. The starting
episode also decreased to an average of 208 steps. location and goal target location for the mobile robot were
randomly selected, however, these locations were constrained
to the traversable portion of each environment similar to
IV. EXPERIMENTS training.
We conducted experiments in new environments to A. Results
validate the performance of our proposed DRL network for The robot was able to navigate to target goal locations in
TABLE I. SUCCESS RATE FOR ROBOT REACHING GOAL TARGET the majority of the environments. For the training stage we
LOCATIONS only used environment sizes of 100𝑚2, whereas in testing we
Environment Size Traversability included the large environment size of 400𝑚2 as well. The
90% 85% 80% experimental results can be found in Table I. We obtained
100 𝑚 2 84.1% 81.7% 71.3% high success rates regardless of the size of the environment,
400 𝑚 2 85% 83.1% 76.7% showing that our approach is robust to new environments
with different terrain distribution and sizes than used in
training. As expected, the success rates were higher for the
environments with less non-climbable obstacles, i.e., 85-90%
traversability. The average computation time for the DRL
E Navigation module to select an action was 0.074 s.
S Robot navigation paths in example environments with
unknown rough terrain are shown in Fig. 7 and Fig. 8. In Fig.
7, it can be seen that the robot learns to traverse different
terrain (i.e., a hill and an irregular-shaped rubble pile) while
avoiding large non-traversable obstacles. Fig. 8 shows how
the robot is able to traverse narrow pathways between large
S non-traversable objects.
From the experiments, we observed if the robot traversed
too close to deep valleys with large descending slopes, it
could slip down into the valley and not be able to get out (e.g.
E being stuck). This led to an undesirable termination state.

Figure 7. Topological view (left) and side view (right) of the robot traversing
a hill and an irregular-shaped rubble pile. The robot navigation path is shown V. CONCLUSION
on the topological view. S represents the starting point and E is the end goal
target location. In this paper, we investigate the use of DRL for robot
navigation in environments with unknown rough terrain such
as the ones found in USAR. Namely, we developed a network
based on the A3C architecture which uses depth images,
S
elevation maps, and 3D orientation as inputs to determine
optimal robot navigation actions. The network was trained in
unique simulated 3D environments which varied in terms of
E level of traversability. Experiments in new environments
ranging in size and traversability were conducted. The results
showed that the DRL approach can successfully navigate a
robot in an environment towards a target goal location when
the rough terrain is unknown. Future work will include
implementing and testing the navigation approach on a
E physical robot in similar environments.

REFERENCES
S
[1] Y. Liu and G. Nejat, “Robotic Urban Search and Rescue: A Survey
from the Control Perspective,” J. Intell. Robot. Syst., vol. 72, no. 2,
pp. 147–165, Nov. 2013.
Figure 8. Topological view (left) and side view (right) of the robot [2] M. Hoy, A. S. Matveev, and A. V. Savkin, “Algorithms for
navigating narrow pathways. The robot navigation path is shown on the collision-free navigation of mobile robots in complex cluttered
topological view. S represents the starting point and E is the end goal target environments: a survey,” Robotica, vol. 33, no. 3, pp. 463–497, Mar.
location. 2015.
[3] A. Stentz, “Optimal and efficient path planning for partially-known [22] W.-Y. G. Louie and G. Nejat, “A victim identification methodology
environments,” in Proceedings of IEEE International Conference on for rescue robots operating in cluttered USAR environments,” Adv.
Robotics and Automation, pp. 3310–3317 vol.4, 1994 Robot., vol. 27, no. 5, pp. 373–384, Apr. 2013.
[4] J. J. Kuffner and S. M. LaValle, “RRT-connect: An efficient [23] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,”
approach to single-query path planning,” in Proceedings of IEEE ArXiv13125602 Cs, Dec. 2013.
International Conference on Robotics and Automation, vol. 2, pp. [24] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
995–1001, 2000. “A Brief Survey of Deep Reinforcement Learning,” IEEE Signal
[5] O. Khatib, “Real-time obstacle avoidance for manipulators and Process. Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017.
mobile robots,” in Proceedings of IEEE International Conference on [25] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deep
Robotics and Automation, vol. 2, pp. 500–505, 1985. Reinforcement Learning with Successor Features for Navigation
[6] J. Barraquand, B. Langlois, and J. C. Latombe, “Numerical potential across Similar Environments,” ArXiv161205533 Cs, Dec. 2016.
field techniques for robot path planning,” IEEE Trans. Syst. Man [26] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine,
Cybern., vol. 22, no. 2, pp. 224–241, Mar. 1992. “Self-supervised Deep Reinforcement Learning with Generalized
[7] M. Wang and J. N. K. Liu, “Fuzzy logic-based real-time robot Computation Graphs for Robot Navigation,” ArXiv170910489 Cs,
navigation in unknown environment with dead ends,” Robot. Auton. Sep. 2017.
Syst., vol. 56, no. 7, pp. 625–643, Jul. 2008. [27] G. Brunner, O. Richter, Y. Wang, and R. Wattenhofer, “Teaching a
[8] F. Niroui, B. Sprenger, and G. Nejat, “Robot exploration in unknown Machine to Read Maps with Deep Reinforcement Learning,”
cluttered environments when dealing with uncertainty,” in ArXiv171107479 Cs Stat, Nov. 2017.
Proceedings of IEEE International Symposium on Robotics and [28] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement
Intelligent Sensors (IRIS), pp. 224–229, 2017. learning: Continuous control of mobile robots for mapless
[9] A. V. Savkin and C. Wang, “Seeking a path through the crowd: navigation,” in Proceedings of IEEE/RSJ International Conference
Robot navigation in unknown dynamic environments with moving on Intelligent Robots and Systems (IROS), pp. 31–36, 2017.
obstacles based on an integrated environment representation,” Robot. [29] J. Vilela, Y. Liu, and G. Nejat, “Semi-autonomous exploration with
Auton. Syst., vol. 62, no. 10, pp. 1568–1580, Oct. 2014. robot teams in urban search and rescue,” in Proceedings of IEEE
[10] A. Chilian and H. Hirschmüller, “Stereo camera based navigation of International Symposium on Safety, Security, and Rescue Robotics
mobile robots on rough terrain,” in Proceedings of IEEE/RSJ (SSRR), pp. 1–6, 2013.
International Conference on Intelligent Robots and Systems, pp. [30] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W.
4571–4576, 2009. Burgard, “OctoMap: an efficient probabilistic 3D mapping
[11] S. Lacroix and R. Chatila, “Motion and perception strategies for framework based on octrees,” Auton. Robots, vol. 34, no. 3, pp. 189–
outdoor mobile robot navigation in unknown environments,” in 206, Apr. 2013.
Experimental Robotics IV, Springer, Berlin, Heidelberg, pp. 538– [31] P. Fankhauser and M. Hutter, Robot Operating System (ROS) – The
547, 1997. Complete Reference, chapter 5, vol. 1. Springer, 2016.
[12] D. Silver, J. A. Bagnell, and A. Stentz, “Learning from [32] V. Mnih et al., “Asynchronous Methods for Deep Reinforcement
Demonstration for Autonomous Navigation in Complex Learning,” in Proceedings of International Conference on Machine
Unstructured Terrain,” Int J Rob Res, vol. 29, no. 12, pp. 1565–1592, Learning (PMLR), Feb. 2016.
Oct. 2010. [33] A. Gruslys, W. Dabney, M. G. Azar, B. Piot, M. Bellemare, and R.
[13] D. Kim, J. Sun, S. M. Oh, J. M. Rehg, and A. F. Bobick, Munos, “The Reactor: A fast and sample-efficient Actor-Critic agent
“Traversability classification using unsupervised on-line visual for Reinforcement Learning,” in Proceedings of International
learning for outdoor robot navigation,” in Proceedings of IEEE Conference on Learning Representations, Feb. 2018.
International Conference on Robotics and Automation (ICRA), pp. [34] G. Hinton, “rmsprop: Divide the gradient by a running average of its
518–525, 2006. recent magnitude,” presented at the COURSERA: Neural Networks
[14] R. Valencia-Murillo, N. Arana-Daniel, C. Lopez-Franco, and A. for Machine Learning, 2012.
Alanis, “Rough Terrain Perception Through Geometric Entities for [35] C. M. Bishop, Pattern recognition and machine learning. New York:
Robot Navigation,” in Proceedings of International Conference on Springer, 2006.
Advances in Computer Science and Engineering, 2013. [36] M. Hausknecht and P. Stone, “Deep Recurrent Q-Learning for
[15] B. Sofman, E. Lin, J. A. Bagnell, J. Cole, N. Vandapel, and A. Stentz, Partially Observable MDPs,” ArXiv150706527 Cs, Jul. 2015.
“Improving robot navigation through self‐supervised online
learning,” J. Field Robot., vol. 23, no. 11‐12, pp. 1059–1075, Nov.
2006.
[16] M. Happold, M. Ollis, and N. Johnson, “Enhancing Supervised
Terrain Classification with Predictive Unsupervised Learning,” in
Robotics: Science and Systems, 2006.
[17] S. Zhou, J. Xi, M. W. McDaniel, T. Nishihata, P. Salesses, and K.
Iagnemma, “Self-supervised learning to visually detect terrain
surfaces for autonomous robots operating in forested terrain,” J.
Field Robot., vol. 29, no. 2, pp. 277–297, Mar. 2012.
[18] C. Wellington, A. Courville, and A. Stentz, “A Generative Model of
Terrain for Autonomous Navigation in Vegetation,” Int. J. Robot.
Res., vol. 25, no. 12, pp. 1287–1304, Dec. 2006.
[19] B. Doroodgar, Y. Liu, and G. Nejat, “A Learning-Based
Semi-Autonomous Controller for Robotic Exploration of Unknown
Disaster Scenes While Searching for Victims,” IEEE Trans. Cybern.,
vol. 44, no. 12, pp. 2719–2732, Dec. 2014.
[20] Y. Liu, G. Nejat, and B. Doroodgar, “Learning based
semi-autonomous control for robots in urban search and rescue,” in
Proceedings of IEEE International Symposium on Safety, Security,
and Rescue Robotics (SSRR), pp. 1–6, 2012.
[21] B. Doroodgar and G. Nejat, “A hierarchical reinforcement learning
based control architecture for semi-autonomous rescue robots in
cluttered environments,” in Proceedings of IEEE International
Conference on Automation Science and Engineering, pp. 948–953,
2010.