0% found this document useful (0 votes)
11 views12 pages

UGV_Navigation_Optimization_Aided_by_Reinforcement_Learning-Based_Path_Tracking

This document presents a study on optimizing unmanned ground vehicle (UGV) navigation using a two-tier approach that combines path planning and tracking through a 'rope' model and a deterministic policy gradient (DPG) algorithm. The proposed methods aim to ensure collision-free paths while minimizing distance and enhancing smoothness, demonstrating effectiveness in complex environments. Experimental results indicate that the DPG-based controller can autonomously adjust the UGV to follow the optimal path in real-time applications without the need for parameter tuning.

Uploaded by

Awaiz Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

UGV_Navigation_Optimization_Aided_by_Reinforcement_Learning-Based_Path_Tracking

This document presents a study on optimizing unmanned ground vehicle (UGV) navigation using a two-tier approach that combines path planning and tracking through a 'rope' model and a deterministic policy gradient (DPG) algorithm. The proposed methods aim to ensure collision-free paths while minimizing distance and enhancing smoothness, demonstrating effectiveness in complex environments. Experimental results indicate that the DPG-based controller can autonomously adjust the UGV to follow the optimal path in real-time applications without the need for parameter tuning.

Uploaded by

Awaiz Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received July 17, 2018, accepted September 16, 2018, date of publication September 28, 2018, date of current

version October 29, 2018.


Digital Object Identifier 10.1109/ACCESS.2018.2872751

UGV Navigation Optimization Aided by


Reinforcement Learning-Based
Path Tracking
MINGGAO WEI, SONG WANG, JINFAN ZHENG, AND DAN CHEN , (Member, IEEE)
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, China
Corresponding author: Dan Chen ([email protected])
This work was supported in part by the National Natural Science Foundation of China under Grant 61772380 and in part by the Foundation
for Innovative Research Groups of Hubei Province under Grant 2017CFA007.

ABSTRACT The success of robotic, such as UGV systems, largely benefits from the fundamental capability
of autonomously finding collision-free path(s) to commit mobile tasks in routinely rough and complicated
environments. Optimization of navigation under such circumstance has long been an open problem: 1) to
meet the critical requirements of this task typically including the shortest distance and smoothness and
2) more challengingly, to enable a general solution to track the optimal path in real-time outdoor applications.
Aiming at the problem, this study develops a two-tier approach to navigation optimization in terms of path
planning and tracking. First, a ‘‘rope’’ model has been designed to mimic the deformation of a path in axial
direction under external force and the fixedness of the radial plane to contain a UGV in a collision-free space.
Second, a deterministic policy gradient (DPG) algorithm has been trained efficiently on abstracted structures
of an arbitrarily derived ‘‘rope’’ to model the controller for tracking the optimal path. The learned policy can
be generalized to a variety of scenarios. Experiments have been performed over complicated environments
of different types. The results indicate that: 1) the rope model helps in minimizing distance and enhancing
smoothness of the path, while guarantees the clearance; 2) the DPG can be modeled quickly (in a couple
of minutes on an office desktop) and the model can apply to environments of increasing complexity under
the circumstance of external disturbances without the need for tuning parameters; and 3) the DPG-based
controller can autonomously adjust the UGV to follow the correct path free of risks by itself.

INDEX TERMS UGV navigation, reinforcement learning, deterministic policy gradient, path tracking.

I. INTRODUCTION are focused on (1) distance and (2) collision-free. A* based


Over the past decades, robotic like unmanned ground vehicle grid methods are simple and commonly used in practice,
(UGV) systems have gained tremendous successes in various but it discretizes motion and lacks the capability of distance
applications, from daily transportations, jungle reconnais- minimization. Visibility graph (VG) [2], [3] and Voronoi
sance, to planet exploration. These successes benefit from diagram (VD) [4] are two classic methods for this purpose.
UGV’s fundamental navigation capability of autonomously VG builds a collection of lines connecting a feature of an
finding collision-free path(s) to commit mobile tasks in object to that of another in the free space. In contrast, VD par-
routinely rough and complicated environments. Given the titions space into cells each consisting of the points closer to
geometry of a UGV and obstacles in a large-scale outdoor one particular object. In large and complicated environments,
environment, a planner needs first to generate a feasible path both methods needs encounter technical barriers in solving
between the start and end points, then to follow the planned NP-hard problems.
path under the physical constraints of the UGV and the path In vision of this, numerous suboptimal solutions have been
as well as the uncertain disturbances [1]. Optimization of proposed, such as Probabilistic Roadmap method (PRM) [5]
navigation under such circumstance has long been a grand and Potential Field method (PFM) [6], [7]. There exist plenty
challenge. of room to improve these methods, e.g., PRM can get a
First, one needs to identify an ‘‘optimal path’’ in the context rough path due to the randomness of the insertion point, the
of path planning. Optimization of path planning may involve Potential Field can achieve obstacle avoidance in real time
various objectives, but the basic issues under consideration but the planned path may get trapped at local minimums.

2169-3536 2018 IEEE. Translations and content mining are permitted for academic research only.
57814 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 6, 2018
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

Nature-inspired methods have been proposed for this pur- grabbing [23]–[25], path planning [26], [27] and locomotion
pose, including Genetic algorithm (GA) [8], artificial skills learning [28], [29]. The success of these applications
bee colony (ABC) [9] and particle swarm optimization motivate us to extend this tool to path tracking. However,
(PSO) [10]. Elastic band [11]–[13] methods utilize the inter- there still exists a technical gap (1) to generalize the learning
nal force between adjacent free space around the robot to model for mobile robot control to adapt to various scenarios
construct a deformable collision-free path, which have been and (2) to adapt to external disturbances without the support
widely used for optimizing global path planners including of sufficient empirical data, which is mandatory for real-time
PRM and PFM. These methods have no guarantee over the applications in large-scale outdoor environments.
fixedness of the width of path, and the consequent path After all, navigation of a UGV routinely needs first to
tracking will be difficult. obtain an optimal path between the two points consider-
Given a path been properly planned in theory, track- ing distance and collision-free, then to follow the planned
ing or following the path to commit some mobile tasks path successfully even the uncertain disturbance occurs. This
remains not less challenging. Methods for this purpose have study is aimed at the challenges in this two-tiered problem
been extensively explored for decades, salient examples (1) to meet the critical requirements of this task typically
include Fuzzy Logic, Neural Network, and Model Predic- including the shortest distance and smoothness and (2) to
tive Control (MPC). The Fuzzy Logic method employs a enable a general solution to track the optimal path in real-
fuzzy logic system (FLS) to estimate the uncertainties of time applications in large-scale outdoor environments:
a UGV system [14]. Fuzzy Logic has been further incorpo- 1) This study first designs a ‘‘rope’’ to mimic the defor-
rated with Neural Network to improve the effectiveness of mation of a path in axial direction under external force
uncertainty estimation with a dynamic Petri recurrent fuzzy and the fixedness of the radial plane to contain a UGV
neural network in path tracking control [15]. MPC enables a in a collision-free space by considering the revolute
controlling strategy considering the error dynamics derived and collision constraints (Subsection III-A). Given the
from both the robot states and the path states [16]. start and end points in any 2D environment, the model
These conventional controllers are focused on driving a can optimize a global path planner by automatically
robot along a track best fitting the path planned previously, constructing a tube straightforwardly that defines the
i.e., with the objective of a high precision. Note that in prac- optimal path with both the shortened distance and
tice UGVs are expected to operate in outdoor applications, enhanced smoothness.
the influence of uncertain disturbances in the course of path 2) Second, this study constructs a Deep Deterministic
tracking increases the difficulty of problem solving. The Policy Gradient algorithm (DDPG, Subsection III-B).
latest Learning-based Nonlinear MPC models the various dis- DDPG can be efficiently trained on abstracted struc-
turbances in a possibly complicated environment as a Gaus- tures (Subsection III-B.1) of an arbitrarily derived tube
sian process [17]. These conventional methods generally treat defining any ‘‘safe area’’ for UGV traversing. The
various uncertainties as a (few) deterministic distribution(s) trained DDPG model enables a general policy to con-
(aka. distributional uncertainty), which is insufficient in prac- trol an UGV to track the correct path free of risks by
tical scenarios usually large in terms of environment and itself. The model can apply to environments of increas-
complicated in terms of dynamics. As for robotics applica- ing complexity under the circumstance of external dis-
tions, this is even more difficult as significant uncertainty may turbances without the need for tuning parameters.
propagate unboundedly temporal- and spatial-wise. There Experiments have been performed over complicated envi-
exists a pressing need to introduce computational intelligence ronments of different types (Section IV). The results indi-
to address the challenge. cate that (1) the rope model helps in minimizing distance
Deep learning technology excels in the capabilities of and enhancing smoothness of the path; (2) the DDPG can
directly learning from empirical data to achieve increas- be modelled quickly and the DDPG-based controller can
ingly optimized performance in problem solving. Methods autonomously adjust the UGV to follow the correct path. The
along this direction have recently been widely used to solve main contributions of this study are as follows:
artificial intelligence problems especially the control sys- 1) This study develops an intuitive method to globally
tem in robotics [18]. Supervised learning methods heavily optimize path planning for UGVs with the capability
rely on knowledge from experts, as a contrast reinforcement of shortening distance and improving smoothness;
learning (RL) is salient because it requires no human- 2) This study enables a general solution to track the opti-
labeling based on trial-and-error interactions with the mal path targeting on real-time outdoor applications.
environment [19]–[21]. Furthermore, deterministic policy
gradient (DPG) algorithms recently gain more and more II. RELATED WORK
attentions for its superiority in solving problems concern- Recent trend on optimization of path planning focuses on
ing continuous action and state spaces [22]. The robot (1) convex optimization and (2) nature-inspired methods.
(UGV) path tracking problem is exactly the case. Actually, Convex optimization aims to plan a continuous trajec-
RL methods combined deep learning with DPG (DDPG) tory to directly meet the dynamics constraints over UGVs.
has achieved successes in robot controlling, such as object A typical example is the CHOMP method proposed by

VOLUME 6, 2018 57815


M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

Zucker et al. [30]. CHOMP optimizes a cost function that


makes trades-off between smoothness and obstacle avoidance
to gain high-quality trajectory of a predetermined duration via
a gradient descent technique. Schulman et al. [31] utilizes a
sequential convex optimization procedure to find continuous-
time safety path by penalizing violated constraints.
Nature-inspired methods mainly consider how to shorten
the distance whole ensuring collision-free. Davoodi et al. [32]
apply a genetic algorithm with NSGA-II framework to inten-
sify explorative power in complicated path planning problem
in the context of grid environment model. Mac et al. [33]
utilized constrained multi-objective particle swarm optimiza-
FIGURE 1. Two kinds of constraints. (a) revolute constraint. (b) collision
tion (PSO) to optimize Dijktra’s algorithm with the Visibility constraint.
Graph model. Contreras-Cruz et al. [34] proposed using the
artificial bee colony (ABC) algorithm for local search and the
evolutionary programming algorithm for refining the feasible that enables a general policy to control UGV to track the
global path. correct path free of risks.
Besides traditional conventional controllers for path track-
ing, Reinforcement Learning (RL) -based methods have also A. ROPE MODEL FOR OPTIMIZATION OF PATH PLANNING
been examined. Baltes and Lin [35] utilized the Function 1) BACKGROUND FOR CONSTRAINT DYNAMICS
Approximator to fit the state space in order to approximate the Consider a common constraint between two objects like
desired path. The action space was discretized that resulted in Figure 1. When the physical constraint is satisfied, the con-
rough steering. straint equation can be defined as:
Zuo et al. [36] attempted to apply RL as a feedback con-
trol over PD. The Laplacian-based hierarchical approximate Cconstraint (−

xa , Ra , −

xb , Rb ) = 0 (1)
policy iteration (GHAPI) applied to decompose the state −

where −→
x a is the center of mass of object a, R a is the rotation
space to smaller subspaces as exploration of subspace was
of a. Take the derivative with respect to time, then get the
much easier. But the action space is also discretized, and
velocity constraint:
this made PD operate in a limited number of parameter
settings. d (Cconstraint ) d (Cconstraint ) −

= v = J−

v =0 (2)
Abbeel et al. [37] proposed an ‘‘inaccurate’’ model that d (t) d (x)
followed human driver’s experiences to obtain tracking strat-
According to the principle that constraint force do not work,
egy without the need for excessive training. The resulted
Jacobian determinants indicate the direction of constraint
controller used an abstracted UGV model for policy search
force, so constraint force can be represented as (3).
in both continuous state space and action space. The model
always required a number of training cases for each individual fconstraint = J T λ (3)
tracking task.
Liu and Tong [38] pointed reinforcement learning can where λ denotes the magnitude of force, the final velocity of
achieve the optimal control performance for a class of the system is
−→ −

multiple-input multiple-output nonlinear discrete-time sys-


→v i+1 = −
→vi + 1t −

a =−→vi + 1tM −1 f ext + f constraint
tems. Inspired of this study, Liu et al. [39] integrated adaptive
reinforcement learning into the fault tolerant controller (FTC) −

If external force f ext is integrated in advance, the equation
for tracking problem. Their efforts focused on reducing the
changes to
number of training parameters thus to alleviate the compu-

→ −

tational load for online parameters tuning at each iteration. v i+1 = −→vi + 1tM −1 f constraint (4)
However, the learning time of the neural network can still be
excessively long. Now substitute equation (3) into (4):
Inspired by the above work, this study mainly focuses on: −

v i+1 = −

vi + 1tM −1 J T λ (5)
(1) optimizing any initial path planner to shorten the distance
while ensuring obstacle clearance and (2) a general solution substitute equation (5) into Ċconstraint = Jv = 0. Then get:
to tracking the optimal path with a low training computational  
cost towards real-time outdoor applications. J −→
vi + 1tM −1 J T λ = 0
−J −→v = JM −1 J T 1tλ
i (6)
III. METHOD FOR UGV NAVIGATION OPTIMIZATION
This section first introduces the ‘‘rope’’ model that optimizes Now, the λ in (6) can be solved, then the velocity of the system
a global path planner. It then presents the DDPG algorithm in the next step −
→v i+1 is calculated according to (5).

57816 VOLUME 6, 2018


M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

2) UTILIZED CONSTRAINTS
Two kinds of constraints are utilized to simulate the rope
model, revolute constraint and collision constraint. As shown
in Figure 1(a), revolute constraint arises due to the fact that
components rotate freely around a common point, which can
keep a certain distance between components to construct an
approximate fixed-size tube. In Figure 1(b), collision con-
straint exists between obstacles and components to prevent FIGURE 2. rope model.
mutual penetration, which can ensure that the constructed
tube is safe. According to the rule [40], the Jacobian deter-
minants of the two constraints are alternately deduced. The Substituting Jrevolute and Jcollision into (5) and (6), we’ll get
first step is to write out an equation that describes the posi- the updated formula directly. For detailed derivation process,
tion constraint. Since the adjacent components rotate around refer to appendix A.
a common point for revolute constraint, the corresponding −
→ −

v ai+1 = −→v ai + P /m1
position constraint is described in (7). −

ωai+1 = ωai + Ia−1 − →
ra × P
Crevolute −

xa , Ra , −

xb , Rb


→ −



= pa − pb−
→ v bi+1 = −→v bi − P /m2
=− →
x +R − →r −− →x −R − → −

a a a r b b b(7) ω bi+1= ω − I −1 −
bi b

r × Pb (11)
where Ra and Rb are the rotation matrices, − →
ra and −

rb are the For revolute constraint, the impulse is calculated as

→ −
→ −
→ − → −

vectors from the centers of the component to the common  v ai + ωai × ra − v bi − ωbi × rb
P revolute  = −K −1
point. The equation indicates that − →
pa and −

pb must be the same and K = m1a + m1b E2×2 + Rra Ia−1 RTra + Rrb Ib−1 RTrb . As to
at any step, which allows the components to translate freely −

about a common point. collision constraint P collision = Pcollision−

n , Pcollision =
− −→
v ai +ωai ×−

ra −−→v bi −ωbi ×−

rb ·−


n
The position constraint for collision constraint is repre-
K , K = m−1
a + m−1
b +
sented in (8), −

ra and − →
rb locate the contact points on compo- −
→ −
→ −1 − →n ×− →
−1
2 2


nent a and b, n is the normal vector from b to a. The position Ia n × ra + Ib rb ,
Pcollision ≥ 0 so that
constraint measures whether the penetration will occur, in the collision constraint force separate the contact.
way the constraint force will separate the components if the
value is negative. On the other hand, if the value is greater 3) CONSTRUCTION FOR ROPE MODEL
than zero, the interaction does not exist. The above discussion provides two concrete solutions to two
constraint force. In this part, the method to construct a rope
Ccollision −
→xa , Ra , −
→xb , Rb

model based on these two constraints is discussed. As shown
= − →
pa − −→pb − →n in Figure 2, the solid circle component is initialized to cover
= x +R r −−

→ −
→ →x −R − →
r −
→
a a a b n a b (8) the original path, and the common rotation point of the adja-
The next step after defining the position constraint is to take cent circle is designated as the center of the prior circle, which
the derivative with respect to time. This will yield the velocity makes the next component rotate only about the center of
constraint. The the prior component to maintain a certain distance. The first
 result of time
 derivation is as follows.
 In (9)
component described as gray is fixed so that the chain can be


→ −ωa ray −ray −ray
ωa × r a = = ωa , Rra = . After tightened under external forces. For another hand, in order to
ωa rax rax rax

→ −
→ −
→ T ensure the correctness of the convergence direction, a shrink
isolating the velocity vi =  va ωa vb ωb , the Jacobian is
plate is constructed at the end of the original chain, and the
obtained easily, Jrevolue = 1 Rra i−1 −Rrb and Jcollision =
h−→ −
→ direction of the shrink plate is consistent with the original
n (ra × n)T −nT − (rb × n)T .
T
direction of the end. Then, an external force is applied to
the end in the same direction as the shrink plate so that the
Ċrevolute = −

va + ωa × −

ra − −
→ ×−
v b + ωb  →
rb  chain will slowly contract. The length of the chain is chosen


va
   ωa  as the judgment condition of convergence. Under the force,
= 1 Rra −1 −Rrb  −→  (9) the distance between the last component and the end point
vb 
will be farther and farther. When the distance between them
ωb
exceeds a certain threshold, the component is deleted and the
= v +ω × r − v +ω ×−

→ −
→ −
→ →
r −
→
Ċcollision a a a b n b b force is applied to the new end. By continually removing the
−→ 
va components, the length of the chain will be constant when the
h−
→ −
→ iω 
= nT (ra × n)T a chain is tightened. However, since the two constraints under
−nT − (rb × n)T 
−→
vb  consideration are not conspicuous for the energy consump-
ωb tion of the entire chain, jitter exists when the chain converges
(10) to a fixed length, which leads to the undesirable shape for

VOLUME 6, 2018 57817


M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

the tube. Therefore, it is necessary to increase the air damping


effect
vt+1 = kvt , k ∈ [0, 1] (12)
where k is the decay ratio for velocity per step.
The update process for each component per step is same
as [41], as shown in the right of Figure 2. Integration of
external forces is applied at first, where only the end com-
ponent receives the external force. Then impulses in (11)
are calculated iteratively to correct the velocity until the
velocity has converged or the iterations have been exhausted.
At last, each component’s position is added with the corrected
velocity. In addition, due to the inaccuracy of the calculations,
Baumgarte Stabilization is also used to prevent the drift and FIGURE 3. The training scenario. As shown in left is the four
configurations: I, L, Z, U. The right is the training scenario consisting of
penetration of the constraints. Thus the modified impulse the four configurations. The dotted line is the desired following path.
equation is

→ −
→ βK −1 (−

pa − −

pb )
P m_revolute = P revolute −
1t
βδ
 
Pm_collision = max Pcollision + ,0 (13)
1tK
where − →p −−
a
→p is the drift error, δ is the penetration error,
b
β > 0 is the decaying bias. More detailed analysis can be
seen in [41].

B. DEEP REINFORCEMENT LEARNING FOR PATH


TRACKING
Path tracking is feasible with the implementation of deep rein-
forcement learning. The model can be trained on simulated
environment that abstracts the paths in real-world scenarios
with extremely simple structures. The trained DDPG algo- FIGURE 4. The model of the car-like robot. W, H are defined as the width
and height of robot body respectively. (x, y) is the location of the robot.
rithm then commits the tracking tasks. φ is the robot’s heading direction. ψ is the steering angle of the robot
decided by the rate of rotation.
1) SIMULATED SCENARIO FOR TRAINING
In order to effectively simulate the complexities of tracking
as car-like robots in reality, only four configurations need to of the robot which can form a simple bounding box to detect
be extracted here: I, L, Z, U. This is a reasonable practice for the collision effectively. The state is
training a great policy because the car-like robot comprised St = (sensor1 , sensor2 ...sensorn ) (15)
by nonholonomic mechanical systems is constrained by its
rate of rotation, like the highway for cars in real world. where sensorn represents the distance information of the
Shown in Figure 3 are the typical cases which are dealt with. obstacle detected by the nth sensor. Reward function is
The training scenario is constructed to contains these four designed on the principle of collision avoidance:
configurations.
(
−1 if min(sensor1 , sensor2 ...sensorn )< W/2
In this study, a common four-wheel mobile robot with two r(st , at ) =
0 else
steered front wheels and two fixed-heading rear wheels is
focused on as shown in Figure 4. The following dynamic (16)
models are described as [42]. A negative reward -1 is arranged while the collision is
φ̇
   
0 1 " v# detected, otherwise a positive reward is arranged.
q̇ =  ẋ  =  sinφ 0  (14)
 
cosφ 0 w 2) PATH TRACKING BASED ON THE DDPG ALGORITHM

A standard reinforcement learning use Markov decision pro-
where υ is the robot’s forward speed and ω is the rate of cesses (MDP) to describe the environment. The MDP is a
rotation. The detailed parameter settings related to the robot tuple hS, A, P, R, γ i. At each time step t, the robot observe
movement will be introduced in Section IV. Robot is sur- the state st ∈ S, choose the action at ∈ A according
rounded by five laser sensors with uniform angle from center to the potential policy π(a|s) mapped s to a and receive a

57818 VOLUME 6, 2018


M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

feedback scalar reward rt+1 ∈ R and a new state st+1 from Algorithm 1: DDPG for Path Tracking (Adapted
the environment, γ ∈ [0, 1] is the discount factor for future From [42])
reward. The cumulative reward at t is represented as Rt = rt + 1 Initialize critic network Q(s, a|θ Q ), actor µ(s|θ µ ), target
γ rt+1 +γ 2 rt+2 +..., the action-value function is Qπ (st , at ) =
critic Q0 and actor µ0 with weights θ Q ,θ µ ,θ Q ← θ Q ,
0

Eπ [Rt ; st , at ], the goal of reinforcement learning is to learn


θ µ ← θ µ , replay buffer M .
0

the optimal policy π = argmaxπ Qπ (s, a), which obeys


2 for episode = 1, MAX _EPISODES do
the Bellman optimality equation Qπ (st , at ) = Est+1 [r +
3 Reset the environment and receive the initial
γ maxat+1 Qπ (st+1 , at+1 )|st , at ]. When the state space is con-
observation state st .
tinuous, value function approximation is introduced, that is
4 for t = 0, MAX _EP_STEPS do
the action-value function is approximated by θ Q , so the loss
5 Choose the action at according to:µ(st ; θ µ ) + N
function is given by:
6 Execute at , receive the feedback reward rt , a new
state st+1 and the condition of end terminal.
Li (θ ) = Est ,at ,r,st+1 [(yi − Qπ (st , at ; θ )) ]
Q Q 2
(17)
7 Store the tuplehst , at , rt , st+1 i in M .
where yi = r + γ Q(st+1 , at+1 ; θ Q)denotes the expected 8 if size(M ) == capacity(M ) then
cumulative reward at (st , at ), where Qπ (st , at ; θ Q ) is the real 9 Sample a random batch transitions
cumulative reward and i is the ith iteration. (si , ai , ri , si+1 ) from M .
Set Zi = ri + γ Q0 (si+1 , µ0 (si+1 |θ µ )|θ Q )
0 0
10
While the action space is also continuous, DPG [22] based
actor-critic algorithm maintains a parameterized actor func- 11 Update critic by minimizing the loss:
i (Zi − Q(si , ai |θ ) )
1 P Q 2
tion µ(s|θ µ ), which maps state to a specific action determin- L = batch
istically. The critic function is updated by Bellman equation 12 Update the actor policy using the sample
similar to Q-learning. Obtain the actor’s iterative equation gradient: ∇θ µ µ|si ≈
µ
batch i α Q(s, a|θ )|s=si ,a=µ(si )∇θ µ(s|θ )|si
1 P Q
through derivation from θ µ according to the chain rule: ∇ µ

h i 13 end
∇θ µ ≈ Est ,at ,r,st+1 δθ µ Q(s, a|θ Q )|s=st ,a=µ(st |θ µ ) 14 if episode % replace_iteration == 0 then
15 Update the target networks:
= Est ,at ,r,st+1 0 0
θ Q ← τ θ Q + (1 − τ )θ Q
µ µ µ
h i
× ∇θ µ Q(s, a|θ Q )|s=st ,a=µ(st ) ∇θµ µ(s|θ µ )|s = st
0 0
θ ← τ θ + (1 − τ )θ
16 end
(18)
17 if terminal == true then
18 break
The DDPG [42] used neural networks as the parameters of
19 end
DPG under the continuous action domain. The DDPG algo-
20 end
rithm adapted for path tracking is shown in Algorithm 1. Sim-
21 end
ilar to [43], the experience replay and separate target network
are also utilized to enhance the stability of the algorithm. Ran-
dom noise N is added to the continuous action space in order
to perform more efficient exploration, which diminishes step
by step to imitate the  − greedy strategy. MAX _EP_STEPS
is the threshold indicating whether exploration is enough for
each trying. In this paper, neural network is used to learn the
policy. Figure 5 shows the structure of the network. The Actor
and Critic Network both contain two full-connected layers,
which are 100 nodes and 20 nodes respectively. The input
vector for actor is the detected distance from obstacles by the
five sensors. Moreover, the tanh is used as activation function
of the output layer to constrain the rate of rotation. As to
critic, the action is merged with state as action-state input,
where Q(s, a) value is calculated by the linear activation FIGURE 5. The structure of networks. From left to right in each small
function. square is type of layer, number of node, activation function.

IV. EXPERIMENT AND RESULTS A. GLOBAL PATH OPTIMIZATION


Two sets of experiments have been performed to evaluate the Ten complicated maps with different layouts and obstacles
proposed approach: (1) global path optimization with the rope were built to validate the performance of the rope model
model; (2) path tracking based on DDPG, and models were by referring to [44] (illustrated in Figure 6). The classic
trained and examined with Tensorflow on a single NVIDIA algorithm A* was utilized to plan the initial global path with
GeForce GTX 1060 and Intel Core i7-4790 with 24GB RAM. one unit searching radius. After that, the proposed rope model

VOLUME 6, 2018 57819


M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

FIGURE 6. Result of proposed optimized method. The black is occupied, the grey is feasible for robot.

TABLE 1. The improvement between initial path and final path.

planned the final optimized path on its basis. The initial where xp is the location of pth point, N is the number of
and final paths were marked as blue and brown red lines component.
separately. Apparently, redundancy exists in the initial path in As shown in Table 6, distances were significantly reduced
terms of distance due to the discrete movement setting. The in almost all maps. Only the ratio for Map8 was not very
optimized path significantly improved the initial path in both obvious due to the key point set have little changes. The ratios
distance and smoothness. The improvement was measured as for Map1,
√ Map6 and Map9 were closed to the optimum one
D 2
DimprovedRadio = 1 − Doptimized
initial
, Simproved = Soptimized − Sinitial , 1 − 2 ≈ 0.293 calculated in the scene where the start and
where D and S were calculated (19) according to [44]. goal points were located on the diagonal line in the square
blank map.
N −1
X It should also be noted that the smoothness for all maps
Distance = ||xp+1 − xp || have significantly increased. The largest improvements were
p=1 witnessed in Map2 and Map10 where the number of deform-
N −1  
1 X −1 (xp −xp−1)(xp+1 −xp ) ing points decreased the most. Observations from Figure 6
Smoothness = cos
N −2 ||xp −xp−1 || ||xp+1 −xp )|| implied that the proposed method could minimize the redun-
p=2
(19) dant distances while maximizing smoothness by simulating

57820 VOLUME 6, 2018


M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

FIGURE 7. The training process. The yellow object represents the UGV;
The red object denotes the sensor; The horizontal axis represents the
number of episodes.

FIGURE 8. Examination of the capability of collision avoidance. The


the deformation characteristics of a chain-like object along object with the box shape denotes the robot; Each arrow denotes an
initial direction of the robot.
the direction of the force. It was applicable in various com-
plicated maps. The method avoided local optimum such as
being trapped in U shape obstacle (the typical problem with
BFS and potential field).
In addition, when overlapping the adjacent circles prop-
erly, a collision-free tube could be constructed as shown
in Figure 6, where the size of grid is enlarged compared
to Figure 6. When the distance between the adjacent circles is
small enough, a tube of fixed size was theoretically obtained.
The experiments assumed that the robot occupied one grid
cell, settings of more grid cells could be easily planned FIGURE 9. Examination of the capability of following. Each number
denotes a randomly chosen destination; The red dot denotes the start
through setting a larger radius in A* for global planner [32]. point; The tube constructed by the rope model is marked blue.

B. EVALUATION OF PATH TRACKING


Path tracking models should be first trained on DDPG before a: LEARNING CAPABILITY
the tracking strategies might be examined. Two scenarios were designed to demonstrate what the model
could learn. The first scenario with obstacles of different
1) PARAMETER SETTING FOR TRAINING sizes is shown as Figure 8, in which eleven points were
This study used a four-wheel mobile robot similar to those randomly chosen as the start points. The trajectories (driving
described in [36] and [45]. The parameters of the model lines) indicated that each robot can do the right action by
were set as v = 0.34m/s, ωmax = 60◦ /s, H = 2W = evaluating the learned policy to avoid the obstacles suc-
0.68m, the time step was 0.1s. The width of the pro- cessfully even in the dead ends like upper right and lower
posed scenario was large enough to allow the robot going right corner. The result suggested the learned policy based
through easily. The parameters for training were as follows: on the proposed scenario has great reaction ability to avoid
MAX _EP_EPISODES = 200, MAX _EP_STEPS = 3000, obstacles in the complex environment according to its own
γ = 0.9, learning rate for actor and critic = 0.0001, state.
batch_size = 32, replay = 3000, N is a normal distribution The second scenario (Figure 9) consisted of a grid world
whose mean was equal to the at and variance diminished step (same cell width in Figure 6 applied). The tube was con-
by step. The training procedure of the model was illustrated structed between a fixed start point (red) and any of 191 ran-
in Figure 7. dom end points, which represented a variety of radians for
The learned policy were gradually improved with more the robot to explore similar to highways in real world. Exper-
explorations. It took 7 mins to derive a satisfactory policy imental results suggested the tracking policy could lead the
at the 95th episode: closely staying in the middle of the tube UGV to reach all destinations without collision and to stay in
to fit the desired path in Figure 3. the middle of the path to the most. The details of the errors
between the desired path (defined by the tube constructed by
2) PERFORMANCE EVALUATION the ‘‘rope’’ model) and the actual tracks were analyzed in the
This subsection first presents the objectives of training, then later part of this section.
the performance of the tracking policy will be analyzed. Clearly, the learned policy made use of the constructed tube
Finally the generalization ability of the learned tracking pol- to enable obstacle avoidance when steering the UGV to track
icy will be examined with a very complicated environment. the desired path.

VOLUME 6, 2018 57821


M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

FIGURE 10. Path tracking and performance analysis. (a) following.


(b) analysis.

b: PERFORMANCE ANALYSIS FIGURE 11. Examination of capability of adapting to variations of velocity


In order to analyze the performance of the learned model, and external disturbances. (a) quadruple velocity. (b) accelerate.
(c) disturbance.
the most complicated map (Map10) was selected for exami-
nation. The size was set as about 120×150m2 and the velocity
of the robot was set as 1.45 m/s. Results (Sub-figure 10(a))
indicated that the model could correctly control the car to pass C. DISCUSSIONS
through the entire scenario. Sub-figure 10(b) was a zoom-in Reinforcement learning has always been an important method
view of Figure 10 (from the beginning to the completion of both in the control field and machine learning field. In the
the first turn), which represented the hardest part of the whole control field, the stability analysis about systems is always
scenario. Although the initialized direction and position of necessary, such as the Lyapunov method. In contrast, the con-
the UGV were not in line with the expectations, the model vergence of an algorithm in the context of machine learning
was able to quickly adapt to the harsh environment with no is to ensure that the desired knowledge can been learned from
collisions with the constructed tube and steering the UGV to the data.
fitting the desired path to the most, i.e., the center of the tube. The high-dimensional continuous action space has always
This study only used a very simple reward function for been a key issue in traditional RL. In recent years,
model training. Nevertheless, the resulted tracking policy DDPG has emerged with the problem properly solved, and
exhibited excellent performance in path tracking. It signif- its convergence has been demonstrated in [42]. This study
icantly outperformed the conventional controllers via min- extends the DDPG method in solving the tracking problem.
imizing the errors between the desired path and the actual The experimental results have demonstrated that a proper
track. tracking policy can be learned.
UGV navigation models with generalization ability have
c: CAPABILITY OF ADAPTING TO VARIATIONS OF VELOCITY also received attentions, such as learning-based and numeri-
AND EXTERNAL DISTURBANCES cal optimization methods [46], [47]:
Three sets of experiments were performed to examine how • A Bayesian learning model [46] aims to gain navigation
well the model might adapt to abrupt change of velocity ability in an unfamiliar environment. The model encode
and external disturbances: (1) setting the UGV to operate safety constraints as a priori over collision probabilities
in a speed four times of the original one; (2) gradually and can be trained on expert data collected in a sepa-
accelerating the UGV to the previous speed; (3) introducing rate hallway environment. Expert data and a priori are
external disturbances while the UGV operated in the original mandatory otherwise it is impossible for the model to
speed, in which noises conforming to a normal distribution rule out risky behaviors.
(mean: at and variance: 4.0) were added to the action of • Collision-free path planning can also be formu-
the UGV. lated as an open-loop pursuit-evasion game [47],
The first results (Sub-figure 11(a)) demonstrated that which can be solved by the modified fast marching
in spite of the shape increase of speed, the UGV still method. However, it remains unclear that these meth-
trespassed the scenario successfully. The second results ods can support non-holonomic dynamics associated
(Sub-figure 11(b)) showed that the learning policy worked with car-like robots, i.e., its practicality needs further
well while the UGV was accelerated with no collision. The examination.
third results (Sub-figure 11(c)) exhibited that although the Compared with these work, this study shows that deep
disturbances caused significant jitters abruptly, the UGV reinforcement learning can easily apply to the navigation
quickly responded with rectification to avoid collisions. The problem without the need for extra expert data. The method
tracking policy also worked well with the much more compli- proposed at the current stage applies in tracking, and for
cated scenario presented in Figure 9. It should be noted that future work deep reinforcement learning will be extended to
this was achieved without any parameter fine-tuning. end-to-end navigation.

57822 VOLUME 6, 2018


M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

− → 
V. CONCLUSIONS v ai
  ωai 
This study developed a two-tier approach to navigation opti- −J −
→ 
vi = − 1 Rra − 1 − Rrb  − → 
mization in terms of path planning and tracking towards large v bi 
scale outdoor applications. ωbi
A rope model was first designed to optimized any given = −( v ai + ωai × ra − v bi + ωbi × −

→ −
→ −
→ →
rb ) (21)
path planner. It mimic the deformation of a path in axial
direction under external force and the fixedness of the radial So the 1tλ = −K −1 (− →v ai + ωai × −→
ra − −→ v bi + ωbi × −→rb ).
plane to contain a UGV in a collision-free space. Given Substitute this formula into (5), then can get:
the start and end points in any 2D environment, the model − →  − → 
v ai+1 v ai
automatically constructed a tube that defined the optimal path  ωa   ωa 
with shortened distance and enhanced smoothness. −
→
i+1 
= →i  −M −1 J T K −1
v bi+1   − v bi 
A Deep Deterministic Policy Gradient algorithm was used
ωbi+1 ωbi
to enable tracking of the resulted optimal path. A simu-
lated scenario was designed that abstracted the paths in real- × (−
→v ai +ωai × −
→ra − −→
v bi −ωbi × −
→rb)
world scenarios with extremely simple structures. The trained −→   
v ai 1
DDPG model enables a general policy to control an UGV  ωa  T 
to track the correct path free of risks by itself in environ- i  −1 
 R a  −1
= −

 v bi  − M  −1  K
ments of increasing complexity without the need for tuning
parameters. ωbi −RTb
Experiments have been performed over 10 types of com- × (−
→v ai + ωai × −→
ra − − →
v bi − ωbi × − →
rb )
plicated environments examined in renowned literatures. The  −
→ 
global paths were first planned with A*, the rope model then P
−→  ma
operated on the paths, and it could significantly shorten the v ai 

→ 

 ωa   Ia−1 RTra P 

distance and enhance smoothness of the paths in all cases. = i 
+ (22)
−→ 
v bi   −

The DDPG can be trained quickly in only a couple of min- P


ωbi
 
mb
utes on an office desktop over the simulated scenario. The
 
−1 T − →
DPG-based controller could then autonomously adjust the −Ib Rrb P
UGV model to follow the correct path. −

Overall, the proposed method was useful in optimizing the where P revolute = −K −1 (−
→v ai + ωai × −

ra − − →v bi −


ωbi × rb ).
global path with respect to distance and clearance in com-
plicated environments independent of initial path planners. h−The same process for collision
i constraint. Jcollision =
→ −

The deep reinforcement learning technique had been able to n (ra × n) −n −(rb × n) , according to (6)
T T T T

support a general solution to path tracking free of risks with


a low computational cost, which held great potentials in real- K = J M −1 J T
time outdoor applications. Ma−1
 

I−1
−
→ −
→  
a
APPENDIX A = n (ra × n) − n −(rb × n) 
T T T T  
−1 
For revolute constraint, Jrevolute = [1 Rra − 1 − Rrb ],  Mb 
−1
according to (6) Ib
 −
→n

K = JM−1 J T
 (ra × n) 
 
Ma−1
 
× 

→ 

  Ia−1 
 −n
= 1 Rra − 1 − Rrb  
Mb−1

− (rb × n)
 
Ib−1 −
  → −
→ −1 
1 = n Ma (ra × n) Ia − n Mb −(rb × n) Ib
T −1 T −1 T T −1
 
 T   −
→n

×  Rra 

 −1 
  (ra × n) 
×  −− → 
−RTrb n 
− (rb × n)


1
  RTra  −
→ −
→ −1 −
= Ma−1 Rra Ia−1 −Mb−1 −Rrb Ib−1  = nT Ma−1 −

a (ra × n) +n Mb n
n +(ra × n)T I−1 T →


 −1 
−RTrb + (rb × n)T I−1
b (rb × n)
 
1 1 1 1
= + E2×2 + Rra Ia−1 RTra + Rrb Ib−1 RTrb (20) = + a (n × ra ) + Ib (n × rb )
+ I−1 2 −1 2
(23)
ma mb ma mb
VOLUME 6, 2018 57823
M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

−
→ −
→ 
− J− →
vi = − nT (ra × n)T − nT − (rb × n)T [10] R. C. Eberhart and Y. Shi, ‘‘Particle swarm optimization: Developments,
applications and resources,’’ in Proc. Congr. Evol. Comput., vol. 1.
−→  May 2001, pp. 81–86.
v ai [11] S. Quinlan and O. Khatib, ‘‘Elastic bands: Connecting path planning
 ωa  and control,’’ in Proc. IEEE Int. Conf. Robot. Automat., May 1993,
i 
× −→ pp. 802–807.
v bi 
[12] O. Brock and O. Khatib, ‘‘Elastic strips: A framework for motion gen-
ωbi eration in human environments,’’ Int. J. Robot. Res., vol. 21, no. 12,
=− − →v ai + ωai × − → r a−− →v bi − ωbi × − →r b ·−
 →
n (24) pp. 1031–1052, 2002.
[13] Z. Zhu, E. Schmerling, and M. Pavone, ‘‘A convex optimization approach
So the 1tλ = −K −1 (− →
v ai + ωai × − →r a−− →v bi − ωbi × to smooth trajectories for motion planning with car-like robots,’’ in Proc.

→ −

r b ) · n . Substitute this formula into (5), then can get: 54th IEEE Conf. Decis. Control (CDC), Dec. 2015, pp. 835–842.
− [14] T. Das and I. N. Kar, ‘‘Design and implementation of an adaptive fuzzy
→v ai+1
 − →v ai

logic-based controller for wheeled mobile robots,’’ IEEE Trans. Control
ωa   ωa  −1 T −1 Syst. Technol., vol. 14, no. 3, pp. 501–510, May 2006.
−
→
i+1 
=  →i  −M J K [15] R.-J. Wai and C.-M. Liu, ‘‘Design of dynamic petri recurrent fuzzy neural
v bi+1   − v bi  network and its application to path-tracking control of nonholonomic
ωbi+1 ωbi mobile robot,’’ IEEE Trans. Ind. Electron., vol. 56, no. 7, pp. 2667–2683,
Jul. 2009.
× −→ v ai +ωai ×−
→ r a −−→v bi −ωbi ×−
→r b ·−
→
n [16] K. Kanjanawanishkul and A. Zell, ‘‘Path following for an omnidirectional
−
→   −
→  mobile robot based on model predictive control,’’ in Proc. IEEE Int. Conf.
v ai n
Robot. Automat., May 2009, pp. 3341–3346.
ωa  (ra × n)  −1
= −
→ i 
−M −1 −− → K [17] C. J. Ostafew, A. P. Schoellig, T. D. Barfoot, and J. Collier, ‘‘Learning-
v bi  n  based nonlinear model predictive control to improve vision-based mobile
ωbi − (rb × n) robot path tracking,’’ J. Field Robot., vol. 33, no. 1, pp. 133–152, 2015.
[18] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
× −→ v ai +ωai ×−
→ r a −−→v bi −ωbi ×−→
r b ·−
→
n no. 7553, p. 436, 2015.
− →  [19] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
− → P
vol. 1, no. 1. Cambridge, MA, USA: MIT Press, 1998.

v ai  m−1a
 ωa   I − → −
→  [20] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, ‘‘DeepDriving: Learning
i  a ra × P  affordance for direct perception in autonomous driving,’’ in Proc. IEEE
= −
 → +  −
→  (25)
v bi   −
 mb P  Int. Conf. Comput. Vis., Dec. 2015, pp. 2722–2730.
ωbi

[21] T. Kollar and N. Roy, ‘‘Trajectory optimization using reinforcement
−I−1 → −
− →
b rb × P
learning for map exploration,’’ Int. J. Robot. Res., vol. 27, no. 2,
pp. 175–196, 2008.


where P collision = Pcollision −

n , [22] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
Pcollision = −K −1 −
→ −
→ −

v ai + ωai × r a − v bi − ωbi × r b · −
→ ‘‘Deterministic policy gradient algorithms,’’ in Proc. 31st Int. Conf. Mach.

→ Learn. (ICML), 2014, pp. 387–395.
n [23] S. Paul and L. Vig, ‘‘Deterministic policy gradient based robotic path
planning with continuous action spaces,’’ in Proc. IEEE Int. Conf. Comput.
Vis. Workshops (ICCVW), Oct. 2017, pp. 725–733.
Acknowledgment
[24] S. Gu, E. Holly, T. Lillicrap, and S. Levine, ‘‘Deep reinforcement learning
This work was supported in part by the National Natural for robotic manipulation with asynchronous off-policy updates,’’ in Proc.
Science Foundation of China (No. 61772380) and the Foun- IEEE Int. Conf. Robot. Automat. (ICRA), May/Jun. 2017, pp. 3389–3396.
dation for Innovative Research Groups of Hubei Province [25] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel.
(2017). ‘‘Overcoming exploration in reinforcement learning with demon-
(No. 2017CFA007). strations.’’ [Online]. Available: https://arxiv.org/abs/1709.10089
[26] L. Tai, G. Paolo, and M. Liu. (2017). ‘‘Virtual-to-real deep reinforcement
REFERENCES learning: Continuous control of mobile robots for mapless navigation.’’
[Online]. Available: https://arxiv.org/abs/1703.00420
[1] B. Siciliano and O. Khatib Springer Handbook of Robotics, 2nd ed. Berlin,
Germany: Springer-Verlag, 2016. [27] P. Mirowski et al. (2016). ‘‘Learning to navigate in complex environ-
[2] H.-P. Huang and S.-Y. Chung, ‘‘Dynamic visibility graph for path plan- ments.’’ [Online]. Available: https://arxiv.org/abs/1611.03673
ning,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), vol. 3, [28] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, ‘‘Bench-
Sep./Oct. 2004, pp. 2813–2818. marking deep reinforcement learning for continuous control,’’ in Proc. Int.
[3] T. Lozano-Pérez and M. A. Wesley, ‘‘An algorithm for planning collision- Conf. Mach. Learn., 2016, pp. 1329–1338.
free paths among polyhedral obstacles,’’ Commun. ACM, vol. 22, no. 10, [29] X. B. Peng and M. van de Panne, ‘‘Learning locomotion skills using
pp. 560–570, 1979. Deeprl: Does the choice of action space matter?’’ in Proc. ACM
[4] P. Bhattacharya and M. L. Gavrilova, ‘‘Voronoi diagram in optimal path SIGGRAPH/Eurograph. Symp. Comput. Animation, 2017, p. 12.
planning,’’ in Proc. 4th Int. Symp. Voronoi Diagrams Sci. Eng. (ISVD), [30] M. Zucker et al., ‘‘CHOMP: Covariant Hamiltonian optimization for
Jul. 2007, pp. 38–47. motion planning,’’ Int. J. Robot. Res., vol. 32, nos. 9–10, pp. 1164–1193,
[5] M. Spong, S. Hutchinson, and M. Vidyasagar, Robot Modeling and Con- 2013.
trol. Hoboken, NJ, USA: Wiley, 2006, pp. 163–182. [31] J. Schulman et al., ‘‘Motion planning with sequential convex optimiza-
[6] Y. Koren and J. Borenstein, ‘‘Potential field methods and their inherent tion and convex collision checking,’’ Int. J. Robot. Res., vol. 33, no. 9,
limitations for mobile robot navigation,’’ in Proc. IEEE Int. Conf. Robot. pp. 1251–1270, 2014.
Automat., Apr. 1991, pp. 1398–1404. [32] M. Davoodi, F. Panahi, A. Mohades, and S. N. Hashemi, ‘‘Multi-objective
[7] C. W. Warren, ‘‘Global path planning using artificial potential fields,’’ in path planning in discrete space,’’ Appl. Soft Comput., vol. 13, no. 1,
Proc. Int. Conf. Robot. Automat., May 1989, pp. 316–321. pp. 709–720, 2013.
[8] Y. Hu and S. X. Yang, ‘‘A knowledge based genetic algorithm for path plan- [33] T. T. Mac, C. Copot, D. T. Tran, and R. De Keyser, ‘‘A hierarchical global
ning of a mobile robot,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), path planning approach for mobile robots based on multi-objective particle
vol. 5, Apr./May 2004, pp. 4350–4355. swarm optimization,’’ Appl. Soft Comput., vol. 59, pp. 68–76, Oct. 2017.
[9] D. Karaboga, B. Gorkemli, C. Ozturk, and N. Karaboga, ‘‘A comprehen- [34] M. A. Contreras-Cruz, V. Ayala-Ramirez, and U. H. Hernandez-Belmonte,
sive survey: Artificial bee colony (ABC) algorithm and applications,’’ Artif. ‘‘Mobile robot path planning using artificial bee colony and evolutionary
Intell. Rev., vol. 42, no. 1, pp. 21–57, 2014. programming,’’ Appl. Soft Comput., vol. 30, pp. 319–328, May 2015.

57824 VOLUME 6, 2018


M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking

[35] J. Baltes and Y. Lin, ‘‘Path tracking control of non-holonomic car-like SONG WANG received the bachelor’s degree
robot with reinforcement learning,’’ in Robot Soccer World Cup. Berlin, from the China University of Geosciences. He
Germany: Springer, 1999, pp. 162–173. is currently pursuing the M.D. degree with the
[36] L. Zuo, X. Xu, C. Liu, and Z. Huang, ‘‘A hierarchical reinforcement School of Computer Science, Wuhan University.
learning approach for optimal path tracking of wheeled mobile robots,’’ His main research interests include machine learn-
Neural Computing Appl., vol. 23, nos. 7–8, pp. 1873–1883, 2013. ing and multi-agent systems.
[37] P. Abbeel, M. Quigley, and A. Y. Ng, ‘‘Using inaccurate models in rein-
forcement learning,’’ in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 1–8.
[38] Y.-J. Liu and S. Tong, ‘‘Optimal control-based adaptive NN design for a
class of nonlinear discrete-time block-triangular systems,’’ IEEE Trans.
Cybern., vol. 46, no. 11, pp. 2670–2680, Nov. 2016.
[39] L. Liu, Z. Wang, and H. Zhang, ‘‘Adaptive fault-tolerant tracking control
for MIMO discrete-time systems via reinforcement learning algorithm
with less learning parameters,’’ IEEE Trans. Autom. Sci. Eng., vol. 14,
no. 1, pp. 299–313, Jan. 2017.
[40] E. Catto, ‘‘Iterative dynamics with temporal coherence,’’ in Proc. Game
Developers Conf., vol. 2, no. 4, 2005, p. 5.
[41] E. Catto, ‘‘Modeling and solving constraints,’’ in Proc. Game Developers
Conf., 2009, p. 16.
[42] T. P. Lillicrap et al. (2015). ‘‘Continuous control with deep reinforcement JINFAN ZHENG is a currently pursuing the bache-
learning.’’ [Online]. Available: https://arxiv.org/abs/1509.02971 lor’s degree with the School of Computer Science,
[43] V. Mnih et al., ‘‘Human-level control through deep reinforcement learn- Wuhan University. His main research interests
ing,’’ Nature, vol. 518, pp. 529–533, 2015. include transfer learning and multitask learning.
[44] J. Han and Y. Seo, ‘‘Mobile robot path planning with surrounding point set
and path improvement,’’ Appl. Soft Comput., vol. 57, pp. 35–47, Aug. 2017.
[45] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot, ‘‘Learning-based nonlin-
ear model predictive control to improve vision-based mobile robot path-
tracking in challenging outdoor environments,’’ in Proc. IEEE Int. Conf.
Robot. Automat. (ICRA), May/Jun. 2014, pp. 4029–4036.
[46] C. Richter, W. Vega-Brown, and N. Roy, ‘‘Bayesian learning for safe high-
speed navigation in unknown environments,’’ in Robotics Research. Cham,
Switzerland: Springer, 2018, pp. 325–341.
[47] R. Takei, H. Huang, J. Ding, and C. J. Tomlin, ‘‘Time-optimal multi-stage
motion planning with guaranteed collision avoidance via an open-loop
game formulation,’’ in Proc. IEEE Int. Conf. Robot. Automat., May 2012,
pp. 323–329.

MINGGAO WEI received the bachelor’s degree DAN CHEN was an HEFCE Research Fellow with
from the Dalian University of Technology. He the University of Birmingham, U.K. He is cur-
is currently pursuing the M.D. degree with the rently a Professor with the School of Computer
School of Computer Science, Wuhan University. Science, Wuhan University, Wuhan, China. His
His main research interests include machine learn- research interests include data science and engi-
ing and multi-agent systems. neering, high-performance computing, and mod-
eling and simulation of complex systems.

VOLUME 6, 2018 57825

You might also like