UGV_Navigation_Optimization_Aided_by_Reinforcement_Learning-Based_Path_Tracking
UGV_Navigation_Optimization_Aided_by_Reinforcement_Learning-Based_Path_Tracking
ABSTRACT The success of robotic, such as UGV systems, largely benefits from the fundamental capability
of autonomously finding collision-free path(s) to commit mobile tasks in routinely rough and complicated
environments. Optimization of navigation under such circumstance has long been an open problem: 1) to
meet the critical requirements of this task typically including the shortest distance and smoothness and
2) more challengingly, to enable a general solution to track the optimal path in real-time outdoor applications.
Aiming at the problem, this study develops a two-tier approach to navigation optimization in terms of path
planning and tracking. First, a ‘‘rope’’ model has been designed to mimic the deformation of a path in axial
direction under external force and the fixedness of the radial plane to contain a UGV in a collision-free space.
Second, a deterministic policy gradient (DPG) algorithm has been trained efficiently on abstracted structures
of an arbitrarily derived ‘‘rope’’ to model the controller for tracking the optimal path. The learned policy can
be generalized to a variety of scenarios. Experiments have been performed over complicated environments
of different types. The results indicate that: 1) the rope model helps in minimizing distance and enhancing
smoothness of the path, while guarantees the clearance; 2) the DPG can be modeled quickly (in a couple
of minutes on an office desktop) and the model can apply to environments of increasing complexity under
the circumstance of external disturbances without the need for tuning parameters; and 3) the DPG-based
controller can autonomously adjust the UGV to follow the correct path free of risks by itself.
INDEX TERMS UGV navigation, reinforcement learning, deterministic policy gradient, path tracking.
2169-3536 2018 IEEE. Translations and content mining are permitted for academic research only.
57814 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 6, 2018
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking
Nature-inspired methods have been proposed for this pur- grabbing [23]–[25], path planning [26], [27] and locomotion
pose, including Genetic algorithm (GA) [8], artificial skills learning [28], [29]. The success of these applications
bee colony (ABC) [9] and particle swarm optimization motivate us to extend this tool to path tracking. However,
(PSO) [10]. Elastic band [11]–[13] methods utilize the inter- there still exists a technical gap (1) to generalize the learning
nal force between adjacent free space around the robot to model for mobile robot control to adapt to various scenarios
construct a deformable collision-free path, which have been and (2) to adapt to external disturbances without the support
widely used for optimizing global path planners including of sufficient empirical data, which is mandatory for real-time
PRM and PFM. These methods have no guarantee over the applications in large-scale outdoor environments.
fixedness of the width of path, and the consequent path After all, navigation of a UGV routinely needs first to
tracking will be difficult. obtain an optimal path between the two points consider-
Given a path been properly planned in theory, track- ing distance and collision-free, then to follow the planned
ing or following the path to commit some mobile tasks path successfully even the uncertain disturbance occurs. This
remains not less challenging. Methods for this purpose have study is aimed at the challenges in this two-tiered problem
been extensively explored for decades, salient examples (1) to meet the critical requirements of this task typically
include Fuzzy Logic, Neural Network, and Model Predic- including the shortest distance and smoothness and (2) to
tive Control (MPC). The Fuzzy Logic method employs a enable a general solution to track the optimal path in real-
fuzzy logic system (FLS) to estimate the uncertainties of time applications in large-scale outdoor environments:
a UGV system [14]. Fuzzy Logic has been further incorpo- 1) This study first designs a ‘‘rope’’ to mimic the defor-
rated with Neural Network to improve the effectiveness of mation of a path in axial direction under external force
uncertainty estimation with a dynamic Petri recurrent fuzzy and the fixedness of the radial plane to contain a UGV
neural network in path tracking control [15]. MPC enables a in a collision-free space by considering the revolute
controlling strategy considering the error dynamics derived and collision constraints (Subsection III-A). Given the
from both the robot states and the path states [16]. start and end points in any 2D environment, the model
These conventional controllers are focused on driving a can optimize a global path planner by automatically
robot along a track best fitting the path planned previously, constructing a tube straightforwardly that defines the
i.e., with the objective of a high precision. Note that in prac- optimal path with both the shortened distance and
tice UGVs are expected to operate in outdoor applications, enhanced smoothness.
the influence of uncertain disturbances in the course of path 2) Second, this study constructs a Deep Deterministic
tracking increases the difficulty of problem solving. The Policy Gradient algorithm (DDPG, Subsection III-B).
latest Learning-based Nonlinear MPC models the various dis- DDPG can be efficiently trained on abstracted struc-
turbances in a possibly complicated environment as a Gaus- tures (Subsection III-B.1) of an arbitrarily derived tube
sian process [17]. These conventional methods generally treat defining any ‘‘safe area’’ for UGV traversing. The
various uncertainties as a (few) deterministic distribution(s) trained DDPG model enables a general policy to con-
(aka. distributional uncertainty), which is insufficient in prac- trol an UGV to track the correct path free of risks by
tical scenarios usually large in terms of environment and itself. The model can apply to environments of increas-
complicated in terms of dynamics. As for robotics applica- ing complexity under the circumstance of external dis-
tions, this is even more difficult as significant uncertainty may turbances without the need for tuning parameters.
propagate unboundedly temporal- and spatial-wise. There Experiments have been performed over complicated envi-
exists a pressing need to introduce computational intelligence ronments of different types (Section IV). The results indi-
to address the challenge. cate that (1) the rope model helps in minimizing distance
Deep learning technology excels in the capabilities of and enhancing smoothness of the path; (2) the DDPG can
directly learning from empirical data to achieve increas- be modelled quickly and the DDPG-based controller can
ingly optimized performance in problem solving. Methods autonomously adjust the UGV to follow the correct path. The
along this direction have recently been widely used to solve main contributions of this study are as follows:
artificial intelligence problems especially the control sys- 1) This study develops an intuitive method to globally
tem in robotics [18]. Supervised learning methods heavily optimize path planning for UGVs with the capability
rely on knowledge from experts, as a contrast reinforcement of shortening distance and improving smoothness;
learning (RL) is salient because it requires no human- 2) This study enables a general solution to track the opti-
labeling based on trial-and-error interactions with the mal path targeting on real-time outdoor applications.
environment [19]–[21]. Furthermore, deterministic policy
gradient (DPG) algorithms recently gain more and more II. RELATED WORK
attentions for its superiority in solving problems concern- Recent trend on optimization of path planning focuses on
ing continuous action and state spaces [22]. The robot (1) convex optimization and (2) nature-inspired methods.
(UGV) path tracking problem is exactly the case. Actually, Convex optimization aims to plan a continuous trajec-
RL methods combined deep learning with DPG (DDPG) tory to directly meet the dynamics constraints over UGVs.
has achieved successes in robot controlling, such as object A typical example is the CHOMP method proposed by
2) UTILIZED CONSTRAINTS
Two kinds of constraints are utilized to simulate the rope
model, revolute constraint and collision constraint. As shown
in Figure 1(a), revolute constraint arises due to the fact that
components rotate freely around a common point, which can
keep a certain distance between components to construct an
approximate fixed-size tube. In Figure 1(b), collision con-
straint exists between obstacles and components to prevent FIGURE 2. rope model.
mutual penetration, which can ensure that the constructed
tube is safe. According to the rule [40], the Jacobian deter-
minants of the two constraints are alternately deduced. The Substituting Jrevolute and Jcollision into (5) and (6), we’ll get
first step is to write out an equation that describes the posi- the updated formula directly. For detailed derivation process,
tion constraint. Since the adjacent components rotate around refer to appendix A.
a common point for revolute constraint, the corresponding −
→ −
→
v ai+1 = −→v ai + P /m1
position constraint is described in (7). −
→
ωai+1 = ωai + Ia−1 − →
ra × P
Crevolute −
→
xa , Ra , −
→
xb , Rb
−
→ −
→
−
→
= pa − pb−
→ v bi+1 = −→v bi − P /m2
=− →
x +R − →r −− →x −R − → −
→
a a a r b b b(7) ω bi+1= ω − I −1 −
bi b
→
r × Pb (11)
where Ra and Rb are the rotation matrices, − →
ra and −
→
rb are the For revolute constraint, the impulse is calculated as
−
→ −
→ −
→ − → −
→
vectors from the centers of the component to the common v ai + ωai × ra − v bi − ωbi × rb
P revolute = −K −1
point. The equation indicates that − →
pa and −
→
pb must be the same and K = m1a + m1b E2×2 + Rra Ia−1 RTra + Rrb Ib−1 RTrb . As to
at any step, which allows the components to translate freely −
→
about a common point. collision constraint P collision = Pcollision−
→
n , Pcollision =
− −→
v ai +ωai ×−
→
ra −−→v bi −ωbi ×−
→
rb ·−
→
n
The position constraint for collision constraint is repre-
K , K = m−1
a + m−1
b +
sented in (8), −
→
ra and − →
rb locate the contact points on compo- −
→ −
→ −1 − →n ×− →
−1
2 2
−
→
nent a and b, n is the normal vector from b to a. The position Ia n × ra + Ib rb ,
Pcollision ≥ 0 so that
constraint measures whether the penetration will occur, in the collision constraint force separate the contact.
way the constraint force will separate the components if the
value is negative. On the other hand, if the value is greater 3) CONSTRUCTION FOR ROPE MODEL
than zero, the interaction does not exist. The above discussion provides two concrete solutions to two
constraint force. In this part, the method to construct a rope
Ccollision −
→xa , Ra , −
→xb , Rb
model based on these two constraints is discussed. As shown
= − →
pa − −→pb − →n in Figure 2, the solid circle component is initialized to cover
= x +R r −−
−
→ −
→ →x −R − →
r −
→
a a a b n a b (8) the original path, and the common rotation point of the adja-
The next step after defining the position constraint is to take cent circle is designated as the center of the prior circle, which
the derivative with respect to time. This will yield the velocity makes the next component rotate only about the center of
constraint. The the prior component to maintain a certain distance. The first
result of time
derivation is as follows.
In (9)
component described as gray is fixed so that the chain can be
−
→ −ωa ray −ray −ray
ωa × r a = = ωa , Rra = . After tightened under external forces. For another hand, in order to
ωa rax rax rax
−
→ −
→ −
→ T ensure the correctness of the convergence direction, a shrink
isolating the velocity vi = va ωa vb ωb , the Jacobian is
plate is constructed at the end of the original chain, and the
obtained easily, Jrevolue = 1 Rra i−1 −Rrb and Jcollision =
h−→ −
→ direction of the shrink plate is consistent with the original
n (ra × n)T −nT − (rb × n)T .
T
direction of the end. Then, an external force is applied to
the end in the same direction as the shrink plate so that the
Ċrevolute = −
→
va + ωa × −
→
ra − −
→ ×−
v b + ωb →
rb chain will slowly contract. The length of the chain is chosen
−
→
va
ωa as the judgment condition of convergence. Under the force,
= 1 Rra −1 −Rrb −→ (9) the distance between the last component and the end point
vb
will be farther and farther. When the distance between them
ωb
exceeds a certain threshold, the component is deleted and the
= v +ω × r − v +ω ×−
−
→ −
→ −
→ →
r −
→
Ċcollision a a a b n b b force is applied to the new end. By continually removing the
−→
va components, the length of the chain will be constant when the
h−
→ −
→ iω
= nT (ra × n)T a chain is tightened. However, since the two constraints under
−nT − (rb × n)T
−→
vb consideration are not conspicuous for the energy consump-
ωb tion of the entire chain, jitter exists when the chain converges
(10) to a fixed length, which leads to the undesirable shape for
feedback scalar reward rt+1 ∈ R and a new state st+1 from Algorithm 1: DDPG for Path Tracking (Adapted
the environment, γ ∈ [0, 1] is the discount factor for future From [42])
reward. The cumulative reward at t is represented as Rt = rt + 1 Initialize critic network Q(s, a|θ Q ), actor µ(s|θ µ ), target
γ rt+1 +γ 2 rt+2 +..., the action-value function is Qπ (st , at ) =
critic Q0 and actor µ0 with weights θ Q ,θ µ ,θ Q ← θ Q ,
0
h i 13 end
∇θ µ ≈ Est ,at ,r,st+1 δθ µ Q(s, a|θ Q )|s=st ,a=µ(st |θ µ ) 14 if episode % replace_iteration == 0 then
15 Update the target networks:
= Est ,at ,r,st+1 0 0
θ Q ← τ θ Q + (1 − τ )θ Q
µ µ µ
h i
× ∇θ µ Q(s, a|θ Q )|s=st ,a=µ(st ) ∇θµ µ(s|θ µ )|s = st
0 0
θ ← τ θ + (1 − τ )θ
16 end
(18)
17 if terminal == true then
18 break
The DDPG [42] used neural networks as the parameters of
19 end
DPG under the continuous action domain. The DDPG algo-
20 end
rithm adapted for path tracking is shown in Algorithm 1. Sim-
21 end
ilar to [43], the experience replay and separate target network
are also utilized to enhance the stability of the algorithm. Ran-
dom noise N is added to the continuous action space in order
to perform more efficient exploration, which diminishes step
by step to imitate the − greedy strategy. MAX _EP_STEPS
is the threshold indicating whether exploration is enough for
each trying. In this paper, neural network is used to learn the
policy. Figure 5 shows the structure of the network. The Actor
and Critic Network both contain two full-connected layers,
which are 100 nodes and 20 nodes respectively. The input
vector for actor is the detected distance from obstacles by the
five sensors. Moreover, the tanh is used as activation function
of the output layer to constrain the rate of rotation. As to
critic, the action is merged with state as action-state input,
where Q(s, a) value is calculated by the linear activation FIGURE 5. The structure of networks. From left to right in each small
function. square is type of layer, number of node, activation function.
FIGURE 6. Result of proposed optimized method. The black is occupied, the grey is feasible for robot.
planned the final optimized path on its basis. The initial where xp is the location of pth point, N is the number of
and final paths were marked as blue and brown red lines component.
separately. Apparently, redundancy exists in the initial path in As shown in Table 6, distances were significantly reduced
terms of distance due to the discrete movement setting. The in almost all maps. Only the ratio for Map8 was not very
optimized path significantly improved the initial path in both obvious due to the key point set have little changes. The ratios
distance and smoothness. The improvement was measured as for Map1,
√ Map6 and Map9 were closed to the optimum one
D 2
DimprovedRadio = 1 − Doptimized
initial
, Simproved = Soptimized − Sinitial , 1 − 2 ≈ 0.293 calculated in the scene where the start and
where D and S were calculated (19) according to [44]. goal points were located on the diagonal line in the square
blank map.
N −1
X It should also be noted that the smoothness for all maps
Distance = ||xp+1 − xp || have significantly increased. The largest improvements were
p=1 witnessed in Map2 and Map10 where the number of deform-
N −1
1 X −1 (xp −xp−1)(xp+1 −xp ) ing points decreased the most. Observations from Figure 6
Smoothness = cos
N −2 ||xp −xp−1 || ||xp+1 −xp )|| implied that the proposed method could minimize the redun-
p=2
(19) dant distances while maximizing smoothness by simulating
FIGURE 7. The training process. The yellow object represents the UGV;
The red object denotes the sensor; The horizontal axis represents the
number of episodes.
− →
V. CONCLUSIONS v ai
ωai
This study developed a two-tier approach to navigation opti- −J −
→
vi = − 1 Rra − 1 − Rrb − →
mization in terms of path planning and tracking towards large v bi
scale outdoor applications. ωbi
A rope model was first designed to optimized any given = −( v ai + ωai × ra − v bi + ωbi × −
−
→ −
→ −
→ →
rb ) (21)
path planner. It mimic the deformation of a path in axial
direction under external force and the fixedness of the radial So the 1tλ = −K −1 (− →v ai + ωai × −→
ra − −→ v bi + ωbi × −→rb ).
plane to contain a UGV in a collision-free space. Given Substitute this formula into (5), then can get:
the start and end points in any 2D environment, the model − → − →
v ai+1 v ai
automatically constructed a tube that defined the optimal path ωa ωa
with shortened distance and enhanced smoothness. −
→
i+1
= →i −M −1 J T K −1
v bi+1 − v bi
A Deep Deterministic Policy Gradient algorithm was used
ωbi+1 ωbi
to enable tracking of the resulted optimal path. A simu-
lated scenario was designed that abstracted the paths in real- × (−
→v ai +ωai × −
→ra − −→
v bi −ωbi × −
→rb)
world scenarios with extremely simple structures. The trained −→
v ai 1
DDPG model enables a general policy to control an UGV ωa T
to track the correct path free of risks by itself in environ- i −1
R a −1
= −
→
v bi − M −1 K
ments of increasing complexity without the need for tuning
parameters. ωbi −RTb
Experiments have been performed over 10 types of com- × (−
→v ai + ωai × −→
ra − − →
v bi − ωbi × − →
rb )
plicated environments examined in renowned literatures. The −
→
global paths were first planned with A*, the rope model then P
−→ ma
operated on the paths, and it could significantly shorten the v ai
−
→
ωa Ia−1 RTra P
distance and enhance smoothness of the paths in all cases. = i
+ (22)
−→
v bi −
→
The DDPG can be trained quickly in only a couple of min- P
−
ωbi
mb
utes on an office desktop over the simulated scenario. The
−1 T − →
DPG-based controller could then autonomously adjust the −Ib Rrb P
UGV model to follow the correct path. −
→
Overall, the proposed method was useful in optimizing the where P revolute = −K −1 (−
→v ai + ωai × −
→
ra − − →v bi −
−
→
ωbi × rb ).
global path with respect to distance and clearance in com-
plicated environments independent of initial path planners. h−The same process for collision
i constraint. Jcollision =
→ −
→
The deep reinforcement learning technique had been able to n (ra × n) −n −(rb × n) , according to (6)
T T T T
I−1
−
→ −
→
a
APPENDIX A = n (ra × n) − n −(rb × n)
T T T T
−1
For revolute constraint, Jrevolute = [1 Rra − 1 − Rrb ], Mb
−1
according to (6) Ib
−
→n
K = JM−1 J T
(ra × n)
Ma−1
×
−
→
Ia−1
−n
= 1 Rra − 1 − Rrb
Mb−1
− (rb × n)
Ib−1 −
→ −
→ −1
1 = n Ma (ra × n) Ia − n Mb −(rb × n) Ib
T −1 T −1 T T −1
T −
→n
× Rra
−1
(ra × n)
× −− →
−RTrb n
− (rb × n)
1
RTra −
→ −
→ −1 −
= Ma−1 Rra Ia−1 −Mb−1 −Rrb Ib−1 = nT Ma−1 −
→
a (ra × n) +n Mb n
n +(ra × n)T I−1 T →
−1
−RTrb + (rb × n)T I−1
b (rb × n)
1 1 1 1
= + E2×2 + Rra Ia−1 RTra + Rrb Ib−1 RTrb (20) = + a (n × ra ) + Ib (n × rb )
+ I−1 2 −1 2
(23)
ma mb ma mb
VOLUME 6, 2018 57823
M. Wei et al.: UGV Navigation Optimization Aided by Reinforcement Learning-Based Path Tracking
−
→ −
→
− J− →
vi = − nT (ra × n)T − nT − (rb × n)T [10] R. C. Eberhart and Y. Shi, ‘‘Particle swarm optimization: Developments,
applications and resources,’’ in Proc. Congr. Evol. Comput., vol. 1.
−→ May 2001, pp. 81–86.
v ai [11] S. Quinlan and O. Khatib, ‘‘Elastic bands: Connecting path planning
ωa and control,’’ in Proc. IEEE Int. Conf. Robot. Automat., May 1993,
i
× −→ pp. 802–807.
v bi
[12] O. Brock and O. Khatib, ‘‘Elastic strips: A framework for motion gen-
ωbi eration in human environments,’’ Int. J. Robot. Res., vol. 21, no. 12,
=− − →v ai + ωai × − → r a−− →v bi − ωbi × − →r b ·−
→
n (24) pp. 1031–1052, 2002.
[13] Z. Zhu, E. Schmerling, and M. Pavone, ‘‘A convex optimization approach
So the 1tλ = −K −1 (− →
v ai + ωai × − →r a−− →v bi − ωbi × to smooth trajectories for motion planning with car-like robots,’’ in Proc.
−
→ −
→
r b ) · n . Substitute this formula into (5), then can get: 54th IEEE Conf. Decis. Control (CDC), Dec. 2015, pp. 835–842.
− [14] T. Das and I. N. Kar, ‘‘Design and implementation of an adaptive fuzzy
→v ai+1
− →v ai
logic-based controller for wheeled mobile robots,’’ IEEE Trans. Control
ωa ωa −1 T −1 Syst. Technol., vol. 14, no. 3, pp. 501–510, May 2006.
−
→
i+1
= →i −M J K [15] R.-J. Wai and C.-M. Liu, ‘‘Design of dynamic petri recurrent fuzzy neural
v bi+1 − v bi network and its application to path-tracking control of nonholonomic
ωbi+1 ωbi mobile robot,’’ IEEE Trans. Ind. Electron., vol. 56, no. 7, pp. 2667–2683,
Jul. 2009.
× −→ v ai +ωai ×−
→ r a −−→v bi −ωbi ×−
→r b ·−
→
n [16] K. Kanjanawanishkul and A. Zell, ‘‘Path following for an omnidirectional
−
→ −
→ mobile robot based on model predictive control,’’ in Proc. IEEE Int. Conf.
v ai n
Robot. Automat., May 2009, pp. 3341–3346.
ωa (ra × n) −1
= −
→ i
−M −1 −− → K [17] C. J. Ostafew, A. P. Schoellig, T. D. Barfoot, and J. Collier, ‘‘Learning-
v bi n based nonlinear model predictive control to improve vision-based mobile
ωbi − (rb × n) robot path tracking,’’ J. Field Robot., vol. 33, no. 1, pp. 133–152, 2015.
[18] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
× −→ v ai +ωai ×−
→ r a −−→v bi −ωbi ×−→
r b ·−
→
n no. 7553, p. 436, 2015.
− → [19] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
− → P
vol. 1, no. 1. Cambridge, MA, USA: MIT Press, 1998.
v ai m−1a
ωa I − → −
→ [20] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, ‘‘DeepDriving: Learning
i a ra × P affordance for direct perception in autonomous driving,’’ in Proc. IEEE
= −
→ + −
→ (25)
v bi −
mb P Int. Conf. Comput. Vis., Dec. 2015, pp. 2722–2730.
ωbi
[21] T. Kollar and N. Roy, ‘‘Trajectory optimization using reinforcement
−I−1 → −
− →
b rb × P
learning for map exploration,’’ Int. J. Robot. Res., vol. 27, no. 2,
pp. 175–196, 2008.
−
→
where P collision = Pcollision −
→
n , [22] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
Pcollision = −K −1 −
→ −
→ −
→
v ai + ωai × r a − v bi − ωbi × r b · −
→ ‘‘Deterministic policy gradient algorithms,’’ in Proc. 31st Int. Conf. Mach.
−
→ Learn. (ICML), 2014, pp. 387–395.
n [23] S. Paul and L. Vig, ‘‘Deterministic policy gradient based robotic path
planning with continuous action spaces,’’ in Proc. IEEE Int. Conf. Comput.
Vis. Workshops (ICCVW), Oct. 2017, pp. 725–733.
Acknowledgment
[24] S. Gu, E. Holly, T. Lillicrap, and S. Levine, ‘‘Deep reinforcement learning
This work was supported in part by the National Natural for robotic manipulation with asynchronous off-policy updates,’’ in Proc.
Science Foundation of China (No. 61772380) and the Foun- IEEE Int. Conf. Robot. Automat. (ICRA), May/Jun. 2017, pp. 3389–3396.
dation for Innovative Research Groups of Hubei Province [25] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel.
(2017). ‘‘Overcoming exploration in reinforcement learning with demon-
(No. 2017CFA007). strations.’’ [Online]. Available: https://arxiv.org/abs/1709.10089
[26] L. Tai, G. Paolo, and M. Liu. (2017). ‘‘Virtual-to-real deep reinforcement
REFERENCES learning: Continuous control of mobile robots for mapless navigation.’’
[Online]. Available: https://arxiv.org/abs/1703.00420
[1] B. Siciliano and O. Khatib Springer Handbook of Robotics, 2nd ed. Berlin,
Germany: Springer-Verlag, 2016. [27] P. Mirowski et al. (2016). ‘‘Learning to navigate in complex environ-
[2] H.-P. Huang and S.-Y. Chung, ‘‘Dynamic visibility graph for path plan- ments.’’ [Online]. Available: https://arxiv.org/abs/1611.03673
ning,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), vol. 3, [28] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, ‘‘Bench-
Sep./Oct. 2004, pp. 2813–2818. marking deep reinforcement learning for continuous control,’’ in Proc. Int.
[3] T. Lozano-Pérez and M. A. Wesley, ‘‘An algorithm for planning collision- Conf. Mach. Learn., 2016, pp. 1329–1338.
free paths among polyhedral obstacles,’’ Commun. ACM, vol. 22, no. 10, [29] X. B. Peng and M. van de Panne, ‘‘Learning locomotion skills using
pp. 560–570, 1979. Deeprl: Does the choice of action space matter?’’ in Proc. ACM
[4] P. Bhattacharya and M. L. Gavrilova, ‘‘Voronoi diagram in optimal path SIGGRAPH/Eurograph. Symp. Comput. Animation, 2017, p. 12.
planning,’’ in Proc. 4th Int. Symp. Voronoi Diagrams Sci. Eng. (ISVD), [30] M. Zucker et al., ‘‘CHOMP: Covariant Hamiltonian optimization for
Jul. 2007, pp. 38–47. motion planning,’’ Int. J. Robot. Res., vol. 32, nos. 9–10, pp. 1164–1193,
[5] M. Spong, S. Hutchinson, and M. Vidyasagar, Robot Modeling and Con- 2013.
trol. Hoboken, NJ, USA: Wiley, 2006, pp. 163–182. [31] J. Schulman et al., ‘‘Motion planning with sequential convex optimiza-
[6] Y. Koren and J. Borenstein, ‘‘Potential field methods and their inherent tion and convex collision checking,’’ Int. J. Robot. Res., vol. 33, no. 9,
limitations for mobile robot navigation,’’ in Proc. IEEE Int. Conf. Robot. pp. 1251–1270, 2014.
Automat., Apr. 1991, pp. 1398–1404. [32] M. Davoodi, F. Panahi, A. Mohades, and S. N. Hashemi, ‘‘Multi-objective
[7] C. W. Warren, ‘‘Global path planning using artificial potential fields,’’ in path planning in discrete space,’’ Appl. Soft Comput., vol. 13, no. 1,
Proc. Int. Conf. Robot. Automat., May 1989, pp. 316–321. pp. 709–720, 2013.
[8] Y. Hu and S. X. Yang, ‘‘A knowledge based genetic algorithm for path plan- [33] T. T. Mac, C. Copot, D. T. Tran, and R. De Keyser, ‘‘A hierarchical global
ning of a mobile robot,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), path planning approach for mobile robots based on multi-objective particle
vol. 5, Apr./May 2004, pp. 4350–4355. swarm optimization,’’ Appl. Soft Comput., vol. 59, pp. 68–76, Oct. 2017.
[9] D. Karaboga, B. Gorkemli, C. Ozturk, and N. Karaboga, ‘‘A comprehen- [34] M. A. Contreras-Cruz, V. Ayala-Ramirez, and U. H. Hernandez-Belmonte,
sive survey: Artificial bee colony (ABC) algorithm and applications,’’ Artif. ‘‘Mobile robot path planning using artificial bee colony and evolutionary
Intell. Rev., vol. 42, no. 1, pp. 21–57, 2014. programming,’’ Appl. Soft Comput., vol. 30, pp. 319–328, May 2015.
[35] J. Baltes and Y. Lin, ‘‘Path tracking control of non-holonomic car-like SONG WANG received the bachelor’s degree
robot with reinforcement learning,’’ in Robot Soccer World Cup. Berlin, from the China University of Geosciences. He
Germany: Springer, 1999, pp. 162–173. is currently pursuing the M.D. degree with the
[36] L. Zuo, X. Xu, C. Liu, and Z. Huang, ‘‘A hierarchical reinforcement School of Computer Science, Wuhan University.
learning approach for optimal path tracking of wheeled mobile robots,’’ His main research interests include machine learn-
Neural Computing Appl., vol. 23, nos. 7–8, pp. 1873–1883, 2013. ing and multi-agent systems.
[37] P. Abbeel, M. Quigley, and A. Y. Ng, ‘‘Using inaccurate models in rein-
forcement learning,’’ in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 1–8.
[38] Y.-J. Liu and S. Tong, ‘‘Optimal control-based adaptive NN design for a
class of nonlinear discrete-time block-triangular systems,’’ IEEE Trans.
Cybern., vol. 46, no. 11, pp. 2670–2680, Nov. 2016.
[39] L. Liu, Z. Wang, and H. Zhang, ‘‘Adaptive fault-tolerant tracking control
for MIMO discrete-time systems via reinforcement learning algorithm
with less learning parameters,’’ IEEE Trans. Autom. Sci. Eng., vol. 14,
no. 1, pp. 299–313, Jan. 2017.
[40] E. Catto, ‘‘Iterative dynamics with temporal coherence,’’ in Proc. Game
Developers Conf., vol. 2, no. 4, 2005, p. 5.
[41] E. Catto, ‘‘Modeling and solving constraints,’’ in Proc. Game Developers
Conf., 2009, p. 16.
[42] T. P. Lillicrap et al. (2015). ‘‘Continuous control with deep reinforcement JINFAN ZHENG is a currently pursuing the bache-
learning.’’ [Online]. Available: https://arxiv.org/abs/1509.02971 lor’s degree with the School of Computer Science,
[43] V. Mnih et al., ‘‘Human-level control through deep reinforcement learn- Wuhan University. His main research interests
ing,’’ Nature, vol. 518, pp. 529–533, 2015. include transfer learning and multitask learning.
[44] J. Han and Y. Seo, ‘‘Mobile robot path planning with surrounding point set
and path improvement,’’ Appl. Soft Comput., vol. 57, pp. 35–47, Aug. 2017.
[45] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot, ‘‘Learning-based nonlin-
ear model predictive control to improve vision-based mobile robot path-
tracking in challenging outdoor environments,’’ in Proc. IEEE Int. Conf.
Robot. Automat. (ICRA), May/Jun. 2014, pp. 4029–4036.
[46] C. Richter, W. Vega-Brown, and N. Roy, ‘‘Bayesian learning for safe high-
speed navigation in unknown environments,’’ in Robotics Research. Cham,
Switzerland: Springer, 2018, pp. 325–341.
[47] R. Takei, H. Huang, J. Ding, and C. J. Tomlin, ‘‘Time-optimal multi-stage
motion planning with guaranteed collision avoidance via an open-loop
game formulation,’’ in Proc. IEEE Int. Conf. Robot. Automat., May 2012,
pp. 323–329.
MINGGAO WEI received the bachelor’s degree DAN CHEN was an HEFCE Research Fellow with
from the Dalian University of Technology. He the University of Birmingham, U.K. He is cur-
is currently pursuing the M.D. degree with the rently a Professor with the School of Computer
School of Computer Science, Wuhan University. Science, Wuhan University, Wuhan, China. His
His main research interests include machine learn- research interests include data science and engi-
ing and multi-agent systems. neering, high-performance computing, and mod-
eling and simulation of complex systems.