The Actor-Dueling-Critic Method
The Actor-Dueling-Critic Method
Article
The Actor-Dueling-Critic Method for
Reinforcement Learning
Menghao Wu 1,2 , Yanbin Gao 1, *, Alexander Jung 2 , Qiang Zhang 1 and Shitong Du 1
1 College of Automation, Harbin Engineering University, Harbin 150001, China;
[email protected] (M.W.); [email protected] (Q.Z.); [email protected] (S.D.)
2 Department of Computer Science, Aalto University, 02150 Espoo, Finland; [email protected]
* Correspondence: [email protected]
Received: 25 February 2019; Accepted: 25 March 2019; Published: 30 March 2019
1. Introduction
Autonomous navigation is a core research area of the mobile robot, and the obstacle avoidance
technique has been treated as a planning problem [1]. An efficient navigation system requires
both global path planning and local motion control ability. The local motion usually uses sensory
information to determine a motion that will avoid collision with unknown obstacles [2]. The classical
solutions such as simultaneous localization and mapping (SLAM) enable autonomous vehicles to
safely roam in unknown environments while incrementally building a map of it [3]. The robot uses
laser or cameras as major sensors to scan a great many points in an area, based on this map it can
avoid collisions. But avoiding obstacles based on a complex world representation can be inconvenient
and lacks flexibility. Besides, some basic obstacle avoidance algorithms determine a suitable distance
and are based on recent sensory data to ensure the real-time avoidance. However, to optimize these
algorithms, lots of environmental circumstances should be considered and tested [1]. The visual-based
navigation system recognizes objects and measures the distance from them by monocular or binocular
cameras to avoid obstacles. However, the visual information is easily affected by environmental
conditions such as the position of light sources, illumination intensity, etc. [4]. To solve the obstacle
avoidance task based on distance measurement, sensors such as lasers and LiDARs can provide
simple and effective solutions. The sensors are employed to collect information around the robot,
thus the robot can perceive the relative position between itself and the environmental obstacles.
Traditional control algorithms are based on expert knowledge and experimental experience to make
the sensors and motors coordinate to avoid collisions. It has a high training cost and is not flexible.
With the rise of artificial intelligence in recent years, approaches based on deep learning have achieved
lots of impressive results [5–9]. Combining with transfer learning technique, the pre-trained local
motion planning model and knowledge in the simulator can be effectively used in real robots to
achieve obstacle avoidance [10]. There are many excellent works [11] training models in simulators
such as gazebo [11] and Mojuco [12]. Typically, robots such as Turtlebot measure distance from
environments with onboard LiDAR and sonar to avoid unpredictable obstacles. Based on the sensor
input, the robot can finish navigation tasks with a pre-trained model. The model and knowledge can
be acquired by means of reinforcement learning. Therefore, the study of reinforcement learning
algorithms and approaches to improve training performance has more practical significance to
navigation field presently.
Reinforcement learning (RL) is a mathematical framework for autonomous learning optimal
control strategy through trial and error in a wide range of fields besides engineering and robotics [13].
The control strategy or policy is a mapping between states and actions, which enables the agent to
have knowledge about selecting a good action based upon the current state and its experience during
interaction with an environment. This learning process continues until the agent acquires a promising
performance and the whole process is fully driven by the reward. For obstacle-avoiding robots,
the sensory input can be regarded as states, and the operation of the motor is regarded as the action.
Whether the collision occurs or not can be defined as rewards. Through this change, an obstacle
avoidance task can be formalized as a standard RL process. Extracting the useful information
from the environment (forms such as images, text, and audio) is a key capability for training the
agent. With the recent advances in deep learning (DL), relying on the neural networks’ powerful
function approximation and representation learning properties allows an RL agent to efficiently
learn features and patterns from high-dimensional data with multiple processing layers models [14].
It has dramatically accelerated the developing process of RL, and deep reinforcement learning (DRL)
could be applied to more fields and learn knowledge end to end by integration of RL and neural
networks. There is a range of successful neural network architectures, such as convolutional neural
networks (CNN) [15], multilayer perceptrons, recurrent neural networks [16], generative adversarial
nets (GAN) [17], etc., which dramatically improves the state of the art in applications such as object
detection, speech recognition, and language understanding [14]. DRL algorithms can deal with
decision-making problems that were previously intractable with high-dimensional state input and
large action spaces [18], which has made significant progress, and it became easier to train a more
complex neural network model than before.
In this paper, we focus on improving the training performance of RL algorithms by providing
new training technique and apply it in an obstacle avoidance simulator to discuss the practicability in
navigation field. This method combines the benefit from the actor-critic framework and the dueling
network architecture [19]. We refer to this hybrid approach as the ADC algorithm. The ADC algorithm
operates well in a continuous action space since it has an actor network, which directly optimizes
policy and selects actions. On the other hand, the dueling-critic network can estimate state-action
value and action-advantage value. By combining the two estimated values in a technique we present,
the estimation of Q-value can be insensitive to the change of environmental noise, thereby improving
the training stability. The dueling-critic adopts the framework design like dueling network [19], which
is an efficient technique that decouples the state and action pairs; therefore, it evaluates the action
value independent of states. Our method provides a more accurate estimation of the state-action value
at each time step, and it is an important factor for guiding the actor to update its policy network.
However, the original dueling network can only work in discrete action space, since it is based on
standard the deep Q-learning network and the action advantage is a relative action value in each
state, relative to other unselected actions. In the continuous action space, the unselected actions are
countless, and it is impossible to evaluate each unselected actions’ advantage value. We introduce
Sensors 2019, 19, 1547 3 of 20
the concept of action interval, converting the action’s advantage to action interval’s advantage value,
which makes it possible to use this technique in continuous action spaces.
To test our approach’s performance, we apply our method to the gym Pendulum-v0 environment
and the obstacle avoidance environment, all tasks are in continuous action space. To explore the
stability of the ADC method, we manually add noise to the environments. From the results show, our
method operates well in the continuous control tasks and the training process is more efficient than
DDPG algorithm [20] especially in the environments with noise input. The contributions of this paper
are summarized as follows:
• We provide a novel network structure (ADC) working in continuous action space, which can
decouple the states and actions and estimate state value and action advantage separately in an
environment.
• We introduce the concept of action interval’s advantage, which makes it possible for the advantage
technique to be used in continuous action domain.
• Based on the ADC structure, we propose an algorithm that is effective at learning policies for
continuous control tasks. This is fully model-free, and it leads to stable long-term behavior.
The rest of paper is organized as follows. Section 2 gives a brief review of the main RL-related
techniques for improving training performance and the robotics applications. In Section 3 we formalize
the problem setup and provide some necessary background knowledge of RL. Our main contribution
is in Section 4 which discusses our approach which combines dueling network architecture with the
actor-critic network and presents the details of the network such as the action interval’s advantage and
dueling network’s aggregating module. Section 5 we present the experimental results of our method
on a gym classic control simulator and navigation simulator. Some general discussions are presented
in Section 6 and Section 7 concludes the paper with some potential use cases.
2. Related Work
Many researchers have studied RL algorithms and relevant techniques to improve their training
performance. At the same time, many research groups have narrowed the gap between the algorithms
and practical applications. In particular, the applications in continuous control have drawn more
attention, which has great significance in the field of robotic navigation.
Mnih et al. [21] developed the first standout DRL algorithm, the Deep Q-Network (DQN)
algorithm, which develops a novel artificial agent achieved human-level performance in playing Atari
series games, directly from raw video input. Silver et al. [22] set a landmark in artificial intelligence
(AI) in playing the game of Go based on supervised learning and RL. DRL is poised to revolutionize the
field of AI and represents a big step towards building fully autonomous or end-to-end systems with a
higher-level understanding of the visual world [18]. Henceforth, a number of new RL algorithms and
techniques have sprung up to improve the training performance in their own way. Hasselt et al. [23]
presented a double-Q estimator for value-based RL methods to decrease the overestimated Q value,
hence improve the agent’s training performance. Wang et al. [19] improve the accuracy of Q value
estimation by adopting two split networks, one for estimating state value and the other one for
estimating its action value. In contrast to modifying networks’ structure, Schaul et al. [24] investigated
the prioritized experience replay (PER) method to make experience replay more efficient and effective,
this prioritization can lead to quickly convergence in the sparse reward environment. Nair et al. [25]
introduced massively distributed DRL architecture which consists of parallel actors and learners
and it uses a distributed replay memory and neural network to increase the computation. These
advances in algorithms drove lots of researchers to do experiments in applications such as visual
navigation and robot control. Barron et al. [26] explored virtual 3-D world navigation with deep
Q-learning method, the trained agent can have good performance with a shallow neural network.
Relative research did by Mirowski et al. [27] they formulated the navigation task as an RL problem
and trained an agent to navigate in a complex environment with dynamic elements. Haarnoja et al.
Sensors 2019, 19, 1547 4 of 20
explored a series of tasks and methods enable real robots to learn skills [28–30], and they presented the
soft actor-critic [31] method to improve the sampling efficiency. As for the application of RL in the
control field, the first thing we should consider is the action space. Because the majority of previous
RL methods are in domains with discrete actions which are based on value function estimation [32].
For the real-world applications related to physical control such as robotics, an important property
is the continuous (real-valued) action spaces. The methods based on the value functions such as
deep Q-learn cannot be straightforwardly applied to continuous domains since it relies on finding
the action that maximizes the action-value function and requires an iterative optimization process
at every step [20]. So, the exploration of RL algorithms in continuous action space is important and
practical work.
For continuous control tasks, simply discretizing the action spaces and using value-based methods
such as DQN is a feasible way. Obviously, it must drop off a lot of information about the action space,
thus undermines the ability to find the true optimum policy. Large action spaces are difficult to
explore effectively and making the training process intractable with traditional value-based methods.
Therefore, lacking most of the action spaces’ information will result in poor performance. Another
series of algorithms developed based on policy-gradient methods [33], which directly optimize the
parameters of a stochastic policy through local gradient information obtained by interacting with the
environment using the current policy [34]. The critical challenge of policy-based methods is finding a
proper score function to evaluate how good or bad a policy is. To solve it, actor-critic approaches have
grown in popularity in the continuous domain, which take advantage of prior researching experience
and they are capable to select actions in a continuous domain with a temporal-difference learning
value function. Based on this hybrid framework, Mnih et al. [35] proposed an asynchronous variant of
actor-critic (A2C) method which surpasses the origin actor-critic in convergence time and performance.
Lillicrap et al. [20] presented the deep deterministic policy-gradient (DDPG) algorithm it robustly
solved a variety of challenging problems with continuous action spaces. O’Donoghue et al. [36] gave a
similar technique such as DDPG which combines the policy gradient with off-policy Q-learning (PGQL).
Among these actor-critic-based methods, the critic network ways of estimating the state-action value
are all same. Notably, these estimations serve as a signal to guide the actor network select better actions
then update the policy. Therefore, we intend to present a method which has more precise and proper
state-action values’ estimation in the critic network, then potentially improve the overall performance.
There is a lot of recent work that solves robotic navigation tasks with RL approaches. Tai et al. [37]
trained a mapless motion planner for navigation tasks with an asynchronous DRL method, which
can be directly applied in unseen environments. The motion planer was trained end to end based
on the sparse laser sensors. Zhu et al. [38] presented an RL-based model for target-driven visual
navigation tasks, the model addressed the issues such as lacking generalization capability and data
inefficiency. Xie et al. [39] presented a method based on a double-Q network for obstacle avoidance
tasks, using monocular RGB vision as input. Zuo et al. [40] built a robotic navigation system based
on the Q-learning method, which is useful for robot to quickly adapt unseen environments with
sonar measurements input. Zhang et al. [41] proposed a successor-feature-based DRL algorithm
which used for obstacle avoidance task rely on raw onboard sensors’ data. Tai et al. [42] presented
the deep-network structure to do obstacle avoidance tasks, they tested their model in real-world
experiments and showed the robot’s control policy has high similarity with human decisions.
Khan et al. [43] proposed a self-supervised policy-gradient algorithm and applied it in a LiDAR-based
robot. These works showed the RL methods can make full use of robots’ sensory input, and map the
input to the appropriate action output for safely walking without collisions, and the models trained
in simulators in these works can be successfully transferred to the robot in the real world for the
same tasks.
Sensors 2019, 19, 1547 5 of 20
Normally we refer to V π (s) (1) as the state-value function, which measures the expected
discounted return when starting in a state s and following a policy π. When actions follow by
the optimal policy π ∗ , the state-value function can be optimal:
In addition to measuring the value of states, there is also an indicator for measuring the quality of
actions’ selection, which is denoted as state-action-value or quality function Qπ (s, a). It defines the
value of choosing an action a from a given state s and thereafter following a policy π.
∞
Qπ (s, a) = E[ R|s, a, π ] = E[ ∑ γk rt+k |s, a, π ] (3)
k =0
State-action-value is similar to the state value V π except the initial action a is provided, and the
policy π is only followed from the succeeding state onwards. The optimal state-action-value function
is denoted as:
Q∗ (s, a) = max Qπ (s, a) ∀s ∈ S , ∀ a ∈ A (4)
π
Sensors 2019, 19, 1547 6 of 20
Q∗ (s, a) gives the maximum state-action value for state s and action a achievable by any policy.
This action-value function satisfies a recursive property, which is a fundamental property of value
functions in the RL setting, and it expresses a relationship between the value of a state and its
successor states:
Qπ (s, a) = Es0 [r + γEa0 ∼π (s0 ) [ Q∗ (s0 , a0 )]|s, a, π ] (5)
Many successful value-based RL algorithms [32,35,46] rely on the idea of advantage updates. In
our approach, we also adopt the advantage value to measure the relative actions’ quality on each step.
with
yiDQN = r + γmax Q(s0 , a0 ; θ − ) (8)
a0
in which θ − is the parameter for the target network. The first stabilizing method is fixing the target
network’s parameters rather than calculating the TD error based on its own rapidly fluctuating
estimates of the Q-values. The second one, experience replay, uses a buffer for storing a certain size of
transitions (st , at , st+1 , rt+1 , ) makes it possible for training off-policy and enhancing the efficiency of
sampling data.
There is a series of improvements in the value-based RL setting after the DQN algorithm ignited
this field. To reduce the overestimated Q-values in DQN, van Hasselt et al. [23] proposed the double
DQN algorithm. Wang et al. [19] presented a dueling Q-network architecture to estimate state-value
function V (s) and associated advantage function A(s, a) respectively. Tamar et al. [48] proposed a
value iteration network that can effectively learn to plan, and it leads to better generalization in many
RL tasks. Schaul et al. [24] developed the PER approach built on top of double DQN, it makes the
experience replay process more efficient and effective than all transitions are replayed uniformly.
streams share a common feature extraction layer (or lower layers). The deep Q-network focuses
on estimating every state-action pairs’ value. However, the idea of dueling network is to estimate
action-independent state function and action-dependent advantage function separately, because in
RL environments, not all states are related to a specific action, there are many states independent of
action, and under these states the agent does not need to change actions to adapt to the new states.
Therefore, it is meaningless and inefficient to estimate such state-action pairs’ value. Dueling network
firstly presented by Wang et al. [19] and through this change, the training efficiency has been greatly
improved than the single-stream Q networks. The dueling network results in a new state of the art
for tasks in the discrete action space according to Wang’s work. Shortly, the Q-values generated by
dueling network are more advantageous to the performance improvement than deep Q-network in an
RL task. In our approach, we adopt a dual-network design similar to dueling architecture to generate
appropriate Q-values. In Section 4, we discuss the ADC network’s architecture and the aggregating
method in detail.
Observation(s) Observation(s)
Input layer and
…
…
feature extraction
layers
Figure 1. Dueling Q-network (left) and standard single-stream Q-network (right). Both networks have
the same input and feature extraction module; the outputs are state-action values. The difference is
that dueling network adopts two sequences (streams) networks to estimate state values and action
advantages, and then combines them to indirectly generate the Q-values; The Q-network has a single
sequence network and it directly produces Q-values’ estimation.
Observation(s)
Input layer and
…
feature extraction
layers
Output
layer
π(a|s) Q(s,a)
4. Proposed Method
In this work, we propose an approach for operating in continuous action space. We named
our method as ADC network, which can find stable policies in continuous action spaces and it
collects the benefits from the actor-critic network and dueling network. The main structure of the
ADC network (Figure 3) is similar to the actor-critic network which consists of the two sequence
networks. The actor network (left blue part in Figure 3) computes continuous actions with the DPG
method. The dueling-critic network (right orange part in Figure 3) supplies the estimate of expected
return as the performance’s knowledge for the actor. A difference from the actor-critic networks is
the application of the dueling network (Figure 3 A-network and V-network) in the original critic
branch. The dueling-critic network consists of two sequences (or streams) of fully connected layers
which provide separate estimates of the state value V (s) and state-dependent advantage A(s, a).
Then, the aggregating module combines the two streams to produce the estimate of state-action
value Q. In the continuous action space, we cannot output the estimation of each possible action’s
advantage value, so we add a new method to enable the dueling network to be used in continuous
space, which was originally used in discrete action space. We manually divide the action space and
estimate the advantage of the action interval in each state. Through this change, the agent could learn
which action interval is good when facing a specific state and pick the action belong to this interval.
The action-advantage value is a relative value and it measures the quality of the possible actions in
one state. Meanwhile, it is a tiny amount close to zero and independent of the environment state and
noise. Therefore, it can be seen as a fine-tuning factor to the Q value and improve the accuracy of Q
value estimation.
Observation(s)
Input layer and
…
feature extraction
layers
Actor Dueling-critic
A-network V-network
π(a|s)
... V(s)
...
Action intervals’
advantages Q(s,a)
The dueling-critic branch provides Q values to the actor, then the actor tends to know how good
or bad the action taken is. So, an accurate Q-value estimation can lead to better performance for the
actor-critic-based methods. In the traditional actor-critic methods, the critic applies a single sequence
network and uses Q-learning updates to estimate state-action values, which force to build connections
between states and actions. However, in practice, many states are independent of an action, which
means in some states, the choice of action has no effect on what happens. Therefore, it is unnecessary
to estimate each state-action pairs’ value. In our method, the dueling-critic decouples the action and
state through this dual-network design. The value stream learns to pay attention to the state’s value;
Sensors 2019, 19, 1547 10 of 20
the advantage stream learns to pay attention to action interval’s advantage on a state, thus making the
Q estimation more accurate by combining the two separate values. It also improves the computing
efficiency. The original dueling network focuses on solving discrete actions’ RL problem. It cannot
scale to continuous control tasks since it is a pure value-based method. However, the ADC method
can cope with continuous action spaces since it has an actor network which is responsible for selecting
actions based on policy. ADC combines the merit of dueling architecture and actor-critic frame. With
an accurate state-action-value estimation, the actor-dueling-critic network can be more efficient in
finding suitable policies than the classic actor-critic methods.
From the advantage Equation (6) we could get Qπ (s, a) = V π (s) + Aπ (s, a). Then under the
definition of advantage, we build an aggregating networks module:
where θ Q denotes the parameters of the first layer in the dueling-critic branch, α and β are the network
parameters of advantage and value streams respectively (A-network and V-network). The Q(s, a; θ, α, β)
is the output of the dueling-critic network, and it is a parameterized estimate of true Q-function.
Equation (10) lacks the identifiability since given Q, the V and A cannot be recovered uniquely.
To migrate this issue, we force the A to have zero advantage at the chosen action:
Through this change, when a = a∗ = arg maxa0 ∈|A| Q(s, a0 ; θ Q , α, β) = arg maxa0 ∈|A| A(s, a0 ; θ Q , α),
the advantage equal to zero and then the Q equal to V. An alternative equation of aggregating module
presented by Wang et al. [19] is:
1
|A| ∑
Q(s, a; θ Q , α, β) = V (s; θ Q , β) + ( A(s, a; θ Q , α) − A(s, a0 ; θ Q , α)) (12)
a0
It replaces the max operator with the mean. Equation (12) increases the stability of the optimization
because the advantages only need to change as the same pace of mean rather than compensate change
to the optimal action’s advantage [19]. It also helps identifiability and does not change the relative
rank of A. The original intention of advantage technique is to measure the relative value by comparing
multiple actions under a state in discrete action spaces. While in this work, we focus on the continuous
action space, so we uniformly partition the action space to n intervals (Figure 3) according to its
experimental environment, and we use z denotes of the action interval. At each step, the A-network
outputs the advantages of each action intervals (z1 , z2 , ..., z a , ..., zn−1 , zn ), and we use the advantage
value of interval (z a ) containing the action actor network adopted subtract the mean of all intervals’
advantage to calculate the step advantage. The definition of advantage’s value of the step when the
agent takes an action a can be calculated with Equation (13):
1
A(s, a; θ Q , α) = A(s, z a ; θ Q , α) −
n ∑ A(s, z; θ Q , α) (13)
z
Therefore, the equation of aggregating module of ADC network can be presented as (14):
In the actor network branch, we apply the off-policy DPG algorithm [49]. We parameterize the
policy as µ(s|θ µ ) which mapping states to a specific action (µ : S → A). The actor network adjust its
parameters θ µ of the policy in the direction of the performance gradient ∇θ µ J:
5. Experiments
We evaluated our approach on gym classic control environment and a navigation task. They are
in continuous action domain. Experiments include non-noisy and noisy environments to explore the
stability of our method.
Sensors 2019, 19, 1547 12 of 20
Table 1. Hyper-parameters.
As the Figure 4 shows, the vanilla actor-critic method performs poorly and fluctuates violently. It is
hard to learn a good policy for vanilla actor-critic method without other technique such as experience
replay; The dueling approach learns policy at a very slow speed and it can achieve good performance
after the 200th episode but still behave unstable before the 400th episode; The ADC method can
overcome the shortcomings of actor-critic and it learns a stable policy quickly. From beginning to the
50th episode the ADC and DDPG both achieve a good level and then ADC behaves more stable than
DDPG (from the 50th episode to 200th episode). After the 200th episode, the performance of ADC and
DDPG are at the same level. For comparing the stability of these two methods, we plot the variance
Figure 5 of these two methods.
The variance of ADC is significantly lower than that of DDPG in the initial stage from the
beginning to the 450th episode. After the 450th episode the DDPG’s variance tends to decrease, and
then both variances stay stable. In the initial part (50th–300th episode), ADC has higher rewards than
that of DDPG (Figure 4), and the rewards’ variance of ADC is lower than that of DDPG (Figure 5).
Therefore, in this task, the ADC approach can learn better policy with high performance (rewards) and
stability than the DDPG method. Obviously, ADC’s learning ability is also better than that of vanilla
actor-critic and dueling approaches.
Sensors 2019, 19, 1547 13 of 20
Figure 4. ADC, DDPG, dueling and actor-critic’s performance in the gym Pendulum-v0 environment.
The x-axis represents the episodes and y-axis represents the cumulative rewards per episode. Table 1
lists important hyper-parameters.
Figure 6. Obstacle avoidance task. The agent measures distance from obstacle with 8 sonars.
The purpose is trying to go as far as possible without any collision.
The actor branches in ADC and DDPG architectures have two hidden layers with 100 neurons in
the first hidden layer and 20 neurons in the second hidden layer. The critic branch of DDPG has 100
neurons and 20 neurons in the first and second hidden layers separately. The dueling-critic branch of
ADC has 100 neurons in the first hidden layer, 100 neuron (advantage stream) and 20 neurons (value
stream) in the second hidden layers separately. We list the important hyper-parameters in Table 2.
The experimental results are shown in Figure 7.
Figure 7. ADC and DDPG’s performance on obstacle avoidance task. The x-axis presents the episodes
and y-axis presents the cumulative rewards per episode. The whole process has 3500 episodes and
each episode has 500 steps.
Table 2. Hyper-parameters.
From the overall performance, both approaches can achieve a similar performance level in this
task, and they can quickly adapt to the environment within the first 100 episodes. However, from
the stability, ADC is more stable than DDPG, especially in the 900th episode, 1800th episode and
3300th episode. For intuitively comparing the stability of these two methods’ rewards, we also plot the
variance figure (Figure 8).
Sensors 2019, 19, 1547 15 of 20
More intuitively, the variance of ADC’s rewards shows a significant downward trend, and it is
less than DDPG’s variance at most of the episodes. The variance of the DDPG is particularly large at
the 900th, the 1350th, and the 1800th episode. In contrast, ADC behaves more stable than DDPG in the
whole process. After the 2500th episode, DDPG’s variance becomes less violently and maintain the
same level as the ADC. To sum up, the ADC approach can tackle this continuous control task, and it
can learn a more stable policy than DDPG method during long-term training.
Table 3. Hyper-parameters.
From the results, the training effects are affected by noise disturbance, which is reflected in the
slower learning rate and worse stability. Meanwhile, the performance of ADC method is better than
that of DDPG in convergence speed and stability. When n = 60, 100, 140, the training results can
convergence at around 120th episode, while for DDPG, it is over the 200th episode. From Figure 9
shows, when n = 20, the performance of ADC is slightly better than that of DDPG. While n = 60,
performance improvement is more obvious. When n = 100, 140 (Figure 10), the overall performance of
the two is similar, and the stability of 100 intervals is slightly better. When n = 100, 140, they show a
Sensors 2019, 19, 1547 16 of 20
higher upward trend in the first 50 episodes than that of 20, 60 action intervals, but the upward trends
become slower after the 50th episode.
Figure 9. ADC and DDPG’s performance on Pendulum-v0 environment with noise input. The action
intervals of ADC network are 20, 60, respectively. When n is 20, ADC has a limited promotion compared
with DDPG. When n is 60, the convergence speed and stability have been greatly improved.
Figure 10. The action intervals of ADC network are 100, 140, respectively. The performance of the two
groups are similar, both better than DDPG. The stability of 100 intervals is slightly better than 140.
6. Discussion
The combination of dueling architecture and actor-critic network allows our approach to use action
advantages as an auxiliary value to Q-value estimations and hence helps the policy select a correct
action in continuous action domain. As the non-noise experiment shows, our approach overcomes
the shortcoming of actor-critic networks, which cannot learn a good policy, and the performance is
significantly unstable. Meanwhile, the dueling network has a low learning rate in the continuous
control task. The DDPG is a successful actor-critic-based method, it has good results in continuous
control tasks. The ADC method can achieve better results with slightly higher stability. From the
navigation task, it demonstrated that the ADC approach can attain long-term higher stability than
DDPG in a non-noise environment. Furthermore, the ADC’s average reward for the whole period
is also higher than that of DDPG. Meanwhile, the navigation task also proves the feasibility of our
method in the field of real-world navigation. We directly applied the trained model to unseen simulator
environments by changing the path and width, and the agent can avoid obstacles perfectly without
any collision. It shows that the trained model has generalization ability. To further explore the
performance of our method in the noise environment, we designed the second experiment. Meanwhile,
the effects of different action intervals on the overall performance were researched. The experimental
results show that ADC is more insensitive to the environment’s noise than DDPG; even the noise
makes the performance of two fluctuate a little. From the exploration of different action intervals,
the preliminary conclusion is that with a small number such as n = 20, its improvement is not very
obvious compared to DDPG, but when n = 60, 100, the overall effects are much better. If it further
increases, such as n = 140, the effect is not obviously improved. In addition, it increases the training
time and computational resource. The specific impact of the action interval’s number needs further
Sensors 2019, 19, 1547 17 of 20
study. Overall, ADC and DDPG work well in continuous action spaces. In the noise environment,
the learning efficiency and stability of ADC are better than that of DDPG.
7. Conclusions
This paper introduces a novel ADC approach for solving the obstacle avoidance task of
sensor-based robots. These are continuous control problems. The ADC is based on the actor-critic
network and it is more efficient than the original vanilla actor-critic method. Continuous control ability
is a fundamental requirement for autonomous robots that interact with the real environment. We used
the navigation scenario to test the performance of the ADC algorithm in the obstacle avoidance task.
From the results, the obstacle avoidance problem in sensor-based robots can be well solved by using
the ADC algorithm. To improve its training stability, we used a series of techniques such as experience
replay, target network, soft update, e-greedy etc. in its algorithm. The applications of these techniques
make the learning process more stable and improve the sampling use rate in the replay buffer. In
addition, since the traditional method of state-action estimation hinders the performance improvement
of actor-critic-based algorithms, we introduce a dueling-critic network which decouples the states and
actions and estimates state value and action interval advantage separately. By aggregating the two
values—dueling-critic output the state-action values—then the actor network updates its parameters
according to the Q-value. The dueling structure can improve the accuracy of Q-value estimation
in noise environment by using advantage technique. Through the combination of the dueling and
actor-critic network, the ADC can work well and be stable in a noise environment. We conduct
experiments to examine the algorithm and compare it with other methods, a vanilla actor-critic
network method, dueling network method, and DDPG method. In the gym Pendulum-v0 experiment,
our approach can quickly adapt to the environment and show high efficiency and stability in dealing
with continuous control problems. In the navigation environment, the results show our method can
solve the obstacle avoidance problem and its training performance is stable and reliable Furthermore,
we designed a noise environment to compare the training efficiency of ADC and DDPG. The superiority
of ADC in the noise environment is more obvious. It indicates that our approach has made progress
on training efficiency.
There are some problems we plan to address in future work. First, the stability and efficiency
of the ADC network need further investigation, especially in the face of more complex problems
and application scenarios. Second, the influence of interval advantage on performance needs to be
further explored. Third, in dealing with the action interval advantage, we need to explore how to
reasonably divide the action space and how to divide action space in a complex environment, such as
adaptively dividing the action space. Fourth, the method will be transferred to a real laser robot to test
performance in obstacle avoidance tasks.
Author Contributions: Conceptualization, M.W.; Funding acquisition, Y.G.; Investigation, M.W. and S.D.;
Methodology, M.W., A.J. and Q.Z.; Project administration, Y.G.; Software, M.W. and S.D.; Supervision, Y.G.
and A.J.; Validation, M.W., A.J. and Q.Z.; Writing—original draft, M.W.; Writing—review & editing, Y.G., A.J.
and Q.Z.
Funding: Menghao Wu is sponsored by the China Scholarship Council (CSC) grant number 201706680063 for his
joint Ph.D. research program at Aalto University, Finland. This work was partially supported by the National
Natural Science Foundation of China (NSFC) grant number 61803118.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Khatib, O. Real-time obstacle avoidance for robot manipulators and mobile robotics. Int. J. Robot. Res.
1986, 5, 90–98. [CrossRef]
2. Djekoune, A.O.; Achour, K.; Toum, R. A sensor based navigation algorithm for a mobile robot using the
DVFF approach. Int. J. Adv. Robot. Syst. 2009, 6, 97–108. [CrossRef]
Sensors 2019, 19, 1547 18 of 20
3. Spero, D.J.; Jarvis, R.A. A New Solution to the Simultaneous Localization and Map Building Problem. IEEE
Trans. Rob. Autom 2005, 17, 229–241.
4. Bonin-Font, F.; Ortiz, A.; Oliver, G. Visual navigation for mobile robots: A survey. J. Intell. Robot. Syst.
Theory Appl. 2008, 53, 263–296, doi:10.1007/s10846-008-9235-4. [CrossRef]
5. Tai, L.; Zhang, J.; Liu, M.; Boedecker, J.; Burgard, W. A Survey of Deep Network Solutions for Learning
Control in Robotics: From Reinforcement to Imitation. arXiv 2016, arXiv:1612.07139.
6. Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724,
doi:10.1177/0278364914549607. [CrossRef]
7. Zhou, X.; Gao, Y.; Guan, L. Towards goal-directed navigation through combining learning based global and
local planners. Sensors 2019, 19, 176, doi:10.3390/s19010176. [CrossRef] [PubMed]
8. Fragkos, G.; Apostolopoulos, P.A.; Tsiropoulou, E.E. ESCAPE: Evacuation strategy through clustering
and autonomous operation in public safety systems. Future Internet 2019, 11, 20, doi:10.3390/fi11010020.
[CrossRef]
9. Narendra, K.S.; Lakshmivarahan, S. Learning Automata: A Critique. J. Cybern. Inf. Sci. 1987, 1, 53–66.
10. Chaplot, D.S.; Lample, G.; Sathyendra, K.M.; Salakhutdinov, R. Transfer Deep Reinforcement Learning in 3D
Environments: An Empirical Study. In Proceedings of the NIPS Deep Reinforcemente Leaning Workshop,
Barcelona, Spain, 9 December 2016.
11. Zamora, I.; Lopez, N.G.; Vilches, V.M.; Cordero, A.H. Extending the OpenAI Gym for robotics: A toolkit for
reinforcement learning using ROS and Gazebo. arXiv 2016, 1–6, arXiv:1608.05742.
12. Tassa, Y.; Doron, Y.; Muldal, A.; Erez, T.; Li, Y.; Casas, D.d.L.; Budden, D.; Abdolmaleki, A.; Merel, J.;
Lefrancq, A.; et al. DeepMind Control Suite. arXiv 2018, arXiv:1801.00690.
13. Sutton, R.S.; Barto, A.G. [Draft-2] Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA,
USA; London, UK, 2013.
14. Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444, doi:10.1038/nature14539.
[CrossRef]
15. Krizhevsky, A.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks.
In Proceedings of the NIPS, Lake Tahoe, NV, USA, 7–8 December 2012.
16. Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. In Proceedings of
the NIPS, Montreal, QC, Canada, 8–13 December 2014; pp. 2204–2212.
17. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y.
Generative Adversarial Nets. In Proceedings of the NIPS, Montreal, QC, Canada, 8–13 December 2014;
pp. 2672–2680. doi:10.1001/jamainternmed.2016.8245.
18. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. A Brief Survey of Deep Reinforcement
Learning. arXiv 2017, 1–16, arXiv:1708.05866.
19. Wang, Z.; Schaul, T.; Hessel, M.; van Hasselt, H.; Lanctot, M.; de Freitas, N. Dueling Network Architectures
for Deep Reinforcement Learning. arXiv 2015, arXiv:1511.06581.
20. Bengio, Y. Continuous control with deep reinforcement learning. Found. Trends R Mach. Learn. 2009, 2, 1–127.
doi:10.1561/2200000006. [CrossRef]
21. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari
with Deep Reinforcement Learning. arXiv 2013, 1–9, arXiv:1312.5602.
22. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.;
Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural
networks and tree search. Nature 2016, 529, 484–489, doi:10.1038/nature16961. [CrossRef] [PubMed]
23. van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. Artif. Intell.
2016, 230, 173–191, doi:10.1016/j.artint.2015.09.002. [CrossRef]
24. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2015, 1–21,
arXiv:1511.05952.
25. Nair, A.; Srinivasan, P.; Blackwell, S.; Alcicek, C.; Fearon, R.; De Maria, A.; Panneershelvam, V.; Suleyman, M.;
Beattie, C.; Petersen, S.; et al. Massively Parallel Methods for Deep Reinforcement Learning. arXiv 2015,
arXiv:1507.04296.
26. Barron, T.; Whitehead, M.; Yeung, A. Deep Reinforcement Learning in a 3-D Blockworld Environmen.
In Proceedings of the International Joint Conference on Artificial Intelligence, New York, NY, USA,
9–15 July 2016.
Sensors 2019, 19, 1547 19 of 20
27. Mirowski, P.; Pascanu, R.; Viola, F.; Soyer, H.; Ballard, A.J.; Banino, A.; Denil, M.; Goroshin, R.; Sifre, L.;
Kavukcuoglu, K.; et al. Learning to Navigate in Complex Environments. arXiv 2016, arXiv:1611.03673.
28. Haarnoja, T.; Zhou, A.; Ha, S.; Tan, J.; Tucker, G.; Levine, S.; Dec, L.G. Learning to Walk via Deep
Reinforcement Learning. arXiv 2018, arXiv:1812.11103.
29. Haarnoja, T.; Pong, V.; Zhou, A.; Dalal, M.; Abbeel, P.; Levine, S. Composable Deep Reinforcement Learning
for Robotic Manipulation. In Proceedings of the 2018 IEEE International Conference on Robotics and
Automation, Brisbane, QLD, Australia, 21–25 May 2018. doi:10.1038/nature20101.
30. Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement Learning with Deep Energy-Based Policies.
In Proceedings of the ICML’17 34th International Conference on Machine Learning, Sydney, NSW, Australia,
6–11 August 2017; pp. 1352–1361.
31. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep
Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290.
32. Gu, S.; Lillicrap, T.; Sutskever, I.; Levine, S. Continuous Deep Q-Learning with Model-based Acceleration.
In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016.
33. Sutton, R.S.; Mcallester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with
Function Approximation. In Proceedings of the NIPS, Denver, CO, USA, 1 January 2000; pp. 1057–1063,
doi:10.1.1.37.9714. [CrossRef]
34. Wu, C.; Rajeswaran, A.; Duan, Y.; Kumar, V.; Bayen, A.M.; Kakade, S.; Mordatch, I.; Abbeel, P. Variance
Reduction for Policy Gradient with Action-Dependent Factorized Baselines. arXiv 2018, arXiv:1803.07246.
35. Mnih, V.; Badia, A.; Mirza, M.; Graves, A.; Lillicrap, T. Asynchronous methods for deep reinforcement
learning. In Proceedings of the International Conference on Machine Learning Machine Learning, New York,
NY, USA, 19–24 June 2016; Volume 48.
36. O’Donoghue, B.; Munos, R.; Kavukcuoglu, K.; Mnih, V. Combining policy gradient and Q-learning. arXiv
2016, arXiv:1611.01626.
37. Tai, L.; Paolo, G.; Liu, M. Virtual-to-real Deep Reinforcement Learning: Continuous Control of Mobile
Robots for Mapless Navigation. In Proceedings of the 2017 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 31–36.
doi:10.1109/IROS.2017.8202134. [CrossRef]
38. Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven Visual Navigation
in Indoor Scenes using Deep Reinforcement Learning. In Proceedings of the IEEE international conference
on robotics and automation (ICRA), Singapore, 29 May–3 June 2017. doi:10.1109/ICRA.2017.7989381.
39. Xie, L.; Wang, S.; Markham, A.; Trigoni, N. Towards Monocular Vision based Obstacle Avoidance through
Deep Reinforcement Learning. arXiv 2017, arXiv:1706.09829, doi:10.1016/j.renene.2009.02.025. [CrossRef]
40. Zuo, B.; Chen, J.; Wang, L.; Wang, Y. A reinforcement learning based robotic navigation system.
In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, San Diego, CA, USA,
5–8 October 2014; pp. 3452–3457, doi:10.1109/smc.2014.6974463. [CrossRef]
41. Zhang, J.; Springenberg, J.T.; Boedecker, J.; Burgard, W. Deep Reinforcement Learning with Successor
Features for Navigation across Similar Environments. In Proceedings of the 2017 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017.
42. Tai, L.; Li, S.; Liu, M. A deep-network solution towards model-less obstacle avoidance. In Proceedings of
the IEEE International Conference on Intelligent Robots and Systems, Daejeon, Korea, 9–14 October 2016;
pp. 2759–2764, doi:10.1109/IROS.2016.7759428. [CrossRef]
43. Khan, A.; Kumar, V.; Ribeiro, A. Learning Sample-Efficient Target Reaching for Mobile Robots. In Proceedings
of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain,
1–5 October 2018.
44. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 1998, 9, 1054,
doi:10.1109/TNN.1998.712192. [CrossRef]
45. Schaul, T.; Horgan, D.; Gregor, K.; Silver, D. Universal Value Function Approximators. In Proceedings of the
32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1312–1320.
46. Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-Dimensional Continuous Control Using
Generalized Advantage Estimation. arXiv 2015, arXiv:1506.02438 .
Sensors 2019, 19, 1547 20 of 20
47. Zhan, Y.; Ammar, H.B.; Taylor, M.E. Human-level control through deep reinforcement learning.
In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, New York, NY, USA,
9–15 July 2016; pp. 2315–2321, doi:10.1038/nature14236. [CrossRef]
48. Tamar, A.; Wu, Y.; Thomas, G.; Levine, S.; Abbeel, P. Value Iteration Networks. Adv. Neural Inf. Process. Syst.
2016, 29, 2154–2162.
49. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient
Algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing,
China, 21–26 June 2014; pp. 387–395.
50. Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with
asynchronous off-policy updates. In Proceedings of the IEEE International Conference on Robotics and
Automation, Singapore, 29 May–3 June 2017; pp. 3389–3396, doi:10.1109/ICRA.2017.7989385. [CrossRef]
51. Levine, S.; Koltun, V. Guided Policy Search. In Proceedings of the 30th International Conference on
Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 28, pp. 1–9, doi:10.1109/ICRA.2015.7138994.
[CrossRef]
52. Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res.
2016, 17, 1–40, doi:10.1007/s13398-014-0173-7.2. [CrossRef]
53. Peters, J.; Schaal, S. Policy Gradient Methods for Robotics. In Proceedings of the 2006 IEEE/RSJ International
Conference on Intelligent Robots and Systems, Beijing, China, 9–15 October 2006; pp. 2219–2225,
doi:10.1109/IROS.2006.282564. [CrossRef]
54. Heess, N.; Wayne, G.; Silver, D.; Lillicrap, T.; Tassa, Y.; Erez, T. Learning Continuous Control Policies by
Stochastic Value Gradients. In Proceedings of the NIPS, Montreal, QC, Canada, 11–12 December 2015;
pp. 2944–2952.
55. Konda, V.R.; Tsitsiklis, J.N. Actor-critic algorithms. In Proceedings of the NIPS, Denver, CO, USA,
1 January 2000; pp. 1008–1014.
56. Grondman, I.; Buoniu, L.; Lopes, G.A.D.; Babuška, R. A Survey of Actor-Critic Reinforcement Learning:
Standard and Natural Policy Gradients. IEEE Trans. Syst. Man Cybern. Part C 2012, 42, 1291–1307,
doi:10.1109/TSMCC.2012.2218595. [CrossRef]
57. Levy, A.; Platt, R.; Saenko, K. Hierarchical Actor-Critic. arXiv 2017, arXiv:1712.00948.
c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).