0% found this document useful (0 votes)
2 views12 pages

He 等 - 2021 - Explainable Deep Reinforcement Learning for UAV Autonomous Navigation

This paper presents a neural network-based reactive controller for autonomous navigation of small Unmanned Aerial Vehicles (UAVs) in unknown environments, utilizing deep reinforcement learning (DRL) to optimize control signals based solely on current sensor data. The authors propose explainable AI methods to enhance understanding of the trained model's decision-making process, providing both local and global explanations for users and experts. Real-world tests demonstrate that the proposed controller outperforms conventional approaches in terms of speed and efficiency while adhering to the computational constraints of micro UAVs.

Uploaded by

brucesi0904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views12 pages

He 等 - 2021 - Explainable Deep Reinforcement Learning for UAV Autonomous Navigation

This paper presents a neural network-based reactive controller for autonomous navigation of small Unmanned Aerial Vehicles (UAVs) in unknown environments, utilizing deep reinforcement learning (DRL) to optimize control signals based solely on current sensor data. The authors propose explainable AI methods to enhance understanding of the trained model's decision-making process, providing both local and global explanations for users and experts. Real-world tests demonstrate that the proposed controller outperforms conventional approaches in terms of speed and efficiency while adhering to the computational constraints of micro UAVs.

Uploaded by

brucesi0904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1

Explainable Deep Reinforcement Learning for UAV


Autonomous Navigation
Lei He, Student Member, IEEE, Aouf Nabil, Member, IEEE, Bifeng Song

Abstract—Autonomous navigation in unknown complex envi- less computation and memory resources because the control
ronment is still a hard problem, especially for small Unmanned signal is obtained using only one forward calculation and it
Aerial Vehicles (UAVs) with limited computation resources. In doesn’t need to maintain the map during flight. This property
this paper, a neural network-based reactive controller is proposed
makes it promising for some micro UAV with size, weight and
arXiv:2009.14551v2 [cs.RO] 2 Feb 2021

for a quadrotor to fly autonomously in unknown outdoor environ-


ment. The navigation controller makes use of only current sensor power (SWaP) constraints. The weakness is that it is always
data to generate the control signal without any optimization or non-optimal because of the lacking of global information.
configuration space searching, which reduces both memory and Also, the design of the reactive policy always relies on the
computation requirement. The navigation problem is modelled expert experiment.
as a Markov Decision Process (MDP) and solved using deep
reinforcement learning (DRL) method. Specifically, to get better Reactive UAV navigation based on only current information
understanding of the trained network, some model explanation can be formulated as a sequential decision-making problem.
methods are proposed. Based on the feature attribution, each Some researchers modelled this problem as a Markov decision
decision making result during flight is explained using both process (MDP) and solved using reinforcement learning (RL)
visual and texture explanation. Moreover, some global analysis methods. For example, Ross et at [5] build and Imitation
are also provided for experts to evaluate and improve the trained
neural network. The simulation results illustrated the proposed learning (IL)-based controller using a small set of human
method can make useful and reasonable explanation for the demonstrations and achieved a good performance in natural
trained model, which is beneficial for both non-expert users forest environments. Imanberdiyev et at [6] developed a high-
and controller designer. Finally, the real world tests shown the level control method for autonomous navigation of UAVs
proposed controller can navigate the quadrotor to goal position using a novel model-based reinforcement learning method,
successfully and the reactive controller performs much faster
than some conventional approach under the same computation TEXPLORE. He et al [7] combine bio-inspired monocular
resource. vision perception method with a deep reinforcement learning
(DRL) reactive local planner to address the UAV navigation
Index Terms—Explainable, Deep reinforcement learning, UAV
obstacle avoidance. problem. They also proposed learning from demonstration
method to speed up the training process [8]. Wang et al [9]
formulated the navigation problem as a partially observable
I. I NTRODUCTION Markov decision process (POMDP) and solved by a novel
online DRL algorithm. He also invested the sparse reward
U NMANED Aerial Vehicles (UAVs) have been widely
used in many application, such as good delivery, emer-
gency surveying and mapping. Autonomous navigation in the
situation using a learn with help (LwH) method [10]. Com-
paring to the traditional rule-based reactive controller, the
large unknown complex environment is an essential capability control policy trained by DRL can get near optimal action in
for these UAVs to operate more intelligent and safety. the training environment. Also, rely on the powerful feature
In general, there are two main solutions for UAV obstacle extraction capacity of deep neural network (DNN), the trained
avoidance. The first solution relies on the state estimator policy can get feature autonomously without human design,
using VIO or SLAM, then generate safety trajectories using always can get better performance.
optimization method [1], [2] or searching-based method. It’s Although DRL method can get excellent performance, an
a cascade process include mapping, localization planning and enormous problem is that deep learning methods turn out to be
control. This kind of method can generate nearly optimal uninterpretable “black boxes,” which create serious challenges
trajectories for some optimization objectives such as safety to the Artificial Intelligence (AI) system based on neural net-
and smoothness, however, they require lots of computation and work [11]. This problem falls with the so-called eXpalinable
memory to store the map and run the optimization algorithms AI (XAI) filed. Arrieta et al gives a review of XAI, include
every step. In addition, these techniques also suffer from high concepts, taxonomies, opportunities and challenges toward
drift and noise, impacting the quality of both localization and responsible AI [12].
the map used for planning. Another solution is using a reactive Comparing to the burst of XAI research in supervised learn-
control method, which can generate control command from the ing, explainability for RL is hardly explored [13]. Juozapaitis
perception information directly [3], [4]. This method require et al [14] explain the RL agent using reward decomposition.
This approach decomposes reward into sums of semantically
Lei He and Bifeng Song are with the School of Aeronautics, Northwestern meaningful reward types so that actions can be compared in
Polytechnical University, Xi’an 710072, China (email: [email protected])
Aouf Nabil is with Department of Electrical and Electronic Engineering, terms of trade-offs among the types. Reward deposition is also
City University of London, EC1V 0HB, UK used in strategic tasks such as StarCraft II [15]. Jung Hoon Lee
2

[16] proposed a method to derive a secondary comprehensible < S, A, R, P, γ >, which is consists of a set of states S,
agent from NN-based RL agent, the decision makings are a set of actions A, a reward function R(s, a), a transition
based on simple rules. Beyret et at [17] proposed a explainable function P (s0 |s, a), and a discount factor γ ∈ (0, 1). In
RL for robotic manipulation. They presented a hierarchical each state s ∈ S, the agent takes an action a ∈ A. After
DRL system include both low-level agent handling actions and executing the action a in the environment, the agent receives
high-level agent learning the dynamics and the environment. a reward R(s, a) and reaches a new state s0 , determined from
The high-level agent is used to interpret for the human opera- the probability distribution P (s0 |s, a).
tor. Madumal et at [18] use causal models to derive causal Solutions for MDPs with finite state and action spaces
explanations of the behaviour of model-free reinforcement can be found through a variety of methods such as dynamic
learning agent. A structural causal model is learned during the programming, especially when the transition probabilities are
reinforcement learning phase. The explanations of behaviour given. However, in most of the MDPs, the transition probabil-
are generated based on the counterfactual analysis of the ities or the reward functions are not available. In this situation,
causal model. They also introduced a distal explanation model to solve the MDP, we need to interact with the environment
that can analyse counterfactual and opportunity chains using to get some inner information, which is RL method. The
decision trees and causal models [19]. goal of RL is to find a policy π mapping states to actions
Explainability is critical and essential for DRL-based UAV that maximizes the expected discounted total reward over the
navigation system. On the one hand, it’s useful for non-expert agent’s lifetime. This concept hP is formalized by i the action value
users to know the reason why the controller turn right rather π π T t π
function: Q (s, a) = E t=0 γ R(st , at ) , where E is the
than turn left when it facing an obstacle. On the other hand, it expectation over the distribution of the admissible trajectories
also benefits the network and controller designer to know the (s0 , a0 , s1 , a1 , . . . ) obtained the policy π starting from s0 = s
decision making progress and do some adjustment to improve and a0 = a. The action value function can be defined by
network performance. a tabular mapping of discrete inputs and outputs. However,
In this paper, an end-to-end neural network is proposed to this is tabular mapping is limiting for continuous states or an
address the UAV reactive navigation problem in the complex infinite/large number of states. Different from the traditional
unknown environment for small UAVs with SWaP constraints. RL algorithms, DRL algorithms uses DNN to approximate the
The network is trained using DRL method in a high-fidelity action value function, as opposed to tabular functions, makes
simulation environment. Then, a post-hoc explanation method it can deal with complex problem with infinite/large number
is proposed to provide explainable information of the trained of states.
network. Comparing to the transparent model methods, post-
hoc methods can provide explanations of an RL policy after
its training, which keeps the model performance. To get B. Explainable AI and Explainable RL
better understanding of the trained network, both visual and In recent years, AI has achieved a notable momentum and
textual explanations to each model output are provided as lies at the core of many activity sectors. However, because
local explanation for non-expert users. Moreover, some global of the black-box property of the deep neural network model,
explanations are also provided for experts to analyze and the demand for transparency is increasing from the various of
improve the network. stakeholders in AI. In order to avoid limiting the effectiveness
Our main contributions can be summarised as follows: of the current generation of AI system, XAI techniques are
• A DNN-based reactive controller for UAV navigation is proposed to produce more explainable models while maintain
learned using DRL method which can be used by some a high level of learning performance and enable humans to
small UAVs with limited computation resources or some understand and trust the emerging generation of artificially
scenarios need very rapid reaction to the environment intelligent partners [12].
changes. There are two kind of methods to increase the transparency
• A novel CNN attention visualization method as well as of AI models, a widely accept classification is that using
texture explanation are provided based on a more fair transparency models or using post-hoc XAI techniques. A
feature attribution than gradients. model is considered to be transparent if by itself it is un-
• Some local and global explanations are provided for non- derstandable, such as linear regression, decision trees, rule-
expert users and expert to diagnose the trained DNN based models, etc. This kind of model is usually simple enough
model. to be understand by humans or designed by some manually
• Real world experiments are carried out to validate the set rules. However, as the increase of the model prediction
trained network and show the computation efficiency accuracy, more and more models using deep and complex
of our reactive controller comparing a conventional neural network appear. This kind of models cannot be easily
searching-based approach. understand by humans directly. Thus, post-hoc XAI techniques
are important to handle such complex models to provide us
II. P RELIMINARIES some inner view.
The success of DRL could augur an imminent arrival in
A. MDP and DRL the industrial world. However, like many Machine Learning
In this work, the navigation and obstacle avoidance problem algorithms, RL algorithms suffer from a lack of explainability.
is formulated using MDP. An MDP is defined by a tuple Although a large set of XAI literature is emerging to explain
3

the DNN output, assessing how XAI techniques can help value for a particular prediction. For a simple linear regression
understand models beyond classification tasks, e.g. for rein- problem, the predictions can be written as:
forcement learning (RL), has not been extensively studied [13].
Furthermore, DRL models are usually complex to debug for yˆi = b0 + b1 x1i + · · · + bd xdi (1)
developers as they rely on many factors, such as environment, where yˆi is the i-th predicted response, x1i , . . . , xdi are the
reward function, observation and even the algorithms used features of current observation, and b0 , . . . , bd are the esti-
for training the policy. Thus, there is an urgent demand for mated regression coefficients. If the features are independent,
explainable RL (XRL). the contribution of the k-th feature to the predicted response
yˆi can be unambiguously expressed as bk xki for k = 1, . . . , d.
C. Feature Attribution SHAP is a generalization of this concept to more complex
Feature attribution is a common method to analysis trained neural network models. We define the following:
DNN model. Formally, suppose we have a function F : • F is the entire set of features, and S denotes a subset.
Rn → [0, 1] that represents a deep neural network and • S ∪ i is the union of the subset S and feature i.
an input x = (x1 , . . . , xn ) ∈ Rn . An attribution of the • E[f (X)|Xs = xs ] is the conditional expectation of
prediction at input x relative to a baseline input x0 is a vector model f ()˙ when a subset S of features are fixed at the
AF (x, x0 ) = (a1 , . . . , an ) ∈ Rn where ai is the contribution local point x.
of xi to the prediction F (x). There are two different types of Then, the SHAP value is defined to measure the contribution
feature attribution algorithms: Shapley-value-based algorithm of the i-th feature as
and gradient-based algorithm. There is a fundamental differ- X |S|!(|F | − |S| − 1)!
ence between these two algorithm types. Φi = [fS∪{i} (xS∪{i} ) − fS (xS )]
Shapley value is a classic method to distribute the total |F |!
S⊆F \{i}
gains of a collaborative game to a coalition of cooperating (2)
players. It is a fair way to attribute the total gain to the SHAP values are proved to satisfy good properties such as
players based on their contributions. For ML models, we fairness and consistency on attributing importance scores to
formulate a game for the prediction at each instance. We each feature. But the calculation of SHAP values is computa-
consider the “total gains” to be the prediction value for that tionally expensive. In our case, we use Deep SHAP, which is a
instance, and the “players” to be the model features of that model-specific method to improve computational performance
instance. The collaborative game is all of the model features through a connection between Shapley values and DeepLIFT
cooperating to form a prediction value. A Shapley-value-based [21].
explanation method tries to approximate Shapley values of a DeepSHAP [22] is a framework for layer-wise propagation
given prediction by examining the effect of removing a feature of Shapley values that builds upon DeepLIFT [21]. If we
under all possible combinations of presence or absence of the define including an input as setting it to its actual value instead
other features. Shapley values are the only additive feature of its reference value, DeepLIFT can be thought of as a fast
attribution method that satisfies the desirable properties of approximation method of the Shapley values. If our model is
local accuracy, missingness, and consistency. However, exact fully linear, we can get exact SHAP values by summing the
Shapley value computation is exponential in the number of attributions along all possible paths between input xi and the
features. model’s output y. However, in our network, for example fully
Besides the Shapley values, gradients can also used as the connected network, there are non-linear activation function
feature attribution. A gradient-based explanation method tries applied after the linear part, such as ReLU, tanh or sigmoid
to explain a given prediction by using the gradient of the operations. To deal with the non-linear part, DeepLIFT pro-
output with respect to the input features. However, the problem vided the Rescale rule and the RevealCancel rule. Passing back
with gradients is that they break sensitivity, a property that all nonlinear attributions linearly is an approximation, but there
attribution methods should satisfy. For example, consider a are two main benefits: 1) fast computation using only one
one variable, one ReLU network, f (x) = 1 − ReLU(1 − x). backward pass and 2) a guarantee of local accuracy.
Suppose the baseline is x = 0 and the input is x = 2.
The output changes from 0 to 1, but the gradient is zero at III. DRL- BASED UAV NAVIGATION
x = 2 because f becomes flat after x = 1, so the gradient
In this section, we introduce a DRL-based reactive con-
method gives attribution of 0 to x. This phenomenon has been
troller to solve the UAV navigation problem in unknown
reported in [26]. To address this problem, Sundararajan et al
environment. In contrast to conventional simultaneous local-
[27] proposed Integrated Gradients (IG) algorithm. However,
ization and mapping-based method, we control the UAV only
this algorithm requires computing the gradients of the model
according to the current sensor data. This kind of controller
output on a few different inputs (typically 50) between current
can make quick reaction in the complex environment, which
feature value and baseline value.
is beneficial to the small UAVs with limited computation
resources.
D. SHAP and DeepSHAP Reactive navigation in unknown environment can be treated
SHAP (SHapley Additive exPlanations), proposed by Lund- as a sequential decision problem. In each time step, the current
berg and Lee [20], can assigns each feature an importance sensor information is used to generate the final control signal.
4

This means the action a depends only on the current state s and C. Reward Function Design
the next state s0 depends on the current state s and the decision Reward function is critical for DRL problem. In general,
maker’s action a. This kind of problem can be modelled as a the reward function for navigation is simple, we can only
MDP after defining a corresponding reward function Ra (s, s0 ). reward for reaching the goal position as soon as possible
and punish for collision. However, because of the huge state
space for the navigation task, especially in 3D environment,
A. Problem Formulation it’s better to introduce some continuous reward signal to guide
Suppose the UAV takes off from a departure position in a the exploration and speed up the training process. After a lot
3-D environment, which is denoted as (x0 , y0 , z0 ) in the Earth- of testing, a hand-designed reward function is utilized, which
fixed coordinate frame, and targets at flying to a destination consists of a continuous goal approaching reward and some
that is denoted as (xd , yd , zd ). The observation or the state penalty terms:
at time t consists of both raw depth image and UAV state (
features: ot = [otdepth , otstate ]. The state feature consists of 10, if success
r(st ) = (3)
the relative position to goal and current velocity information: Rgoal − Pstate , otherwise
otstate = [dtxy , dtz , ξ t , vxy
t
, vzt , φt ], where dtxy and dtz denote
the distance between the UAV’s current position and the where Rgoal = d(st−1 )−d(st ) is the goal approaching reward
destination position in x-y plane and z axis, ξ t is the relative and d(st ) is the Euclidean distance from current position to
angle between UAV current first-perspective direction to the goal position at time t. Pstate is the penalty term at current
destination position, vxy t
and vzt are the UAV current speed and step:
t
φ is the steering angular speed. Action a = [vxy , vz , φcmd ]
cmd cmd Pstate = ω1 · Cobs − ω2 · Cact − ω3 · Cpos (4)
generated from the policy network π(s) consists of 2 linear where Cobs , Cact and Cpos are penalty terms for obstacle,
velocity and 1 angular velocity. These actions are passed to the action, and position error.
low-level controller as velocity setpoint command to achieve
the navigation. The network architecture of the navigation dsaf e − dobs (st )
Cobs = (5)
network is shown in Fig. 1. dsaf e − dmin
is the penalty term to prevent the quadrotor from getting
close to the obstacle. In equation 5, dsaf e and dmin is the
B. Training Environment and Setting safety distance and minimum distance allowed to the obstacles.
The navigation network is trained from scratch in Air- dobs (st ) is the minimum distance to the obstacle at time t. In
Sim [23] simulator which is built on Unreal Engine. This our training process, dsaf e = 5 and dmin = 1, which means
simulator can provide high fidelity depth image and a low- we give punishment if the quadrotor gets close to the obstacle
level controller to stabilize the UAV. As shown in Fig. 2, a in 5 meters. When the minimum distance to the obstacle is
customized environment is created for the training. The size of less than 1 meter, it is considered crashed and this episode
the environment is square with 200 meters on each side. Some terminates. To stabilize the training process, the continuous
stones were randomly placed as obstacles. At the beginning reward part is constrained to -1 to 1.
of each episode, the quadrotor takes off from the centre of
the environment. The goal position is set randomly on the D. Training Result
circle with a radius of 70 meters and centred on the take-off
The policy network is trained for 200k time steps (around
point. The episode terminated when the quadrotor reaches the
1000 episodes) in the simulation environment. To speed up
goal position with an accept radius of 2 meters or crashed on
the training process, the AirSim simulation clock speed is set
the obstacles. At each time step, the neural network controller
to 10 makes it can run 10 times faster than real time. The
received the depth image as well as the velocity and position
total training process took about 7 hours on an PC with Intel
information of the quadrotor to generate the velocity setpoint
i7-8700 processor and Nvidia GeForce GTX1060 GPU. The
in 3D environment as the control command. The controller is
episode reward and success rate are plotted in Fig. 3. From the
running at 10Hz and the velocity control is realised by the
training result, the policy gets about 75% success rate when
low-level controller provided by AirSim.
the algorithm converged, which means that the network can
To get a smooth velocity command, we use continuous
guide the UAV to the goal position without collision with any
action space. An off-policy model-free reinforcement learning
obstacles in most of the scenario.
algorithm, Twin Delayed DDPG (TD3) [24], is used for
model training. As the successor of the DDPG method, TD3
addresses the overestimate problem issue of Q-value in DDPG IV. P OST- HOC E XPLANATION M ETHOD
by introducing three critical tricks: clipped double Q-Learning, In this section, we introduce our model explanation method.
delayed policy update and target policy smoothing [25]. This To keep the network performance, post-hoc explanation ap-
DRL algorithm is widely used for continuous control problem. proach is used to explain the trained network. Feature attribu-
The hyper-parameters are tuned based on massive training. tion is a common method to get post-hoc model explanation.
The final hyper-parameters of the algorithm are summarized In most cases, gradients can be used to reflect the feature
in Table I in Appendix. attribution if the model is differentiable. However, because
5

Fig. 1. Network architecture of our control policy. The input is raw depth image and UAV states such as current speed and relative position to the goal. The
features in the Depth image is extracted using CNN. Then global average pooling layer is used to get the intensity of each visual feature and then feed to the
fully connected network combined with state features. The outputs are 3 control command includes forward, climb and steering speed.

connected part to fuse the image feature and state feature.


Because of this specific kind of network architecture, our
explanation consists of both visual part for the image input
and texture part for state features.

A. Visual explanation
In our problem, depth image provide the obstacle informa-
tion. A CNN part is used to extract the visual feature from
(a) AirSim simulation environment (b) Top view of the environment the raw depth image. So CNN visualization is important for
understanding the output of the learned policy.
Fig. 2. Customized training environment created using Unreal Engine. The
quadrotor takes off from the center of the environment. The goal is set Understanding the insights of CNN has always been a pain
randomly on the circle with a radius of 70 meters and centred on the take-off point, though CNN can get excellent predictive performance.
point. Some stones were randomly placed as obstacles. In [28], a deconvolutional network (Deconvnet) approach was
proposed to visualize activated pattern in each hidden unit.
This method can visualize features individually but is limited
as it is hard to summarize all hidden patterns into one pattern.
Simonyan et al [29] visualize partial derivatives of predicted
class scores w.r.t.pixel intensities, while Guided Backpropa-
gation [30] makes modifications to ‘raw’ gradients that result
in qualitative improvements. This method can provide fine-
grained visualizations.
In [31], the authors proposed Class Activation Map (CAM)
(a) Episode reward (b) Success rate
using global average pooling (GAP) layer to summarize the
Fig. 3. Mean episode reward and success rate versus the training step activation of the last CNN layer. However, it is only applicable
curves. The success rate is obtained by evaluating each learned policy over
10 randomly generated navigation tasks without action noise. The evaluation to a particular CNN architecture where the GAP layer is fed
is executed every 2k time steps during training. directly into the soft-max layer. To address this problem, Grad-
CAM [32] method combined feature maps and the gradient
signal that does not require any modification in the network
of the model saturation and discontinuities, gradients is not architecture . It can be used to off-the-shelf CNN architecture.
always fear to all the features. In our work, a novel feature Grad-CAM uses the gradient information flowing into the last
attribution metrics, SHAP value, is used to measure the feature convolutional layer of CNN to assign importance values to
attribution. SHAP value is provably the only distribution with each neuron for a particular decision of interest.
certain desirable properties, which can make the explanation To visualize the CNN perception part of our network, a
fairer. method combined both CAM and SHAP values is proposed.
Different from the traditional image classification or purely Similar to CAM method, global average pooling (GAP) layer
vision based navigation, in our case, the input of the network is reserved to summarize the visual feature in our CNN
consists both depth information (image) and state information perception network. The output of the GAP layer is defined as
(scalar). Hence, the navigation network consists of a CNN the CNN feature. Different from CAM and Grad-CAM, in our
perception part to deal with the image information and a fully method, the SHAP value of CNN feature is used to determine
6

Fig. 5. Action description. Each action is divided into 3 parts. While the
prediction fall into the central part, we say it is maintain the current state.
Otherwise, there will be a textual description of each action. The final
description will be the combination of these three individual descriptions.

Fig. 4. SHAP-CAM method. Different from CAM and Grad-CAM, in our explanations only take one forward propagation. Hence, it can
problem, the network output is action rather than class score. We use Global also provide real-time explanation during flight.
Average Pooling as CAM to get the CNN perception feature intensity. Then
SHAP value is calculated directly as the weight of the saliency map.
V. M ODEL E XPLANATION
In this section, we explain the trained model using the pro-
the importance of the CNN feature which generated from posed explanation method. Both visual and texture explanation
the corresponding activation map. A coarse localization map are provided to analysis every model prediction which can
highlighting the important regions in the image is generated provide real-time action explanations for non-expert users. The
by a weighted sum of the last CNN activation map, where visual part shows the attention of the CNN perception part,
SHAP value is the weight. and the texture part summarises the contribution of other state
Comparing to Grad-CAM, SHAP value is used as weights features. In addition, activation map of the last CNN layer is
of the forward activation maps rather than gradients because drawn to show the visual feature extracted by the CNN part.
SHAP value has some unique properties comparing to the gra- Finally, to help expert to diagnose and improve the network
dient, such as efficiency which means the feature attributions design, some global explanations are also provided to analyse
should sum to the prediction value. So using SHAP values as the network performance based on the data gathered in 20
the feature importance can provide a fairer attribution of the continuous episodes.
activation maps. The difference between CAM, Grad-CAM
and our method is shown in Fig. 4.
A. Defining the Reference Input
Baselines or norms are essential to all explanations [33].
B. Texture explanation Feature attribution method always generates the contribution
of each feature based on a reference input or baseline input.
In addition to the visual part, our model also takes some Thus, the choice of the reference input is critical for obtaining
state features as input of the network. To get a reasonable insightful results [21]. In practice, choosing a good reference
explanation of the model output, both image and state input would rely on domain-specific knowledge. For instance, in
should be considered. To explain the state feature contribution, object recognition networks, it is the black image.
some texture explanations are provided based on their SHAP
In our case, we choose the depth image at the target flight
values.
height without any obstacles as the reference image input. For
Our model has 3 continuous action outputs, horizontal speed state feature input, we set the oref = [dxy = 70, dz = 0, ξ =
cmd
vxy , vertical speed vzcmd and steering angular speed φcmd . 0, vxy = 0, vz = 0, φ = 0] which means the UAV just take off
To get the textual explanation, each action is divided into 3 from the start point and has no velocity. The reference image
parts based on the reference action, as shown in Fig. 5. If the is shown in Fig. 6. Based on this reference input, we can
action is similar to the reference action, we say that this action get reference model output from the trained network: vxy cmd
=
is to maintain current action. If the output action either bigger cmd
3.71m/s, vz = −0.03m/s, φ cmd ◦
= 4.15 /s, which means
or smaller than the reference action, a specific text is used to the network wants the quadrotor to speed up and turn a little
describe the action, such as ’slow down’ or ’speed up’ for the bit right based on the reference input.
cmd
horizontal speed vxy . The final textual output of the action is
the combination of these three action textual descriptions, for
example, the action can be described as ’slow down, maintain B. Local explanation
the altitude and turn right’. Local explanation can be generated for every time step
Finally, with both visual and textual explanation, every and every model prediction. Here, 3 specific time steps are
output of the network can be explained to non-expert users choosing to demonstrate our visual and textual explanation
to illustrate the reason of the network’s decision. These for actions in one of the model evaluation episode. As shown
7

measurement as shown in Fig. 13. From the plot, we can find


that there is some relationship between the feature value and
the SHAP value. For example, from the first plot in the third
row of Fig. 13, which shows the SHAP value of state feature ξ
with respect to a3 : φcmd . From these two plots we can see that
the angle error shows a positive correlation to its SHAP value.
However, the angular speed shows a negative correlation.

Fig. 6. Reference depth image. In our case, we choose the depth image at
the target flight height without any obstacles as the reference image input. VI. R EAL W ORLD FLIGHT TEST
To validate the performance of our reactive navigation
controller, some real world outdoor experiments are carried
in Fig. 7, at t = 0, the action is slow down, keep altitude
out.
and turn right. The explanation shows both slow down and
turn right are caused by the angular error to goal. This makes
sense because the direction at t = 0 doesn’t match the goal A. Flight Platform
position, so the UAV need turn right. At t = 53, the action is A self-assembled quadrotor platform is used to evaluate
slow down, climb and turn right. The explanation shows this the trained navigation network, as shown in Fig. 15 (a). The
is caused by the CNN feature. From the heat-map generated platform is based on the S500 quadrotor framework, equipped
using SHAP-CAM, we can see the CNN detected left edge of with a Pixhawk flight controller for the low-level attitude
the stone which is the obstacle. At t = 89, the action is slow and velocity control, which is also providing the position and
down, climb and turn left. This is also caused by the CNN velocity information of the quadrotor. A Intel RealSense D435i
feature. camera is mounted forward to perceive the depth information
To find out the meaning of the CNN features, we also plotted in front of the quadrotor. The on-board computer is a Nvidia
the last CNN layer activation map at both t = 53 and t = 89 Jetson Nano. It is used to run the neural network and generate
as shown in Fig. 8. From this activation map, we can see the velocity command signals. The velocity command signal
at t = 53, CNN feature 8 is the left and right edges of the is sent to the flight controller as velocity setpoint via serial
obstacle which contributes most to the slow down action. CNN port at 10 Hz.
feature 7 is the obstacle and some ground which contributes
to the climb. CNN feature 4 shows the right side edge of the B. Model Retraining
obstacle with some free space background, which leads to the
turn right action. To simplify the experiment, we fixed the the height of the
To illustrate more local explanation result, we choose one quadrotor during flight to 5m and reduce the controller output
episode from the evaluation process and explain the model to only forward velocity and steering velocity. Notable, to
predictions at each time step using SHAP-CAM. Fig. 9 shows reduce the gap between the simulation and real environment,
the depth image and the SHAP-CAM activation map for 3 the network was retrained in a custom Gazebo environment.
actions at different time steps. From Fig. 9, we can see that The new environment uses simulated trees as obstacles. The
at different time step, the network decision-making process controller is running in the PX4 Simulation in the Loop
for different output is rely on the different visual pattens. (SITL) configuration. Also, to keep safety of the quadrotor, the
Moreover, Fig. 10 and Fig. 11 show the control command maximum forward velocity is limited to 1m/s. The network is
and state features during this evaluation episode. From dxy trained for 20k timesteps. Training result is shown in Fig. 14.
in Fig. 10, we can see that the UAV always fly towards the The success rate is about 75% after training.
goal position and the distance to goal is reducing over the
trajectory. Finally, at t = 160, UAV reached the goal position. C. Real World Fight
After training, the trained network is deployed directly to
C. Global explanation the real flight platform. The real-world test environment is
Except for the local explanation, some global explanations shown in figure 15a. A big tree was chose as obstacle and the
are provided. We summarized all the feature attribution over goal position was set behind the tree. One of the flight paths is
the 20 trajectories, 2858 time steps in total. Fig. 12 shows the shown in figure 15b. From the flight path, the trained reactive
SHAP summary plot that orders the features based on their controller can navigate the quadrotor to avoid the obstacle and
importance to the different action. We can see that the CNN reach the goal finally.
cmd
feature contributes most to action a1 : vxy and a2 : vzcmd . As for comparison, we also tested a traditional search based
Except the CNN features, the current horizontal velocity vxy obstacle avoidance algorithm. PX4 avoidance system which is
and distance to goal dxy are the most importance features based on the 3DVFH+ is deployed on the test platform. This
cmd
contribute to a1 : vxy . dxy , vxy and vz contributes more to local planner plans in a vector field histogram include some
cmd history information. The flight results show both algorithm can
a2 : vz , the vertical velocity command. The angle error ξ is
the most important feature to a3 : φcmd . navigate the quadrotor to the final position. However, using
With the feature value and its SHAP value, we can invest the the same hardware, the PX4 avoidance system can only run at
relationship between the feature intensity and its importance 10Hz. In contrast, our neural network-based reactive controller
8

(a) Action explanation at t = 0 (b) Action explanation at t = 53 (c) Action explanation at t = 89


Fig. 7. Action explanation at 3 different time steps. At t = 0, the action is slow down, keep altitude and turn right. Mainly because the big angle error. At
t = 53, the action is slow down, climb and turn right, mainly because the image features. From the heat map, we can see the quadrotor is close to the stone
and the CNN detected the edge of the stone. At t = 89, the action is slow down, climb and turn left. This is also cause of the image feature.

(a) Last CNN layer activation map at t = 53

(b) Last CNN layer activation map at t = 89


Fig. 8. Last CNN layer activation map. From this map we can get the meaning of different CNN feature. For example, according to Fig. 7, at t = 53, the
action 3 is turn right, because CNN 4 and CNN 3 feature. From Fig. 8 (a), CNN 3 and CNN 4 is the right edge of the stone.

Fig. 9. Depth image and SHAP-CAM at 10 different time steps in the evaluation episode. The first line is the input depth image. The second to fourth lines
are three SHAP-CAM activation maps for three network outputs.
9

Fig. 10. State feature in the evaluation episode. Blue line is the state feature and orange line is the reference state feature value.

Fig. 11. Network output in the evaluation episode. Blue line is the action and orange line is the reference action.

can run at 60Hz maximum. This shows the computational on the feature attribute, both visual and textual explanation
advantage of the reactive controller rather than conventional are generated to open the black box. To get a better visual
planner, which is benefit for lightweight UAV with limited explanation of the CNN perception part, a new saliency
computation resources. map generation method proposed combining both CAM and
However, according to the flight test result, some problems SHAP values. Our method can provide both visual and textual
have been exposed. The main problem is the output oscillation. explanation for non-expert users of every model output, which
As shown in Fig. 16, the output during the avoiding process is is important for the application of DRL based model in the
not very smooth in the real test, although this output is smooth real world.
in the simulation. We think this is caused by the persistent Because this paper mainly focused on the explanation part,
gap between the simulation and real environment, such as the the trained model is not perfect enough. There still some
difference dynamics and the different input image. This also explanations don’t make sense. In the future, the model will
reflect the domain adjustment problem for all the learning- be fine-trained and improved based on the explanation.
based control system.
Feature attribution method can provide some explanation of
the deep neural network, however, it is still pretty shallow.
VII. C ONCLUSION For example, attributions do not explain how the network
In this paper, the UAV autonomous navigation problem is combines the features to produce the answer and why gradient
solved with the DRL technique. Different from other works, descent converged. In the future, we will looking for other
this paper focused on improving the model explainability method to make better explanation to the trained network.
rather than treat the trained model as a black box. Based Also, we are still interested in the explanation of the training
10

Fig. 12. Feature analysis over 20 trajectories for 3 actions, forward speed vxy cmd (left), vertical speed v cmd (middle) and steering speed φcmd (right). For each
z
cmd , the CNN 8 and CNN 2 feature contribute most, then the current forward
action, all the features are sorted according to their average attribution. For vxy
cmd cmd
speed v xy. For vz , the distance to goal contributes most. For φ , the angle error to goal is the most important feature.

(a) Episode reward (b) Success rate


Fig. 14. Training result in gazebo simulation environment. The model is
trained from scratch for 20k time steps. The episode reward and success rate
is smoothed over 100 episodes. After training, the model can achieve about
75% success rate.

TABLE I
H YPERPARAMETERS OF TD3

Hyperparameter Value
mini-batch size 128
replay buffer size 50000
discount factor 0.99
learning rate 0.0003
random exploration steps 2000
square deviation of exploration noise 0.3
Fig. 13. Feature dependence plot using 2858 sample from 20 trajectories. The
x-axis is the feature value, the y-axis is its SHAP value. The feature value is
normalized to 0 to 1 so angle error is 0.5 means ξ = 0. The first row shows
the SHAP value of state feature dxy and vxy with respect to a1 : vxy cmd . The
[2] B. Zhou, F. Gao, L. Wang, C. Liu, and S. Shen, “Robust and effi-
second row shows the SHAP value of state feature dz and vz with respect cient quadrotor trajectory generation for fast autonomous flight,” IEEE
to a2 : vzcmd . The third row shows the SHAP value of state feature ξ and φ Robotics and Automation Letters, vol. 4, no. 4, pp. 3529–3536, 2019.
with respect to a3 : φcmd . [3] S. Paschall and J. Rose, “Fast, lightweight autonomy through an
unknown cluttered environment: Distribution statement: A—approved
for public release; distribution unlimited,” in 2017 IEEE Aerospace
Conference. IEEE, 2017, pp. 1–8.
process of DRL to reflect the knowledge acquisition process. [4] H. D. Escobar-Alvarez, N. Johnson, T. Hebble, K. Klingebiel, S. A.
Quintero, J. Regenstein, and N. A. Browning, “R-advance: Rapid
adaptive prediction for vision-based autonomous navigation, control, and
A PPENDIX evasion,” Journal of Field Robotics, vol. 35, no. 1, pp. 91–100, 2018.
[5] S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A.
A. Hyperparameters of TD3 Bagnell, and M. Hebert, “Learning monocular reactive uav control in
The hyperparameters are shown in Table cluttered natural environments,” in 2013 IEEE international conference
on robotics and automation. IEEE, 2013, pp. 1765–1772.
[6] N. Imanberdiyev, C. Fu, E. Kayacan, and I.-M. Chen, “Autonomous nav-
igation of uav by using real-time model-based reinforcement learning,”
R EFERENCES in 2016 14th International Conference on Control, Automation, Robotics
and Vision (ICARCV). IEEE, 2016, pp. 1–6.
[1] S. Liu, M. Watterson, K. Mohta, K. Sun, S. Bhattacharya, C. J. Taylor, [7] L. He, N. Aouf, J. F. Whidborne, and B. Song, “Integrated moment-
and V. Kumar, “Planning dynamically feasible trajectories for quadrotors based lgmd and deep reinforcement learning for uav obstacle avoidance,”
using safe flight corridors in 3-d complex environments,” IEEE Robotics pp. 7491–7497, 2020.
and Automation Letters, vol. 2, no. 3, pp. 1688–1695, 2017. [8] ——, “Deep reinforcement learning based local planner for uav obstacle
11

(a) Real world test environment

Fig. 16. Forward speed (up) and steering speed (down) during the real flight
test.

reinforcement learning,” arXiv preprint arXiv:2008.06693, 2020.


[14] Z. Juozapaitis, A. Koul, A. Fern, M. Erwig, and F. Doshi-Velez,
“Explainable reinforcement learning via reward decomposition,” in
IJCAI/ECAI Workshop on Explainable Artificial Intelligence, 2019.
[15] R. Pocius, L. Neal, and A. Fern, “Strategic tasks for explainable
reinforcement learning,” in Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 33, 2019, pp. 10 007–10 008.
[16] J. H. Lee, “Complementary reinforcement learning towards explainable
agents,” arXiv preprint arXiv:1901.00188, 2019.
[17] B. Beyret, A. Shafti, and A. A. Faisal, “Dot-to-dot: Explainable hierar-
chical reinforcement learning for robotic manipulation,” arXiv preprint
arXiv:1904.06703, 2019.
[18] P. Madumal, T. Miller, L. Sonenberg, and F. Vetere, “Explain-
able reinforcement learning through a causal lens,” arXiv preprint
arXiv:1905.10958, 2019.
[19] ——, “Distal explanations for explainable reinforcement learning
agents,” arXiv preprint arXiv:2001.10284, 2020.
[20] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model
predictions,” in Advances in neural information processing systems,
2017, pp. 4765–4774.
[21] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important
(b) Test flight path features through propagating activation differences,” arXiv preprint
Fig. 15. Real world flight test. Fig. shows the test environment. The quadrotor arXiv:1704.02685, 2017.
is facing a tree as obstacles in front of it. The goal position is set about 35m [22] H. Chen, S. Lundberg, and S.-I. Lee, “Explaining models by propagating
behind the obstacle. and the flight trajectory. shapley values of local components,” arXiv preprint arXiv:1911.11888,
2019.
[23] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual
and physical simulation for autonomous vehicles,” in Field and Service
avoidance using demonstration data,” arXiv preprint arXiv:2008.02521, Robotics, 2017. [Online]. Available: https://arxiv.org/abs/1705.05065
2020. [24] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approx-
[9] C. Wang, J. Wang, Y. Shen, and X. Zhang, “Autonomous navigation imation error in actor-critic methods,” arXiv preprint arXiv:1802.09477,
of uavs in large-scale complex environments: A deep reinforcement 2018.
learning approach,” IEEE Transactions on Vehicular Technology, vol. 68, [25] J. Achiam, “Spinning Up in Deep Reinforcement Learning,” 2018.
no. 3, pp. 2124–2136, 2019. [26] A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje, “Not just
[10] C. Wang, J. Wang, J. Wang, and X. Zhang, “Deep reinforcement a black box: Learning important features through propagating activation
learning-based autonomous uav navigation with sparse rewards,” IEEE differences,” arXiv preprint arXiv:1605.01713, 2016.
Internet of Things Journal, 2020. [27] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep
[11] R. Goebel, A. Chander, K. Holzinger, F. Lecue, Z. Akata, S. Stumpf, networks,” arXiv preprint arXiv:1703.01365, 2017.
P. Kieseberg, and A. Holzinger, “Explainable ai: the new 42?” in Inter- [28] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
national cross-domain conference for machine learning and knowledge tional networks,” in European conference on computer vision. Springer,
extraction. Springer, 2018, pp. 295–303. 2014, pp. 818–833.
[12] A. B. Arrieta, N. Dı́az-Rodrı́guez, J. Del Ser, A. Bennetot, S. Tabik, [29] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional
A. Barbado, S. Garcı́a, S. Gil-López, D. Molina, R. Benjamins et al., networks: Visualising image classification models and saliency maps,”
“Explainable artificial intelligence (xai): Concepts, taxonomies, opportu- arXiv preprint arXiv:1312.6034, 2013.
nities and challenges toward responsible ai,” Information Fusion, vol. 58, [30] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
pp. 82–115, 2020. “Striving for simplicity: The all convolutional net,” arXiv preprint
[13] A. Heuillet, F. Couthouis, and N. D. Rodrı́guez, “Explainability in deep arXiv:1412.6806, 2014.
12

[31] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning


deep features for discriminative localization,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June
2016.
[32] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
D. Batra, “Grad-cam: Visual explanations from deep networks via
gradient-based localization,” in Proceedings of the IEEE international
conference on computer vision, 2017, pp. 618–626.
[33] D. Kahneman and D. T. Miller, “Norm theory: Comparing reality to its
alternatives.” Psychological review, vol. 93, no. 2, p. 136, 1986.

You might also like