Adversarial Attacks and Defenses in Deep Learning
Adversarial Attacks and Defenses in Deep Learning
Article
PII: S2095-8099(19)30503-X
DOI: https://doi.org/10.1016/j.eng.2019.12.012
Reference: ENG 349
Please cite this article as: K. Ren, T. Zheng, Z. Qin, X. Liu, Adversarial Attacks and Defenses in Deep Learning,
Engineering (2020), doi: https://doi.org/10.1016/j.eng.2019.12.012
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover
page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will
undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing
this version to give early visibility of the article. Please note that, during the production process, errors may be
discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Engineering xx (xxxx)
Research
Artificial Intelligence—Article
* Corresponding author.
E-mail address: [email protected] (K. Ren)
ARTICLE INFO
Article history:
Available online
Keywords:
Machine learning
Deep neural network
Adversarial example
Adversarial attack
Adversarial defense
ABSTRACT
With the rapid developments of artificial intelligence (AI) and deep learning (DL) techniques, it is critical to ensure the
security and robustness of the deployed algorithms. Recently, the security vulnerability of DL algorithms to adversarial
samples has been widely recognized. The fabricated samples can lead to various misbehaviors of the DL models while
being perceived as benign by humans. Successful implementations of adversarial attacks in real physical-world scenarios
further demonstrate their practicality. Hence, adversarial attack and defense techniques have attracted increasing attention
from both machine learning and security communities and have become a hot research topic in recent years. In this paper,
we first introduce the theoretical foundations, algorithms, and applications of adversarial attack techniques. We then de-
scribe a few research efforts on the defense techniques, which cover the broad frontier in the field. Several open problems
and challenges are subsequently discussed, which we hope will provoke further research efforts in this critical area.
1. Introduction
A trillion-fold increase in computation power has popularized the usage of deep learning (DL) for
handling a variety of machine learning (ML) tasks, such as image classification [1], natural language
processing [2], and game theory [3]. However, a severe security threat to the existing DL algorithms has
been discovered by the research community: Adversaries can easily fool DL models by perturbing benign
samples without being discovered by humans [4]. Perturbations that are imperceptible to human vi-
sion/hearing are sufficient to prompt the model to make a wrong prediction with high confidence. This
phenomenon, named the adversarial sample, is considered to be a significant obstacle to the mass de-
ployment of DL models in production. Substantial research efforts have been made to study this open
problem.
2095-8099/© 2019 THE AUTHORS. Published by Elsevier LTD on behalf of Chinese Academy of Engineering and Higher Education Press Limited
Company. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
K. Ren et al. / Engineering xx (xxxx)
According to the threat model, existing adversarial attacks can be categorized into white-box, gray-
box, and black-box attacks. The difference between the three models lies in the knowledge of the adver-
saries. In the threat model of white-box attacks, the adversaries are assumed to have full knowledge of
their target model, including model architecture and parameters. Hence, they can directly craft adversarial
samples on the target model by any means. In the gray-box threat model, the knowledge of the adversaries
is limited to the structure of the target model. In the black-box threat model, the adversaries can only
resort to the query access to generate adversarial samples. In the frameworks of these threat models, a
number of attack algorithms for adversarial sample generation have been proposed, such as limited-
memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm [4], the fast gradient sign method
(FGSM) [5], the basic iterative method/projected gradient descent (BIM/PGD) [6], distributionally ad-
versarial attack [7], Carlini and Wagner attacks (C&W) [8], Jacobian-based saliency map attack (JSMA)
[9], and DeepFool [10]. These attack algorithms are designed in the white-box threat model. However,
they are also effective in many gray-box and black-box settings due to the transferability of the adversar-
ial samples among models [11,12].
Meanwhile, various defensive techniques for adversarial sample detection/classification have been
proposed recently, including heuristic and certificated defenses. Heuristic defense refers to a defense
mechanism that performs well in defending specific attacks without theoretical accuracy guarantees. Cur-
rently, the most successful heuristic defense is adversarial training, which attempts to improve the DL
model’s robustness by incorporating adversarial samples into the training stage. In terms of empirical
results, PGD adversarial training achieves state-of-the-art accuracy against a wide range of 𝐿∞ attacks on
several DL model benchmarks such as the modified National Institute of Standards and Technology
(MNIST) database, the Canadian Institute for Advanced Research-10 dataset (CIFAR10), and ImageNet
[13,14]. Other heuristic defenses mainly rely on input/feature transformations and denoising to alleviate
the adversarial effects in the data/feature domains. In contrast, certified defenses can always provide
certifications for their lowest accuracy under a well-defined class of adversarial attacks. A recently pop-
ular network certification approach is to formulate an adversarial polytope and define its upper bound
using convex relaxations. The relaxed upper bound is a certification for trained DL models, which guar-
antees that no attack with specific limitations can surpass the certificated attack success rate, as approxi-
mated by the upper bound. However, the actual performance of these certificated defenses is still much
worse than that of the adversarial training.
In this paper, we investigate and summarize the adversarial attacks and defenses that represent the
state-of-the-art efforts in this area. After that, we provide comments and discussions on the effectiveness
of the presented attack and defense techniques. The remainder of the paper is organized as follows: In
Section 2, we first sketch out the background. In Section 3, we detail several classic adversarial attack
methods. In Section 4, we present a few applications of adversarial attacks in real-world industrial sce-
narios. In Section 5, we introduce a few recently proposed defense methods. In Section 6, we provide
some observations and insights on several related open problems. In Section 7, we conclude this survey.
2. Preliminaries
3. Adversarial attacks
In this section, we introduce a few representative adversarial attack algorithms and methods. These
methods target to attack image classification DL models, but can also be applied to other DL models. We
detail the specific adversarial attacks on the other DL models in Section 4.
3.1 L-BFGS
The vulnerability of deep neural networks (DNNs) to adversarial samples is first reported in Ref. [4];
that is, hardly perceptible adversarial perturbations are introduced to an image to mislead the DNN clas-
sification result. A method called L-BFGS is proposed to find the adversarial perturbations with the min-
imum 𝐿𝑝 norm, which is formulated as follows:
where ||𝑥 − 𝑥 ′ ||𝑝 is the 𝐿𝑝 norm of the adversarial perturbations and 𝑦′ is the adversarial target label
(𝑦 ′ ≠ 𝑦). However, this optimization problem is intractable. The authors propose minimizing a hybrid
loss, that is, 𝑐||𝑥 − 𝑥 ′ ||𝑝 + 𝐽(𝜃, 𝑥 ′ , 𝑦′), where c is a hyper parameter, as an approximation of the solution
to the optimization problem, where an optimal value of 𝑐 could be found by line search/grid search.
3
K. Ren et al. / Engineering xx (xxxx)
Fig. 1. A demonstration of an adversarial sample generated by applying FGSM to GoogleNet [5] . The imperceptible per-
turbation crafted by FGSM fools GoogleNet into recognizing the image as a gibbon.
′
𝑥𝑡+1 = Clip{𝑥𝑡′ + 𝛼 ⋅ sign(∇𝑥 𝐽(𝜃, 𝑥𝑡′ , 𝑦))} (6)
where 𝛼𝑇 = 𝜖 and 𝛼 is the magnitude of the perturbation in each iteration. The PGD can be considered as a
generalized version of BIM without the constraint 𝛼𝑇 = 𝜖 . In order to constrain the adversarial perturba-
tions, the PGD projects the adversarial samples learned from each ite ration into the 𝜖 –𝐿∞ neighbor of
the benign samples. Hence, the adversarial perturbation size is smaller than 𝜖 . Formally, the update pro-
cedure follows
′
𝑥𝑡+1 = Proj{𝑥𝑡′ + 𝛼 ⋅ sign(∇𝑥 𝐽(𝜃, 𝑥𝑡′ , 𝑦))} (7)
where Proj projects the updated adversarial sample into the 𝜖 –𝐿∞ neighbor and a valid range.
4
K. Ren et al. / Engineering xx (xxxx)
′
𝑥𝑡+1 = Clip{𝑥𝑡′ + 𝛼 ⋅ sign(𝑔𝑡+1 )} (8)
where the gradient 𝑔 is updated by 𝑔𝑡+1 = 𝜑 ⋅ 𝑔𝑡 + ∇𝑥 𝐽(𝜃, 𝑥𝑡′ , 𝑦)/||∇𝑥 𝐽(𝜃, 𝑥𝑡′ , 𝑦)||1 , 𝜑 is a decay factor .
The authors further proposed a scheme that aims to build an ensemble of models to attack a model in the
black-box/gray-box settings. The basic idea is to consider the gradients of multiple models with respect
to the input and identify a gradient direction that is more likely to transfer to other models. The combi-
nation of MI-FGSM and the ensemble attack scheme won the first places in the non-targeted adversarial
attack and targeted adversarial attack competitions (black-box setting) at the 2017 Neural Information
Processing Systems (NIPS) conference.
where 𝜇 denotes the adversarial data distribution and 𝜋(𝑥) denotes the benign data distribution. Since
direct optimization over the distribution is intractable, the authors exploit two particle-optimization meth-
ods for approximation. Compared with PGD, DAA explores new adversarial patterns, as shown in Fig.
2 [7]. It ranks second on the Massachusetts Institute of Technology (MIT) MadryLab’s white-box lead-
erboards [13], and is considered to be one of the most effective 𝐿∞ attacks on multiple defensive models.
Fig. 2. Comparison between PGD and DAA. DAA tends to generate more structured perturbations [7].
5
K. Ren et al. / Engineering xx (xxxx)
1
such that 𝑥 + 𝛿 = 2 (𝑡𝑎𝑛ℎ(𝜅) + 1), which always resides in the range of [0, 1] in the optimization pro-
cess. C&W attacks achieve 100% attack success rate on naturally trained DNNs for MNIST, CIFAR10,
and ImageNet. They also compromise defensive distilled models, on which L-BFGS and DeepFool fail
to find the adversarial samples.
𝜕𝑙(𝑥) 𝜕𝑙𝑗(𝑥)
𝛻𝑙 (𝑥) = 𝜕𝑥
=[ 𝜕𝑥𝛾
] (12)
𝛾∈1,…,𝑀𝑖𝑛 .𝑗∈1,…,𝑀𝑜𝑢𝑡
where 𝑀𝑖𝑛 is the number of neurons on the input layer; 𝑀𝑜𝑢𝑡 is the number of neurons on the out-
put layer; 𝛾is the index for input x component; j is the index for output 𝑙 component.
The Jacobian matrix identifies how the elements of input 𝑥 affect the logit outputs of different classes.
According to the Jacobian matrix, an adversarial saliency map 𝑆(𝑥, 𝑦′) is defined to select the fea-
tures/pixels that should be perturbed in order to obtain the desired changes in logit outputs. Specifically,
the proposed algorithm perturbs the element 𝑥[𝛾] with the highest value of 𝑆(𝑥, 𝑦′)[𝛾] to increase/de-
crease the logit outputs of the target/other class significantly. Hence, perturbations on a small proportion
of elements can already affect the 𝑙(𝑥) and fool the neural network.
3.8 DeepFool
Moosavi-Dezfooli et al. [10] propose a new algorithm named DeepFool to find the minimum 𝐿2 ad-
versarial perturbations on both an affine binary classifier and a general binary differentiable classifier.
For an affine classifier 𝑓(𝑥) = 𝑤 T 𝑥 + 𝑏 , where 𝑤 is the weight of the addine classifier and 𝑏 is the bias
of the affine classifier, the minimum perturbation to change the class of example 𝑥0 is the distance from
𝑓(𝑥 )
𝑥0 to the decision boundary hyperplane ℱ = {𝑥: 𝑤 T 𝑥 + 𝑏 = 0}, that is, − 02 𝑤 . For a general differen-
||𝑤||2
tiable classifier, DeepFool assumes that 𝑓 is linear around 𝑥𝑡′ and iteratively calculates the perturbation
𝛿𝑡 :
argmin ||𝛿𝑡 ||2 subject to 𝑓 (𝑥𝑡′ ) + ∇𝑓(𝑥𝑡′ )⊺ 𝛿𝑡 = 0 (13)
𝛿𝑡
This process runs until 𝑓(𝑥𝑡′ ) ≠ 𝑓(𝑥), and the minimum perturbation is eventually approximated by the
sum of 𝛿𝑡 . This method can also be extended to attack the general multi-class classifiers, where the prob-
lem is transformed into calculating the distance from 𝑥0 to the surface of a convex polyhedron formed by
the decision boundaries between all the classes, as illustrated in Fig. 3 [10]. Experiments show that the
perturbation introduced in DeepFool is smaller than FGSM on several benchmark datasets.
Fig. 3. Convex polyhedron formed by the decision boundaries between all the classes: (a) linear model; (b) nonlinear
model [10].
6
K. Ren et al. / Engineering xx (xxxx)
2
min 𝑐𝐽 (𝜃, 𝑥′, 𝑦′) + 𝛽||𝑥′ − 𝑥||1 + ||𝑥′ − 𝑥||2 (14)
x′
subject to 𝑥′ ∈ [0,1]𝒹
where 𝑐 and 𝛽 are hyper parameters, 𝐽(𝜃, 𝑥′, 𝑦′) is the targeted adversarial loss, and 𝛽||𝑥′ − 𝑥||1 +
2
||𝑥′ − 𝑥||2 is used to penalize the 𝐿1 and 𝐿2 distances between the adversarial samples 𝑥′ and the benign
samples 𝑥. EAD first introduces an 𝐿1 norm constraint into adversarial attacks, leading to a different set
of adversarial examples with comparable performance to other state-of-the-art methods.
7
K. Ren et al. / Engineering xx (xxxx)
Fig. 4. Eyeglasses with adversarial perturbations deceive a facial recognition system to recognize the faces in the first
row as those in the second row [25].
8
K. Ren et al. / Engineering xx (xxxx)
Fig. 5. (a) shows the original image identified by an inception v3 model as a microwave, and (b) shows its physical ad-
versarial example, identified as a phone [33].
9
K. Ren et al. / Engineering xx (xxxx)
Fig. 6. In (a), faster CNN correctly detects three dogs and identifies their regions, while in (b) generated by DAG, the
segmentation results are completely wrong [40].
10
K. Ren et al. / Engineering xx (xxxx)
defined by the difference between the confidence values of the sentence with and without the
word.
Fig. 7. Adversarial text generated by TextBugger [16]: A negative comment is misclassified as a positive comment.
5. Adversarial defenses
In this section, we summarize the representative defenses developed in recent years, mainly includ-
ing adversarial training, randomization-based schemes, denoising methods, provable defenses, and some
other new defenses. We also present a brief discussion on their effectiveness against different attacks
under different settings.
min max
′
𝐽(𝜃, 𝑥 ′ , 𝑦) (15)
𝜃 𝐷(𝑥,𝑥 )<𝜂
where 𝐽(𝜃, 𝑥′, 𝑦) is the adversarial loss, with network weights 𝜃 , adversarial input 𝑥′, and ground-truth
label 𝑦. 𝐷(𝑥, 𝑥 ′ ) represents a certain distance metric between 𝑥 and 𝑥′. The inner maximization problem
is to find the most effective adversarial samples, which is solved by a well-designed adversarial attack,
such as FGSM [5] and PGD [6]. The outer minimization is the standard training procedure to minimize
the loss. The resulting network is supposed to be resistant against the adversarial attack used for the
adversarial sample generation in the training stage. Recent studies in Refs. [13,14,57,58] show that ad-
versarial training is one of the most effective defenses against adversarial attacks. In particular, it achieves
state-of-the-art accuracy on several benchmarks. Therefore, in this section, we elaborate on the best-
performing adversarial training techniques in the past few years.
11
K. Ren et al. / Engineering xx (xxxx)
where 𝑥 + 𝜖 · 𝑠𝑖𝑔𝑛(∇𝑥 𝐽(𝜃, 𝑥, 𝑦)) is the FGSM-generated adversarial sample for the benign sample
𝑥, and 𝑐 is used to balance the accuracy on benign and adversarial samples as a hyper parameter.
Experiments in Ref. [5] show that the network becomes somewhat robust to FGSM-generated ad-
versarial samples. Specifically, with adversarial training, the error rate on adversarial samples dra-
matically fell from 89.4% to 17.9%. However, the trained model is still vulnerable to iterative/opti-
mization-based adversarial attacks despite its effectiveness when defending FGSM-generated ad-
versarial samples. Therefore, a number of studies further dig into adversarial training with stronger
adversarial attacks, such as BIM/PGD attacks.
where 𝐽(𝜃, 𝑥, 𝑦) is the original loss, 𝐽(𝜃, 𝑥, 𝑥 ′ ) is the cross-entropy between the logits of 𝑥 and 𝑥′ , and c is
a hyper parameter. Experiments in Ref. [62] show that this pairing loss helps improve the performance
of PGD adversarial training on several benchmarks, such as SVHN, CIFAR-10, and tiny ImageNet. Con-
cretely, it is claimed in Ref. [62] that ALP increases the accuracy of the inception V3 model under the
white-box PGD attack from 1.5% to 27.9%. Its performance is almost as good as that of EAT against
black-box attacks. However, the work in Ref. [64] evaluates the robustness of an ALP-trained ResNet
and discovers that the ResNet only achieves 0.6% correct classification rate under the targeted attack
considered in Ref. [62]. The authors also point out that ALP is less amenable to gradient descent, since
ALP sometimes induces a “bumpier,” that is, depressed loss landscape tightly around the input points.
Therefore, ALP might not be as robust as expected in Ref. [62].
5.2 Randomization
Many recent defenses resort to randomization schemes for mitigating the effects of adversarial per-
turbations in the input/feature domain. The intuition behind this type of defense is that DNNs are always
robust to random perturbations. A randomization-based defense attempts to randomize the adversarial
effects into random effects, which are not a concern for most DNNs. Randomization-based defenses have
achieved comparable performance under black-box and gray-box settings, but in the white-box setting,
the EoT method [28] can compromise most of them by considering the randomization process in the
attack process. In this section, we present details of several typical randomization-based defenses and
introduce their performance against various attacks in different settings.
13
K. Ren et al. / Engineering xx (xxxx)
Fig. 8. The pipeline of the randomization-based defense mechanism proposed by Xie et al. [67]: The input image is first
randomly resized and then randomly padded.
Fig. 9. The architecture of RSE [69]. FC: fully connected layer; Fin : the input vector of the noise layer; Fout: the output
vector of the noise layer; 𝜖: the perturbation which follow the Gaussian distribution 𝒩(0, σ2 ).
also reduce the accuracy of SAP to 0 with 8/255 𝐿∞ adversarial perturbations. Luo et al. [74] introduce a
new CNN structure by randomly masking the feature maps output from the convolutional layers. By
randomly masking the output features, each filter only extracts the features from partial positions. The
authors claim that this assists the filters in learning features distributing consistently with respect to the
mask pattern; hence, the CNN can capture more information on the spatial structures of local features.
5.3 Denoising
Denoising is a very straightforward method in terms of mitigating adversarial perturbations/effects.
Previous works point out two directions to design such a defense: input denoising and feature denoising.
The first direction attempts to partially or fully remove the adversarial perturbations from the inputs, and
the second direction attempts to alleviate the effects of adversarial perturbations on high-level features
learned by DNNs. In this section, we elaborate on several well-known defenses in both directions.
Fig. 10. The feature-squeezing framework proposed by Xu et al. [75]. d1 and d2: the diffierence between the model’s pre-
diction on a squeezed input and its prediction on the original input ; H: the threshold which is used to detect adversarial
examples.
15
K. Ren et al. / Engineering xx (xxxx)
adversarial sample by using it as input, and outputs a benign counterpart. Although APE-GAN achieves
a good performance in the testbed of Ref. [80], the adaptive white-box CW2 attack proposed in Ref. [81]
can easily defeat APE-GAN.
Fig. 11. The pipeline of Defense-GAN [79]. G: the generative model which can generate a high-dimensional input sam-
ple from a low dimensional vector z; R: the number of random vectors generated by the random number generator.
16
K. Ren et al. / Engineering xx (xxxx)
where 𝜙 is a candidate set for all the distributions around the benign data, which can be constructed by
f-divergence balls [90] or Wasserstein balls [91], 𝜑 is sampled from the candidate set 𝜙.
Optimization over this distributional objective is equivalent to minimizing the empirical risk over
all the samples in the neighbor of the benign data—that is, all the candidates for the adversarial samples.
Since 𝜙 affects the computability, and direct optimization over an arbitrary 𝜙 is intractable, the work in
Ref. [80] derives tractable sets 𝜙 using the Wasserstein distance metric with computationally efficient
relaxations that are computable even when 𝐽(𝜃, 𝑥, 𝑦) is non-convex. In fact, the work in Ref. [89] also
provides an adversarial training procedure with provable guarantees on its computational and statistical
performance. In the proposed training procedure, it incorporates a penalty to characterize the adversarial
robustness region. Since optimization over this penalty is intractable, the authors propose a Lagrangian
relaxation for the penalty and achieve robust optimization over the proposed distributional loss. In addi-
tion, the authors derive guarantees for the empirical minimizer of the robust saddle-point problem and
give specialized bounds for domain adaptation problems, which also shed light on the distributional ro-
bustness certification.
Ω(√dnlog(𝑛)) is too large for real datasets with a high data dimension and numerous samples, the authors
propose an efficient 1-nearest neighbor algorithm. Based on the observation that the 1-nearest neighbor
is robust when oppositely labeled points are far apart, the proposed algorithm removes the nearby oppo-
sitely labeled points and keeps the points whose neighbors share the same label. On MNIST, for small
adversarial perturbations (low attack radii), this algorithm followed by 1-nearest neighbor-based classi-
fication performs slightly worse than the other defenses, such as an adversarially trained classifier, while
it outperforms those defenses in the case of large attack radii. Papernot et al. [98] propose a KNN-based
defensive mechanism called DkNN by executing the KNN algorithm on the representations of the data
learned by each layer of the DNN. The KNN algorithm is mainly used to estimate the abnormality of a
prediction on the test input. The prediction is considered abnormal when the intermediate representations
learned by the DNN are not close to the representations of those training samples that share the same
label with the prediction. Experiments show that DkNN significantly improves the accuracy of DNN
under multiple adversarial attacks, especially under the C&W attack.
6. Discussions
18
K. Ren et al. / Engineering xx (xxxx)
method with this gradient estimation achieved first place in the NIPS 2017 Challenge (under a black-box
setting) [18]. Chen et al. [104] investigate another black-box setting, where additional query access is
granted to the adversaries. Therefore, the adversaries can infer the gradients from the output of the target
model given well-designed inputs. In this setting, the proposed design can apply a zero-order method to
give a much better estimation of the model gradients. However, a drawback of this method is its require-
ment for a large number of queries, which is proportional to the data dimension.
19
K. Ren et al. / Engineering xx (xxxx)
where 𝒹 is the data dimension. Ludwig et al. [113] show that adversarially robust generalization requires
more data than common ML tasks, and the required data size scales in Ο(√1/𝒹).
Existence of a general robust decision boundary. Since there are numerous adversarial attacks
defined under different metrics, a natural question is: Is there a general robust decision boundary that can
be learned by a certain kind of DNNs with a particular training strategy? At present, the answer to this
question is “no.” Although PGD adversarial training demonstrates remarkable resistance against a wide
range of 𝐿∞ attacks, Sharma et al. [59] show that it is still vulnerable to adversarial attacks measured by
other 𝐿𝑝 norms, such as EAD and CW2. Recently, Khoury et al. [111] prove that the optimal 𝐿2 and 𝐿∞
decision boundaries are different for a two-concentric-sphere dataset, and their disparity grows with the
codimension of the dataset—that is, the difference between the dimensions of the data manifold and the
whole data space.
Effective and efficient defense against white-box attacks. To the best of our knowledge, no de-
fense that can achieve a balance between effectiveness and efficiency has been proposed. In terms of
effectiveness, adversarial training demonstrates the best performance but at a substantial computational
cost. In terms of efficiency, the configuration of many randomization-based and denoising-based de-
fenses/detection systems only takes a few seconds. However, many recent works [17,84,114,115] show
that those schemes are not as effective as they claim to be. Certificated defenses indicate a way to reach
theoretically guaranteed security, but both their accuracy and their efficiency are far from meeting the
practical requirements.
7. Conclusions
In this paper, we have presented a general overview of the recent representative adversarial attack and
defense techniques. We have investigated the ideas and methodologies of the proposed methods and
algorithms. We have also discussed the effectiveness of these adversarial defenses based on the most
recent advances. New adversarial attacks and defenses developed in the past two years have been elabo-
rated. Some fundamental problems, such as the causation of adversarial samples and the existence of a
general robust boundary, have also been investigated. We have observed that there is still no existing
defense mechanism that achieves both efficiency and effectiveness against adversarial samples. The most
effective defense mechanism, which is adversarial training, is too computationally expensive for practical
deployment, while many efficient heuristic defenses have been demonstrated to be vulnerable to adaptive
white-box adversaries. We have also discussed several open problems and challenges in this critical area
to provide a useful research guideline to boost the development of this critical area.
Acknowledgement
This work has been supported by Ant Financial-Zhejiang University Financial Technology Research
Center.
Compliance with ethics guidelines
Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu declare that they have no conflict of interest or
financial conflicts to disclose.
References
[1] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Proceedings of
the 26th Conference on Neural Information Processing Systems; 2012 Dec 3–6; Nevada, USA; 2012. p. 1097–105.
[2] Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using
RNN encoder-decoder for statistical machine translation. 2014. arXiv:1406.1078.
[3] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. Mastering the game of Go with deep neural
networks and tree search. Nature 2016;529(7587):484–9.
[4] Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, et al. Intriguing properties of neural networks. 2013.
arXiv:1312.6199.
[5] Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. 2014. arXiv:1412.6572.
[6] Kurakin A, Goodfellow I, Bengio S. Adversarial examples in the physical world. 2016. arXiv:1607.02533.
[7] Zheng T, Chen C, Ren K. Distributionally adversarial attack. 2018. arXiv:1808.05537.
20
K. Ren et al. / Engineering xx (xxxx)
[8] Carlini N, Wagner D. Towards evaluating the robustness of neural networks. In: Proceedings of the 2017 IEEE Symposium
on Security and Privacy; 2017 May 22–26; San Jose, CA, USA; 2017. p. 39–57.
[9] Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A. The limitations of deep learning in adversarial settings.
In: Proceedings of the 2016 IEEE European Symposium on Security and Privacy; 2016 Mar 21–24; Saarbrucken, Germany;
2016. p. 372–87.
[10] Moosavi-Dezfooli SM, Fawzi A, Frossard P. DeepFool: a simple and accurate method to fool deep neural networks. In:
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV,
USA; 2016. p. 2574–82.
[11] Papernot N, McDaniel P, Goodfellow I. Transferability in machine learning: from phenomena to black-box attacks using
adversarial samples. 2016. arXiv:1605.07277.
[12] Liu Y, Chen X, Liu C, Song D. Delving into transferable adversarial examples and black-box attacks. 2016.
arXiv:1611.02770.
[13] Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. 2017.
arXiv: 1706.06083.
[14] Xie C, Wu Y, van der Maaten L, Yuille A, He K. Feature denoising for improving adversarial robustness. 2018.
arXiv:1812.03411.
[15] Zheng T, Chen C, Yuan J, Li B, Ren K. PointCloud saliency maps. 2018. arXiv:1812.01687.
[16] Li J, Ji S, Du T, Li B, Wang T. TextBugger: generating adversarial text against real-world applications. 2018.
arXiv:1812.05271.
[17] Athalye A, Carlini N, Wagner D. Obfuscated gradients give a false sense of security: circumventing defenses to adversarial
examples. 2018. arXiv:1802.00420.
[18] Dong Y, Liao F, Pang T, Su H, Zhu J, Hu X, et al. Boosting adversarial attacks with momentum. In: Proceedings of the 2018
IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Salt Lake City, UT, USA; 2018. p. 9185–
193.
[19] Chen PY, Sharma Y, Zhang H, Yi J, Hsieh CJ. EAD: elastic-net attacks to deep neural networks via adversarial examples.
In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence; 2018 Feb 2–7; New Orleans, LA, USA;
2018.
[20] Moosavi-Dezfooli SM, Fawzi A, Fawzi O, Frossard P. Universal adversarial perturbations. In: Proceedings of the 2017 IEEE
Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA; 2017. p. 1765–73.
[21] Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al. Caffe: convolutional architecture for fast feature
embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia; 2014 Nov 3–7; Orlando, FL, USA;
2014. p.675–8.
[22] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of the
2015 IEEE Conference on Computer Vision and Pattern Recognition; 2015 Jun 7–12; Boston, MA, USA; 2015. p. 1–9.
[23] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014. arXiv:1409.1556.
[24] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference
on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA; 2016. p. 770–8.
[25] Sharif M, Bhagavatula S, Bauer L, Reiter MK. Accessorize to a crime: real and stealthy attacks on state-of-the-art face
recognition. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security; 2016 Oct
24–28; Vienna, Austria; 2016. p. 1528–40.
[26] Parkhi OM, Vedaldi A, Zisserman A. Deep face recognition. In: Proceedings of British Machine Vision Conference; 2017
Sep 7–10; Swansea, UK; 2015.
[27] Brown TB, Mané D, Roy A, Abadi M, Gilmer J. Adversarial patch. 2017. arXiv:1712.09665.
[28] Athalye A, Engstrom L, Ilya A, Kwok K. Synthesizing robust adversarial examples. 2017. arXiv:1707.07397.
[29] Liu Y, Ma S, Aafer Y, Lee WC, Zhai J, Wang W, et al. Trojaning attack on neural networks. In: Proceedings of Network
and Distributed Systems Security Symposium; 2018 Feb 18–21; San Diego, CA,USA; 2018.
[30] Xiao C, Li B, Zhu JY, He W, Liu M, Song D. Generating adversarial examples with adversarial networks. 2018.
arXiv:1801.02610.
[31] Song Y, Shu R, Kushman N, Ermon S. Constructing unrestricted adversarial examples with generative models. In: Proceed-
ings of the 32nd Conference on Neural Information Processing Systems; 2018 Dec 3–8; Montréal, Canada; 2018. p. 8312–
23.
[32] Odena A, Olah C, Shlens J. Conditional image synthesis with auxiliary classifier GANs. In: Proceedings of the 34th Inter-
national Conference on Machine Learning; 2017 Aug 6–11; Sydney, Austrila; 2017. p. 2642–51.
21
K. Ren et al. / Engineering xx (xxxx)
[33] Eykholt K, Evtimov I, Fernandes E, Li B, Rahmati A, Xiao C, et al. Robust physical-world attacks on deep learning visual
classification. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23;
Salt Lake City, UT, USA; 2018. p. 1625–34.
[34] Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Proceedings of
the International Conference on Medical Image Computing and Computer-Assisted Intervention; 2015 Oct 5–9; Munich,
Germany; 2015. p. 234–41.
[35] Grundmann M, Kwatra V, Han M, Essa I. Efficient hierarchical graph-based video segmentation. In: Proceedings of the 2010
IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2010 Jun 13–18; San Francisco, CA, USA;
2010. p. 2141–8.
[36] Su H, Maji S, Kalogerakis E, Learned-Miller E. Multi-view convolutional neural networks for 3D shape recognition. In:
Proceedings of the IEEE International Conference on Computer Vision; 2015 Dec 7–13; Santiago, Chile; 2015. p.945–53.
[37] Qi CR, Su H, Mo K, Guibas LJ. Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA; 2017. p.
652–60.
[38] Lee H, Pham P, Largman Y, Ng AY. Unsupervised feature learning for audio classification using convolutional deep belief
networks. In: Proceedings of the 23rd Conference on Neural Information Processing Systems; 2009 Dec 7–10; Vancouver,
Canada; 2009. p. 1096–104.
[39] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforce-
ment learning. Nature 2015;518(7540): 529–33.
[40] Xie C, Wang J, Zhang Z, Zhou Y, Xie L, Yuille A. Adversarial examples for semantic segmentation and object d etection.
In: Proceedings of the 2017 IEEE International Conference on Computer Vision; 2017 Oct 22–29; Venice, Italy; 2017. p.
1369–78.
[41] Cisse M, Adi Y, Neverova N, Keshet J. Houdini: fooling deep structured prediction models. 2017. arXiv:1707.05373.
[42] Qi CR, Yi L, Su H, Guibas LJ. Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Proceedings
of the 31st Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA; 2017. p. 5099–
108.
[43] Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, Solomon JM. Dynamic graph CNN for learning on point clouds. 2018.
arXiv:1801.07829.
[44] Xiang C, Qi CR, Li B. Generating 3D adversarial point clouds. 2018. arXiv:1809.07016.
[45] Liu D, Yu R, Su H. Extending adversarial attacks and defenses to deep 3D point cloud classifiers. 2019. arXiv:1901.03006.
[46] Xiao C, Yang D, Li B, Deng J, Liu M. MeshAdv: adversarial meshes for visual recognition. 2018. arXiv:1810.05206v2.
[47] Carlini N, Wagner D. Audio adversarial examples: targeted attacks on speech-to-text. In: Proceedings of 2018 IEEE Security
and Privacy Workshops; 2018 May 24; San Francisco, CA, USA; 2018. p. 1–7.
[48] Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, et al. Deep speech: scaling up end-to-end speech recognition.
2014. arXiv:1412.5567.
[49]Yakura H, Sakuma J. Robust audio adversarial example for a physical attack. 2018. arXiv:1810.11793.
[50]Liang B, Li H, Su M, Bian P, Li X, Shi W. Deep text classification can be fooled. 2017. arXiv:1704.08006.
[51]Huang S, Papernot N, Goodfellow I, Duan Y, Abbeel P. Adversarial attacks on neural network policies. 2017.
arXiv:1702.02284.
[52]Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing Atari with deep reinforcement learning.
2013. arXiv:1312.5602.
[53] Mnih V, Badia AP, Mirza M, Graves A, Harley T, Lillicrap TP, et al. Asynchronous methods for deep reinforcement learning.
In: Proceedings of the 33rd International Conference on Machine Learning; 2016 Jun 19–24; New York, NY, USA; 2016. p.
1928–37.
[54]Schulman J, Levine S, Moritz P, Jordan M, Abbeel P. Trust region policy optimization. In: Proceedings of the 32nd Interna-
tional Conference on Machine Learning; 2015 Jul 6–11; Lille, France; 2015. p. 1889–97.
[55]Behzadan V, Munir A. Vulnerability of deep reinforcement learning to policy induction attacks. In: Proceedings of the Inter-
national Conference on Machine Learning and Data Mining in Pattern Recognition; 2017 Jul 15–20; New York, NY, USA;
2017. p. 262–75.
[56]Lin YC, Hong ZW, Liao YH, Shih ML, Liu MY, Sun M. Tactics of adversarial attack on deep reinforcement learning agents.
2017. arXiv:1703.06748.
[57]Carlini N, Katz G, Barrett C, Dill DL. Ground-truth adversarial examples. In: ICLR 2018 Conference; 2018 Apr 30; Vancou-
ver, BC, Canada; 2018.
22
K. Ren et al. / Engineering xx (xxxx)
[58]Papernot N, Faghri F, Carlini N, Goodfellow I, Feinman R, Kurakin A, et al. Technical report on the CleverHans v2.1.0
adversarial examples library. 2016. arXiv:1610.00768v6.
[59]Sharma Y, Chen PY. Attacking the Madry defense model with L1-based adversarial examples. 2017. arXiv:1710.10733v4.
[60]Kurakin A, Goodfellow I, Bengio S. Adversarial machine learning at scale. 2016. arXiv: 1611.01236.
[61]Tramèr F, Kurakin A, Papernot N, Goodfellow I, Boneh D, McDaniel P. Ensemble adversarial training: attacks and defenses.
2017. arXiv:1705.07204.
[62]Kannan H, Kurakin A, Goodfellow I. Adversarial logit pairing. 2018. arXiv:1803.06373.
[63]Zheng S, Song Y, Leung T, Goodfellow I. Improving the robustness of deep neural networks via stability training. In: Pro-
ceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA;
2016. p. 4480–8.
[64]Engstrom L, Ilyas A, Athalye A. Evaluating and understanding the robustness of adversarial logit pairing. 2018. arXiv:
1807.10272.
[65]Lee H, Han S, Lee J. Generative adversarial trainer: defense to adversarial perturbations with GAN. 2017. arXiv: 1705.03387.
[66]Liu X, Hsieh CJ. Rob-GAN: generator, discriminator, and adversarial attacker. 2018. arXiv:1807.10454v3.
[67]Xie C, Wang J, Zhang Z, Ren Z, Yuille A. Mitigating adversarial effects through randomization. 2017. arXiv: 1711.01991.
[68]Guo C, Rana M, Cisse M, van der Maaten L. Countering adversarial images using input transformations. 2017. arXiv:
1711.00117.
[69]Liu X, Cheng M, Zhang H, Hsieh CJ. Towards robust neural networks via random self-ensemble. In: Proceedings of the 2018
European Conference on Computer Vision; 2018 Sep 8–14; Munich, Germany; 2018. p. 369–85.
[70]Lecuyer M, Atlidakis V, Geambasu R, Hsu D, Jana S. Certified robustness to adversarial examples with differential privacy.
2018, arXiv:1802.03471v4.
[71]Dwork C, Lei J. Differential privacy and robust statistics. In: Proceedings of the 41st Annual ACM Symposium on Theory
of Computing; 2009 May 31– Jun 2; Bethesda, MD, USA; 2009. p. 371–80.
[72]Li B, Chen C, Wang W, Carin L. Certified adversarial robustness with additive noise. 2018. arXiv: 1809.03113v6.
[73]Dhillon GS, Azizzadenesheli K, Lipton ZC, Bernstein J, Kossaifi J, Khanna A, et al. Stochastic activation pruning for robust
adversarial defense. 2018. arXiv: 1803.01442.
[74] Luo T, Cai T, Zhang M, Chen S, Wang L. Random mask: towards robust convolutional neural networks. In: ICLR 2019
Conference; 2019 Apr 30; New Orleans, LA, USA; 2019.
[75]Xu W, Evans D, Qi Y. Feature squeezing: detecting adversarial examples in deep neural networks. 2017. arXiv: 1704.01155.
[76]Xu W, Evans D, Qi Y. Feature squeezing mitigates and detects Carlini/Wagner adversarial examples. 2017. arXiv:
1705.10686.
[77]He W, Wei J, Chen X, Carlini N, Song D. Adversarial example defenses: ensembles of weak defenses are not strong. 2017.
arXiv: 1706.04701.
[78]Sharma Y, Chen PY. Bypassing feature squeezing by increasing adversary strength. 2018. arXiv:1803.09868.
[79]Samangouei P, Kabkab M, Chellappa R. Defense-GAN: protecting classifiers against adversarial attacks using generative
models. 2018. arXiv:1805.06605.
[80] Shen S, Jin G, Gao K, Zhang Y. APE-GAN: adversarial perturbation elimination with GAN. 2017. arXiv: 1707.05474.
[81]Carlini N, Wagner D. MagNet and "efficient defenses against adversarial attacks" are not robust to adversarial examples.
2017. arXiv:1711.08478.
[82]Meng D, Chen H. MagNet: a two-pronged defense against adversarial examples. In: Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security; 2017 Oct 30– Nov 3; New York, NY, USA; 2017. p. 135–47.
[83]Liao F, Liang M, Dong Y, Pang T, Hu X, Zhu J. Defense against adversarial attacks using high-level representation guided
denoiser. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Salt
Lake City, UT, USA; 2018. p. 1778–87.
[84]Athalye A, Carlini N. On the robustness of the CVPR 2018 white-box adversarial example defenses. 2018. arXiv:1804.03286.
[85]Raghunathan A, Steinhardt J, Liang P. Certified defenses against adversarial examples. 2018. arXiv:1801.09344.
[86]Raghunathan A, Steinhardt J, Liang P. Semidefinite relaxations for certifying robustness to adversarial examples. In: Pro-
ceedings of the 32nd Conference on Neural Information Processing Systems; 2018 Dec 3–8; Montréal, Canada; 2018. p.
10877–87.
23
K. Ren et al. / Engineering xx (xxxx)
[87]Wong E, Kolter JZ. Provable defenses against adversarial examples via the convex outer adversarial polytope. In: Proceedings
of the 31st Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA; 2017.
[88]Wong E, Schmidt FR, Metzen JH, Kolter JZ. Scaling provable adversarial defenses. 2018. arXiv:1805.12514.
[89] Sinha A, Namkoong H, Duchi J. Certifying some distributional robustness with principled adversarial training. 2017.
arXiv:1710.10571.
[90]Namkoong H, Duchi JC. Stochastic gradient methods for distributionally robust optimization with f-divergences. In: Pro-
ceedings of the 30th Conference on Neural Information Processing Systems; 2016 Dec 5–10; Barcelona, Spain; 2016. p. 2208–
16.
[91]Gao R, Kleywegt AJ. Distributionally robust stochastic optimization with Wasserstein distance. 2016. arXiv:1604.02199.
[92]Guo Y, Zhang C, Zhang C, Chen Y. Sparse DNNs with improved adversarial robustness. In: Proceedings of the 32nd Con-
ference on Neural Information Processing Systems; 2018 Dec 3–8; Montréal, Canada; 2018. p. 242–51.
[93]Hein M, Andriushchenko M. Formal guarantees on the robustness of a classifier against adversarial manipulation. In: Pro-
ceedings of the 31st Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA; 2017. p.
2266–76.
[94]Weng TW, Zhang H, Chen PY, Yi J, Su D, Gao Y, et al. Evaluating the robustness of neural networks: an extreme value
theory approach. 2018. arXiv:1801.10578.
[95]Xiao KY, Tjeng V, Shafiullah NM, Madry A. Training for faster adversarial robustness verification via inducing ReLU sta-
bility. 2018. arXiv:1809.03008.
[96]Katz G, Barrett C, Dill DL, Julian K, Kochenderfer MJ. Reluplex: an efficient SMT solver for verifying deep neural networks.
In: Proceedings of the International Conference on Computer Aided Verification; 2017Jul 24–28; Heidelberg, Germany;
2017:97–117.
[97]Wang Y, Jha S, Chaudhuri K. Analyzing the robustness of nearest neighbors to adversarial examples. 2017 . arXiv:
1706.03922.
[98]Papernot N, McDaniel P. Deep k-nearest neighbors: towards confident, interpretable and robust deep learning. 2018.
arXiv:1803.04765.
[99] Liu X, Li Y, Wu C, Hsieh C. Adv-BNN: improved adversarial defense through robust Bayesian neural network. 2018.
arXiv:1810.01279.
[100] Neal RM. Bayesian learning for neural networks. New York: Springer Science & Business Media; 2012.
[101] Schott L, Rauber J, Bethge M, Brendel W. Towards the first adversarially robust neural network model on MNIST. 2018.
arXiv:1805.09190.
[102]Xiao C, Deng R, Li B, Yu F, Liu M, Song D. Characterizing adversarial examples based on spatial consistency information
for semantic segmentation. In: Proceedings of the European Conference on Computer Vision; 2018 Sep 8–14; Munich, Ger-
many; 2018. p. 217–34.
[103]Yang Z, Li B, Chen PY, Song D. Characterizing audio adversarial examples using temporal dependency. 2018.
arXiv:1809.10875.
[104]Chen PY, Zhang H, Sharma Y, Yi J, Hsieh CJ. Zoo: zeroth order optimization based black-box attacks to deep neural
networks without training substitute models. In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Secu-
rity; 2017 Nov 3; Dalas, TX, USA; 2017. p. 15–26.
[105]Cao Y, Xiao C, Yang D, Fang J, Yang R, Liu M, et al. Adversarial objects against LiDAR-based autonomous driving
systems. 2019. arXiv:1907.05418.
[106]Fawzi A, Fawzi O, Frossard P. Analysis of classifiers’ robustness to adversarial perturbations. Mach Learn
2018;107(3):481–508.
[107]Mirman M, Gehr T, Vechev M. Differentiable abstract interpretation for provably robust neural networks. In: Proceedings
of the 35th International Conference on Machine Learning; 2018 Jul 10–15; Stockholm, Sweden; 2018. p. 3578–86.
[108]Singh G, Gehr T, Mirman M, Puschel M, Vechev M. Fast and effective robustness certification. In: Proceedings of the 32nd
Conference on Neural Information Processing Systems; 2018 Dec 3–8; Montréal, Canada; 2018. p. 10802–13.
[109]Gowal S, Dvijotham K, Stanforth R, Bunel R, Qin C, Uesato J, et al. On the effectiveness of interval bound propagation f or
training verifiably robust models. 2018. arXiv:1810.12715.
[110]Dube S. High dimensional spaces, deep learning and adversarial examples. 2018. arXiv:1801.00634.
[111]Khoury M, Hadfield-Menell D. On the geometry of adversarial examples. 2018. arXiv:1811.00525.
[112]Gilmer J, Metz L, Faghri F, Schoenholz SS, Raghu M, Watterberg M, et al. Adversarial spheres. 2018. arXiv:1801.02774.
24
K. Ren et al. / Engineering xx (xxxx)
[113]Schmidt L, Santurkar S, Tsipras D, Talwar K, Madry A. Adversarially robust generalization requires more data. 2018.
arXiv:1804.11285.
[114]Carlini N, Wagner D. Adversarial examples are not easily detected: bypassing ten detection methods. In: Proceedings of the
10th ACM Workshop on Artificial Intelligence and Security; 2017 Nov 3; Dalas, TX, USA; 2017. p. 3–14.
[115]Carlini N. Is AmI (attacks meet interpretability) robust to adversarial examples? 2019. arXiv:1902.02322v1.
25