0% found this document useful (0 votes)
27 views

Adversarial Examples Are Not Bugs, They Are Features

Uploaded by

Daniel Mesafint
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Adversarial Examples Are Not Bugs, They Are Features

Uploaded by

Daniel Mesafint
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Adversarial examples are not bugs, they are features

The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.

Citation Ilyas, A, Santurkar, S, Tsipras, D, Engstrom, L, Tran, B et al. 2019.


"Adversarial examples are not bugs, they are features." Advances in
Neural Information Processing Systems, 32.

As Published https://papers.nips.cc/paper/2019/hash/
e2c420d928d4bf8ce0ff2ec19b371514-Abstract.html

Version Final published version

Citable link https://hdl.handle.net/1721.1/137500

Terms of Use Article is made available in accordance with the publisher's


policy and may be subject to US copyright law. Please refer to the
publisher's site for terms of use.
Adversarial Examples are not Bugs, they are Features

Andrew Ilyas∗ Shibani Santurkar∗ Dimitris Tsipras∗


MIT MIT MIT
[email protected] [email protected] [email protected]

Logan Engstrom∗ Brandon Tran Aleksander Madry


˛
MIT MIT MIT
[email protected] [email protected] [email protected]

Abstract

Adversarial examples have attracted significant attention in machine learning, but


the reasons for their existence and pervasiveness remain unclear. We demonstrate
that adversarial examples can be directly attributed to the presence of non-robust
features: features (derived from patterns in the data distribution) that are highly
predictive, yet brittle and (thus) incomprehensible to humans. After capturing
these features within a theoretical framework, we establish their widespread ex-
istence in standard datasets. Finally, we present a simple setting where we can
rigorously tie the phenomena we observe in practice to a misalignment between
the (human-specified) notion of robustness and the inherent geometry of the data.

1 Introduction
The pervasive brittleness of deep neural networks [Sze+14; Eng+19b; HD19; Ath+18] has attracted
significant attention in recent years. Particularly worrisome is the phenomenon of adversarial ex-
amples [Big+13; Sze+14], imperceptibly perturbed natural inputs that induce erroneous predictions
in state-of-the-art classifiers. Previous work has proposed a variety of explanations for this phe-
nomenon, ranging from theoretical models [Sch+18; BPR18] to arguments based on concentration
of measure in high-dimensions [Gil+18; MDM18; Sha+19a]. These theories, however, are often
unable to fully capture behaviors we observe in practice (we discuss this further in Section 5).
More broadly, previous work in the field tends to view adversarial examples as aberrations arising
either from the high dimensional nature of the input space or statistical fluctuations in the training
data [GSS15; Gil+18]. From this point of view, it is natural to treat adversarial robustness as a goal
that can be disentangled and pursued independently from maximizing accuracy [Mad+18; SHS19;
Sug+19], either through improved standard regularization methods [TG16] or pre/post-processing
of network inputs/outputs [Ues+18; CW17a; He+17].
In this work, we propose a new perspective on the phenomenon of adversarial examples. In con-
trast to the previous models, we cast adversarial vulnerability as a fundamental consequence of the
dominant supervised learning paradigm. Specifically, we claim that:

Adversarial vulnerability is a direct result of sensitivity to well-generalizing features in the data.

Recall that we usually train classifiers to solely maximize (distributional) accuracy. Consequently,
classifiers tend to use any available signal to do so, even those that look incomprehensible to hu-
mans. After all, the presence of “a tail” or “ears” is no more natural to a classifier than any other
equally predictive feature. In fact, we find that standard ML datasets do admit highly predictive

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
yet imperceptible features. We posit that our models learn to rely on these “non-robust” features,
leading to adversarial perturbations that exploit this dependence.2
Our hypothesis also suggests an explanation for adversarial transferability: the phenomenon that
perturbations computed for one model often transfer to other, independently trained models. Since
any two models are likely to learn similar non-robust features, perturbations that manipulate such
features will apply to both. Finally, this perspective establishes adversarial vulnerability as a human-
centric phenomenon, since, from the standard supervised learning point of view, non-robust features
can be as important as robust ones. It also suggests that approaches aiming to enhance the inter-
pretability of a given model by enforcing “priors” for its explanation [MV15; OMS17; Smi+17]
actually hide features that are “meaningful” and predictive to standard models. As such, produc-
ing human-meaningful explanations that remain faithful to underlying models cannot be pursued
independently from the training of the models themselves.
To corroborate our theory, we show that it is possible to disentangle robust from non-robust features
in standard image classification datasets. Specifically, given a training dataset, we construct:

1. A “robustified” version for robust classification (Figure 1a)3 . We are able to effectively
remove non-robust features from a dataset. Concretely, we create a training set (semanti-
cally similar to the original) on which standard training yields good robust accuracy on the
original, unmodified test set. This finding establishes that adversarial vulnerability is not
necessarily tied to the standard training framework, but is also a property of the dataset.
2. A “non-robust” version for standard classification (Figure 1b)2 . We are also able to
construct a training dataset for which the inputs are nearly identical to the originals, but
all appear incorrectly labeled. In fact, the inputs in the new training set are associated
to their labels only through small adversarial perturbations (and hence utilize only non-
robust features). Despite the lack of any predictive human-visible information, training on
this dataset yields good accuracy on the original, unmodified test set. This demonstrates
that adversarial perturbations can arise from flipping features in the data that are useful for
classification of correct inputs (hence not being purely aberrations).

Finally, we present a concrete classification task where the connection between adversarial examples
and non-robust features can be studied rigorously. This task consists of separating Gaussian distri-
butions, and is loosely based on the model presented in Tsipras et al. [Tsi+19], while expanding
upon it in a few ways. First, adversarial vulnerability in our setting can be precisely quantified as a
difference between the intrinsic data geometry and that of the adversary’s perturbation set. Second,
robust training yields a classifier which utilizes a geometry corresponding to a combination of these
two. Lastly, the gradients of standard models can be significantly misaligned with the inter-class
direction, capturing a phenomenon that has been observed in practice [Tsi+19].

2 The Robust Features Model


We begin by developing a framework, loosely based on the setting of Tsipras et al. [Tsi+19], that
enables us to rigorously refer to “robust” and “non-robust” features. In particular, we present a set of
definitions which allow us to formally describe our setup, theoretical results, and empirical evidence.

Setup. We study binary classification, where input-label pairs (x, y) ∈ X × {±1} are sampled
from a distribution D; the goal is to learn a classifier C : X → {±1} predicting y given x.
We define a feature to be a function mapping from the input space X to real numbers, with the
set of all features thus being F = {f : X → R}. For convenience, we assume that the features
in F are shifted/scaled to be mean-zero and unit-variance (i.e., so that E(x,y)∼D [f (x)] = 0 and
E(x,y)∼D [f (x)2 ] = 1), making the following definitions scale-invariant. Note that this definition
captures what we abstractly think of as features (e.g., a function capturing how “furry” an image is).
2
It is worth emphasizing that while our findings demonstrate that adversarial vulnerability does arise from
non-robust features, they do not preclude the possibility of adversarial vulnerability also arising from other
phenomena [Nak19a]. Nevertheless, the mere existence of useful non-robust features suffices to establish that
without explicitly preventing models from utilizing these features, adversarial vulnerability will persist.
3
The corresponding datasets for CIFAR-10 are publicly available at http://git.io/adv-datasets.

2
Robust dataset Training image Adversarial example Relabel as cat
towards “cat”

good standard accuracy max P(cat)


Train good robust accuracy
dog cat
frog Robust Features: dog Robust Features: dog
Unmodified Non-Robust Features: dog Non-Robust Features: cat
test set

frog
Training image good standard accuracy
Train good accuracy Train
bad robust accuracy

cat
frog
Evaluate on
Non-robust dataset original test set

(a) (b)

Figure 1: A conceptual diagram of the experiments of Section 3: (a) we disentangle features into
robust and non-robust (Section 3.1), (b) we construct a dataset which appears mislabeled to humans
(via adversarial examples) but results in good accuracy on the original test set (Section 3.2).

Useful, robust, and non-robust features. We now define the key concepts required for formulat-
ing our framework. To this end, we categorize features in the following manner:

• ρ-useful features: For a given distribution D, we call a feature f ρ-useful (ρ > 0) if it is


correlated with the true label in expectation, that is if
E(x,y)∼D [y · f (x)] ≥ ρ. (1)
We then define ρD (f ) as the largest ρ for which feature f is ρ-useful under distribution D.
(Note that if a feature f is negatively correlated with the label, then −f is useful instead.)
Crucially, a linear classifier trained on ρ-useful features can attain non-trivial performance.
• γ-robustly useful features: Suppose we have a ρ-useful feature f (ρD (f ) > 0). We
refer to f as a robust feature (formally a γ-robustly useful feature for γ > 0) if, under
adversarial perturbation (for some specified set of valid perturbations ∆), f remains γ-
useful. Formally, if we have that
 
E(x,y)∼D inf y · f (x + δ) ≥ γ. (2)
δ∈∆(x)

• Useful, non-robust features: A useful, non-robust feature is a feature which is ρ-useful for
some ρ bounded away from zero, but is not a γ-robust feature for any γ ≥ 0. These features
help with classification in the standard setting, but may hinder accuracy in the adversarial
setting, as the correlation with the label can be flipped.

Classification. In our framework, a classifier C = (F, w, b) is comprised of a set of features


F ⊆ F, a weight vector w, and a scalar bias b. For an input x, the classifier predicts the label y as
 
X
C(x) = sgn b + wf · f (x) .
f ∈F

For convenience, we denote the set of features learned by a classifier C as FC .

Standard Training. Training a classifier is performed by minimizing a loss function Lθ (x, y)


over input-label pairs (x, y) from the training set (via empirical risk minimization (ERM)) that de-
creases with the correlation between the weighted combination of the features and the label. When
minimizing classification loss, no distinction exists between robust and non-robust features: the
only distinguishing factor of a feature is its ρ-usefulness. Furthermore, the classifier will utilize any
ρ-useful feature in F to decrease the loss of the classifier.

Robust training. In the presence of an adversary, any useful but non-robust features can be made
anti-correlated with the true label, leading to adversarial vulnerability. Therefore, ERM is no longer
sufficient to train classifiers that are robust, and we need to explicitly account for the effect of the

3
adversary on the classifier. To do so, we use an adversarial loss function that can discern between
robust and non-robust features [Mad+18]:
 
E(x,y)∼D max Lθ (x + δ, y) , (3)
δ∈∆(x)

for an appropriately defined set of perturbations ∆. Since the adversary can exploit non-robust
features to degrade classification accuracy, minimizing this adversarial loss [GSS15; Mad+18] can
be viewed as explicitly preventing the classifier from relying on non-robust features.

Remark. We want to note that even though this framework enables us to describe and predict
the outcome of our experiments, it does not capture the notion of non-robust features exactly as
we intuitively might think of them. For instance, in principle, our theoretical framework would
allow for useful non-robust features to arise as combinations of useful robust features and useless
non-robust features [Goh19b]. These types of constructions, however, are actually precluded by our
experimental results (for instance, the classifiers trained in Section 3 would not generalize). This
shows that our experimental findings capture a stronger, more fine-grained statement than our formal
definitions are able to express. We view bridging this gap as an interesting direction for future work.

3 Finding Robust (and Non-Robust) Features


The central premise of our proposed framework is that there exist both robust and non-robust features
that constitute useful signals for standard classification. We now provide evidence in support of this
hypothesis by disentangling these two sets of features (see conceptual description in Figure 1).
On one hand, we will construct a “robustified” dataset, consisting of samples that primarily contain
robust features. Using such a dataset, we are able to train robust classifiers (with respect to the
standard test set) using standard (i.e., non-robust) training. This demonstrates that robustness can
arise by removing certain features from the dataset (as, overall, the new dataset contains less infor-
mation about the original training set). Moreover, it provides evidence that adversarial vulnerability
is caused by non-robust features and is not inherently tied to the standard training framework.
On the other hand, we will construct datasets where the input-label association is based purely on
non-robust features (and thus the resulting dataset appears completely mislabeled to humans). We
show that this dataset suffices to train a classifier with good performance on the standard test set. This
indicates that natural models use non-robust features to make predictions, even in the presence of
robust features. These features alone are sufficient for non-trivial generalization to natural images,
indicating that they are indeed predictive, rather than artifacts of finite-sample overfitting.

3.1 Disentangling robust and non-robust features

Recall that the features a classifier learns to rely on are based purely on how useful these features
are for (standard) generalization. Thus, under our conceptual framework, if we can ensure that only
robust features are useful, standard training should result in a robust classifier. Unfortunately, we
cannot directly manipulate the features of very complex, high-dimensional datasets. Instead, we will
leverage a robust model and modify our dataset to contain only the features relevant to that model.
Conceptually, given a robust (i.e., adversarially trained [Mad+18]) model C, we aim to construct a
bR for which features used by C are as useful as they were on the original distribution
distribution D
D while ensuring that the rest of the features are not useful. In terms of our formal framework:

E(x,y)∼D [f (x) · y] if f ∈ FC
E(x,y)∼DbR [f (x) · y] = (4)
0 otherwise,
where FC again represents the set of features utilized by C.
We will construct a training set for DbR via a one-to-one mapping x 7→ xr from the original training
set for D. In the case of a deep neural network, FC corresponds to exactly the set of activations in the
penultimate layer (since these correspond to inputs to a linear classifier). To ensure that features used
by the model are equally useful under both training sets, we (approximately) enforce all features in
FC to have similar values for both x and xr through the following optimization:
min kg(xr ) − g(x)k2 , (5)
xr

4
“airplane’’ “ship’’ “dog’’ “truck’’ “frog’’ 100 Std accuracy Adv accuracy ( = 0.25)

D
80

Test Accuracy on (%)


60
!R
D

40

20
!N R
D

0
Std Training Adv Training Std Training Std Training
using using using R using NR

(a) (b)

Figure 2: (a): Random samples from our variants of the CIFAR-10 [Kri09] training set: the original
training set; the robust training set DbR , restricted to features used by a robust model; and the non-
b
robust training set DN R , restricted to features relevant to a standard model (labels appear incorrect
to humans). (b): Standard and robust accuracy on the CIFAR-10 test set (D) for models trained
with: (i) standard training (on D) ; (ii) standard training on D bN R ; (iii) adversarial training (on
b b
D); and (iv) standard training on DR . Models trained on DR and D bN R reflect the original models
used to create them: notably, standard training on D bR yields nontrivial robust accuracy. Results for
Restricted-ImageNet [Tsi+19] are in D.8 Figure 12.

where x is the original input and g is the mapping from x to the representation layer. We optimize
this objective using (normalized) gradient descent (see details in Appendix C).
Since we don’t have access to features outside FC , there is no way to ensure that the expectation
in (4) is zero for all f 6∈ FC . To approximate this condition, we choose the starting point of gradient
descent for the optimization in (5) to be an input x0 which is drawn from D independently of the
label of x (we also explore sampling x0 from noise in Appendix D.1). This choice ensures that
any feature present in that input will not be useful since they are not correlated with the label in
expectation over x0 . The underlying assumption here is that, when performing the optimization
in (5), features that are not being directly optimized (i.e., features outside FC ) are not affected. We
provide pseudocode for the construction in Figure 5 (Appendix C).
Given the new training set for D bR (a few random samples are visualized in Figure 2a), we train a
classifier using standard (non-robust) training. We then test this classifier on the original test set (i.e.
D). The results (Figure 2b) indicate that the classifier learned using the new dataset attains good
accuracy in both standard and adversarial settings (see additional evaluation in Appendix D.2.) 4 .
As a control, we repeat this methodology using a standard (non-robust) model for C in our con-
struction of the dataset. Sample images from the resulting “non-robust dataset” DbN R are shown
in Figure 2a—they tend to resemble more the source image of the optimization x0 than the target
image x. We find that training on this dataset leads to good standard accuracy, yet yields almost
no robustness (Figure 2b). We also verify that this procedure is not simply a matter of encoding
the weights of the original model—we get the same results for both DbR and DbN R if we train with
different architectures than that of the original models.
Overall, our findings corroborate the hypothesis that adversarial examples can arise from (non-
robust) features of the data itself. By filtering out non-robust features from the dataset (e.g. by
restricting the set of available features to those used by a robust model), one can train a significantly
more robust model using standard training.

3.2 Non-robust features suffice for standard classification

The results of the previous section show that by restricting the dataset to only contain features that
are used by a robust model, standard training results in classifiers that are significantly more robust.
4
In an attempt to explain the gap in accuracy between the model trained on D bR and the original robust
classifier C, we test distributional shift, by reporting results on the “robustified” test set in Appendix D.3.

5
This suggests that when training on the standard dataset, non-robust features take on a large role in
the resulting learned classifier. Here we will show that this is not merely incidental. In particular, we
demonstrate that non-robust features alone suffice for standard generalization— i.e., a model trained
solely on non-robust features can generalize to the standard test set.
To show this, we construct a dataset where the only features that are useful for classification are
non-robust features (or in terms of our formal model from Section 2, all features f that are ρ-useful
are non-robust). To accomplish this, we modify each input-label pair (x, y) as follows. We select a
target class t either (a) uniformly at random (hence features become uncorrelated with the labels) or
(b) deterministically according to the source class (e.g. permuting the labels). Then, we add a small
adversarial perturbation to x to cause it to be classified as t by a standard model:
xadv = arg min LC (x0 , t), (6)
kx0 −xk≤ε

where LC is the loss under a standard (non-robust) classifier C and ε is a small constant. The result-
ing inputs are indistinguishable from the originals (Appendix D Figure 9)—to a human observer,
it thus appears that the label t assigned to the modified input is simply incorrect. The resulting
input-label pairs (xadv , t) make up the new training set (pseudocode in Appendix C Figure 6).
Now, since kxadv − xk is small, by definition the robust features of xadv are still correlated with
class y (and not t) in expectation over the dataset. After all, humans still recognize the original class.
On the other hand, since every xadv is strongly classified as t by a standard classifier, it must be that
some of the non-robust features are now strongly correlated with t (in expectation).
In the case where t is chosen at random, the robust features are originally uncorrelated with the label
t (in expectation), and after the perturbation can be only slightly correlated (hence being significantly
less useful for classification than before) 5 . Formally, we aim to construct a dataset Dbrand where

> 0 if f non-robustly useful under D,
E(x,y)∼Dbrand [y · f (x)] (7)
' 0 otherwise.

In contrast, when t is chosen deterministically based on y, the robust features actually point away
from the assigned label t. In particular, all of the inputs labeled with class t exhibit non-robust
features correlated with t, but robust features correlated with the original class y. Thus, robust
features on the original training set provide significant predictive power on the training set, but will
bdet such that
actually hurt generalization on the standard test set. Formally, our goal is to construct D

> 0 if f non-robustly useful under D,
E(x,y)∼Dbdet [y · f (x)] < 0 if f robustly useful under D (8)

∈ R otherwise (f not useful under D).

We find that standard training on these datasets actually generalizes to the original test set, as shown
in Table 1). This indicates that non-robust features are indeed useful for classification in the standard
setting. Remarkably, even training on D bdet (where all the robust features are correlated with the
wrong class), results in a well-generalizing classifier. This indicates that non-robust features can be
picked up by models during standard training, even in the presence of predictive robust features 6

3.3 Transferability can arise from non-robust features

One of the most intriguing properties of adversarial examples is that they transfer across models with
different architectures and independently sampled training sets [Sze+14; PMG16; CRP19]. Here, we
show that this phenomenon can in fact be viewed as a natural consequence of the existence of non-
robust features. Recall that, according to our main thesis, adversarial examples can arise as a result
of perturbing well-generalizing, yet brittle features. Given that such features are inherent to the data
distribution, different classifiers trained on independent samples from that distribution are likely
to utilize similar non-robust features. Consequently, perturbations constructed by exploiting non-
robust features learned by one classifier will transfer to other classifiers utilizing similar features.
5
Goh [Goh19a] provides an approach to quantifying this “robust feature leakage” and finds that one can
obtain a (small) amount of test accuracy by leveraging robust feature leakage on D
brand .
6
Additional results and analysis are in App. D.5, D.6, and D.7.

6
In order to illustrate and corroborate this hypothesis, we train five different architectures on the
dataset generated in Section 3.2 (adversarial examples with deterministic labels) for a standard
ResNet-50 [He+16]. Our hypothesis would suggest that architectures which learn better from this
training set (in terms of performance on the standard test set) are more likely to learn similar non-
robust features to the original classifier. Indeed, we find that the test accuracy of each architecture is
predictive of how often adversarial examples transfer from the original model to standard classifiers
with that architecture (Figure 3). In a similar vein, Nakkiran [Nak19a] constructs a set of adversarial
perturbations that is explicitly non-transferable and finds that these perturbations cannot be used to
learn a good classifier. These findings thus corroborate our hypothesis that adversarial transferability
arises when models learn similar brittle features of the underlying dataset.

100 ResNet-50
Transfer success rate (%)

DenseNet Dataset
90
ResNet-18 Source Dataset
CIFAR-10 ImageNetR
Inception-v3
80 D 95.3% 96.6%

70
brand
D 63.3% 87.9%
VGG-16 Dbdet 43.7% 64.4%
60
25 30 35 40 45 50
Test accuracy (%; trained on Dy + 1) Table 1: Test accuracy (on D) of classifiers
trained on the D, D brand , and D bdet train-
Figure 3: Transfer rate of adversarial exam-
ples from a ResNet-50 to different architec- ing sets created using a standard (non-robust)
model. For both D brand and D bdet , only non-
tures alongside test set performance of these
architecture when trained on the dataset gen- robust features correspond to useful features
erated in Section 3.2. Architectures more on both the train set and D. These datasets
susceptible to transfer attacks also performed are constructed using adversarial perturba-
better on the standard test set supporting tions of x towards a class t (random for
our hypothesis that adversarial transferability Dbrand and deterministic for Dbdet ); the result-
arises from using similar non-robust features. ing images are relabeled as t.

4 A Theoretical Framework for Studying (Non)-Robust Features

The experiments from the previous section demonstrate that the conceptual framework of robust and
non-robust features is strongly predictive of the empirical behavior of state-of-the-art models on real-
world datasets. In order to further strengthen our understanding of the phenomenon, we instantiate
the framework in a concrete setting that allows us to theoretically study various properties of the
corresponding model. Our model is similar to that of Tsipras et al. [Tsi+19] in the sense that it
contains a dichotomy between robust and non-robust features, extending upon it in a few ways: a)
the adversarial vulnerability can be explicitly expressed as a difference between the inherent data
metric and the `2 metric, b) robust learning corresponds exactly to learning a combination of these
two metrics, c) the gradients of robust models align better with the adversary’s metric.

Setup. We study a simple problem of maximum likelihood classification between two Gaussian
distributions. In particular, given samples (x, y) sampled from D according to
u.a.r.
y ∼ {−1, +1}, x ∼ N (y · µ∗ , Σ∗ ), (9)
our goal is to learn parameters Θ = (µ, Σ) such that
Θ = arg min E(x,y)∼D [`(x; y · µ, Σ)] , (10)
µ,Σ

where `(x; µ, Σ) represents the Gaussian negative log-likelihood (NLL) function. Intuitively, we
find the parameters µ, Σ which maximize the likelihood of the sampled data under the given model.
Classification can be accomplished via likelihood test: given an unlabeled sample x, we predict y as

y = arg max `(x; y · µ, Σ) = sign x> Σ−1 µ .
y

7
In turn, the robust analogue of this problem arises from replacing `(x; y · µ, Σ) with the NLL under
adversarial perturbation. The resulting robust parameters Θr can be written as
 
Θr = arg min E(x,y)∼D max `(x + δ; y · µ, Σ) , (11)
µ,Σ kδk2 ≤ε
A detailed analysis appears in Appendix E—here we present a high-level overview of the results.

(1) Vulnerability from metric misalignment (non-robust features). Note that in this model, one
can rigorously refer to an inner product (and thus a metric) induced by the features. In particular,
one can view the learned parameters of a Gaussian Θ = (µ, Σ) as defining an inner product over the
input space given by hx, yiΘ = (x−µ)> Σ−1 (y−µ). This in turn induces the Mahalanobis distance,
which represents how a change in the input affects the features of the classifier. This metric is not
necessarily aligned with the metric in which the adversary is constrained, the `2 -norm. Actually, we
show that adversarial vulnerability arises exactly as a misalignment of these two metrics.
Theorem 1 (Adversarial vulnerability from misalignment). Consider an adversary whose pertur-
bation is determined by the “Lagrangian penalty” form of (11), i.e.
max `(x + δ; y · µ, Σ) − C · kδk2 ,
δ
where C ≥ σmin1(Σ∗ ) is a constant trading off NLL minimization and the adversarial constraint (the
bound on C ensures the problem is concave). Then, the adversarial loss Ladv incurred by (µ, Σ) is
 2 
−1
Ladv (Θ) − L(Θ) = tr I + (C · Σ∗ − I) − d,

and, for a fixed tr(Σ∗ ) = k the above is minimized by Σ∗ = kd I.


In fact, note that such a misalignment corresponds precisely to the existence of non-robust features—
“small” changes in the adversary’s metric along certain directions can cause large changes under the
notion of distance established by the parameters (illustrated in Figure 4).

(2) Robust Learning. The (non-robust) maximum likelihood estimate is Θ = Θ∗ , and thus the
vulnerability for the standard MLE depends entirely on the data distribution. The following theorem
characterizes the behaviour of the learned parameters in the robust problem (we study a slight relax-
ation of (11) that becomes exact exponentially fast as d → ∞, see Appendix E.3.3). In fact, we can
prove (Section E.3.4) that performing (sub)gradient descent on the inner maximization (known as
adversarial training [GSS15; Mad+18]) yields exactly Θr . We find that as the perturbation budget ε
increases, the metric induced by the classifier mixes `2 and the metric induced by the data features.
Theorem 2 (Robustly Learned Parameters). Just as in the non-robust case, µr = µ∗ , i.e. the true
mean is learned. For the robust covariance Σr , there exists an ε0 > 0, such that for any ε ∈ [0, ε0 ),
r    
1 1 1 1 1 + ε1/2 1 + ε1/2
Σr = Σ∗ + ·I + · Σ∗ + Σ2∗ , where Ω 1/2 ≤ λ ≤ O .
2 λ λ 4 ε + ε3/2 ε1/2
The effect of robust optimization under an `2 -constrained adversary is visualized in Figure 4. As 
grows, the learned covariance becomes more aligned with identity. For instance, we can see that the
classifier learns to be less sensitive in certain directions, despite their usefulness for classification.

(3) Gradient Interpretability. Tsipras et al. [Tsi+19] observe that gradients of robust models
tend to look more semantically meaningful. It turns out that under our model, this behaviour arises
as a natural consequence of Theorem 2. In particular, we show that the resulting robustly learned
parameters cause the gradient of the linear classifier and the vector connecting the means of the two
distributions to better align (in a worst-case sense) under the `2 inner product.
Theorem 3 (Gradient alignment). Let f (x) and fr (x) be monotonic classifiers based on the linear
separator induced by standard and `2 -robust maximum likelihood classification, respectively. The
maximum angle formed between the gradient of the classifier (wrt input) and the vector connecting
the classes can be smaller for the robust model:
hµ, ∇x fr (x)i hµ, ∇x f (x)i
min > min .
µ kµk · k∇x fr (x)k µ kµk · k∇x f (x)k

Figure 4 illustrates this phenomenon in the two-dimensional case, where `2 -robustness causes the
gradient direction to become increasingly aligned with the vector between the means (µ).

8
10.0
Maximum likelihood estimate 10.0
True Parameters ( = 0) 10.0
Robust parameters, = 1.0 10.0
Robust parameters, = 10.0
2 unit ball Samples from ( , )
1-induced metric unit ball Samples from ( , )
7.5 Samples from (0, ) 7.5 7.5 7.5

5.0 5.0 5.0 5.0

2.5 2.5 2.5 2.5

Feature x2

Feature x2

Feature x2

Feature x2
0.0 0.0 0.0 0.0

2.5 2.5 2.5 2.5

5.0 5.0 5.0 5.0

7.5 7.5 7.5 7.5

10.0 10.0 10.0 10.0


20 15 10 5 0 5 10 15 20 20 15 10 5 0 5 10 15 20 20 15 10 5 0 5 10 15 20 20 15 10 5 0 5 10 15 20
Feature x1 Feature x1 Feature x1 Feature x1

Figure 4: An empirical demonstration of the effect illustrated by Theorem 2—as the adversarial
perturbation budget ε is increased, the learned mean µ remains constant, but the learned covariance
“blends” with the identity matrix, effectively adding uncertainty onto the non-robust feature.

Discussion. Our analysis suggests that rather than offering quantitative classification benefits, a
natural way to view the role of robust optimization is as enforcing a prior over the features learned
by the classifier. In particular, training with an `2 -bounded adversary prevents the classifier from
relying heavily on features which induce a metric dissimilar to the `2 metric. The strength of the
adversary then allows for a trade-off between the enforced prior, and the data-dependent features.

Robustness and accuracy. Note that in the setting described so far, robustness can be at odds
with accuracy since robust training prevents us from learning the most accurate classifier (a similar
conclusion is drawn in [Tsi+19]). However, we note that there are very similar settings where non-
robust features manifest themselves in the same way, yet a classifier with perfect robustness and
accuracy is still attainable. Concretely, consider the distributions pictured in Figure 14 in Appendix
D.10. It is straightforward to show that while there are many perfectly accurate classifiers, any
standard loss function will learn an accurate yet non-robust classifier. Only when robust training is
employed does the classifier learn a perfectly accurate and perfectly robust decision boundary.

5 Related Work

Several models for explaining adversarial examples have been proposed in prior work, utilizing
ideas ranging from finite-sample overfitting to high-dimensional statistical phenomena [Gil+18;
FFF18; For+19; TG16; Sha+19a; MDM18; Sha+19b; GSS15; BPR18]. The key differentiating
aspect of our model is that adversarial perturbations arise as well-generalizing, yet brittle, features,
rather than statistical anomalies. In particular, adversarial vulnerability does not stem from using
a specific model class or a specific training method, since standard training on the “robustified”
data distribution of Section 3.1 leads to robust models. At the same time, as shown in Section 3.2,
these non-robust features are sufficient to learn a good standard classifier. We discuss the connection
between our model and others in detail in Appendix A and additional related work in Appendix B.

6 Conclusion

In this work, we cast the phenomenon of adversarial examples as a natural consequence of the
presence of highly predictive but non-robust features in standard ML datasets. We provide support
for this hypothesis by explicitly disentangling robust and non-robust features in standard datasets,
as well as showing that non-robust features alone are sufficient for good generalization. Finally,
we study these phenomena in more detail in a theoretical setting where we can rigorously study
adversarial vulnerability, robust training, and gradient alignment.
Our findings prompt us to view adversarial examples as a fundamentally human phenomenon. In
particular, we should not be surprised that classifiers exploit highly predictive features that happen
to be non-robust under a human-selected notion of similarity, given such features exist in real-world
datasets. In the same manner, from the perspective of interpretability, as long as models rely on these
non-robust features, we cannot expect to have model explanations that are both human-meaningful
and faithful to the models themselves. Overall, attaining models that are robust and interpretable
will require explicitly encoding human priors into the training process.

9
Acknowledgements
We thank Preetum Nakkiran for suggesting the experiment of Appendix D.9 (i.e. replicating Figure 3
with targeted attacks). We also are grateful to the authors of Engstrom et al. [Eng+19a] (Chris Olah,
Dan Hendrycks, Justin Gilmer, Reiichiro Nakano, Preetum Nakkiran, Gabriel Goh, Eric Wallace)—
for their insights and efforts replicating, extending, and discussing our experimental results.
Work supported in part by the NSF grants CCF-1553428, CCF-1563880, CNS-1413920, CNS-
1815221, IIS-1447786, IIS-1607189, the Microsoft Corporation, the Intel Corporation, the MIT-
IBM Watson AI Lab research grant, and an Analog Devices Fellowship.

References
[ACW18] Anish Athalye, Nicholas Carlini, and David A. Wagner. “Obfuscated Gradients Give a
False Sense of Security: Circumventing Defenses to Adversarial Examples”. In: Inter-
national Conference on Machine Learning (ICML). 2018.
[Ath+18] Anish Athalye et al. “Synthesizing Robust Adversarial Examples”. In: International
Conference on Machine Learning (ICML). 2018.
[BCN06] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. “Model compres-
sion”. In: International Conference on Knowledge Discovery and Data Mining (KDD).
2006.
[Big+13] Battista Biggio et al. “Evasion attacks against machine learning at test time”. In:
Joint European conference on machine learning and knowledge discovery in databases
(ECML-KDD). 2013.
[BPR18] Sébastien Bubeck, Eric Price, and Ilya Razenshteyn. “Adversarial examples from com-
putational constraints”. In: arXiv preprint arXiv:1805.10204. 2018.
[Car+19] Nicholas Carlini et al. “On Evaluating Adversarial Robustness”. In: ArXiv preprint
arXiv:1902.06705. 2019.
[CRK19] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. “Certified adversarial robustness
via randomized smoothing”. In: arXiv preprint arXiv:1902.02918. 2019.
[CRP19] Zachary Charles, Harrison Rosenberg, and Dimitris Papailiopoulos. “A Geometric Per-
spective on the Transferability of Adversarial Directions”. In: International Conference
on Artificial Intelligence and Statistics (AISTATS). 2019.
[CW17a] Nicholas Carlini and David Wagner. “Adversarial Examples Are Not Easily Detected:
Bypassing Ten Detection Methods”. In: Workshop on Artificial Intelligence and Secu-
rity (AISec). 2017.
[CW17b] Nicholas Carlini and David Wagner. “Towards evaluating the robustness of neural net-
works”. In: Symposium on Security and Privacy (SP). 2017.
[Dan67] John M. Danskin. The Theory of Max-Min and its Application to Weapons Allocation
Problems. 1967.
[Das+19] Constantinos Daskalakis et al. “Efficient Statistics, in High Dimensions, from Trun-
cated Samples”. In: Foundations of Computer Science (FOCS). 2019.
[Din+19] Gavin Weiguang Ding et al. “On the Sensitivity of Adversarial Robustness to Input
Data Distributions”. In: International Conference on Learning Representations. 2019.
[Eng+19a] Logan Engstrom et al. “A Discussion of ’Adversarial Examples Are Not Bugs, They
Are Features’”. In: Distill (2019). https://distill.pub/2019/advex-bugs-discussion. DOI:
10.23915/distill.00019.
[Eng+19b] Logan Engstrom et al. “A Rotation and a Translation Suffice: Fooling CNNs with
Simple Transformations”. In: International Conference on Machine Learning (ICML).
2019.
[FFF18] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. “Adversarial vulnerability for any
classifier”. In: Advances in Neural Information Processing Systems (NeuRIPS). 2018.
[FMF16] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. “Robustness
of classifiers: from adversarial to random noise”. In: Advances in Neural Information
Processing Systems. 2016.

10
[For+19] Nic Ford et al. “Adversarial Examples Are a Natural Consequence of Test Error in
Noise”. In: arXiv preprint arXiv:1901.10513. 2019.
[Fur+18] Tommaso Furlanello et al. “Born-Again Neural Networks”. In: International Confer-
ence on Machine Learning (ICML). 2018.
[Gei+19] Robert Geirhos et al. “ImageNet-trained CNNs are biased towards texture; increasing
shape bias improves accuracy and robustness.” In: International Conference on Learn-
ing Representations. 2019.
[Gil+18] Justin Gilmer et al. “Adversarial spheres”. In: Workshop of International Conference
on Learning Representations (ICLR). 2018.
[Goh19a] Gabriel Goh. “A Discussion of ’Adversarial Examples Are Not Bugs, They Are
Features’: Robust Feature Leakage”. In: Distill (2019). https://distill.pub/2019/advex-
bugs-discussion/response-2. DOI: 10.23915/distill.00019.2.
[Goh19b] Gabriel Goh. “A Discussion of ’Adversarial Examples Are Not Bugs, They
Are Features’: Two Examples of Useful, Non-Robust Features”. In: Distill
(2019). https://distill.pub/2019/advex-bugs-discussion/response-3. DOI: 10 . 23915 /
distill.00019.3.
[GSS15] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and Harness-
ing Adversarial Examples”. In: International Conference on Learning Representations
(ICLR). 2015.
[HD19] Dan Hendrycks and Thomas G. Dietterich. “Benchmarking Neural Network Robust-
ness to Common Corruptions and Surface Variations”. In: International Conference on
Learning Representations (ICLR). 2019.
[He+16] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: Conference
on Computer Vision and Pattern Recognition (CVPR). 2016.
[He+17] Warren He et al. “Adversarial example defense: Ensembles of weak defenses are not
strong”. In: USENIX Workshop on Offensive Technologies (WOOT). 2017.
[HVD14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the Knowledge in a Neu-
ral Network”. In: Neural Information Processing Systems (NeurIPS) Deep Learning
Workshop. 2014.
[JLT18] Saumya Jetley, Nicholas Lord, and Philip Torr. “With friends like these, who needs ad-
versaries?” In: Advances in Neural Information Processing Systems (NeurIPS). 2018.
[Kri09] Alex Krizhevsky. “Learning Multiple Layers of Features from Tiny Images”. In: Tech-
nical report. 2009.
[KSJ19] Beomsu Kim, Junghoon Seo, and Taegyun Jeon. “Bridging Adversarial Robustness and
Gradient Interpretability”. In: International Conference on Learning Representations
Workshop on Safe Machine Learning (ICLR SafeML). 2019.
[Lec+19] Mathias Lecuyer et al. “Certified robustness to adversarial examples with differential
privacy”. In: Symposium on Security and Privacy (SP). 2019.
[Liu+17] Yanpei Liu et al. “Delving into Transferable Adversarial Examples and Black-box At-
tacks”. In: International Conference on Learning Representations (ICLR). 2017.
[LM00] Beatrice Laurent and Pascal Massart. “Adaptive estimation of a quadratic functional
by model selection”. In: Annals of Statistics. 2000.
[Mad+18] Aleksander Madry et al. “Towards deep learning models resistant to adversarial at-
tacks”. In: International Conference on Learning Representations (ICLR). 2018.
[MDM18] Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. “The curse of
concentration in robust learning: Evasion and poisoning attacks from concentration of
measure”. In: AAAI Conference on Artificial Intelligence (AAAI). 2018.
[Moo+17] Seyed-Mohsen Moosavi-Dezfooli et al. “Universal adversarial perturbations”. In: con-
ference on computer vision and pattern recognition (CVPR). 2017.
[MV15] Aravindh Mahendran and Andrea Vedaldi. “Understanding deep image representations
by inverting them”. In: computer vision and pattern recognition (CVPR). 2015.
[Nak19a] Preetum Nakkiran. “A Discussion of ’Adversarial Examples Are Not Bugs,
They Are Features’: Adversarial Examples are Just Bugs, Too”. In: Distill
(2019). https://distill.pub/2019/advex-bugs-discussion/response-5. DOI: 10 . 23915 /
distill.00019.5.

11
[Nak19b] Preetum Nakkiran. “Adversarial robustness may be at odds with simplicity”. In: arXiv
preprint arXiv:1901.00532. 2019.
[OMS17] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. “Feature Visualization”.
In: Distill. 2017.
[Pap+17] Nicolas Papernot et al. “Practical black-box attacks against machine learning”. In: Asia
Conference on Computer and Communications Security. 2017.
[PMG16] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. “Transferability in Machine
Learning: from Phenomena to Black-box Attacks using Adversarial Samples”. In:
ArXiv preprint arXiv:1605.07277. 2016.
[Rec+19] Benjamin Recht et al. “Do CIFAR-10 Classifiers Generalize to CIFAR-10?” In: Inter-
national Conference on Machine Learning (ICML). 2019.
[RSL18] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. “Certified defenses against
adversarial examples”. In: International Conference on Learning Representations
(ICLR). 2018.
[Rus+15] Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge”. In:
International Journal of Computer Vision (IJCV). 2015.
[Sch+18] Ludwig Schmidt et al. “Adversarially Robust Generalization Requires More Data”. In:
Advances in Neural Information Processing Systems (NeurIPS). 2018.
[Sha+19a] Ali Shafahi et al. “Are adversarial examples inevitable?” In: International Conference
on Learning Representations (ICLR). 2019.
[Sha+19b] Adi Shamir et al. “A Simple Explanation for the Existence of Adversarial Examples
with Small Hamming Distance”. In: arXiv preprint arXiv:1901.10861. 2019.
[SHS19] David Stutz, Matthias Hein, and Bernt Schiele. “Disentangling Adversarial Robustness
and Generalization”. In: Computer Vision and Pattern Recognition (CVPR). 2019.
[Smi+17] D. Smilkov et al. “SmoothGrad: removing noise by adding noise”. In: ICML workshop
on visualization for deep learning. 2017.
[Sug+19] Arun Sai Suggala et al. “Revisiting Adversarial Risk”. In: Conference on Artificial
Intelligence and Statistics (AISTATS). 2019.
[Sze+14] Christian Szegedy et al. “Intriguing properties of neural networks”. In: International
Conference on Learning Representations (ICLR). 2014.
[TG16] Thomas Tanay and Lewis Griffin. “A Boundary Tilting Perspective on the Phenomenon
of Adversarial Examples”. In: ArXiv preprint arXiv:1608.07690. 2016.
[Tra+17] Florian Tramer et al. “The Space of Transferable Adversarial Examples”. In: ArXiv
preprint arXiv:1704.03453. 2017.
[Tsi+19] Dimitris Tsipras et al. “Robustness May Be at Odds with Accuracy”. In: International
Conference on Learning Representations (ICLR). 2019.
[Ues+18] Jonathan Uesato et al. “Adversarial Risk and the Dangers of Evaluating Against Weak
Attacks”. In: International Conference on Machine Learning (ICML). 2018.
[Wan+18] Tongzhou Wang et al. “Dataset Distillation”. In: ArXiv preprint arXiv:1811.10959.
2018.
[WK18] Eric Wong and J Zico Kolter. “Provable defenses against adversarial examples via the
convex outer adversarial polytope”. In: International Conference on Machine Learning
(ICML). 2018.
[Xia+19] Kai Y. Xiao et al. “Training for Faster Adversarial Robustness Verification via Inducing
ReLU Stability”. In: International Conference on Learning Representations (ICLR).
2019.
[Zou+18] Haosheng Zou et al. “Geometric Universality of Adversarial Examples in Deep Learn-
ing”. In: Geometry in Machine Learning ICML Workshop (GIML). 2018.

12

You might also like