Invariant Anomaly Detection Under Distribution Shifts: A Causal Perspective
Invariant Anomaly Detection Under Distribution Shifts: A Causal Perspective
João B. S. Carvalho, Mengtao Zhang, Robin Geyer, Carlos Cotrini, Joachim M. Buhmann
Institute for Machine Learning
Department of Computer Science
arXiv:2312.14329v1 [cs.LG] 21 Dec 2023
ETH Zürich
{joao.carvalho, mengtao.zhang, robin.geyer, ccarlos, jbuhmann}@inf.ethz.ch
Abstract
Anomaly detection (AD) is the machine learning task of identifying highly dis-
crepant abnormal samples by solely relying on the consistency of the normal
training samples. Under the constraints of a distribution shift, the assumption that
training samples and test samples are drawn from the same distribution breaks down.
In this work, by leveraging tools from causal inference we attempt to increase the
resilience of anomaly detection models to different kinds of distribution shifts. We
begin by elucidating a simple yet necessary statistical property that ensures invari-
ant representations, which is critical for robust AD under both domain and covariate
shifts. From this property, we derive a regularization term which, when minimized,
leads to partial distribution invariance across environments. Through extensive
experimental evaluation on both synthetic and real-world tasks, covering a range
of six different AD methods, we demonstrated significant improvements in out-of-
distribution performance. Under both covariate and domain shift, models regular-
ized with our proposed term showed marked increased robustness. Code is available
at: https://github.com/JoaoCarv/invariant-anomaly-detection.
1 Introduction
Anomaly detectors are the subject of increased interest in fields such as finance (Ahmed et al. [2016],
Hilal et al. [2022]), medicine (Schlegl et al. [2019]), and security (Mothukuri et al. [2021], Siddiqui
et al. [2019], Hosseinzadeh et al. [2021]). Having been trained on a sample from an unknown
distribution, these models are capable of identifying abnormal objects unlikely to come from the
original distribution (Bishop and Nasrabadi [2006]). Anomaly detection (AD) stands apart from
supervised classification as it does not involve anomalies during training, making it challenging to
articulate a model that depicts the class of objects deemed as normal.
AD as a field boasts a plethora of diverse methodologies (Ruff et al. [2021]). Current detectors have
demonstrated the advantage of approaches based on representation learning (Reiss and Hoshen [2021],
Deng and Li [2022]). In this context, an encoder maps objects to representations which capture the
most distinctive features of an object. In addition, it strives to map the class of normal objects onto
a subset characterized by a more regular shape, thereby rendering representations from abnormal
samples easily identifiable by comparison. Central to this second goal is a notable vulnerability of
representation learning-based methods: they hinge on the assumption of independent and identically
distributed (i.i.d.) training and test data. This implies that normal samples in the training data are ex-
pected to be sampled identically in the test data, thereby being mapped to the same vicinity in the repre-
sentation space - an assumption that is frequently violated in real-world scenarios (Koh et al. [2021]).
Indeed, distribution shifts in the context of AD present a unique challenge because it involves discern-
ing two types of distribution shifts targeting the distribution of the normal objects, pn . Anomalies
2 Related Works
The relationship between invariance and robustness to shifts in data distribution has been extensively
explored, notably within the field of causal inference.
A range of studies has pursued the development of invariant representation learning, achieving notable
success (Long et al. [2013], Kang et al. [2019], Mitrovic et al. [2020], Lv et al. [2022], Nguyen et al.
[2021]). Moreover, a detailed examination of various forms of invariance-based methods derived
from underlying causal graphs has been provided by Veitch et al. [2021] and Wang and Veitch [2022].
2
Figure 1: Comparative analysis of various mapping strategies from X to Z in the context of AD under
distribution shifts. (a) Anomaly detection setting with multiple environments in X . (b) Ideal scenario
where the mapping is both invariant and informative; normal samples from different environments
converge to the same subset of Z, maintaining the distinction between normal and abnormal samples.
(c) Mapping is invariant but not informative, resulting in both normal and abnormal representations
collapsing to the same subset of Z. (d) Mapping is informative but not invariant; although the
mappings of all environments remain distinct, this makes the encoder vulnerable to distribution shifts.
Causal inference has also been shown to foster distributional robustness, as documented in Mein-
shausen [2018]. Building on this notion, Invariant Risk Minimization (IRM) (Arjovsky et al. [2019])
introduced a novel optimization principle hinged on maintaining invariance across diverse environ-
ments. However, Koh et al. [2021] demonstrated that this approach tends to falter when subjected to
real-world distribution shifts. Yao et al. [2022] has introduced selective data augmentations to induce
distribution shift robustness. However, due to its reliance on class labels, it is not directly applicable
to AD.
3
Despite the effectiveness of these methods, it is important to note that they primarily focus on AD
within datasets where the training and test data distributions are identical. As we will show, its
performance tends to degrade when confronted with data that exhibit different kinds of distribution
shifts.
One way to improve robustness to distribution shifts entails pretraining the encoder using an
invariance-inducing method (for example, IRM or LISA), applied to an unrelated pretext task
using the same data (Smeu et al. [2022]). Unfortunately, this requires the existence of additional
task labels, which is uncommon for AD tasks. Moreover, as we experimentally demonstrate, this
approach achieves limited performance when paired with state-of-the-art AD methods.
3 Formalization
3.1 Background
Our formulation operates under the assumption of a given set of environments, denoted as E, and
a sample space, X . We have a sample D = {x1 , . . . , xn } drawn from an unknown distribution
pn , whose mass is mostly concentrated on a subset N ⊆ X . This subset, N , defines the normal
class. Given that X ∼ pn , we consider X as a pair of random variables Xa and Xe , such that
X = (Xa , Xe ). Here, the environment E only influences Xe . Conceptually, Xa represents the
component of X that determines whether X is an anomaly, while Xe comprises style features. These
style features, while unaffected by the anomaly status of an object, are influenced by the environment.
Note that this assumption implies that X = Xa × Xe , for some appropriate sets Xa and Xe . Fig. 1 (a)
illustrates our setting under three different environments.
Our main goal in this section is to elicit the requirements that representations should fulfil in order to
lead to effective anomaly detectors. More precisely, we argue that representations must simultaneously
(i) maximize the information they contain about the original objects and (ii) they must be invariant to
the environment.
We illustrate the necessity of invariance and informativeness for effective AD using an example
with random variables X1 , X2 , X3 . We assume that they are distributed over the set N of all points
(x1 , x2 , x3 ) ∈ R3 such that 0 ≤ x1 = x2 ≤ 1. We assume that x3 denotes the environment. Our
dataset D consists of points in N where x3 = 0.
Consider now the following encoders f1 , f2 , and f3 , where f1 (x1 , x2 , x3 ) = x1 , f1 (x1 , x2 , x3 ) =
(x1 , x2 ), and f1 (x1 , x2 , x3 ) = (x1 , x2 , x3 ). Note that encoder f1 is invariant but lacks the necessary
information to detect anomalies, as it can map different points to the same representation. Indeed, the
normal point (1, 1, 1) and the anomaly point (1, 2, 1) would be mapped to the same representation,
the number 1. Hence, any anomaly detector based on this encoder can be easily evaded. Encoder f3
captures all information but is not invariant, which leads to a potential evasion of AD. Encoder f2 , on
the other hand, is both invariant and informative, correctly capturing the necessary information to
detect anomalies while ignoring the environment variable x3 .
This simplified scenario emphasizes the dual necessity of invariance and informativeness for the
development of robust anomaly detectors. Regrettably, as we will demonstrate, contemporary AD
methods tend to excel in informativeness while significantly lagging in invariance. For a graphical
illustration of this effect, please see Fig. 1 (b-d).
We now formalize the desiderata for representations computed by an encoder for AD.
Let f be an encoder mapping a X to a representation space Z. We also let Z = f (X) denote the
representation of X.
Definition 1. We say that Z is an invariant representation of X under domain shifts if
′
pdo(E=e) (Z = ·) = pdo(E=e ) (Z = ·), for any e, e′ ∈ E. (1)
4
Definition 2. We say that Z is an invariant representation of X under covariate shifts if
′
pdo(Xe =x) (Z = ·) = pdo(Xe =x ) (Z = ·), for any x, x′ ∈ Xe . (2)
We simply say that Z is an invariant representation of X if it remains unaltered under both domain
shifts and covariate shifts. Note that domain shifts are stronger than covariate shifts. Specifically,
domain shifts entail assessing data from diverse sources with disparate parameters, whereas covariate
shifts pertain to subtle alterations in the existing data. Consequently, the broader changes inherent to
domain shifts usually exert a more substantial impact on data representation stability and are harder
to tackle.
Invariant representations prevent anomaly detectors from incorrectly classifying objects that result
from covariate and domain shifts. These representations are, by definition, resistant to changes in E
or Xe . As a result, adversaries cannot expect to alter the detection of an anomaly by merely using a
different environment or changing style features of the anomaly.
However, solely enforcing invariant representations can be insufficient, and potentially lead to a
collapse of the representations (see Fig. 1 (c)). For this reason, we argue that representations shall
store as much information about the original object as possible. This principle is in line with the
design of many popular encoders (Linsker [1988], Stone [2004], Hjelm et al. [2018]) and can be
formalized using information theory.
We can use the mutual information between two random variables as our metric to evaluate how
much information about X is captured by the representation Z.
Definition 3. The mutual information I(X; Z) between X and Z is defined as
pX,Z (x, z)
Z Z
I(X; Z) = dx dz. (3)
pX (x)pZ (z)
Here, pX,Z is the joint pdf of (X, Z) and pX and pZ are the marginal pdfs of X and Z, respectively.
Intuitively, I(X; Z) indicates how much information learning about one of X or Z reveals about the
other. We argue then that encoders shall compute representations that keep as much information as
possible from the original objects.
We start by introducing a causal graph for anomaly detection in Fig. 2. In these graphs, E, Xa ,
and Xs denote the environment, the relevant features, and the style features. We introduce two new
random variables W and U . W is a binary variable indicating if the object is normal (W = 0) or
an anomaly (W = 1). U denotes all confounding factors. That is, factors that produce correlations
between Xa and Xe during the sample generation process.
We now recall that an encoder f that attains invariant representations is Xa -measurable (Veitch et al.
[2021]). The following result, which follows from a theorem from Veitch et al. [2021], argues that
we can ensure invariant representations by enforcing a possibly conditional statistical independence
between Z and E.
5
U U
W E W E
Xa Xe Xa Xe
Z Z
Figure 2: Causal graphs for anomaly detection. The left figure shows the case of no confounding. The
right figure shows the case of confounding. An intervention at the E variable induces a domain shift
(gray hammer), whereas an intervention at the Xe variable induces a covariate shift (black hammer).
Theorem 1. Suppose that f learns invariant representations. If W and E are confounded, then
Z ⊥ E | W . Otherwise, Z ⊥ E.
Proof. We sketch a proof here and provide a more rigorous proof in the appendix. Since f learns
invariant representations and is, therefore, measurable, we can say that Z is Xa -measurable. This
means that any probability of an event involving Z can be rewritten as a probability of an event
involving Xa . As a result, it suffices to show that Xa ⊥ E | W , when W and E are confounded and
Xa ⊥ E, otherwise. These claims can be shown using d-separation. For example, when W and E
are not confounded, there are only two paths from Xa to E and they are blocked. Hence, Xa ⊥ E by
d-separation.
The fundamental question we must address is how to ensure statistical independence between Z and
E. If such independence holds, we find that for all e, e′ ∈ E, it implies that pdo(E=e) (Z = ·) =
′
pdo(E=e ) (Z = ·). In general, we cannot access counterfactual examples, so enforcing counterfactual
invariance becomes impossible. However, it is still possible to induce a counterfactual invariance
signature by imposing appropriate conditional independence conditions. In practice, this condition
can be set as MMD (p(Z = · | E = e), p(Z = · | E = e′ )) = 0, where MMD stands for maximum
mean discrepancy (Gretton et al. [2006, 2012]), and measures the distance between two distributions
using empirical samples by calculating the divergence between the means of these sample sets once
they have been projected into a reproducing kernel Hilbert space (RKHS).
Based on the previous insights, we propose using the MMD between two distributions as the driver for
the invariance-inducing regularization term. Taking into account our previous formulation depicted
in Fig. 2, we are now in a position to derive a novel regularization term specifically designed for
invariant AD:
X
ΩPCIR = MMD (p(Z = · | W = 0, E = e), p(Z = · | W = 0, E = e′ )) . (4)
e,e′ ∈E
e̸=e′
We call this approach partial conditional invariant regularization (PCIR), as it induces conditional
distribution invariance over only one instantiation of W .
This partial conditional regularization aligns with other regularization terms proposed in preceding
research (Veitch et al. [2021], Li et al. [2018]). It is ’conditional’ because, in the event of confounding,
we must condition on W when computing the MMD. It is ’partial’ due to the fact that the training
dataset only contains samples where W = 0.
6
5 Experiments
5.1 Experimental Setup
Our solutions are validated under two distinct settings: domain generalization across domain shifts
and domain generalization across covariate shifts. All experiments described were the result of 5
repetitions over different seeds. For additional details on the experimental design please refer to the
supplementary material.
Real-world domain shift For a realistic anomaly detection scenario, we considered the task of
identifying tumorous tissue from images of histological cuts, using the Camelyon17 (Koh et al.
[2021], Bandi et al. [2018]) dataset. This dataset consists of five different subsets of images arising
from five different hospitals, with domain shifts occurring due to differences in slide staining, image
acquisition protocol, and patient cohorts. This presents a challenging anomaly detection task, as
the alterations that differentiate normal from abnormal samples are often subtle and may correlate
with unknown features that are domain-dependent. In our experimental design, we motivate an
environment per domain and set up two sets of experiments: one using three environments for training
and another using two environments for training.
Real-world shortcut We have also evaluated our method in the Waterbird dataset (Sagawa et al.
[2019]). This is a real-world natural image dataset where the distribution shift occurs as the natural
habitat depicted in the background changes between an aquatic and land setting. From the two kinds
of birds in the dataset (water and land birds), water birds were assigned to training data and land birds
were set as an anomaly. To make this a more challenging setup, we have defined a highly unbalanced
distribution of the environments, with 184 images in training data with a water background and 3498
with a land background. The evaluation of the methods was performed in images with only a water
background for out-of-distribution and only land background for in-distribution.
Synthetic covariate shifts To evaluate the robustness against covariate shifts, we use the DiaViB-6
dataset Eulig et al. [2021] for our experiments. This dataset comprises modified and upsampled
images from MNIST Deng [2012], where the generative factors of the image can be altered, leading
to changes in texture, hue, lightness, position, and scale.
The training data is comprised of two distinct environments, generated by manipulating original
instantiations of each factor, and ensuring all factors differed between the two environments. For
the test data, we defined five distinct environments denoted as e0 , e1 , ..., e4 . The environment e0
corresponds to images from MNIST for a specific digit under an original instantiation of each
factor. All images of that digit are labeled as normal, while images of any other digit are considered
anomalies. For 0 < i ≤ 4, each environment ei corresponds to images where i factors have been
modified. In ei , all factors modified in ei−1 , plus an additional unique factor, are altered. The task is
to classify an image of a handwritten digit as either normal or an anomaly.
We also adapted this setup to include the Fashion-MNIST dataset, a more challenging synthetic
benchmark. The same cumulative covariate shifts were induced from an original in-distribution
environment, e0 , to subsequent environments e1−4 . Two specific classes were chosen to generate
normal and abnormal samples.
5.2 Results
Real-world anomaly detection under domain shift Notably, even with the complexity and real-
world variability introduced by the domain shift, the incorporation of partial conditional distribution
invariance still resulted in notable improvements for both training setups, in particular with MeanShift
and Red PANDA retrieving almost in-distribution performance (see Fig. 3). This suggests that our
regularization term is robust to more substantial changes associated with domain shifts.
7
0.9 Original Original Original
0.85
PCIR PCIR 0.9 PCIR
0.80
0.8
0.75 0.8
AUROC
AUROC
AUROC
0.70
0.7 0.7
0.65
0.60 0.6
0.6
0.55
0.5
// // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //
0.5 0.50
STFPM ReverseDistil CFA MeanShift CSI Red PANDA STFPM ReverseDistil CFA MeanShift CSI Red PANDA STFPM ReverseDistil CFA MeanShift CSI Red PANDA
Focusing on the out-of-distribution setting, our results highlight a consistent pattern of performance
enhancement when applying the partial conditional invariance regularization term, ranging from
2% to 8% increase in the AUROC when comparing regularized and un-regularized methods. The
consistency of this performance boost across different models and with both two and three in-
distribution training environemnts underscores the potential of PCIR as a beneficial regularization
technique in out-of-distribution settings.
Realist shortcut learning We observe that in all original unregularized methods evaluated, there is
a noticeable drop in performance when comparing in-distribution (background transparent plots) to
out-of-distribution (foreground opaque plots) that is attributed to the models effectively capturing
the exposed shortcut. Models with PCIR nearly recover the in-distribution performance, showing
the effectiveness of this approach to ignore uninformative environment features even when exposed
through a shortcut. Furthermore, in all models tested, there’s a consistent observation that the model
performance not only stabilizes but also increases when adding the regularization term even for
in-distribution scenarios. The increase in model performance, attributed to the addition of PCIR to
each method, exhibits an improvement range from 5% to 15% AUROC.
MMNIST and Fashion-MNIST under covariate shift Models that absorb shortcut features have
been observed to be especially susceptible to covariate shifts (Eulig et al. [2021], Geirhos et al.
[2020]). Therefore, if a model abuses a shortcut, then inducing a covariate shift in the shortcut feature
often significantly deteriorates performance. This effect is discernible in Fig. 4, where performance
drops progressively increase for all models as more covariate shifts are induced in the images, and
thus more shortcut features deviate from their original form in the training data.
However, note that by adopting regularization based on partial conditional invariance, we have been
able to construct anomaly detectors that consistently exhibit enhanced robustness against induced
covariate shifts. In some cases, even in-distribution performance increases through partial conditional
invariance. These findings corroborate our hypothesis that partial conditional distribution invariance
serves as a sufficient prior for robustness to covariate shifts in AD. For a more comprehensive analysis
of performance improvements achieved refer to the supplementary material.
Ablation study In our ablation study, we held all method parameters constant to exclusively
examine the impact of the weight assigned to the partial conditional invariance regularization term.
Across all tested methods, the relationship between regularization weight and performance exhibited
a concave shape. This pattern suggests a simple linear search could be sufficient to identify an
optimal weight for the regularization term. This observed behaviour aligns intuitively with the nature
of invariance regularization. Over-emphasizing the regularization term could potentially cause the
model’s encoder to generate non-discriminative and non-informative representations. In an extreme
case, the model could collapse all inputs into a single value. Interestingly, the behaviour of the
regularization term remained consistent across both datasets used in our study. This reinforces the
robustness of our approach across diverse datasets.
8
0.80 Original 0.80 Original 0.80 Original
PCIR PCIR PCIR
0.75 0.75 0.75
AUROC
AUROC
AUROC
0.65 0.65 0.65
(a) e1 : MNIST s.t. single covariate (b) e2 : MNIST s.t. double covariate (c) e3 : MNIST s.t triple covariate
shift shift shift
0.80 Original 0.80 Original 0.80 Original
PCIR PCIR PCIR
0.75 0.75 0.75
AUROC
AUROC
0.65 0.65 0.65
(d) e4 : MNIST s.t. 4-fold covariate (e) e1 : Fashion-MNIST s.t. single (f) e2 : Fashion-MNIST s.t. double
shift covariate shift covariate shift
0.80 Original 0.80 Original
PCIR PCIR
0.75 0.75
0.70 0.70
AUROC
AUROC
0.65 0.65
0.60 0.60
0.55 0.55
0.50 0.50
// // // // // // // // // // // // // // // // // // // //
0.45 0.45
STFPM ReverseDistil MeanShift CSI Red PANDA STFPM ReverseDistil MeanShift CSI Red PANDA
0.70
0.70
0.65 0.65
mean-AUROC
mean-AUROC
0.60
0.60
STFPM 0.55 STFPM
ReverseDistil ReverseDistil
0.55 MeanShift MeanShift
0.50
CSI CSI
Red PANDA Red PANDA
0.0 0.001 0.01 0.1 1.0 10.0 100.0 0.0 0.001 0.01 0.1 1.0 10.0 100.0
Regularization weight Regularization weight
Invariance regularization also improves in-distribution AD A surprising result from this study
was the consistent enhancement in in-distribution performance across most methods when partial
conditional invariance was incorporated. This observation can be explained by recent advances
suggesting that invariance also bolsters in-distribution robustness against noise and sampling variabil-
ity (Lopes et al. [2019], Mitrovic et al. [2020]) - factors to which anomaly detection is inherently
vulnerable (Ramotsoela et al. [2018]). By design, invariance regularizers discourage representations
encoding style features. Hence, representations only carry over information that is inherently related
to the object rather than the environment. This leads to models identifying meaningful patterns
instead of noisy variations in the data, even in-distribution.
9
Unlabeled environments A prominent limitation of invariance regularization lies in its dependence
on environment labels (Veitch et al. [2021], Jiang and Veitch [2022], Arjovsky et al. [2019]). Addition-
ally, as evidenced in our experimental comparisons between covariate and domain shifts, interventions
directly applied to the covariates during environment generation yield enhanced robustness under the
constraints of partial conditional invariance. This underscores that conditional invariance regular-
ization’s effectiveness diminishes when the environments are not distinctly segregated. However, in
situations where datasets lack explicit environment partitions, alternative strategies can be employed.
As demonstrated by Lin et al. [2022] it remains feasible to jointly estimate environment partitions
and invariant representations, provided there is access to sufficient auxiliary information. This finding
opens the door to expanding the applicability of partial conditional invariance in AD, even in scenarios
with limited information on environmental conditions or corrupted separation between environments.
Invariance beyond partial conditioning Theorem 1 shows that in the presence of confounding,
learning invariant representations requires that representations are independent from the environment,
when conditioned on W . Note that our partial conditional invariant regularization conditions only on
W = 0 and not on W = 1 (see Equation 4). This is an inherent limitation due to the fact that we only
have normal objects in the sample. However, our regularization is still powerful enough to provide
improvements over the state of the art. We argue that the quality of this improvement depends on
how disentangled content features are from style features (i.e., Xa and Xe in Figure 2). Consider,
for example, an MNIST setting where normal objects are images of particular digit and images of
any other digit are anomalies. Furthermore, assume that images from one environment have one
background color and images from another environment have another background color. There is a
clear disentanglement between style (i.e., the digit) and content (i.e., the background color). In such
settings, attaining invariance to the background color just with W = 0 leads also to invariance to
the background color with W = 1. We conjecture that partial conditional invariant regularization is
sufficient when W does not influence Xe ; that is, when there is no arrow between these variables in
Figure 2.
Multi-shift environments Certain datasets, such as the one described by Christie et al. [2018],
take into account data shifts across two different kinds of domain shifts (e.g. time and location),
equivalent to dual interventions in a bivariate E in our formulation. While the extension of our work
to address multi-attribute settings might be plausible in a dual intervention scenario, it presents a
non-trivial challenge as the number of different possible interventions increases, given that it requires
the induction of pairwise invariance across environments. This issue continues to be an area of active
research.
Fairness One potential extension of this work concerns the setting where either a covariate shift
is induced over a protected attribute (e.g., gender or race), or a domain shift that permeates to such
attributes. Such nuances have been brought in the context of invariant representation learning by
Veitch et al. [2021] and could serve as a potential application of this work in AD. In particular, we
could see our regularization term be implemented similarly to Louizos et al. [2015], but instead with
the intent of effectively discarding incidental correlations that might reflect societal or systemic biases
present in the data.
7 Conclusions
Our study sheds light on the significant challenges anomaly detectors face in the context of distribution
shifts. We proposed a novel, robust solution centered around invariant representations, which mitigates
the impact of shortcut learning by enforcing statistical independence between the representations and
the environment. Empirical validation of our theoretical proposals confirmed the effectiveness of our
approach, with a regularization term inducing partial conditional distribution invariance, significantly
improving model performance under covariate and domain shifts. We believe these findings pave the
way for a deeper understanding of AD methods’ robustness and how to mitigate their vulnerability to
distribution shifts.
10
References
M. Ahmed, A. N. Mahmood, and M. R. Islam. A survey of anomaly detection techniques in financial
domain. Future Generation Computer Systems, 55:278–288, 2016.
S. Akcay, D. Ameln, A. Vaidya, B. Lakshmanan, N. Ahuja, and U. Genc. Anomalib: A deep learning
library for anomaly detection, 2022.
M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup
and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning,
volume 70 of Proceedings of Machine Learning Research, pages 214–223. PMLR, 06–11 Aug
2017. URL https://proceedings.mlr.press/v70/arjovsky17a.html.
M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant risk minimization. arXiv preprint
arXiv:1907.02893, 2019.
P. Bandi, O. Geessink, Q. Manson, M. Van Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee,
K. Paeng, A. Zhong, et al. From detection of individual metastases to classification of lymph node
status at the patient level: the camelyon17 challenge. IEEE Transactions on Medical Imaging, 38
(2):550–560, 2018.
P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger. Uninformed students: Student-teacher anomaly
detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 4183–4192, 2020.
C. M. Bishop and N. M. Nasrabadi. Pattern recognition and machine learning, volume 4. Springer,
2006.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of
visual representations. arXiv preprint arXiv:2002.05709, 2020.
G. Christie, N. Fendley, J. Wilson, and R. Mukherjee. Functional map of the world. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018.
N. Cohen, J. Kahana, and Y. Hoshen. Red PANDA: Disambiguating image anomaly detection by
removing nuisance factors. In The Eleventh International Conference on Learning Representations,
2023. URL https://openreview.net/forum?id=z37tDDHHgi.
H. Deng and X. Li. Anomaly detection via reverse distillation from one-class embedding. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 9737–9746, June 2022.
L. Deng. The MNIST database of handwritten digit images for machine learning research. IEEE
Signal Processing Magazine, 29(6):141–142, 2012.
E. Eulig, P. Saranrittichai, C. K. Mummadi, K. Rambach, W. Beluch, X. Shi, and V. Fischer. Diagvib-
6: A diagnostic benchmark suite for vision models in the presence of shortcut and generalization
opportunities. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 10655–10664, 2021.
R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann.
Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel. Memorizing
normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly
detection. In IEEE International Conference on Computer Vision (ICCV), 2019.
S. Goyal, A. Raghunathan, M. Jain, H. V. Simhadri, and P. Jain. DROCC: Deep robust one-class
classification. In International conference on machine learning, pages 3711–3721. PMLR, 2020.
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two-
sample-problem. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information
Processing Systems, volume 19. MIT Press, 2006.
11
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test.
The Journal of Machine Learning Research, 13(1):723–773, 2012.
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual
representation learning. arXiv preprint arXiv:1911.05722, 2019.
W. Hilal, S. A. Gadsden, and J. Yawney. Financial fraud: A review of anomaly detection techniques
and recent advances. 2022.
R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio.
Learning deep representations by mutual information estimation and maximization. arXiv preprint
arXiv:1808.06670, 2018.
M. Hosseinzadeh, A. M. Rahmani, B. Vo, M. Bidaki, M. Masdari, and M. Zangakani. Improving
security using SVM-based anomaly detection: issues and challenges. Soft Computing, 25:3195–
3223, 2021.
Y. Jiang and V. Veitch. Invariant and transportable representations for anti-causal domain shifts.
arXiv preprint arXiv:2207.01603, 2022.
J. Kahana and Y. Hoshen. A contrastive objective for learning disentangled representations. In
Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
Proceedings, Part XXVI, pages 579–595. Springer, 2022.
G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann. Contrastive adaptation network for unsupervised
domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2019.
P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga,
R. L. Phillips, I. Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International
Conference on Machine Learning, pages 5637–5664. PMLR, 2021.
L. Kong, C. d. M. d’Autume, W. Ling, L. Yu, Z. Dai, and D. Yogatama. A mutual information
maximization perspective of language representation learning. arXiv preprint arXiv:1910.08350,
2019.
S. Lee, S. Lee, and B. C. Song. CFA: Coupled-hypersphere-based feature adaptation for target-
oriented anomaly localization. arXiv preprint arXiv:2206.04325, 2022.
H. Li, S. J. Pan, S. Wang, and A. C. Kot. Domain generalization with adversarial feature learning.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
5400–5409, 2018.
Y. Lin, S. Zhu, L. Tan, and P. Cui. Zin: When and how to learn invariance without environment
partition? Advances in Neural Information Processing Systems, 35:24529–24542, 2022.
R. Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988.
C. Liu, X. Sun, J. Wang, H. Tang, T. Li, T. Qin, W. Chen, and T.-Y. Liu. Learning causal semantic
representation for out-of-distribution prediction. Advances in Neural Information Processing
Systems, 34:6155–6170, 2021.
M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer feature learning with joint distribution
adaptation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),
December 2013.
R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk. Improving robustness without sacrificing
accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611, 2019.
C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel. The variational fair autoencoder. arXiv
preprint arXiv:1511.00830, 2015.
F. Lv, J. Liang, S. Li, B. Zang, C. H. Liu, Z. Wang, and D. Liu. Causality inspired representation
learning for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 8046–8056, June 2022.
12
N. Meinshausen. Causality from a distributional robustness point of view. In 2018 IEEE Data Science
Workshop (DSW), pages 6–10. IEEE, 2018.
Y. Ming, H. Yin, and Y. Li. On the impact of spurious correlation for out-of-distribution detection. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10051–10059,
2022.
J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, and C. Blundell. Representation learning via
invariant causal mechanisms. arXiv preprint arXiv:2010.07922, 2020.
V. Mothukuri, P. Khare, R. M. Parizi, S. Pouriyeh, A. Dehghantanha, and G. Srivastava. Federated-
learning-based anomaly detection for IoT security attacks. IEEE Internet of Things Journal, 9(4):
2545–2554, 2021.
A. T. Nguyen, T. Tran, Y. Gal, and A. G. Baydin. Domain invariant representation learning with
domain density transformations. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and
J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages
5264–5275. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_
files/paper/2021/file/2a2717956118b4d223ceca17ce3865e2-Paper.pdf.
A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding.
arXiv preprint arXiv:1807.03748, 2018.
H. Park, J. Noh, and B. Ham. Learning memory-guided normality for anomaly detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
14372–14381, 2020.
D. Ramotsoela, A. Abu-Mahfouz, and G. Hancke. A survey of anomaly detection in industrial
wireless sensor networks with critical water system infrastructure as a case study. Sensors, 18(8):
2491, 2018.
T. Reiss and Y. Hoshen. Mean-shifted contrastive loss for anomaly detection. arXiv preprint
arXiv:2106.03844, 2021.
T. Reiss, N. Cohen, L. Bergman, and Y. Hoshen. Panda: Adapting pretrained features for anomaly
detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 2806–2814, 2021.
L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich,
and K.-R. Müller. A unifying review of deep and shallow anomaly detection. Proceedings of the
IEEE, 109(5):756–795, 2021.
S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks. In
International Conference on Learning Representations, 2019.
T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-Erfurth. f-AnoGAN: Fast
unsupervised anomaly detection with generative adversarial networks. Medical Image Analysis,
54:30–44, 2019.
M. A. Siddiqui, J. W. Stokes, C. Seifert, E. Argyle, R. McCann, J. Neil, and J. Carroll. Detecting cyber
attacks using anomaly detection with explanations and expert feedback. In ICASSP 2019-2019
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
2872–2876. IEEE, 2019.
S. Smeu, E. Burceanu, A. L. Nicolicioiu, and E. Haller. Env-aware anomaly detection: Ignore style
changes, stay true to content! arXiv preprint arXiv:2210.03103, 2022.
A. Sordoni, N. Dziri, H. Schulz, G. Gordon, P. Bachman, and R. T. Des Combes. Decomposed
mutual information estimation for contrastive representation learning. In International Conference
on Machine Learning, pages 9859–9869. PMLR, 2021.
J. V. Stone. Independent component analysis: a tutorial introduction. MIT press, 2004.
J. Tack, S. Mo, J. Jeong, and J. Shin. CSI: Novelty detection via contrastive learning on distributionally
shifted instances. In Advances in Neural Information Processing Systems, 2020.
13
D. Tellez, G. Litjens, P. Bándi, W. Bulten, J.-M. Bokhorst, F. Ciompi, and J. Van Der Laak. Quantify-
ing the effects of data augmentation and stain color normalization in convolutional neural networks
for computational pathology. Medical Image Analysis, 58:101544, 2019.
N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE
Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.
V. Veitch, A. D’Amour, S. Yadlowsky, and J. Eisenstein. Counterfactual invariance to spurious
correlations: Why and how to pass stress tests. arXiv preprint arXiv:2106.00545, 2021.
G. Wang, S. Han, E. Ding, and D. Huang. Student-teacher feature pyramid matching for anomaly
detection. In The British Machine Vision Conference (BMVC), 2021.
R. Wang, M. Yi, Z. Chen, and S. Zhu. Out-of-distribution generalization with causal invariant
transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 375–385, 2022.
Z. Wang and V. Veitch. A unified causal view of domain invariant representation learning. arXiv
preprint arXiv:2208.06987, 2022.
M. Wu, C. Zhuang, M. Mosse, D. Yamins, and N. Goodman. On mutual information in contrastive
learning for visual representations. arXiv preprint arXiv:2005.13149, 2020.
H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking
machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
X. Yan, H. Zhang, X. Xu, X. Hu, and P.-A. Heng. Learning semantic context from normal samples
for unsupervised anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence,
35(4):3110–3118, May 2021. doi: 10.1609/aaai.v35i4.16420. URL https://ojs.aaai.org/
index.php/AAAI/article/view/16420.
H. Yao, Y. Wang, S. Li, L. Zhang, W. Liang, J. Zou, and C. Finn. Improving out-of-distribution
robustness via selective augmentation. In Proceeding of the Thirty-ninth International Conference
on Machine Learning, 2022.
14
A Theoretical Details
A.1 Measurability
Intuitively, the σ-algebra generated by Y describes all events that Y “can express” and can be
measured in probability. If X is Y -measurable, that means that Y can express all what X can express.
We provide here a brief overview on d-separation and confounding and refer the reader to Bishop and
Nasrabadi [2006] for details.
Definition 6. Two random variables in a Bayesian network are confounded if they share a latent
parent.
Definition 7. A path is a sequence of random variables (V1 , . . . , Vn ) in a Bayesian network, where
Vi is a parent or child of Vi+1 , for i < n. For 1 < i < n, the variable Vi is a collision if Vi is a child
of both Vi−1 and Vi+1 .
Definition 8. A path is blocked if V1 and Vn are not observed and at least one of following holds:
• For some collision V in the path, neither V nor any of its descendants is observed.
• For some variable V in the path that is not a collision, V is observed.
U U
W E W E
Xa Xe Xa Xe
Z Z
Figure 6: Causal graphs for anomaly detection. The left figure shows the case of no confounding. The
right figure shows the case of confounding. An intervention at the E variable induces a domain shift
(gray hammer), whereas an intervention at the Xe variable induces a covariate shift (black hammer).
We repeat in Figure 6 the causal graph for anomaly detection here for convenience. To prove Theorem
1, we use the following lemma.
Lemma 2. If W and E are confounded, then Xa ⊥ E | W . Otherwise, Xa ⊥ E.
Proof. We prove this via d-separation. We start with the case of W and E not being confounded.
Note that there are two paths from Xa to E. One via W and the other via Z. The path via W is
blocked, because Xe is a collision and neither Xe nor its descendant Z are observed. The path via Z
15
is blocked, because Z is a collision with no descendants and is not observed. Hence, Xa ⊥E, when
W and E are not confounded. In the case of W and E being confounded, we can show analogously
that Xa ⊥ E | W . Note that there are three paths from Xa to E. The first one is via U , but this is
blocked, because W is in that path, it is not a collision, and it is observed. The second one is via W
and Xe , but W is in that path and it is not a collision there, so that path is also blocked. The third
one is via Z, but Z is a collision in that path with no descendants and it is not observed, so it is also
blocked. Since all paths are blocked, when W is observed, we conclude that Xa ⊥ E | W , when W
and E are confounded.
Theorem 2. Suppose that f learns invariant representations. If W and E are confounded, then
Z ⊥ E | W . Otherwise, Z ⊥ E.
Proof. Recall that if f learns invariant representations, then we assume it to be Xa -measurable. This
assumption is justified in Veitch et al. [2021]. As a result, since Z = f (Xa , Xb ), we have that Z is
Xa -measurable.
We assume that W and E are not confounded and show that Z⊥E by proving that p(Z ∈
n
A, E ∈ B) = p(Z ∈ A)p(E ∈ B), for any A, B ∈ B (R ). Note that p (Z ∈ A, E ∈ B) =
−1 −1 −1 −1
p Z (A) ∩ E (B) = p Xa (CA ) ∩ E (B) , for some Borel set CA . The last equality fol-
lows from Z being Xa -measurable, which implies that Z −1 (A) = Xa−1 (CA ), for some Borel set CA ,
by Definition5. By Lemma 2, we have that Xa ⊥E. This implies that p Xa−1 (CA ) ∩ E −1 (B) =
p Xa−1 (CA ) p E −1 (B) = p Z −1 (A) p E −1 (B) = p(Z ∈ A)p(E ∈ B), which is what we
wanted to show.
We now assume that W and E are confounded and show that Z⊥E | W by proving that
p(Z ∈ A, E ∈ B | W ∈ C) = p(Z ∈ A | W ∈ C)p(E ∈ B | W ∈ C),
for any A, B, C ∈ B (Rn ). By an analogous argument, we can show that
p (Z ∈ A, E ∈ B | W ∈C) = p Xa−1 (CA ) ∩ E−1 (B) | W −1 (C) . By Lemma 2, we have that
p Xa−1 (CA ) | W −1 (C) p E −1 (B) | W −1 (C) . With arguments similar to those above, we can
show that the last expression is equal to p(Z ∈ A | W ∈ C)p(E ∈ B | W ∈ C), which is what we
wanted to show.
B Dataset Details
B.1 Camelyon17
Our realistic anomaly detection dataset was derived from the Camelyon17 dataset (Koh et al. [2021],
Bandi et al. [2018]), and contains 3 × 96 × 96 patches of whole-slide images of lymph node sections
sourced from patients who may have metastatic breast cancer. This dataset encompasses tissue
patches obtained from five different hospitals. The objective here is to accurately predict the presence
of tumor tissue within the patches drawn from hospitals that were not part of the training data. Prior
work has shown that differences in staining between hospitals are the primary source of variation in
this dataset, however, other divergent factors in the sampling distribution include different acquisition
protocols and patient populations (Tellez et al. [2019]).
The in-distribution data was comprised of 151, 280 images evenly distributed across three hospitals,
or 100, 810 images evenly distributed across two hospitals depending on the training setting. The
other out-of-distribution data covered two additional datasets, the first with 34, 904 patches, and the
second with 85, 054 patches. Note that to adapt this dataset to the anomaly detection setting, only
normal images were included in the in-distribution training data.
The synthetic datasets employed in this study were derived from the DiagViB-6 benchmark (Eulig
et al. [2021]). This benchmark uniquely allows for the manipulation of five independent generative
factors from colored images: overlaid texture, object size, object position, lightness, and saturation,
in addition to the semantic features that correspond to the label. Our synthetic experiments utilized
two datasets: MNIST (Deng [2012]) and Fashion-MNIST (Xiao et al. [2017]). All images in both
datasets were upsampled to dimensions of 3 × 256 × 256.
16
Train Test
normal
abnormal
Figure 7: Illustration of our experimental setup for the synthetic covariate shift experiment. The image
demonstrates representative examples of training data from two distinct environments, alongside
instances of normal and abnormal test data subject to progressively accumulated covariate shifts. This
configuration embodies the nuanced challenge of identifying subtle, yet potentially consequential,
changes in the data distribution.
Initially, we generate two unique and distinct environments specifically designed for the training
data. Our primary goal during this stage was to guarantee that all these factors exhibited noticeable
differences when compared across the two generated environments.
Following the generation of these training environments, we proceeded to develop another pair of
environments. These new environments were crafted for the validation data. To ensure consistency,
these validation environments were fashioned in such a way that they closely mirrored or replicated
the factor configuration that was present in the initial training environments, thus retaining an
in-distribution setting.
In the final step, six additional environments, denoted as e0 , e1 , ..., e5 , were generated. Each environ-
ment ei consists of images in which i factors have been altered with respect to e0 . For a depiction of
the samples for these different environments, please refer to Fig. 7.
In devising our evaluation setup, we opted for inducing covariate shifts that are minor deviations
from the original in-distribution environments. This decision was motivated by our goal to simulate
subtle yet potentially detrimental covariate shifts, particularly in comparison to the challenge of
differentiating normal from abnormal.
A description of the accumulated covariate shifts in the test environments, e0 , e1 , e2 , e3 , e4 , is
provided in Table 1.
i Chosen factor in ei
0 None
1 Hue
2 Texture
3 Lightness
4 Position
Table 1: Environments e0 , . . . , e4 used in our synthetic benchmark. For 0 < i ≤ 4, the environment
ei modifies the new factor indicated in the table in addition to the factors modified by e0 , . . . , ei−1 .
17
C Invariantly pretrained encoders
We also extend our experimental evaluation by adding a comparison to Smeu et al. [2022], an
environment-aware framework for AD that pretrains the encoder of the AD model using an invariance-
inducing method (LISA or IRM). We evaluate this method in MNIST and F-MNIST subject to
targeted covariate shifts (the exact same setup as seen in our original experiments). The method was
incorporated into all baselines and compared to the same baselines while regularized through partial
conditional invariance.
We show the results of these experiments in Fig. 8. Across all methods tested, and in both datasets,
we observe a sharp decrease in performance when compared to a baseline non-invariant method and
that it provides less robustness to covariate shifts than our proposed methodology.
AUROC
AUROC
0.7 0.7 0.7
STFPM ReverseDistil MeanShift Red PANDA STFPM ReverseDistil MeanShift Red PANDA STFPM ReverseDistil MeanShift Red PANDA
(a) e1 :MNIST s.t. one covariate (b) e2 :MNIST s.t. two covariate (c) e3 :MNIST s.t three covariate
shift shifts shifts
Original IRM pretraining Original IRM pretraining Original IRM pretraining
0.9 PCIR LISA pretraining 0.9 PCIR LISA pretraining 0.9 PCIR LISA pretraining
AUROC
AUROC
0.7 0.7 0.7
STFPM ReverseDistil MeanShift Red PANDA STFPM ReverseDistil MeanShift Red PANDA STFPM ReverseDistil MeanShift Red PANDA
(d) e4 :MNIST s.t. four covariate (e) e2 :F-MNIST s.t. one covariate (f) e2 :F-MNIST s.t. two covariate
shifts shift shift
Original IRM pretraining Original IRM pretraining
0.9 PCIR LISA pretraining 0.9 PCIR LISA pretraining
0.8 0.8
AUROC
AUROC
0.7 0.7
0.6 0.6
0.5 0.5
// // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //
STFPM ReverseDistil MeanShift Red PANDA STFPM ReverseDistil MeanShift Red PANDA
(g) e3 :F-MNIST s.t. three covariate (h) e4 :F-MNIST s.t. four covariate
shifts shifts
Figure 8: Experimental results MNIST and Fashion-MNIST with additional invariant pretraining
following. (background transparent bar-plots: in-distribution evaluation; foreground opaque
bar-plots: out-of-distribution evaluation). (a-d) Results in MNIST. (e-h) Results in Fashion-MNIST.
In Fig. 9 we plot the two-dimensional representation of the final layer of a model trained through
MeanShift(Reiss and Hoshen [2021]) at different levels of PCIR regularization. The embeddings are
obtained through t-SNE. From the progressive increase in the weight of the PCIR term, we see the
increased superimposition of the different environments leading to more invariance at the loss of
informativeness in the representation.
18
5.0 5.0 5.0
Y
−2.5 −2.5 −2.5
−5.0 Environment 1 (blue digits) −5.0 Environment 1 (blue digits) −5.0 Environment 1 (blue digits)
Environment 2 (red digits) Environment 2 (red digits) Environment 2 (red digits)
Environment 3 (yellow digits) Environment 3 (yellow digits) Environment 3 (yellow digits)
−7.5 −7.5 −7.5
−7.5 −5.0 −2.5 0.0 2.5 5.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 −5 0 5
X X X
(a) Invariant and informative (b) Invariant but non-informative (c) Informative but non-invariant
Figure 9: TSNE embeddings of MNIST with three background colors for the digits 4 and 9. The
model used was MeanShift subject to different degrees of partial conditional invariant regularization.
(a) PCIR term set to 5 (b) PCIR term set to 150. (c) PCIR term set to 0.
E Tables of Results
19
In dist. Out of dist.
(↑) AUROC (↑) AUROC
STFPM 0.699 ± 0.025 0.630 ± 0.025
STFPM (PCIR) 0.724 ± 0.020 0.698 ± 0.023
ReverseDistil 0.673 ± 0.057 0.617 ± 0.032
ReverseDistil (PCIR) 0.734 ± 0.013 0.723 ± 0.013
CFA 0.752 ± 0.003 0.705 ± 0.018
CFA (PCIR) 0.785 ± 0.003 0.759 ± 0.005
MeanShift 0.683 ± 0.041 0.629 ± 0.043
MeanShift (PCIR) 0.731 ± 0.022 0.726 ± 0.052
CSI 0.671 ± 0.026 0.626 ± 0.024
CSI (PCIR) 0.692 ± 0.017 0.674 ± 0.019
Red PANDA 0.732 ± 0.026 0.691 ± 0.031
Red PANDA (PCIR) 0.742 ± 0.029 0.721 ± 0.012
Table 4: Experimental results on realist shortcut learning in the Waterbirds dataset, for both
regularized and unregularized. Results are presented over in-distribution evaluation and over out-of-
distribution.
20
F Performance Gained Compared to Baseline
17.5 STFPM
ReverseDistil
15.0
CFA
15 MeanShift
CSI 10.0
Red PANDA
10 7.5
5.0
5 2.5
0.0
0 eid
1 eid
2 eid
3 eod
1 eod
2
Environments
eid
1 eid
2 eod
1 eod
2 eod
3
Environments
(b) Camelyon17: three in-
(a) Camelyon17: two in-distribution distribution
20 STFPM
STFPM
ReverseDistil 12.5 ReverseDistil
mean-AUROC difference (%)
MeanShift
mean-AUROC difference (%)
MeanShift
CSI
15 Red PANDA 10.0 CSI
Red PANDA
7.5
10
5.0
5 2.5
0.0
0
−2.5
e0 e1 e2 e3 e4 e0 e1 e2 e3 e4
Environments Environments
G Ablation Studies
As part of our comprehensive examination of how covariate shifts influence individual regularization
weights, we have plotted the performance trajectories of all evaluated models, traversing environments
frome0 to e4 . These are captured in Fig.14 and Fig.15.
As expected, and already observed in both our main findings and previous work (Ming et al. [2022]),
a review of these plots unveils a trend that permeates across all examined models: the performance
of models appears to inversely correlate with the number of induced covariate shifts. As the com-
plexity introduced by these shifts mounts, the models’ performance experiences a proportionate and
systematic decline. This observable trend is essentially monotonic, signifying a erosion in model
performance with each incremental rise in the quantity of covariate shifts.
However, it is important to note that this trend is not without exceptions. In particular, when
scrutinizing the data pertaining to environment e4 , we can observe anomalies to this downward trend.
In these exceptional instances, despite the increase in the number of covariate shifts, the performance
of certain models appears to resist the general declining pattern.
21
STFPM ReverseDistil CFA
0.88 regularization weight=0 0.90 regularization weight=0 0.70 regularization weight=0
regularization weight=100 regularization weight=1.0 regularization weight=1.0
0.86 0.65
0.84 0.85
mean-AUROC
mean-AUROC
mean-AUROC
0.60
0.82
0.80
0.80 0.55
0.78
0.75 0.50
0.76
0.74 0.45
eid
1 eid
2 eod
1 eod
2 eod
3 eid
1 eid
2 eod
1 eod
2 eod
3 eid
1 eid
2 eod
1 eod
2 eod
3
Environments Environments Environments
mean-AUROC
0.76
mean-AUROC
0.72
0.575
0.74 0.70
0.550
0.68
0.72 0.525
0.66
0.500
0.70 0.64
eid
1 eid
2 eod
1 eod
2 eod
3 eid
1 eid
2 eod
1 eod
2 eod
3 eid
1 eid
2 eod
1 eod
2 eod
3
Environments Environments Environments
(d) Camelyon17: MeanShift (e) Camelyon17: CSI (f) Camelyon17: Red PANDA
Figure 11: Mean-AUROC curve of each anomaly detector and its regularized version in the Cam-
lyon17 dataset.
mean-AUROC
mean-AUROC
0.66 0.65
0.60
0.64
0.55 0.60
0.62
0.50 0.55
e0 e1 e2 e3 e4 e0 e1 e2 e3 e4 e0 e1 e2 e3 e4
Environments Environments Environments
0.68
mean-AUROC
0.625
0.600
0.66
0.575
0.550 0.64
0.525
0.62
e0 e1 e2 e3 e4 e0 e1 e2 e3 e4
Environments Environments
Thus, our overall conclusion, while acknowledging these exceptions, is that the prevalence of covariate
shifts largely contributes to a degradation in model performance.
It is however important to note that the unregularized methods still underperforms when compared to
the same method under a even small amount (0.001) of partially conditional regularization added.
H Additional Discussion
To expand on the experiment tackling real-world shortcut learning, and to better understand how a
distribution shift affects different kinds of shortcut features (Geirhos et al. [2020]) captured by the
model, we now will look at how inducing distinct changes to the anomaly detection causal graph may
lead to malfunctions in the model.
22
STFPM ReverseDistil MeanShift
0.75 regularization weight=0 regularization weight=0
0.76 regularization weight=0
regularization weight=0.1 regularization weight=1.0 regularization weight=10
mean-AUROC
mean-AUROC
mean-AUROC
0.72
0.65 0.65
0.70
0.60
0.60 0.68
0.55 0.66
0.55
0.64
e0 e1 e2 e3 e4 e0 e1 e2 e3 e4 e0 e1 e2 e3 e4
Environments Environments Environments
0.62 0.70
mean-AUROC
mean-AUROC
0.60 0.68
0.58 0.66
0.64
0.56
0.62
0.54
eid
1 eid
2 eid
3 eod
1 eod
2 e0 e1 e2 e3 e4
Environments Environments
mean-AUROC
mean-AUROC
10.0 10.0 10.0
100.0 100.0 100.0
0.66 0.65
0.60
0.64
0.55 0.60
0.62
0.50 0.55
e0 e1 e2 e3 e4 e0 e1 e2 e3 e4 e0 e1 e2 e3 e4
Environment Environment Environment
0.68 10.0
mean-AUROC
10.0
0.625 100.0 100.0
0.600 0.66
0.575
0.550 0.64
0.525
0.62
e0 e1 e2 e3 e4 e0 e1 e2 e3 e4
Environment Environment
Suppose we recover our formulation of the partition of an object X into the semantic features that
distinguish normal and abnormal samples, Xa , and into the style features induced by the environment,
Xe . In that case, it is possible to distinguish between settings that may lead to a model failure when a
shortcut feature is captured.
Let us simplify our analysis by considering the setting where the training data is sampled under
′ ′
the intervention pdo(E=e ) (X), that is the style features are fixed into a specific setting, Xe . This is
a prevalent setting in real-world applications as spurious correlations between style and semantic
features may occur when sampling the training data. Remember that in the training set, Xa only
produces features of normal objects. Under this constraint, it was already previously noted that
anomaly detection methods are particularly susceptible to capturing the style features as a prominent
factor for the representation of X (Ming et al. [2022]).
Moving to the evaluation stage of the anomaly detector, we can then consider two settings.
23
STFPM ReverseDistil MeanShift
0.75
0.75 0.0 0.0 0.0
0.001 0.001 0.76 0.001
0.01 0.01 0.01
0.70 0.1
0.70 0.1 0.1
0.74
1.0 1.0 1.0
mean-AUROC
mean-AUROC
mean-AUROC
10.0 10.0 10.0
0.65 100.0 0.65 100.0 0.72 100.0
0.60 0.70
0.60
0.55 0.68
0.55
0.50 0.66
0.50 0.64
e0 e1 e2 e3 e4 e0 e1 e2 e3 e4 e0 e1 e2 e3 e4
Environment Environment Environment
(a) Fashion-MNIST: STFPM (b) Fashion-MNIST: Reverse Distil (c) Fashion-MNIST: MeanShift
CSI Red PANDA
0.0 0.0
0.001 0.72 0.001
0.675 0.01 0.01
0.1 0.1
0.70 1.0
0.650 1.0
mean-AUROC
10.0
mean-AUROC
10.0
100.0 100.0
0.625 0.68
0.600 0.66
0.575 0.64
0.550 0.62
e0 e1 e2 e3 e4 e0 e1 e2 e3 e4
Environment Environment
′
The first setting consists of no changes to the intervention pdo(E=e ) (X), that is, there was no
distribution shift in the sampling of the data. In this setting, the main surrogate for a model failure
is derived from anomalies that are characterized by small changes in Xa from normal to abnormal
samples. That is, the style features would be considered the main source of information to characterize
new samples, and under these constraints, it is only natural that all test instances would be highly
likely to be set as normal. This setting was introduced by both Ming et al. [2022] and Cohen et al.
[2023], and the underlying features can be referred to as nuisance features.
′′
The second setting consists of a different intervention, pdo(E=e ) (X) such that it differs from the
′
training density pdo(E=e ) (X). In particular, we can consider an intervention that only changes
stylistic features, Xe , captured by the model. We are essentially operating under a highly targeted
covariate shift that focuses on the shortcut features. Therefore, depending on the extent of the changes
in Xa from normal to abnormal samples and how well they are captured by the model, this distribution
shift would lead to an anomaly detector that classifies all new samples as anomalies.
Note that in both settings, as shown in our main presentation of the method, by inducing a partially
conditional invariance to different environments, our regularization method also inherently introduces
an invariance to the style features Xe . As supported by our results in the synthetic covariate shift
experiments, we believe this also produces models that are not only robust to covariate shifts, as in
the second scenario, but also to shortcut learning, as described in the first scenario.
24
pool of examples of the data in hopes that the model implicitly captures the additional simulated
variance in the images.
Yet, data augmentation could still be beneficial in the context of penalized invariant regularization, by
producing meaningful augmentations in the context of a multi-environment setting. It could be used
to generate additional data for a single intervention, thus increasing the pool of available samples
of a specific environment, or be used to generate new data for a new intervention, leading to an
increased number of environments available, and the overall pool of samples in the dataset. This
would effectively alleviate the second drawback of solely using data augmentations.
We considered other options aside from MMD, like the Wasserstein distance. MMD is known to be
computationally easier than Wasserstein’s distance. This is because MMD can be easily computed
using Gaussian kernels, whereas the Wasserstein distance requires computing a supremum over a set
of probability measures, which is computationally intractable in general.
Another option we considered was the KL-divergence. However, previous works have demonstrated
that this divergence is very unstable for probability distributions supported on low-dimensional
manifolds (Arjovsky et al. [2017]). Note that many of the methods used for anomaly detection and
machine learning are trained on samples from such distributions. A similar argument applies for
variations of this divergence like Shannon-Jensen divergence and total variation.
Finally, we remark that MMD has a strong theoretical foundation. In spite of its simplicity, it is
derived from a supremum of differences of expected values of functions from a reproducible-kernel
Hilbert space (Gretton et al. [2012]). It has been shown that when this metric is 0 then the two
distributions must match, which is precisely what we aim to achieve when learning representations
that are invariant to the environments.
I Implementation Details
I.1 Baselines
STFPM The STFPM (Wang et al. [2021]) algorithm incorporates a pretrained teacher network
and a student network that share the same structure. The student network assimilates the distribution
of images devoid of anomalies by aligning the features with corresponding features in the teacher
network. To boost robustness, the algorithm uses multi-scale feature matching. This multi-tiered
feature alignment lets the student network absorb a blend of multi-level insights from the feature
pyramid, thereby permitting the detection of anomalies of varying magnitudes.
During the inference stage, the feature pyramids of both teacher and student networks are put into
comparison. A larger discrepancy between these pyramids implies a heightened likelihood of the
presence of anomalies.
Reverse distillation The reverse distillation method (Deng and Li [2022]) is assembled from three
networks: an initial pretrained feature extractor, f , a bottleneck embedding, ϕ, and the student
decoder network, ν. The primary layer, or the backbone, of f is derived from a ResNet model that
was pretrained on the ImageNet dataset.
During the execution of a forward pass, the model extracts features from three separate ResNet blocks.
These features are encoded by amalgamating the three feature maps using the multi-scale feature
fusion block of ϕ, and then transferred to the decoder, ν, which is constructed to mirror the feature
extractor, albeit with operations reversed.
Throughout the training process, the output of these mirrored blocks is made to match the outputs
from the respective layers of the feature extractor. This is ensured by adopting the cosine distance as
the loss metric.
CFA Feature Adaptation based on CFA (Lee et al. [2022]) identifies anomalies utilizing features
that are specifically tailored to the target dataset. The CFA model comprises two main elements:
firstly, a learnable patch descriptor that learns and assimilates features oriented towards the target,
25
and secondly, a scalable memory bank that remains unimpacted by the size of the target dataset. In
conjunction with a pre-trained encoder, CFA applies a patch descriptor and memory bank. By doing
so, it makes use of transfer learning to bolster the density of normal features. Consequently, this
facilitates an easier distinction between normal and abnormal features.
Mean-shifted The mean-shifted contrastive learning method (Reiss and Hoshen [2021]) introduces
a novel loss function that calculates angular distances using the mean of all feature vectors as a
reference point. This is done in contrast to using the origin as a reference and the Euclidean distance.
It also combines two loss functions, one involving contrastive terms akin to Chen et al. [2020], but
these terms are positioned around a hypersphere centred on the mean of all feature vectors. To deter
positive samples from repelling themselves, it also incorporates an angular centre loss that encourages
samples to gravitate towards the mean of normal samples.
CSI CSI (Tack et al. [2020]) is a direct extension of Chen et al. [2020], introducing a unique form
of data augmentations known as distribution-shifting augmentations. In this setup, distribution-shifted
augmentations are treated as negative samples instead of positive ones and are consequently pushed
away from all positive samples. These augmentations include manipulations such as rotations and
permutations. The augmentation’s potential to shift the distribution is assessed through the AUROC,
where samples altered by the said augmentation are considered out-of-distribution samples. The
underlying notion here is that distinguishability is directly proportional to the shift in distribution.
Red PANDA The Red PANDA method for anomaly detection Cohen et al. [2023] tackles the
particular problem of anomaly detection under nuisance or distracting features. Relying on labels
from the nuisance factors, it employs a contrastive disentanglement loss following Kahana and
Hoshen [2022], in conjunction with a perceptual loss to train a generator function end-to-end with a
pretrained encoder.
STFPM During training, the student feature tries to align the distribution of training dataset with
the teacher. During prediction, the input x is fed into both student and teacher feature extractors,
where the student outputs fs (x) and teacher outputs ft (x). For the anomaly scoring, it relies in
a traditional density estimation approach. The assumption would be that the normal samples are
mapped to the high density area where the student encoder and the teacher encoder are aligned, and
the anomalous samples are mapped to the low density area of the student extractor. Therefore, the
anomaly score is computed as the distance between fs (x) and ft (x), i.e. AS(x) = d(fs (x), ft (x)).
In this case the distance metric d is the l2 -norm.
Reverse distillation The anomaly scoring function here is derived from the standard anomaly
scoring functions used in reconstruction-based anomaly detection algorithm. In particular, the
anomaly score is defined as the distance between the encoded features and the reconstructed features
from the decoder. AS(y) = d(f (y), ν(ϕ(f (y)), where ν(·) is the decoder, ϕ(·) is the distiller and
f (·) is the encoder. The idea behind this anomaly scoring function is that the decoder has learned to
reconstruct normal samples due to training dataset, but isn’t able to reconstruct anomalous samples.
Therefore, anomalous samples would have a higher anomaly score.
CFA For CFA we use the standard image-level density estimation approach. In particular, the
training samples are mapped to a feature map f (x) and clustered into k clusters using k-means. For
the prediction, given an input y, its final feature f (y) is computed after feeding it into the feature
extractor and descriptor. Then the d nearest cluster centers of f (y) would be selected, and the anomaly
score for that sample is the mean distance to those d centers. AS(y) = Σfi ∈Nk (x) d(fi , f (y)). In
this case, the distance metric d is simply the l2 distance. The idea behind this approach is to assume
that the anomalous samples would be mapped to low-density area in feature space, which are far
away from all the cluster centers, while the normal samples would be mapped to high density area in
feature space, which is close to the normal samples in training dataset.
Mean-shift and Red PANDA Both contrastive learning based methods used as baselines, namely,
mean-shift (Reiss and Hoshen [2021]) and Red Panda (Cohen et al. [2023]) rely on finetuning an
26
encoder (both pre-trained or not) by grouping the set of feature vectors from images in the training
data around a sub-region of the hypersphere centered in the origin. During prediction time, the most
common approach to classify a new sample as anomaly or not is through the mean distance of the
kNN normal images. Following the original work, in mean-shift, k was set to 2, and in Red Panda it
was set to 1.
CSI CSI (Tack et al. [2020]) relied on only a vanilla contrastive loss between original samples, and
highly augmented samples to serve as a proxy for abnormal samples. Similarly to the previous setting
of mean-shift and Red Panda, this leads to a feature space that falls around the hypersphere centered
in the origin, but not necessarily in the surface. However, operating with the underlying hypothesis
that the highly augmented samples match the anomalies, the feature vectors from images in the
training data are already being pushed to the diametrically opposite side of the hypersphere when
compared to abnormal samples. Additionally, it was empirically verified that the norm of vectors
of abnormal samples is much lower than that of in-distribution samples. This leads to a distance
criterion that measures the closest training sample through the cosine similarity and the norm of the
feature vector of the sample: maxm sim(f (xm ), f (x)) · |f (x)|, where f is the encoder that maps the
input object, x, to its feature space, and sim is the cosine similarity.
I.3 Hyperparameters
As this work considered a novel setting where each anomaly detection method was evaluated for the
first time, we modestly optimized hyperparameters. Our approach consisted of two primary steps.
The first involved scaling up two key factors: (a) batch size, and (b) learning rate. Subsequently,
we methodically scanned through an array of distinct parameters for each baseline model. These
included the backbones ResNet18, ResNet34, ResNet50 and WideResNet50, alongside various
anomaly scoring methodologies that leverage image-level, density estimation, reconstruction error,
and pixel-wise density estimation approaches. An additional aspect of our study was an ablation
analysis where the regularization weight was fine-tuned by sweeping through the set of values
0.001, 0.01, 0.1, 1, 10, 100. We adhered to the hyperparameters as depicted in the original works for
all other variables and refrained from performing any further optimizations on them.
One thing we notice during our experiments is that for models that rely on the pretrained backbones,
the choice of backbone matters. For instance, for STFPM, the optimal choice was the simplest feature
extractor ResNet18, but for reverse distillation, the optimal choice was WideResNet50. The choices
seem to be model dependent more than dataset relevant. For more details on the chosen backbone for
each method refer to Tab. 7
The complete project required 3400 hours of GPU usage throughout all experiments, covering
development, testing, and comparisons. The resources supplied were part of a local custer, and
consited of two GPU models: the NVIDIA TITAN RTX and the NVIDIA Tesla V100.
The main Python libraries used in our implementation, were Pytorch, which is under a BSD-3 license1 ,
and Pytorch Lightning, which is under Apache 2.0 license2 .
Methods that were derived from the anomalib library (Akcay et al. [2022]), namely STFPM, reverse
distillation, and CFA, were already implemented as a Pytorch Lightning Module, and are all under an
Apache 2.0 license3 . These were incorporated directly in our pipeline.
1
https://github.com/pytorch/pytorch/blob/main/LICENSE
2
https://github.com/Lightning-AI/lightning/blob/master/LICENSE
3
https://github.com/openvinotoolkit/anomalib/blob/main/LICENSE
27
STFPM ReverseDistil CFA MeanShift CSI Red PANDA
Wang et al. [2021] Deng and Li [2022] Lee et al. [2022] Reiss and Hoshen [2021] Tack et al. [2020] Cohen et al. [2023]
Camelyon17 (3x96x96)
Table 7: A detailed summary of the hyperparameters used for each evaluated model across three
datasets: Camelyon17, DiagViB-6 (MNIST), and DiagViB-6 (Fashion-MNIST). Parameters include
learning rate, scheduler, optimizer, batch size, backbone, pretraining, regularization weight, and mmd
kernel type, along with the type of anomaly score. Notably, the CFA model could not be successfully
implemented for DiagViB-6 based experiments despite trying an extensive range of hyperparameter
combinations. Models are referenced by their respective citations.
DCoDR (Kahana and Hoshen [2022]), which Red Panda is based from, was released under a
Software Research License4 . Our experiments for Red Panda were derived directly from the official
repository. Mean-shifted contrastive learning was released under a Software Research License5 . We
re-implemented this method as a Pytorch Lightning Module, loosely following its original official
implementation. DiagViB-6 (Eulig et al. [2021]) and the Camelyon17 (Koh et al. [2021]) datasets
were also publicly released with a GNU Affero General Public License v3.06 and a MIT License7 ,
respectively. Our implementation follows directly from its official repository.
4
https://github.com/jonkahana/DCoDR/blob/main/LICENSE
5
https://github.com/talreiss/Mean-Shifted-Anomaly-Detection/blob/main/LICENSE
6
https://github.com/boschresearch/diagvib-6/blob/main/LICENSE
7
https://github.com/p-lambda/wilds/blob/main/LICENSE
28