0% found this document useful (0 votes)
53 views

Context-Aware Drift Detection

1. The document discusses context-aware drift detection for machine learning models. It notes that existing drift detection approaches assume batches of new data are i.i.d., but in reality various factors like time can induce correlation so the data is not i.i.d. 2. It proposes a new drift detection framework built on two-sample tests of conditional distributional treatment effects, allowing the test to be insensitive to changes in permitted contextual factors like time of day. 3. An empirical study demonstrates the approach can effectively detect drift while being insensitive to changes in data subpopulation prevalences or contexts like time of day, outperforming standard i.i.d. assumption-based methods.

Uploaded by

sibanandarms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Context-Aware Drift Detection

1. The document discusses context-aware drift detection for machine learning models. It notes that existing drift detection approaches assume batches of new data are i.i.d., but in reality various factors like time can induce correlation so the data is not i.i.d. 2. It proposes a new drift detection framework built on two-sample tests of conditional distributional treatment effects, allowing the test to be insensitive to changes in permitted contextual factors like time of day. 3. An empirical study demonstrates the approach can effectively detect drift while being insensitive to changes in data subpopulation prevalences or contexts like time of day, outperforming standard i.i.d. assumption-based methods.

Uploaded by

sibanandarms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Context-Aware Drift Detection

Oliver Cobb 1 Arnaud Van Looveren 1

Abstract
When monitoring machine learning systems, two-
sample tests of homogeneity form the foundation
arXiv:2203.08644v2 [stat.ML] 2 Aug 2022

upon which existing approaches to drift detection


build. They are used to test for evidence that the
distribution underlying recent deployment data
differs from that underlying the historical refer-
ence data. Often, however, various factors such as
time-induced correlation mean that batches of re-
cent deployment data are not expected to form an
i.i.d. sample from the historical data distribution. Figure 1. Batches of the most recent deployment data may not
Instead we may wish to test for differences in the cover all contexts (e.g. day and night) covered by the training
distributions conditional on context that is permit- data. We often do not wish for this partial coverage to cause drift
ted to change. To facilitate this we borrow machin- detections, but instead to detect other changes not explained by
ery from the causal inference domain to develop a the change/narrowing of context. A nighttime deployment batch
more general drift detection framework built upon should be permitted to contain only wolves and owls, but a daytime
a foundation of two-sample tests for conditional deployment batch should not contain any owls.
distributional treatment effects. We recommend
a particular instantiation of the framework based Given a model of interest, drift can be categorised based
on maximum conditional mean discrepancies. We on whether the change is in the distribution of features, the
then provide an empirical study demonstrating its distribution of labels, or the relationship between the two.
effectiveness for various drift detection problems Approaches to drift detection are therefore diverse, with
of practical interest, such as detecting drift in the some methods focusing on one or more of these categories.
distributions underlying subpopulations of data They invariably, however, share an underlying structure (Lu
in a manner that is insensitive to their respective et al., 2018). Available data is repeatedly arranged into a set
prevalences. The study additionally demonstrates of reference samples, a set of deployment1 samples, and a
applicability to ImageNet-scale vision problems. test of equality is performed under the assumption that the
samples within each set are i.i.d. Methods then vary in the
notion of equality chosen to be repeatedly tested, which may
1. Introduction be defined in terms of specific moments, model-dependent
transformations, or in the most general distributional sense.
Machine learning models are designed to operate on unseen (Gretton et al., 2008; 2012a; Tasche, 2017; Lipton et al.,
data sharing the same underlying distribution as a set of 2018; Page, 1954; Gama et al., 2004; Baena-Garcıa et al.,
historical training data. When the distribution changes, the 2006; Bifet & Gavaldà, 2007; Wang & Abraham, 2015;
data is said to have drifted and models can fail catastrophi- Quionero-Candela et al., 2009; Rabanser et al., 2019; Cobb
cally (Recht et al., 2019; Engstrom et al., 2019; Hendrycks et al., 2022). Each test evaluates a test statistic capturing
& Dietterich, 2019; Taori et al., 2020; Barbu et al., 2019). It the extent to which the two sets of samples deviate from the
is therefore important to have systems in place that detect chosen notion of equality and make a detection if a threshold
when drift occurs and raise alarms accordingly (Breck et al., is exceeded. Threshold values are set such that, under the
2019; Klaise et al., 2020; Paleyes et al., 2020). assumptions of i.i.d. samples and strict equality, the rate of
1
Seldon Technologies. Correspondence to: Oliver Cobb false positives is controlled. However, in practice windows
<[email protected]>. of deployment data may stray from these assumptions in
ways deemed to be permissible.
Proceedings of the 39 th International Conference on Machine
1
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy- Throughout this paper we use “deployment” data to mean the
right 2022 by the author(s). unseen data on which a model was designed to operate.
Context-Aware Drift Detection

As a simple example, illustrated in Figure 1, consider a focus on methods applicable to multivariate distributions
computer vision model operating on sequentially arriving and sensitive to differences or effects in the general distribu-
images. The distribution underlying the images is known tional sense, not only in certain moments or projections.
to change depending on the time of day, throughout which
lighting conditions change. The distribution underlying a 2.1. Two-sample tests of homogeneity
nighttime batch of deployment samples differs from that
underlying the full training set, which also contains daytime Let (Ω, F, P ) denote a probability space with associated
images. Despite this, the model owner does not wish to be random variables X : Ω → X and Z : Ω → {0, 1}.
inundated with alarms throughout the night reminding them We subscript P to denote distributions of random vari-
of this fact. In this situation the time of day (or lighting ables such that, for example, PX denotes the distribution of
condition) forms important context which the practitioner X. Consider an i.i.d. sample (x, z) = {(xi , zi )}ni=1 from
is happy to let vary between windows of data. They are in- PX,Z and the two associated samples x0 = {x0i }ni=1 0
and
1 1 n1
terested only in detecting deviations from equality that can x = {xi }i=1 from PX0 := PX|Z=0 and PX1 := PX|Z=1
not be attributed to a change in this context. Although this respectively. A two-sample test of homogeneity is a statisti-
simple example may be addressed by comparing only to rel- cal test of the null hypothesis h0 : PX0 = PX1 against the
evant subsets of the reference data, more generally changes alternative h1 : PX0 6= PX1 . The test starts by specifying a
in context are distributional and such simple approaches can test statistic t̂ : X n0 × X n1 → R, typically an estimator of
not be effectively applied. a distance D(PX0 , PX1 ), expected to be large under h1 and
small under h0 . To test at significance level (false positive
Deviations from the i.i.d. assumption are the norm rather probability) α, the observed value of the test statistic is com-
than the exception in practice. An application may be puted along with the probability p = Ph0 (T > t̂(x0 , x1 ))
used by different age groups at different times of the week. that such a large value of the test statistic would be observed
Search engines experience surges in similar queries in re- under h0 . If p < α then h0 is rejected.
sponse to trending news stories. Food delivery services
expect the distribution of orders to differ depending on the Although effective alternatives exist (Lopez-Paz & Oquab,
weather. In all of these cases there exists important context 2017; Ramdas et al., 2017; Bu et al., 2018), we focus on
that existing approaches to drift detection are not equipped kernel-based test statistics (Harchaoui et al., 2007; Gret-
to account for. A common response is to decrease the sensi- ton et al., 2008; 2009; Fromont et al., 2012; Gretton et al.,
tivity of detectors so that such context changes cause fewer 2012a;b; Chwialkowski et al., 2015; Jitkrittum et al., 2016;
unwanted detections. This, however, hampers the detector’s Liu et al., 2020) which are particularly popular due to
ability to detect the potentially costly changes that are of their applicability to any domain X upon which a kernel
interest. With this in mind, our contributions are to: k : X × X → R, capturing a meaningful notion of sim-
ilarity, can be defined. The most common example is to
1. Develop a framework for drift detection that allows let t̂ be an estimator of the squared MMD D(PX0 , PX1 ) =
practitioners to specify context permitted to change ||µP0 − µP1 ||2Hk (Gretton et al., 2012a), which is the dis-
between windows of deployment data and only detects tance between the distributions’ kernel mean embeddings
drift that can not be attributed to such context changes. (Muandet et al., 2017) in the reproducing kernel Hilbert
space Hk . The squared MMD admits a consistent (although
2. Recommend an effective and intuitive instantiation biased) estimator of the form
of the framework based on the maximum mean dis-
crepancy (MMD) and explore connections to popular
1 X 1 X
MMD-based two-sample tests. t̂(x0 , x1 ) = 2 k(x0i , x0j ) + 2 k(x1i , x1j )
n0 i,j n1 i,j
3. Explain and demonstrate the applicability of the frame- (1)
2 X
work to various drift detection problems of practical − k(x0i , x1j ).
interest. n0 n1 i,j

4. Make an implementation available to use as part


of the open-source Python library alibi-detect In cases such as this, where the distribution of the test statis-
(Van Looveren et al., 2022). tic t̂ under the null distribution h0 is unknown, it is common
to use a permutation test to obtain an accurate estimate p̂ of
2. Background and Notation the unknown p-value p. This compares the observed value
of the test statistic t̂ against a large number of alternatives
We briefly review two-sample tests for homogeneity and that could, with equal probability under the null hypothe-
treatment effects. To make clear the connections between sis, have been observed under random reassignments of the
the two we adopt treatment-effect notation for both. We indexes {zi }ni=1 to instances {xi }ni=1 .
Context-Aware Drift Detection

2.2. Two-sample tests for treatment effects only that of Park et al. (2021) straightforwardly (through
a kernel formulation) generalises to multivariate and non-
A related problem is that of inferring treatment effects. Here,
euclidean domains for both X and C. They choose D to be
instead of asking whether X is independent of Z we ask
the squared MMD such that the associated CoDiTE
whether it is causally affected by Z. To illustrate, now
consider Z a treatment assignment and X an outcome of UMMD (c) = ||µX0 |C=c − µX1 |C=c ||2Hk (2)
interest. We may write X = X 0 (1 − Z) + X 1 Z, where
is the squared distance in Hk between the mean embed-
both the observed outcome and the counterfactual outcome
dings of PX0 |C=c and PX1 |C=c , which are each well de-
corresponding to the alternative treatment assignment are
fined under the overlap assumption A2. The associated
considered. If, as is common in observational studies, the
ADiTE E[UMMD (C)] is then equal to the expected dis-
treatment assignment Z is somehow guided by X, then the
tance between the conditional mean embeddings (CMEs)
distribution of X will depend on Z even if the treatment
µX0 |C = E[k(X0 , ·)|C] and µX1 |C = E[k(X1 , ·)|C],
is ineffective. The dependence will be non-causal however.
which are C-measurable random variables in Hk . Here
In such cases, to determine the causal effect of Z on X
CMEs are defined using the measure theoretic formulation
it is important to control for covariates C through which
which Park & Muandet (2020) introduce as preferable to
Z might depend on X. Supposing we can identify such
the operator-theoretic formulation of Song et al. (2009) for
covariates, i.e. a random variable C : Ω → C satisfying the
various reasons. Singh et al. (2020) study CoDiTE-like
following condition of unconfoundedness,
quantities within the operator-theoretic framework.
⊥ (X 0 , X 1 ) | C,
Z⊥ (A1) To estimate UMMD (c) Park et al. (2021) introduce a covari-
then differences between the distributions of X 0 and X 1 can ate kernel l : C × C → R and perform regularised operator-
be identified from observational data. This is because uncon- valued kernel regression to obtain the estimator
−1 −> >
foundedness ensures PX0 |C := PX|C,Z=0 = PX z |C,Z=0 = ÛMMD (c) = l>
0 (c)Lλ0 K0,0 Lλ0 l0 (c)
PX 0 |C , and likewise for Z = 1. Henceforth we assume the −1 −> >
unconfoundedness assumption holds and use PXz |C to refer + l>
1 (c)Lλ1 K1,1 Lλ1 l1 (c) (3)
−1 −> >
to both PX|C,Z=z and PX z |C . − 2l>
0 (c)Lλ0 K0,1 Lλ1 l1 (c),

A common summary of effect size is the average treatment where L−1 λz = (Lz,z + λz nz I)
−1
is a regularised inverse
effect ATE = E[X 1 ]−E[X 0 ] (Rosenbaum & Rubin, 1983), of the kernel matrix Lz,z with (i, j)-th entry l(czi , czj ) and
which is the expectation of the conditional average treatment lz (c) is the vector with i-th entry l(czi , c). This estimator
effect (CATE) U (c) = E[X 1 |C = c] − E[X 0 |C = c] with is consistent if k and l are bounded, l is universal and λ0
respect to the marginal distribution PC . More generally −1 −1
and λ1 decay at slower rates than O(n0 2 ) and O(n1 2 )
however we may be interested in effects beyond the mean respectively. Moreover the associated ADiTE E[UMMD (C)]
and consider a treatment effect to exist if the conditional can be consistently estimated via the Monte Carlo estimator
distributions PX0 |C=c (·) and PX1 |C=c (·) are not equal al-
most everywhere (a.e.) with respect to PC (Lee & Whang, n
1X
2009; Chang et al., 2015; Shen, 2019). In this more general t̂(x, c, z) = ÛMMD (ci ). (4)
n i=1
setting the effect size can be summarised by defining a con-
ditional distributional treatment effect (CoDiTE) function Park & Muandet (2020) use this estimator to test the null
UD (c) = D(PX0 |C=c , PX1 |C=c ) and marginalising over hypothesis that there exists no distributional treatment ef-
PC to obtain what we refer to as an average distributional fect of Z on X. Unlike for tests of homogeneity however,
treatment effect (ADiTE) E[D(PX0 |C , PX1 |C )]. This is the p-value can not straightforwardly be estimated using a
the expected distance, as measured by D, between two C- permutation test. This is because a value of the unpermuted
measurable random variables. For either the ATE or ADiTE test statistic that is extreme relative to the permuted variants
quantities to be well defined requires a second assumption may result from a dependence of Z on (X 0 , X 1 ) that ceases
of overlap, to exist under permutation of {zi }ni=1 . By the unconfound-
edness assumption the dependence must pass through C
0 < e(c) := P (Z = 1|C = c) < 1 PC -a.e., (A2) and can therefore be preserved by reassigning treatments as
zi0 ∼ Ber(e(ci )), instead of naively permuting the instances.
to avoid the inclusion of quantities conditioned on events
Under the null hypothesis, the distribution of the original
of zero probability. The function e : C → [0, 1] is often
statistic and test statistics computed via this treatment re-
referred to as the propensity score.
assignment procedure are then equal (Rosenbaum, 1984),
Although various CoDiTE functions have been pro- allowing p-values to be estimated in the usual way. Imple-
posed (Hohberg et al., 2020; Chernozhukov et al., 2013; menting this procedure requires using {(ci , zi )}ni=1 to fit a
Briseño Sanchez et al., 2020), to the best of our knowledge classifier ê(c) approximating the propensity score e(c).
Context-Aware Drift Detection

3. Context-Aware Drift Detection 3. Features X conditional on model predictions M (X).


An increased frequency of certain predictions should
Suppose we wish to monitor a model M : X → Y mapping correspond to the expected change in the covariate dis-
features X ∈ X onto labels Y ∈ Y. Existing approaches tribution rather than the emergence of a new concept.
to drift detection differ in the statistic S(X, Y ; M ) chosen Similarly, conditioning on a notion of model uncer-
to be monitored2 for underlying changes in distribution PS . tainty H(M (X)) would allow increases in model un-
This involves applying tests of homogeneity to samples certainty due to covariate drift into familiar regions
{s0i }ni=1
0
and {s1i }ni=1
1
as described in Section 2.1. In this of high aleatoric uncertainty (often fine) to be distin-
section we will introduce a framework that affords practi- guished from that into unfamiliar regions of high epis-
tioners the ability to augment samples with realisations of temic uncertainty (often problematic).
an associated context variable C ∈ C, whose distribution is
permitted to differ between reference and deployment stages. 4. Labels Y conditional on features X. Although deploy-
It is then the conditional distribution PS|C that is monitored ment labels are rarely available, this would correspond
by applying methods adapted from those in Section 2.2 to explicitly detecting concept drift where the underly-
to contextualised samples {(s0i , c0i )}ni=10
and {(s1i , c1i )}ni=1
1
. ing change is in the conditional distribution PY |X .
Recalling the deployment scenarios discussed in Section 1,
the underlying motivation is that the reference set often 3.1. Context-Aware Drift Detection with ADiTT
corresponds to a wider variety of contexts than a specific Estimators
deployment batch. In other words PC1 may differ from PC0
in a manner such that the support of the former may be a Consider a set of contextualised reference samples
strict subset of that of the latter. In such scenarios, which are {(s0i , c0i )}ni=1
0
and deployment samples {(s1i , c1i )}ni=1 1
.
common in practice, we postulate that practitioners instead Rather than making the assumption that each set forms
wish for their detectors to satisfy the following desiderata: an i.i.d. sample from their underlying distributions and test-
ing for equality, we first make the much weaker assumption
D1: The detector should be completely insensitive to that {(si , ci , zi )}ni=1 constitues an i.i.d. sample from PS,C,Z ,
changes in the data distribution that can be attributed where Z ∈ {0, 1} is a domain indicator with reference sam-
to changes in the distribution of the context variable. ples corresponding to Z = 0 and deployment samples to
Z = 1. We then make the stronger assumption that each
D2: The detector should be sensitive to changes in the data i.i.d. sample admits the following generative process
distribution that can not be attributed to changes in the
distribution of the context variable. Z ∼ PZ ,
C ∼ PC|Z ,
D3: The detector should prioritise being sensitive to (5)
S ∼ PS 0 |C , S 1 ∼ PS 1 |C
0
changes in the data distribution in regions that are
highly probable under the deployment distribution. S = S 0 (1 − Z) + S 1 Z.

Intuitively we can consider this as relaxing the assump-


Before describing our framework for drift detection that tion that {s0i }ni=1
0
and {s1i }ni=1
1
are each i.i.d. samples to
satisfies the above desiderata, we describe a number of them each being i.i.d. conditional on their respective con-
drift detection problems of practical interest, corresponding texts {c0i }ni=1
0
and {c1i }ni=1
1
. We are then interested in test-
to different choices of S(X, Y ; M ) and C for which we ing for differences between PS 0 |C and PS 1 |C . Note that
envisage our framework being particularly useful. focusing on these context-conditional distributions allows
the marginal distributions PS0 and PS1 underling {s0i }ni=1 0

1 n1
1. Features X conditional on an indexing variable t – and {si }i=1 to differ, so long as the difference can be at-
such as time, lighting or weather – informed by domain tributed to a difference between the distributions PC0 and
specific knowledge. PC1 underlying {c0i }ni=1 0
and {c1i }ni=1
1
. Importantly, the
above process satisfies the unconfoundedness condition of
2. Features X conditional on the relative prevalences of Z ⊥⊥ (S 0 , S 1 ) | C, such that PSz |C := PS|C,Z=z = PS z |C .
known subpopulations. This would allow changes to This allows us to apply adapted versions of the methods
the proportion of instances falling into pre-existing described in Section 2.2 to test the null and alternative hy-
modes of the distribution whilst requiring the distribu- potheses
tion of each mode to remain constant.
2
The unavailability of deployment labels and fact that any h0 : PS0 |C=c (·) = PS1 |C=c (·) PC1 -almost everywhere,
change in the distribution of features could cause performance 
degradation, makes S(X, Y ; M ) = X a common choice. h1 : PC1 {c ∈ C : PS0 |C=c (·) 6= PS1 |C=c (·)} > 0.
Context-Aware Drift Detection

Note here that we interest ourselves only in differences Algorithm 1 Context-Aware Drift Detection
between PS0 |C=c (·) and PS1 |C=c (·) at contexts supported Input: Statistics, contexts and domains (s, c, z), an
by the deployment context distribution PC1 . It would not be ADiTT estimator t̂, significance level α, number of per-
possible, or from the practitioner’s point of view desirable mutations nperm .
(D3), to detect differences at contexts that are not possible Compute t̂ on (s, c, z).
under the deployment distribution PC1 , regardless of their Use (c, z) to fit a probabilistic classifier ê(c) that approx-
likelihood under the reference distribution PC0 . imates the propensity score e(c).
To test the above hypotheses first recall that we may use for i = 1 to nperm do
a CoDiTE function UD (c), as introduced in Section 2.2, Reassign domains as zi ∼ Bernoulli(ê(c)).
to assign contexts c to distances between corresponding Compute t̂i on (s, c, zi )
conditional distributions PS0 |C=c and PS1 |C=c . Secondly end for Pnperm
1
note that testing the above hypotheses is equivalent to test- Output: p-value p̂ = nperm i=1 1{t̂i > t̂}
ing whether, for deployment instances specifically and con-
trolling for context, their status as deployment instances
causally affected the distribution underlying their statistics. (i.e. UD (c) = E[S1 |C = c] − E[S0 |C = c]), for which as-
This focus on deployment instances specifically means that sociated ATT estimators are well established, can be con-
we are not interested in an effect summary such as ADiTE sidered a particularly simple choice. Recall, however, that
that averages a context-conditional effect summary UD (c) changes harmful to the performance of a deployed model
over the full context distribution PC , but instead in an ef- may not be easily identifiable though the mean and therefore
fect summary that averages a context-conditional summary we focus on distributional alternatives.
over the deployment context distribution PC1 . This nar-
rowing of focus is not uncommon in analogous cases in 3.2. MMD-based ADiTT Estimation
causal inference where practitioners wish to focus on the
average effect of a treatment on those who actually received In this section we recommend Park et al. (2021)’s MMD-
it – the average effect on the treated – commonly abbrevi- based CoDiTE function and adapt their corresponding es-
ated as ATT. We therefore refer to the distributional version, timator for use within the framework described above. We
E[UD (C)|Z = 1], as the ADiTT associated with D. make this recommendation for several reasons. Firstly, it
allows for statistics and contexts residing in any domains S
Assuming D is a probability metric or divergence, (thereby and C upon which meaningful kernels can be defined. Sec-
satisfying D(P, Q) ≥ 0 with equality iff P = Q), the ondly, the computation consists primarily of manipulating
ADiTT is non-zero if and only if the null hypothesis h0 kernel matrices, which can be implemented efficiently on
fails to hold. We may therefore consider it an oracular test modern hardware. Thirdly, the procedure is simple and intu-
statistic and note that any consistent estimator t̂ can be used itive and closely parallels the MMD-based approach that is
in a consistent test of the hypothesis. In practice however widely used for two-sample testing.
we must estimate it using a finite number of samples and
the fact that the estimate is a weighted average w.r.t. PC1 Recall from Section 2.2 that Park et al. (2021) define, for
explicitly adheres to desiderata D3. We use Rosenbaum a given kernel k : S × S → R, the MMD-based CoDiTE
(1984)’s conditional resampling scheme, as described in UMMD (c) = ||µS0 |C=c − µS1 |C=c ||2Hk that captures the
Section 2.2, to estimate the p-value summarising the ex- squared distance between the kernel mean embeddings
tremity of t̂ under the null. Thankfully when estimating the of conditional distributions PS0 |C=c and PS1 |C=c . They
p-value associated with an estimate of ADiTT (or ATT), we show that UMMD (c) and the associated ADiTE E[UMMD (C)]
can relax the overlap assumption required when estimating can each be consistently estimated by Equations 3 and 4
a p-value associated with an estimate of ADiTE (or ATE). respectively. However, the context-aware drift detec-
We instead must only satisfy the condition of weak overlap tion framework instead requires estimation of the ADiTT
E[UMMD (C)|Z = 1]. We note that this can be achieved
0 < e(c) = P (Z = 1|C = c) < 1 PC1 -a.e., (A3) with the alternative estimator
n1
1 X
which is a necessary relaxation given we do not expect t̂(s, c, z) = ÛMMD (c1i ), (6)
n1 i=1
all contexts supported by the reference distribution to be
supported by the deployment distribution.
which averages only over deployments contexts {c1i }ni=1 1
,
n
Our general framework for context-aware drift detec- rather than all contexts {ci }i=1 . We further note that al-
tion is described in Algorithm 1. Implementation, how- though this estimator is consistent, and therefore asymp-
ever, requires a suitable choice of CoDiTE function and totically unbiased, averaging over estimates of CoDiTE
procedure for estimating the associated ADiTT. CATE conditioned on the same contexts used in estimation intro-
Context-Aware Drift Detection

Figure 2. Illustration of the computation of the MMD-based ADiTT statistic, where the difference between distributions underlying the
reference (blue) and deployment (orange) data s can be attributed to a change in distribution of context c. Computation can be thought of
as considering a number of held out deployment samples (red stars), and for each one computing a weighted MMD where only reference
and deployment samples with similar contexts significantly contribute. These weighted MMDs are then averaged to form the test statistic.
To visualise weight matrices we sort the samples in ascending order w.r.t. c and, for W0,1 for example, show the (j, k)-th entry as white if
the similarity k(s0j , s1k ) is to significantly contribute.

duces bias at finite sample sizes. We instead recommend the tionally note that we can rewrite
PnhEquation 7 in exactly the
test statistic same form but with Wu,v = i=1 Wu,v,i , where
nh
t̂((s, c, z), c̃1 ) =
1 X
ÛMMD (c̃1i ), (7) Wu,v,i = (L−1 1 −1 1 >
λu lu (c̃i ))(Lλv lv (c̃i )) . (9)
nh i=1
Here Wu,v,i can be viewed as an outer product between
where a portion c̃1 ∈ C nh of the deployment contexts (e.g. lu (c̃1i ) and lv (c̃1i ), assigning weight to pairs (cuj , cvk ) that are
25%) are held out to be conditioned on whilst the rest are both similar to c̃1i , but adjusted via L−1 −1
λu and Lλv such that
used for estimating the corresponding CoDiTEs. A fur- the weight is less if cuj (resp. cvk ) has many similar instances
ther motivation for this modification is that conditioning in cu (resp. cv ). This adjustment is important to ensure that,
on and averaging over all possible contexts carries a high for example, the combined weight assigned to comparing
computational cost that we found unjustified. s0j to s1k , both with contexts similar to c̃1i , does not depend
on the number of reference or deployment instances with
3.2.1. S ETTING R EGULARISATION PARAMETERS λ0 , λ1 similar contexts. If there are fewer such instances they re-
ceive higher weight to compensate. This has the effect of
In Section 2.2 we noted that ÛMMD (c) is a consistent esti-
controlling for PC0 and PC1 in the CoDiTE estimates. The
mator of UMMD (c) if λ0 and λ1 decay at slower rates than
−1 −1 dependence on PC1 returns when we average over the held
O(n0 2 ) and O(n1 2 ) respectively. In practice however out contexts c̃1 . We visualise this process for estimating the
sample sizes are fixed and values for λ0 and λ1 must be MMD-based ADiTT in Figure 2. For illustrative purposes
chosen. These parameters arise as regularisation parameters we consider only two held out contexts, visualise the cor-
in an operator-valued kernel regression, where functions responding weight matrices Wu,v,i for u, v ∈ {0, 1} and
{k(szi , ·)}ni=1
z
are regressed against contexts {czi }ni=1
z
to ob- i ∈ {1, 2} and show how they are summed to form the ma-
z
tain an estimator of µSz |C = E[k(S , ·)|C]. We propose trices Wu,v used by Equation 8. In particular note the intu-
using k-fold cross-validation to identify regularisation pa- itive block structure showing that only similarities k(suj , svk )
rameters that minimise the validation error in this regression between instances with similar contexts are weighted and
problem. Full details can be found in Appendix A. contribute to the test statistic t̂. By contrast the weight ma-
trices used in an estimate of MMD used by two-sampling
3.2.2. R ELATIONSHIP TO MMD T WO -S AMPLE T ESTS testing approaches would be fully white.
To facilitate illustrative comparisons to traditional MMD-
based tests of homogeneity, we first note that Equation 1 4. Experiments
can be rewritten in matrix form as
This section will show that using the MMD-based ADiTT
t(s0 , s1 ) = hK0,0 , W0,0 i + hK1,1 , W1,1 i − 2hK0,1 W0,1 i, estimator of Section 3.2 within the framework developed
(8) in Section 3.1 results in a detector satisfying desiderata D1-
where, for u, v ∈ {0, 1}, Ku,v denotes the kernel matrix D3. This will involve showing that the resulting detector is
with (j, k)-th entry k(suj , svk ) and Wu,v is a uniform weight calibrated when there has been no change in the distribution
matrix with all entries equal to (nu nv )−1 . We now addi- PS and when there is a change which can be attributed to a
Context-Aware Drift Detection

change in the context distribution PC . This prevents com- 1.0 1.0


5.0 Deployment 5.0 Reference
Reference 0.8 Deployment 0.8
parisons to conventional drift detectors that are not designed 2.5 2.5
0.6 0.6
to detect drift for the latter case. We therefore first develop 0.0 0.0

s
0.4 0.4
2.5 2.5
a baseline that generalises the principle underlying ad-hoc 0.2 0.2
5.0 5.0
approaches that might be considered by practitioners faced 2 0 2
0.0
2 0 2
0.0
with changing context. c c

Suppose a batch of deployment instances s1 has con- Figure 3. Visualisation of the weight attributed by MMD-ADiTT to
texts c1 (such as time) contained within a strict subset comparing each reference sample to the set of deployment samples
(left) and vice versa (right). Only reference samples with contexts
[min(c1 ), max(c1 )] ⊂ [min(c0 ), max(c0 )] of those cov-
in the support of the deployment contexts significantly contribute.
ered by the reference distribution (e.g. 1 of 24 hours). Prac- Weight here refers to the corresponding row/column sum of W0,1 .
titioners might here perform a two-sample test of the de-
ployment batch against the subsample of reference instances
with contexts contained within the same interval. More gen-
erally the practitioner wishes to perform the test with a
subset of the reference data that is sampled such that its
underlying context distribution matches PC1 . Knowledge
of PC0 and PC1 would allow the use of rejection sampling
to obtain such a subsample and subsequent application of a
two-sample test would provide a perfectly calibrated detec-
tor for our setting. We therefore consider, as a baseline we Figure 4. Visualisation of the weight matrices used in computation
refer to as MMD-Sub, rejection sampling using density esti- of the MMD-ADiTT when deployment contexts fall into two dis-
joint modes. The block-like structure that emerges when ordering
mators P̂C0 and P̂C1 . We cannot fit the density estimators
the samples by context confirms that only similarities between
using the samples being rejection sampled. We therefore instances with contexts in the same deployment mode contribute.
use the held-out portion of samples that the MMD-based
ADiTT method uses to condition on. For the closest possible
comparison we fit kernel density estimators using the same 4.1. Controlling for domain specific context
kernel as the ADiTT method and apply the MMD-based This example is designed to correspond to problems where
two-sample test described in Section 2.1. Further details on we wish to allow changes in domain specific context, such as
this baseline can be found in Appendix C. time or weather. To facilitate visualisations we consider uni-
Evaluating detectors by performing multiple runs and report- variate statistics S ∈ R and contexts C ∈ R. For the refer-
ing false positive rates (FPRs) and true positive rates (TPRs) ence distribution we take S0 ∼ N (C, 1) and C0 ∼ N (0, 1).
at fixed significance levels results in high-variance perfor- For changes in the context distributions we consider for PC1
mance measures that vary depending on the levels chosen. both a simple narrowing from N (0, 1) to N (0, σ 2 ) where
We instead evaluate power using AUC; the area under the re- σ < 1 and a more complex change to a mixture of Gaussians
ceiver operating characteristic curve (ROC) of FPR plotted with K modes. Figure 5 shows that MMD-ADiTT remains
against TPR across all significance levels. More powerful perfectly calibrated across all settings and MMD-Sub is also
detectors obtain higher AUCs. To evaluate calibration we strongly calibrated. Unsurprisingly, even a slight narrowing
similarly capture the FPR across all significance levels using of context causes conventional two-sample tests to become
the Kolmogorov-Smirnov (KS) distance between the set of wildly uncalibrated in our setting.
obtained p-values and U[0, 1]: the distribution of p-values To compare how powerfully detectors respond to changes
for a perfectly calibrated detector. We contextualise KS in the context-conditional distribution we change PS1 |C
distances in plots by shading the interval (0.046, 0.146): from S1 ∼ N (C, 1) to S1 ∼ N (C + , ω 2 ) for instances
the 95% confidence interval of the KS distance computed within one of the K deployment modes, with an example
using p-values actually sampled from U [0, 1]. We use plots for K = 2 and (, ω) = (0, 2) shown in Figure 3. Figure 6
to present key trends contained within results and defer demonstrates how the power of each detector varies with
full tables, as well as more detailed descriptions of exper- sample size for the K = 1, 2 and (, ω) = (0.25, 0), (0, 0.5)
imental procedures, to Appendices B-E. This includes, in cases. We see that even for the unimodal case, where we
Appendix D, a discussion of ablations performed to confirm might not necessarily expect the MMD-ADiTT approach
the importance of using the adapted estimator of ADiTT, to have an advantage, it is more powerful across all sample
rather than Park et al. (2021)’s estimator of ADiTE. For ker- sizes and distortions considered (see Appendix E.1 for more
nels we use Gaussian RBFs with bandwidths set using the results). For the bimodal case the difference in performance
median heuristic (Gretton et al., 2012a). For MMD-ADiTT is much larger. This is for an important reason that we
we fit ê(c) using kernel logistic regression. illustrate by considering the difference between reference
Context-Aware Drift Detection

1.0 0.6 1.0


MMD-ADiTT 0.25 MMD-ADiTT MMD-ADiTT
0.8 MMD-Sub MMD-Sub MMD-Sub
Calibration (KS)

Calibration (KS)
Calibration (KS)
MMD 0.20 0.8

Power (AUC)
0.6 0.4
0.15 MMD-ADiTT, = 0.6
0.4 0.10 0.6 MMD-Sub, = 0.6
0.2
0.2 MMD-ADiTT, = 0.5
0.05 MMD-Sub, = 0.5
0.4
0.0 2 3 2 2 2 1 20 0.00 1 0.0 27 28 29 210 211 27 28 29 210 211
2 3 4 5
No. of modes (K) No. of samples No. of samples

Figure 5. Plots of the calibration of detectors as (left) the con- Figure 7. Plots showing (left) calibration under resampling of sub-
text distribution PC1 gradually narrows from N (0, 1) to N (0, σ 2 ) population prevalences and (right) power under changes to a sub-
and (right) completely changes to a mixture of Gaussians with K population mean ( = 0.6) or scale (ω = 0.5).
modes.

MMD-ADiTT, K=1 MMD-ADiTT, K=2 differ. Figure 1 is one example where the distribution un-
MMD-Sub, K=1 MMD-Sub, K=2
derlying both the image and its label differed depending on
= 0.25 (shift) = 0.5 (scale)
1.0 whether it corresponded to day or night. Often practitioners
0.9 wish to be be alerted to changes in distributions underly-
Power (AUC)

0.8 ing subpopulations, but not to changes in their prevalence.


0.7
Sometimes subpopulation membership will not be available
0.6
explicitly but will have to be inferred. If subpopulations
are known and labels can be assigned to all (resp. some)
27 28 29 210 211 27 28 29 210 211 of the reference data, a classifier could be trained in a fully
No. of samples No. of samples
(resp. semi-) supervised manner to map instances onto a
Figure 6. Plots showing how powerfully detectors respond when vector representing subpopulation membership probabilities.
the context-conditional distribution PS1 |C changes from N (C, 1) Alternatively a fully unsupervised approach could be taken
to N (C + , ω 2 ). We consider both unimodal (K = 1) and where a probabilistic clustering algorithm is used to iden-
bimodal (K = 2) deployment context distributions PC1 and in the tify subpopulations and map onto probabilities accordingly,
latter case only change PS1 |C for one mode. which we demonstrate in the following experiments.
We take S = R2 and the reference distribution to be a mix-
ture of two Gaussians. We wish to allow changes to the
and deployment distributions shown in Figure 3. Although
mixture weights but detect when a component is scaled by
MMD-Sub subsamples reference instances corresponding to
a factor of ω or shifted by  standard deviations in a random
the deployment modes to achieve a marginal weighting ef-
direction. We do not assume access to subpopulation (com-
fect similar to that shown in Figure 3, the MMD is computed
ponent) labels and therefore train a Gaussian mixture model
in a manner that weights the similarities between instances
to associate instances with a probability that they belong to
in different modes equally to the similarities between in-
subpopulation 1, which is then used as context C.
stances in the same mode. This adds noise to the test statistic
that makes it more difficult to observe differences of inter- Figure 7 shows how the calibration, and – for the  = 0.6
est. Instead the ADiTT method leverages the information and ω = 0.5 cases – power, varies with sample size. Given
provided by the context, only comparing the similarity of that the MMD-ADiTT method remains well calibrated, we
instances with similar contexts. This can be observed by vi- can assume the miscalibration of the MMD-Sub detector
sualising the weights matrices W0,0 , W1,1 and W0,1 where results from the bias resulting from imperfect density esti-
the rows and columns are ordered by increasing context, as mation used for rejection sampling. Although the fit will
shown in Figure 4. Appendix E.1 gives further explanation improve with the number of samples, it seems to be outpaced
and visualises the corresponding matrices for MMD-Sub. In by the increase in power with which the bias is detected as
summary, MMD-ADiTT detects differences conditional on the sample size grows. Moreover we see that across all
context, i.e. differences between PS1 |C and PS0 |C , whereas sample sizes MMD-ADiTT more powerfully detects both
subsampling detects R differences between the marginal dis- changes in mean and variance, with additional results shown
tributions PS1 and dcPS0 |C (·|c)PC1 (c)/PC0 (c). in Appendix E.1.

4.2. Controlling for subpopulation prevalences 4.3. Controlling for model predictions
The population underlying reference data can often be de- Detecting covariate drift conditional on model predictions
composed into subpopulations between which the distribu- allows a model to be more or less confident than average on
tions underlying features, labels or their relationship may a batch of deployment samples, or more frequently predict
Context-Aware Drift Detection

MMD-ADiTT 1.0 tions have drifted the MMD-ADiTT detector, by focusing


0.8
MMD-Sub on the type of drift of interest, is able to respond more pow-
Calibration (KS)

0.9

Power (AUC)
0.6 erfully. As the number of drifted subpopulations increases
0.4 0.8 to J ≥ 3, such that the distribution has changed in a more
MMD-ADiTT
0.2 0.7 MMD-Sub global manner, the standard MMD test is equally powerful,
MMD as might be expected. Table 5 in Appendix E.3 shows that
1 2 3 4 5 6 0.6 1 2 3 4 5 6
No. of Predicted Classes (K) No. of Drifted Classes (J) this pattern holds more generally.

Figure 8. Plots showing (left) calibration under changes to model 5. Conclusion


predictions resulting in only K of 6 predicted classes and (right)
power under changes to J of 6 class distributions. We introduced a new framework for drift detection which
breaks from the i.i.d. assumption by allowing practition-
ers to specify context under which the deployment data is
certain labels, as long as the features are indistinguishable permitted to change. This drastically expands the space of
from reference features for which the model made simi- problems to which drift detectors can be usefully applied. In
lar predictions. Covariate drift into a mode existing in the future work we intend to further explore how certain com-
reference set would therefore be permitted whereas covari- binations of contexts and statistics may be used to target
ate drift into a newly emerging concept would be detected. certain types of drift, such as covariate drift into regions of
We use the ImageNet (Deng et al., 2009) class structure high epistemic uncertainty.
developed by Santurkar et al. (2021) to represent realistic
drifts in distributions underlying subpopulations. The Im-
ageNet classes are partitioned into 6 semantically similar
Acknowledgements
superclasses and drift in the distribution underlying a super- We would like to thank Ashley Scillitoe and Hao Song
class corresponds to a change to the constituent subclasses. for their help integrating our research into the open-source
Adhering to the popularity of self-supervised backbones Python library alibi-detect.
in computer vision we define a model M (x) = H(B(x)),
where B is the convolutional base of a pretrained SimCLR
model (Chen et al., 2020) and H is a classification head we
References
train on the ImageNet training split to predict superclasses. Baena-Garcıa, M., del Campo-Ávila, J., Fidalgo, R., Bifet,
Experiments are then performed using the validation split. A., Gavalda, R., and Morales-Bueno, R. Early drift de-
We also use the pretrained SIMCLR model as part of the ker- tection method. In Fourth international workshop on
nel k, applying both the base and projection head to images knowledge discovery from data streams, volume 6, pp.
before applying the usual Gaussian RBF in R128 . 77–86, 2006.
To investigate calibration under changing context C = Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gut-
M (X), we randomly choose K ∈ {1, .., 6} of the 6 la- freund, D., Tenenbaum, J., and Katz, B. Objectnet: A
bels and sample a deployment batch containing only images large-scale bias-controlled dataset for pushing the limits
to which the model assigns a most-likely label within the of object recognition models. In Wallach, H., Larochelle,
K chosen. Therefore for K = 6 the marginal distribution H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Gar-
of the images remains the same in both the reference and nett, R. (eds.), Advances in Neural Information Process-
deployment sets, but for K < 6 they differ in a way we wish ing Systems, volume 32. Curran Associates, Inc., 2019.
to allow. In Figure 8 we see that MMD-Sub was unable to URL https://proceedings.neurips.cc/p
remain calibrated for this trickier problem requiring density aper/2019/file/97af07a14cacba681feac
estimators to be fit in six-dimensional space. By contrast f3012730892-Paper.pdf.
the MMD-ADiTT method remains well calibrated. Given
MMD-Sub’s ineffectiveness on this harder problem, which Bifet, A. and Gavaldà, R. Learning from Time-Changing
can be further seen in Figure 8, we provide an alternative Data with Adaptive Windowing, pp. 443–448. 2007. doi:
baseline to contextualise power results. The standard MMD 10.1137/1.9781611972771.42. URL https://epubs.
two-sample test does not allow for changes in distribution siam.org/doi/abs/10.1137/1.978161197
that can be attributed to changes in model predictions and 2771.42.
therefore does not have built in insensitivities like MMD- Breck, E., Polyzotis, N., Roy, S., Whang, S., and Zinkevich,
ADiTT. We might therefore expect it to respond more pow- M. Data validation for machine learning. In MLSys, 2019.
erfully to drift generally, including those corresponding to
specific subpopulations. Figure 8 shows however that this Briseño Sanchez, G., Hohberg, M., Groll, A., and Kneib,
is not generally the case. When only one or two subpopula- T. Flexible instrumental variable distributional regres-
Context-Aware Drift Detection

sion. Journal of the Royal Statistical Society: Series A Fromont, M., Laurent, B., Lerasle, M., and Reynaud-Bouret,
(Statistics in Society), 183(4):1553–1574, 2020. P. Kernels based tests with non-asymptotic bootstrap
approaches for two-sample problems. In Mannor, S.,
Bu, L., Alippi, C., and Zhao, D. A pdf-free change detection Srebro, N., and Williamson, R. C. (eds.), Proceedings
test based on density difference estimation. IEEE Trans- of the 25th Annual Conference on Learning Theory, vol-
actions on Neural Networks and Learning Systems, 29 ume 23 of Proceedings of Machine Learning Research,
(2):324–334, 2018. doi: 10.1109/TNNLS.2016.2619909. pp. 23.1–23.23, Edinburgh, Scotland, 25–27 Jun 2012.
Chang, M., Lee, S., and Whang, Y.-J. Nonparametric tests PMLR. URL https://proceedings.mlr.pres
of conditional treatment effects with an application to s/v23/fromont12.html.
single-sex schooling on academic achievements. The Gama, J., Medas, P., Castillo, G., and Rodrigues, P. Learning
Econometrics Journal, 18(3):307–346, 2015. with drift detection. In Bazzan, A. L. C. and Labidi, S.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A (eds.), Advances in Artificial Intelligence – SBIA 2004,
simple framework for contrastive learning of visual rep- pp. 286–295, Berlin, Heidelberg, 2004. Springer Berlin
resentations. In International conference on machine Heidelberg. ISBN 978-3-540-28645-5.
learning, pp. 1597–1607. PMLR, 2020.
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borg-
Chernozhukov, V., Fernández-Val, I., and Melly, B. Infer- wardt, K., and Schölkopf, B. Dataset shift in machine
ence on counterfactual distributions. Econometrica, 81 learning. In Covariate Shift and Local Learning by Dis-
(6):2205–2268, 2013. tribution Matching, pp. 131–160. MIT Press, 2008.

Chwialkowski, K., Ramdas, A., Sejdinovic, D., and Gretton, Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperum-
A. Fast two-sample testing with analytic representations budur, B. K. A fast, consistent kernel two-sample test. In
of probability measures. In Cortes, C., Lawrence, N. D., Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C.
Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Ad- K. I., and Culotta, A. (eds.), Advances in Neural Informa-
vances in Neural Information Processing Systems 28: tion Processing Systems 22: 23rd Annual Conference on
Annual Conference on Neural Information Processing Neural Information Processing Systems 2009. Proceed-
Systems 2015, December 7-12, 2015, Montreal, Que- ings of a meeting held 7-10 December 2009, Vancouver,
bec, Canada, pp. 1981–1989, 2015. URL https: British Columbia, Canada, pp. 673–681. Curran Asso-
//proceedings.neurips.cc/paper/2015/ ciates, Inc., 2009. URL https://proceedings.
hash/b571ecea16a9824023ee1af16897a58 neurips.cc/paper/2009/hash/9246444d9
2-Abstract.html. 4f081e3549803b928260f56-Abstract.html.

Cobb, O., Van Looveren, A., and Klaise, J. Sequential Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf,
multivariate change detection with calibrated and mem- B., and Smola, A. A kernel two-sample test. Journal
oryless false detection rates. In Camps-Valls, G., Ruiz, of Machine Learning Research, 13(25):723–773, 2012a.
F. J. R., and Valera, I. (eds.), Proceedings of The 25th URL http://jmlr.org/papers/v13/gretto
International Conference on Artificial Intelligence and n12a.html.
Statistics, volume 151 of Proceedings of Machine Learn-
ing Research, pp. 226–239. PMLR, 28–30 Mar 2022. Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan,
URL https://proceedings.mlr.press/v1 S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K.
51/cobb22a.html. Optimal kernel choice for large-scale two-sample tests. In
Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger,
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, K. Q. (eds.), Advances in Neural Information Processing
L. Imagenet: A large-scale hierarchical image database. Systems, volume 25. Curran Associates, Inc., 2012b. URL
In 2009 IEEE conference on computer vision and pattern https://proceedings.neurips.cc/paper
recognition, pp. 248–255. Ieee, 2009. /2012/file/dbe272bab69f8e13f14b405e0
38deb64-Paper.pdf.
Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., and Madry,
A. Exploring the landscape of spatial robustness. In Harchaoui, Z., Bach, F. R., and Moulines, E. Testing
Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceed- for homogeneity with kernel fisher discriminant anal-
ings of the 36th International Conference on Machine ysis. In Platt, J. C., Koller, D., Singer, Y., and Roweis,
Learning, volume 97 of Proceedings of Machine Learn- S. T. (eds.), Advances in Neural Information Process-
ing Research, pp. 1802–1811, Long Beach, California, ing Systems 20, Proceedings of the Twenty-First Annual
USA, 09–15 Jun 2019. PMLR. URL http://procee Conference on Neural Information Processing Systems,
dings.mlr.press/v97/engstrom19a.html. Vancouver, British Columbia, Canada, December 3-6,
Context-Aware Drift Detection

2007, pp. 609–616. Curran Associates, Inc., 2007. URL 2017. URL https://openreview.net/for
https://proceedings.neurips.cc/paper um?id=SJkXfE5xx.
/2007/hash/4ca82782c5372a547c104929f
03fe7a9-Abstract.html. Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., and Zhang, G.
Learning under concept drift: A review. IEEE Trans-
Hendrycks, D. and Dietterich, T. G. Benchmarking neural actions on Knowledge and Data Engineering, 31(12):
network robustness to common corruptions and pertur- 2346–2363, 2018.
bations. In 7th International Conference on Learning
Representations, ICLR 2019, New Orleans, LA, USA, Muandet, K., Fukumizu, K., Sriperumbudur, B. K., and
May 6-9, 2019. OpenReview.net, 2019. URL https: Schölkopf, B. Kernel mean embedding of distributions:
//openreview.net/forum?id=HJz6tiCqYm. A review and beyond. Found. Trends Mach. Learn., 10
(1-2):1–141, 2017. doi: 10.1561/2200000060. URL
Hohberg, M., Pütz, P., and Kneib, T. Treatment effects https://doi.org/10.1561/2200000060.
beyond the mean using distributional regression: Methods
Page, E. S. CONTINUOUS INSPECTION SCHEMES.
and guidance. PloS one, 15(2):e0226514, 2020.
Biometrika, 41(1-2):100–115, 06 1954. ISSN 0006-3444.
Jitkrittum, W., Szabó, Z., Chwialkowski, K. P., and Gretton, doi: 10.1093/biomet/41.1-2.100. URL https://doi.
A. Interpretable distribution features with maximum test- org/10.1093/biomet/41.1-2.100.
ing power. In Lee, D. D., Sugiyama, M., von Luxburg, U.,
Paleyes, A., Urma, R.-G., and Lawrence, N. D. Challenges
Guyon, I., and Garnett, R. (eds.), Advances in Neural In-
in deploying machine learning: a survey of case studies.
formation Processing Systems 29: Annual Conference on
ACM Computing Surveys (CSUR), 2020.
Neural Information Processing Systems 2016, December
5-10, 2016, Barcelona, Spain, pp. 181–189, 2016. URL Park, J. and Muandet, K. A measure-theoretic approach
https://proceedings.neurips.cc/paper to kernel conditional mean embeddings. In Larochelle,
/2016/hash/0a09c8844ba8f0936c20bd791 H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H.
130d6b6-Abstract.html. (eds.), Advances in Neural Information Processing Sys-
tems, volume 33, pp. 21247–21259. Curran Associates,
Klaise, J., Looveren, A. V., Cox, C., Vacanti, G., and Coca,
Inc., 2020. URL https://proceedings.neurip
A. Monitoring and explainability of models in production,
s.cc/paper/2020/file/f340f1b1f65b6df
2020.
5b5e3f94d95b11daf-Paper.pdf.
Lee, S. and Whang, Y.-J. Nonparametric tests of conditional
Park, J., Shalit, U., Schölkopf, B., and Muandet, K. Con-
treatment effects. 2009.
ditional distributional treatment effect with kernel con-
Lipton, Z. C., Wang, Y., and Smola, A. J. Detecting and ditional mean embeddings and u-statistic regression. In
correcting for label shift with black box predictors. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th
Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume
International Conference on Machine Learning, ICML 139 of Proceedings of Machine Learning Research, pp.
2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 8401–8412. PMLR, 18–24 Jul 2021. URL https://pr
2018, volume 80 of Proceedings of Machine Learning oceedings.mlr.press/v139/park21c.html.
Research, pp. 3128–3136. PMLR, 2018. URL http:
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and
//proceedings.mlr.press/v80/lipton18a.
Lawrence, N. D. Dataset Shift in Machine Learning. The
html.
MIT Press, 2009. ISBN 0262170051.
Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Suther- Rabanser, S., Günnemann, S., and Lipton, Z. Failing loudly:
land, D. J. Learning deep kernels for non-parametric An empirical study of methods for detecting dataset shift.
two-sample tests. In III, H. D. and Singh, A. (eds.), Advances in Neural Information Processing Systems, 32,
Proceedings of the 37th International Conference on Ma- 2019.
chine Learning, volume 119 of Proceedings of Machine
Learning Research, pp. 6316–6326. PMLR, 13–18 Jul Ramdas, A., Trillos, N. G., and Cuturi, M. On wasserstein
2020. URL https://proceedings.mlr.pres two-sample testing and related families of nonparametric
s/v119/liu20m.html. tests. Entropy, 19(2):47, 2017. doi: 10.3390/e19020047.
URL https://doi.org/10.3390/e19020047.
Lopez-Paz, D. and Oquab, M. Revisiting classifier two-
sample tests. In 5th International Conference on Learning Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do
Representations, ICLR 2017, Toulon, France, April 24-26, ImageNet classifiers generalize to ImageNet? In Chaud-
2017, Conference Track Proceedings. OpenReview.net, huri, K. and Salakhutdinov, R. (eds.), Proceedings of the
Context-Aware Drift Detection

36th International Conference on Machine Learning, vol-


ume 97 of Proceedings of Machine Learning Research,
pp. 5389–5400, Long Beach, California, USA, 09–15 Jun
2019. PMLR. URL http://proceedings.mlr.
press/v97/recht19a.html.
Rosenbaum, P. R. Conditional permutation tests and the
propensity score in observational studies. Journal of
the American Statistical Association, 79(387):565–574,
1984.
Rosenbaum, P. R. and Rubin, D. B. The central role of
the propensity score in observational studies for causal
effects. Biometrika, 70(1):41–55, 1983.
Santurkar, S., Tsipras, D., and Madry, A. {BREEDS}:
Benchmarks for subpopulation shift. In International
Conference on Learning Representations, 2021. URL
https://openreview.net/forum?id=mQPB
mvyAuk.
Scott, D. W. Multivariate density estimation: theory, prac-
tice, and visualization. John Wiley & Sons, 2015.
Shen, S. Estimation and inference of distributional partial
effects: theory and application. Journal of Business &
Economic Statistics, 37(1):54–66, 2019.
Singh, R., Xu, L., and Gretton, A. Reproducing kernel
methods for nonparametric and semiparametric treatment
effects. arXiv preprint arXiv:2010.04855, 2020.
Song, L., Huang, J., Smola, A., and Fukumizu, K. Hilbert
space embeddings of conditional distributions with appli-
cations to dynamical systems. In Proceedings of the 26th
Annual International Conference on Machine Learning,
pp. 961–968, 2009.
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and
Schmidt, L. Measuring robustness to natural distribu-
tion shifts in image classification. Advances in Neural
Information Processing Systems, 33:18583–18599, 2020.
Tasche, D. Fisher consistency for prior probability shift.
Journal of Machine Learning Research, 18(95):1–32,
2017. URL http://jmlr.org/papers/v18/
17-048.html.
Van Looveren, A., Klaise, J., Vacanti, G., Cobb, O., Scillitoe,
A., and Samoilescu, R. Alibi Detect: Algorithms for
outlier, adversarial and drift detection, 1 2022. URL
https://github.com/SeldonIO/alibi-de
tect.
Wang, H. and Abraham, Z. Concept drift detection for
streaming data. 2015 International Joint Conference on
Neural Networks (IJCNN), Jul 2015. doi: 10.1109/ijcnn.
2015.7280398. URL http://dx.doi.org/10.11
09/IJCNN.2015.7280398.
Context-Aware Drift Detection

A. Setting Regularisation Parameters for the MMD-based ADiTT Estimator


Consider the problem of estimating the CoDiTE function UMMD : C → R, defined in Equation 2 as

UMMD (c) = ||µS0 |C=c − µS1 |C=c ||2Hk , (10)

where µSz |C=c is the kernel mean embedding of PS z |C=c , which by unconfoundedness is equal to PSz |C := PS|C,Z=z .
Unconfoundedness allows us to use the methods of Park & Muandet (2020) on samples {(s0i , c0i )}ni=1 0
and {(s1i , c1i )}ni=1
1

separately to obtain estimators µ̂S0 |C=c and µ̂S1 |C=c of CMEs µS0 |C and µS1 |C respectively. Park et al. (2021) show that
the plug-in estimator
ÛMMD (c) = ||µ̂S0 |C=c − µ̂S1 |C=c ||2Hk , (11)
for which a closed form expression is given in Equation 3, is then consistent in the sense that
P
E[(ÛMMD (C) − UMMD (C))2 ] −
→ 0 as n0 , n1 → ∞. (12)

Park & Muandet (2020)’s method for estimating the CME µS0 |C from samples {(s0i , c0i )}ni=1
0
first considers the RKHS GSC
0
of functions from C → Hk induced by the operator-valued kernel lSC = l(z, z )Id where l : C × C → R is a kernel on C
and Id : Hk → Hk is the identity operator. If one then uses GSC as the hypothesis space for a regression of the functions
{k(s0i , ·)}ni=1
0
against contexts {c0i }ni=1
0
under the regularised objective
n0
1 X
||k(s0i , ·) − f (c0i )||2Hk + λ0 ||f ||2GSC , (13)
n0 i=1

then a representer theorem applies that states there exists an optimal solution of the form
X
fλ0 (c) = αi l(c, c0i ) = α> l0 (c), (14)
i

where
α = (L0,0 + n0 λ0 I)−1 k0 (·) ∈ Hkn0 . (15)
We therefore recommend running an optimisation process for λ0 which for each candidate value splits {(s0i , c0i )}ni=1
into 0

k-folds, computes fλ0 (c) for each fold, and sums the squared errors ||k(s, ·) − fλ0 (c)||2Hk across out-of-fold instances (s, c).
Recalling the shorthand L−1λ0 = (L0,0 + n0 λ0 I)
−1
, this can be achieved by noting

||k(s, ·) − fλ0 (c)||2Hk =||k(s, ·) − l0 (c)L−1 2


λ0 k0 (·)||Hk (16)
=hk(s, ·), k(s, ·)iHk + hl0 (c)L−1 −1

λ0 k0 (·), l0 (c)Lλ0 k0 (·)iHk 2hl0 (c)L−1
λ0 k0 (·), k(s, ·)iHk (17)
X X
=hk(s, ·), k(s, ·)iHk +h l(ci , c)(L−1
λ0 )i,j k(sj , ·), l(cu , c)(L−1
λ0 )u,v k(sv , ·)iHk (18)
i,j u,v
X
− 2h l(ci , c)(L−1
λ0 )i,j k(sj , ·), k(s, ·)iHk (19)
i,j
X X
=k(s, s) + l(ci , c)(L−1 −1
λ0 )i,j k(sj , sv )l(cu , c)(Lλ0 )u,v − 2 l(ci , c)(L−1
λ0 )i,j k(sj , s) (20)
i,j,u,v i,j

=k(s, s) + l0 (c)L−1 −>


λ0 K0,0 Lλ0 l0 (c)
>
− 2l0 (c)L−1 >
λ0 k0 (s) . (21)

The errors for all out-of-fold instances can be computed in one go by stacking l0 (c) and k0 (s) vectors into matrices. The
same procedure can then be performed to select λ1 .

B. Implementation Details for Drift Detection with the MMD-based ADiTT Estimator
In this section we make clear the exact process used for computing the MMD-ADiTT test statistic and estimating the
associated p-value representing its extremity under the null hypothesis. Recall from Equations 7 and 3 that the test statistic
is the average MMD-based CoDiTE estimate on a set of held-out deployment contexts c̃1 . The portion of samples we hold
Context-Aware Drift Detection

out for this purpose is 25% across all experiments. CoDiTE estimates require a regularisation parameter λ to use as part of
the estimation process, for which we use λ = 0.001 across all experiments. For kernels k : S × S → R and l : C × C → R
0 2
we use the Gaussian RBF f (x, x0 ) = exp(− (x−x )
2σ 2 ) where σ is set to be the median distance between all reference and
deployment statistics, for k, or contexts, for l.
To associate resulting test statistics with estimates of p-values we use the conditional permutation test of Rosenbaum (1984).
We do so using nperm = 100 conditional permutations. This requires fitting a classifier ê : C → [0, 1] to approximate the
propensity score e(c) = P (Z = 1|C = c). We do this by training a kernel logistic regressor on the data {(ci , zi )}ni=1 , using
the same kernel l defined above. More precisely, we achieve this by first fitting a kernel support vector classifier (SVC)
and then performing logistic regression on the scores to obtain probabilities. We found that mapping SVC scores onto
probabilities in this manner using just two logistic regression parameters meant that overfitting to {(ci , zi )}ni=1 was not a
problem.

C. MMD-Sub Baseline
In this section we make clear the exact process used for computing the MMD-Sub test statistic and estimating the associated
p-value representing its extremity under the null hypothesis. Recall from Section 4 that MMD-Sub aims to obtain a subset
of reference instances for which the underlying context distribution matches PC1 . It proceeds by first fitting kernel density
estimators P̂C0 and P̂C1 to approximate PC0 and PC1 . We use 25% of the reference and deployment contexts for fitting the
estimators and then hold out the samples from the rest of the process. We again use Gaussian RBF kernels but this time
allow the bandwidth to be tuned using Scott’s Rule (Scott, 2015). Once fit, we retain for each i in the set of unheld indices
U, reference sample i with probability P̂C1 (ci )/(mP̂C0 (ci )), where m = maxj∈U P̂C0 (cj )/(P̂C1 (cj )).
Once a subset of the reference set has been sampled, an MMD two-sample test is applied against the deployment set in the
normal way (Gretton et al., 2012a). We use the same Gaussian RBF kernel with median heuristic and estimate the p-value
using a conventional (unconditional) permutation test, again with nperm = 100.

D. MMD-ADiTE: An Ablation
In Section 3.1 we noted the importance of defining a test statistic that considers the difference between conditional
distributions only at contexts supported by the deployment context distribution PC1 , rather than the more general PC . When
using MMD-based CoDiTE estimators this corresponded to averaging over held out deployment contexts, rather than both
reference and deployment contexts. We refer to the estimator that would have resulted from averaging over both reference
and deployment contexts as MMD-ADiTE. We show in Tables 3 and 4 that on the experiments considered in Section 4.1
using MMD-ADiTE as a test statistic results in a wildly miscalibrated detector, as would be expected. Further details on
these experiments can be found in Appendix E.1.

E. Experiments: Further Details and Visualisations


The general procedure we follow for obtaining results is as follows. For calibration we define reference and deployment
distributions PS0 |C (s|c)PC0 (c) and PS1 |C (s|c)PC1 (c) respectively where PS0 |C (s|c) is equal to PS1 |C (s|c) but PC0 (c)
is not necessarily equal to PC1 (c). A single run then involves generating batches of reference and deployment data and
applying a detector to obtain a p-value, with permutation tests performed using 100 permutations. We perform 100 runs to
obtain 100 p-values. We then report the Kolmogorov-Smirnov distance between the empirical CDF of the 100 p-values and
the CDF of the uniform distribution on [0, 1].
For experiments exploring power, PS0 |C additionally differs from PS1 |C . In this case we perform 100 runs where this
change is present and 100 runs where only the change in PC is present. The ROC curve then plots TPRs, computed using
the first 100 p-values, against FPRs, computed using the second 100 p-values. The area under the ROC curve – the AUC – is
then reported as a measure of power.

E.1. Controlling for domain specific context


For this problem we take S = C = R. For the reference distribution we take S0 | ∼ N (C, 1) and C0 ∼ N (0, 1), as shown in
blue in both plots of Figures 9, such that marginally S0 ∼ N (0, 2). We first consider a simple narrowing of the context
Context-Aware Drift Detection

Table 1. Calibration under a narrowing of context from N (0, 1) to N (0, σ 2 )


Method Calibration (KS)
σ = 0.125 σ = 0.25 σ = 0.5 σ = 1.0
MMD-ADiTT 0.10 0.06 0.09 0.09
MMD-Sub 0.12 0.06 0.16 0.19
MMD 1.00 1.00 0.96 0.10

Table 2. Calibration under a change of context from N (0, 1) to a mixture of Gaussians with K components.
Method Calibration (KS)
K=1 K=2 K=3 K=4 K=5
MMD-ADiTT 0.06 0.11 0.11 0.14 0.08
MMD-Sub 0.11 0.19 0.12 0.11 0.14

distribution from PC0 = N (0, 1) to PC1 = N (0, σ 2 ) for σ ∈ {0.125, 0.25, 0.5, 1.0} in order to demonstrate as clearly as
possible how conventional two-sample tests fail to satisfy our notion of calibration, whereas MMD-ADiTT and MMD-Sub
succeed. The context-conditional distribution remains unchanged at PS0 |C = PS1 |C = N (C, 1). An example for the
σ = 0.5 case is shown in Figure 9a and the results are shown in Table 1 for a sample size of n0 = n1 = 1000.
Secondly to test whether detectors remain calibrated under
PKmore complex changes in context, we consider for the deployment
1 2
context distribution a mixture of Gaussians PC1 = K k=1 N (µk , σk ). For each of the 100 runs we generate new means
{µk }Kk=1 from a N (0, 1) and fix σk = 0.2, resulting in deployment samples such as that shown in Figure 9b for the K = 2
case. Again the context-conditional distribution remains unchanged at PS0 |C = PS1 |C = N (C, 1). Results are shown in
Table 2 for a sample size of n0 = n1 = 1000 and a Q-Q plot is shown in Figure 13 for the K = 2 case. Figure 12 shows
that the conditional resampling scheme indeed manages to reassign samples into reference and deployment windows in a
diversity of ways whilst staying true the context distribution PC1 .
Finally to test power we again take the mixture of Gaussians deployment context distribution described above but now,
for deployment instances in one of the K modes, vary the context-conditional distribution from N (C, 1) to N (C + , ω 2 )
for  ∈ {0.25, 0.5} or ω ∈ {0.5, 2.0}, with examples shown in Figure 10 for the K = 2 case. Table 3 shows how the
detectors’ power varies with sample size for the unimodal (K = 1) case. Table 4 shows the bimodal (K = 2) case, where
the difference in performance is more significant.
As noted in Section 4.1, this difference is because MMD-ADiTT detects differences conditional on context, i.e. differ-
ences
R between PS1 |C and PS0 |C , whereas subsampling detects differences between the marginal distributions PS1 and
dcPS0 |C (·|c)PC1 (c)/PC0 (c). This was apparent from the block-structured weight matrices visualised in Figure 4, corre-
sponding to the context distributions of Figure 9b, showing that only similarities between instances with similar contexts
significantly contributed. For example in W0,0 the blobs in the lower left corresponds to the weights assigned to similarities
between reference instances whose contexts fall within the lower deployment context cluster and the blob in the upper right
similarly corresponds to those within the upper cluster, with no weight assigned to similarities between the two. Figure 11
shows the corresponding matrices for MMD-Sub. Here rows and columns are fully active or inactive depending on whether
a given reference instance was successfully rejection sampled. We again see similar blobs in the lower left and upper right,
but now additionally in the upper left and lower right, adding unwanted noise to the test statistic. The pattern is even clearer
for W1,1 where, in the MMD-Sub case, all similarities between deployment instances are considered relevant, regardless of
how similar their contexts are.

E.2. Controlling for subpopulation prevalences


For this problem we take S = R2 and a (context-unconditional) reference distribution of PS0 = π1 N (µ1 , σ12 I) +
π2 N (µ2 , σ22 I) where I is the 2-dimensional identity matrix. We aim to detect drift in the distributions underlying subpopula-
tions, i.e. changes in (µ1 , µ2 , σ1 , σ2 ), whilst allowing their relative prevalences, i.e. (π1 , π2 ), to change. For each run we
sample new prevalences (π1 , π2 ) from a Beta(2, 2), means µ1 , µ2 from a N (0, I) and variances from an inverse gamma
distribution with mean 0.5, such that some runs have highly overlapping modes and others do not. An example, for a run
with minimal mode overlap, is shown in Figure 15.
Context-Aware Drift Detection

Table 3. Power under a change of context-conditional distribution from N (C, 1) to N (C + , 1) or N (C, ω 2 ).


Method Sample Size Calibration (KS) Power (AUC)
 = 0.25  = 0.5 ω = 0.5 ω = 2.0
MMD-ADiTT 0.10 0.64 0.83 0.93 0.94
MMD-ADiTE 128 0.37 0.61 0.77 0.69 0.89
MMD-Sub 0.08 0.56 0.77 0.87 0.87
MMD-ADiTT 0.14 0.68 0.89 0.97 0.98
MMD-ADiTE 256 0.52 0.63 0.75 0.77 0.92
MMD-Sub 0.10 0.67 0.87 0.97 0.92
MMD-ADiTT 0.07 0.78 0.97 0.97 0.98
MMD-ADiTE 512 0.55 0.58 0.75 0.65 0.90
MMD-Sub 0.10 0.69 0.90 0.97 0.94
MMD-ADiTT 0.12 0.88 0.99 0.98 0.99
MMD-ADiTE 1024 0.50 0.62 0.71 0.55 0.88
MMD-Sub 0.08 0.83 0.98 0.97 0.98
MMD-ADiTT 0.09 0.97 0.99 0.99 0.99
MMD-ADiTE 2048 0.44 0.60 0.69 0.53 0.91
MMD-Sub 0.09 0.95 0.99 0.99 0.99

Table 4. Power under a change of context-conditional distribution from N (C, 1) to N (C + , 1) or N (C, ω 2 ) for instances in one of 2
deployment modes.
Method Sample Size Calibration (KS) Power (AUC)
 = 0.25  = 0.5 ω = 0.5 ω = 2.0
MMD-ADiTT 0.09 0.56 0.73 0.78 0.87
MMD-ADiTE 128 0.22 0.55 0.69 0.66 0.84
MMD-Sub 0.04 0.53 0.63 0.68 0.74
MMD-ADiTT 0.10 0.63 0.85 0.90 0.93
MMD-ADiTE 256 0.39 0.59 0.77 0.74 0.89
MMD-Sub 0.16 0.55 0.67 0.75 0.76
MMD-ADiTT 0.10 0.69 0.92 0.94 0.97
MMD-ADiTE 512 0.27 0.58 0.77 0.71 0.85
MMD-Sub 0.12 0.62 0.76 0.83 0.86
MMD-ADiTT 0.12 0.80 0.95 0.97 0.97
MMD-ADiTE 1024 0.32 0.65 0.80 0.72 0.87
MMD-Sub 0.12 0.68 0.89 0.94 0.95
MMD-ADiTT 0.11 0.91 0.98 0.98 0.98
MMD-ADiTE 2048 0.20 0.69 0.76 0.67 0.89
MMD-Sub 0.08 0.79 0.96 0.95 0.98

4 4
2 2
0 0
s
s

2 2
4 4
2 0 2 2 0 2
c c
(a) C1 ∼ N (0, 0.52 ). (b) C1 ∼ 12 N (−0.75, 0.22 ) + 21 N (0.75, 0.22 )

Figure 9. Reference (blue) and deployment (orange) instances where the reference data has context-conditional distribution S0 ∼ N (C0 , 1)
and context distribution C0 ∼ N (0, 1). The context-conditional distribution of the deployment instances remain the same, but the context
distributions change to the distributions stated. We do not wish for these changes to result in a detection.
Context-Aware Drift Detection

4 4
2 2
s 0 0

s
2 2
4 4
2 0 2 2 0 2
c c
(a) (, ω) = (0.5, 0) (b) (, ω) = (0, 2)

Figure 10. Reference (blue) and deployment (orange) instances where the reference data has context-conditional distribution S0 ∼
N (C0 , 1) and context distribution C0 ∼ N (0, 1). The context distribution then changes PC1 = 12 N (−0.75, 0.22 ) + 21 N (0.75, 0.22 )
for the deployment sample and, for deployment instances corresponding to the first mode, the context-conditional distribution changes to
N (C + , ω 2 ) We wish for these changes to result in a detection.

Figure 11. Visualisation of the weight matrices used in computation of the MMD-Sub when deployment contexts fall into two disjoint
modes. We see that the similarities between instances in different modes of context contribute equally as the similarities between instances
in the same mode of context.

2.5
0.0
s

2.5
2.5
0.0
s

2.5
2.5
0.0
s

2.5
2 0 2 2 0 2 2 0 2
c c c

Figure 12. The central plot shows here shows a batch of deployment samples shown generated under the same setup as Figure 9b. The
surrounding plots all show alternative sets of reassigned deployment samples obtained by using the conditional resampling procedure of
Rosenbaum (1984) to reassign deployment statuses as zi0 ∼ Ber(e(ci )) for i = 1, ..., n. Note that the alternatives do not use identical
samples for each reassignment, but do manage to achieve the desired context distribution, with none of the plots being noticeably different
from any other.
Context-Aware Drift Detection

Figure 13. Shown centrally is the Q-Q plot of a U [0, 1] against the p-values obtained by MMD-ADiTT under a change in context
distribution from N (0, 1). The context-conditional distribution has not changed and therefore a perfectly calibrated detector should have
a Q-Q plot lying close to the diagonal. To contextualise how well the central plot follows the diagonal, we surround it with Q-Q plots
corresponding to 100 p-values actually sampled from U [0, 1].
Context-Aware Drift Detection

Figure 14. Shown centrally is the Q-Q plot of a U [0, 1] against the p-values obtained by MMD-Sub under a change in context distribution
from N (0, 1). The context-conditional distribution has not changed and therefore a perfectly calibrated detector should have a Q-Q plot
lying close to the diagonal. To contextualise how well the central plot follows the diagonal, we surround it with Q-Q plots corresponding
to 100 p-values actually sampled from U [0, 1].
Context-Aware Drift Detection

Table 5. Calibration under changes to the mixture weights of a mixture of two Gaussians and power under a change corresponding to
shifting one of the components by  standard deviations or scaling its standard deviation by a factor of ω.
Method Sample Size Calibration (KS) Power (AUC)
 = 0.2  = 0.4  = 0.6  = 0.8  = 1.0 ω = 0.5 ω = 2.0
MMD-ADiTT 128 0.13 0.53 0.63 0.72 0.79 0.84 0.79 0.83
MMD-Sub 128 0.31 0.52 0.59 0.66 0.74 0.78 0.70 0.76
MMD-ADiTT 256 0.13 0.56 0.70 0.80 0.88 0.91 0.89 0.91
MMD-Sub 256 0.28 0.55 0.65 0.75 0.81 0.84 0.81 0.83
MMD-ADiTT 512 0.09 0.56 0.75 0.88 0.92 0.93 0.92 0.95
MMD-Sub 512 0.35 0.57 0.71 0.80 0.84 0.86 0.83 0.83
MMD-ADiTT 1024 0.15 0.67 0.84 0.91 0.94 0.96 0.96 0.96
MMD-Sub 1024 0.32 0.63 0.83 0.88 0.90 0.91 0.87 0.90
MMD-ADiTT 2048 0.12 0.78 0.91 0.94 0.95 0.96 0.94 0.95
MMD-Sub 2048 0.38 0.70 0.82 0.86 0.87 0.87 0.85 0.86

We do not assume access to knowledge of the subpopulation from which samples were generated. We therefore first perform
unsupervised clustering to associate samples with a vector containing the probability that it belongs to each subpopulation.
We do this by fitting, in an unsupervised manner, a Gaussian mixture model (GMM) with 2 components. The GMM is fit
using a held out portion of the reference data. We hold out the same amount of reference data – 25% – that is held out of
the deployment data by the MMD-ADiTT to condition CoDiTE functions on and by MMD-Sub to fit density estimators.
Because we consider only two subpopulations (for illustrative purposes) we can simply use the first element of the vector
of two subpopulation probabilities as the context variable C ∈ [0, 1], which we can think of as a proxy for subpopulation
membership.
To test the calibration of detectors as subpopulation prevalences vary we sample, for each run, new prevalences (ω1 , ω2 )
from a Beta(1, 1). We choose Beta(2, 2) and Beta(1, 1) to generate reference and deployment prevalences respectively so
that deployment distibutions are typically more extremely dominated by one subpopulation than referencee distributions, as
would be common in practice. Table 5 shows, for various sample sizes, the calibration of the MMD-ADiTT and MMD-Sub
detectors as prevalences vary in this manner. We see that MMD-ADiTT achieves strong calibration whereas MMD-Sub
does not. Figures 20 and 21 further demonstrate the difference in calibration.
To test the power of detectors in response to a change in location of the distribution underlying a single subpopulation we
randomly select one of the two subpopulations and perturb its mean in a random direction by  ∈ {0.2, 0.4, 0.6, 0.8, 1.0}
standard deviations. Similarly to test power to detect change in scale we randomly select one of the two subpopulations
and multiply its standard deviation by a factor of ω ∈ {0.5, 1.0}. Results are shown in Table 7. Having a univariate
context variable again allows us to visualise weight martices, as shown in Figure 16 and Figure 17, where we again observe
the desired block structure for MMD-ADiTT and not for MMD-Sub. We also plot the marginal weights assigned by
MMD-ADITT to reference and deployment instances in Figure 18. Here we see deployment weights receiving fairly uniform
weights whereas the reference weights in the left mode receive much less weight than those in the right mode. This is due
to the changes in prevalences making it necessary to upweight reference instances in the right mode to match the high
proportion of deployment instances in that mode and similarly downweight the reference instances in the left mode to match
the lower proportion.

E.3. Controlling for model predictions


Santurkar et al. (2021) organise the 1000 ImageNet (Deng et al., 2009) classes into a hierarchy of various levels. At level
2 of their hierarchy exists 10 superclasses each containing a number of semantically related subclasses. We retain the 6
superclasses containing 50 or more subclasses, to allow sufficient samples for our experiments. Anew for each run, for each
superclass we sample 25 subclasses to act as the reference distribution of the superclass and 25 subclasses to act a drifted
alternative. We use the ImageNet training split to train a model M to predict superclasses from the undrifted samples. We
then use the ImageNet validation split for experiments, assigning images x model predictions in M (x) ∈ [0, 1]6 as context
c. We wish to allow the distribution of model predictions to change between reference and deployment samples so long as
the distribution of images, conditional on the model predictions, remains the same.
The model is defined as M (x) = H(B(x)), where B is the convolutional base of a pretrained SimCLR (Chen et al., 2020)
model, which maps images onto 2048-dimensional vectors, and H(x) : R2048 → [0, 1]6 is a classification head. We define
Context-Aware Drift Detection

No Drift No Drift = 0.25


2.5 2.5 2.5
0.0 0.0 0.0
s2

2.5 2.5 2.5


2 0 2 4 2 0 2 4 2.5 0.0 2.5 5.0
s1 s1 s1
= 0.5 = 0.5 = 2.0
2.5 2
0.0 0 0
s2

2.5 2
5
2 0 2 4 2 0 2 4 5 0 5
s1 s1 s1

Figure 15. Visualisation of reference (blue) and deployment (orange) instances under various no drift/drift scenarios where we would like
to allow the prevalence of modes in a mixture of two Gaussians to vary, but not their underlying distributions. From top left to bottom
right: no change to either the prevalence of modes or their underlying distribution; a change only to the prevalence of modes; a change in
the prevalence of modes as well as a shift in the mean of one mode by  = 0.25; a change in the prevalence of modes as well a shift in the
mean of one mode by  = 0.5; a change in the prevalence of modes as well a scale in standard deviation for one mode by ω = 0.5; a
change in the prevalence of modes as well a scale in standard deviation for one mode by ω = 2.0.

Figure 16. Visualisation of the weight matrices used in computation of the MMD-ADiTT when deployment contexts fall into two disjoint
modes. Here we order samples by subpopulation (rather than context) to show explicitly that conditioning on proxies managed to achieve
the desired block structure where only similarities between instances in the same same subpopulation contribute. The shapes of the blocks
correspond to reference and deployment subpopulation prevalences.

Figure 17. Visualisation of the weight matrices used in computation of the MMD-Sub when deployment contexts fall into two disjoint
modes. Here we order samples by subpopulation (rather than context) and again see how similarities between instances in different modes
contribute.
Context-Aware Drift Detection

1.0 1.0
Deployment Reference
2 Reference 0.8 2 Deployment 0.8
0.6 0.6
0 0
s2

s2
0.4 0.4
2 0.2 2 0.2
0.0 0.0
2 0 2 4 2 0 2 4
s1 s1
(a) (b)

Figure 18. Visualisation of the weight attributed by MMD-ADiTT to comparing each reference sample to the set of deployment samples
(left) and vice versa (right). Only reference samples with contexts in the support of the deployment contexts significantly contribute.
Weight here refers to the corresponding row/column sum of W0,1 .

2.5 2.5 2.5


0.0 0.0 0.0
s2

2.5 2.5 2.5


0 5 0 5 0 5
2.5 2.5 2.5
0.0 0.0 0.0
s2

2.5 2.5 2.5


0 5 0 5 0 5
2.5 2.5 2.5
0.0 0.0 0.0
s2

2.5 2.5 2.5


0 5 0 5 0 5
s1 s1 s1

Figure 19. The central plot shows here shows a batch of deployment samples shown generated under the same setup as Figure 15. The
surrounding plots all show alternative sets of reassigned deployment samples obtained by using the conditional resampling procedure of
Rosenbaum (1984) to reassign deployment statuses as zi0 ∼ Ber(e(ci )) for i = 1, ..., n. Note that the alternatives do not use identical
samples for each reassignment, but do manage to achieve the desired context distribution, with none of the plots being noticeably different
from any other.
Context-Aware Drift Detection

Figure 20. Shown centrally is the Q-Q plot of a U [0, 1] against the p-values obtained by MMD-ADiTT under a change to the prevalence
of modes in a Gaussian mixture. The context-conditional distribution has not changed and therefore a perfectly calibrated detector should
have a Q-Q plot lying close to the diagonal. To contextualise how well the central plot follows the diagonal, we surround it with Q-Q plots
corresponding to 100 p-values actually sampled from U [0, 1].
Context-Aware Drift Detection

Figure 21. Shown centrally is the Q-Q plot of a U [0, 1] against the p-values obtained by MMD-Sub under a change to the prevalence of
modes in a Gaussian mixture. The context-conditional distribution has not changed and therefore a perfectly calibrated detector should
have a Q-Q plot lying close to the diagonal. To contextualise how well the central plot follows the diagonal, we surround it with Q-Q plots
corresponding to 100 p-values actually sampled from U [0, 1].
Context-Aware Drift Detection

Table 6. Calibration under the change of context corresponding to a computer vision model making predictions evenly across 6 classes to
making predictions only within K of 6 classes.
Method Calibration (KS)
K=1 K=2 K=3 K=4 K=5 K=6
MMD-ADiTT 0.08 0.08 0.11 0.14 0.09 0.07
MMD-Sub 0.89 0.88 0.80 0.64 0.43 0.23

Table 7. Power to detect changes in the distributions underlying J of the 6 classes being predicted by a computer vision model.
Method Sample Size Calibration (KS) Power (AUC)
J =0 J =1 J =2 J =3 J =4 J =5 J =6
MMD-ADiTT 128 0.19 0.55 0.59 0.70 0.73 0.82 0.84
MMD-Sub 128 0.15 0.50 0.52 0.57 0.65 0.84 0.72
MMD 128 0.13 0.57 0.58 0.62 0.74 0.79 0.89
MMD-ADiTT 256 0.13 0.65 0.83 0.86 0.92 0.95 0.98
MMD-Sub 256 0.17 0.60 0.55 0.72 0.78 0.88 0.86
MMD 256 0.19 0.59 0.65 0.78 0.87 0.94 0.98
MMD-ADiTT 512 0.10 0.82 0.94 0.99 1.00 0.99 0.99
MMD-Sub 512 0.14 0.54 0.65 0.82 0.91 0.88 0.96
MMD 512 0.18 0.58 0.73 0.92 0.96 1.00 1.00
MMD-ADiTT 1024 0.06 0.95 1.00 1.00 1.00 1.00 1.00
MMD-Sub 1024 0.20 0.62 0.77 0.87 0.92 0.88 0.94
MMD 1024 0.15 0.65 0.88 0.98 1.00 1.00 1.00
MMD-ADiTT 2048 0.10 0.97 0.98 0.98 0.98 0.98 0.98
MMD-Sub 2048 0.09 0.79 0.88 0.94 0.95 0.85 0.96
MMD 2048 0.12 0.69 0.98 1.00 1.00 1.00 1.00

H(x) = Softmax(L2 (σ(L1 (x)))) to consist of a linear projection onto R128 , followed by a ReLU activation, followed by a
linear projection onto R6 , followed by a softmax activation. Each of the 100 runs uses different subclasses for superclasses
and therefore we retrain H on the full ImageNet training set for each run. We do so for just a single epoch, which was
invariably sufficient to obtain a classifier with an accuracy of between 91% and 93%.
We take the reference context distribution to be that corresponding to the model’s predictions across all images. To test
calibration we vary this for the deployment context distribution by selecting K of 6 superclasses and only retaining contexts
for which the superclass deemed most probable is one of the K chosen. Therefore when K = 1, the context distribution has
collapsed onto around one sixth of its original support, whereas for K = 6 it has not changed. Table 6 shows, for a sample
size of n0 = n1 = 1000, how well calibrated detectors are under such changes.
To test power to detect changes in the distribution of images conditional on model predictions we keep the deployment
context distribution the same as the reference context distribution at K = 6 so that conventional MMD two-sample tests can
be compared to. We then change the distribution underlying J of the 6 superclasses to their drifted alternatives. Table 7
shows how power varies with J.

You might also like