article - ai models collapse when trained on recursively generated data
article - ai models collapse when trained on recursively generated data
1038/s41586-024-07566-y
Supplementary information
Nature | www.nature.com/nature
AI Models Collapse When Trained on Recursively
Generated Data: Supplementary Material
1 Related work
In this section we are going to cover two closest concepts to model collapse from exist-
ing literature: catastrophic forgetting and data poisoning. Neither is able to explain
the phenomenon of model collapse fully, as the setting is fundamentally different, but
they provide another perspective on the observed phenomenon.
the notion of separate tasks. Our work Fig. 1: Changing distribution of latents
examines a particular scenario in which over the learning process with generated
the changed data distributions arise data as perceived by the original encoder.
from the model itself, as a result of Just as with the Gaussian case described
training in the previous iteration. A above, the tails get washed away and the
typical challenge in continual learning is model arrives at the mean representation
that the model forgets previous samples of the underlying data.
when learning new information; this is
known as catastrophic forgetting [3]. A typical way of preventing it is to use regulari-
sations (e.g. Memory Aware Synpass [4]) or just rely on data (e.g. Learning without
Forgetting [5]). This has an indirect connection to our work, yet differs since the data
in the process of model collapse are generated by different generations of models.
1
1.2 Data poisoning
Poisoning attacks are crafted and inserted during training in order to degrade the
model’s performance when deployed [6]. Malicious data can be inserted into train-
ing data to induce unintended behaviors that can be activated by special triggers [7].
The early literature on data poisoning focused mainly on supervised learning, where
classifiers are trained with labeled samples. But with the emergence of contrastive
learning [8] and LLMs [9], more recent models are trained with large-scale web
crawls, making data poisoning attacks more feasible on these untrustworthy web
sources. Recent studies have demonstrated that web-scale datasets can be poisoned
by introducing malicious data into a small percentage of samples [10, 11].
Input: some started before 1360 — was typically accomplished by a master mason
and a small team of itinerant masons, supplemented by local parish labourers,
according to Poyntz Wright. But other authors reject this model, suggesting
instead that leading architects designed the parish church towers based on early
examples of Perpendicular
Outputs:
Gen 0: Revival architecture such as St. John’s Cathedral in London. The earliest
surviving example of Perpendicular Revival architecture is found in the 18th
@-@ century Church of Our Lady of Guernsey, which dates from the late 19th
century. There are two types of perpendicular churches : those
Gen 1: architecture such as St. Peter’s Basilica in Rome or St. Peter’s Basilica in
Buenos Aires. There is no evidence that any of these buildings were built during
the reign of Pope Innocent III, but it is possible that they may have been built
during the reign of his successor, Pope Innocent
Gen 3: architecture such as St. Peter’s Basilica ( now part of the Church of Jesus
Christ of Latter @-@ day Saints ) or St. Peter’s Basilica ( now part of the Church
of Jesus Christ of Latter @-@ day Saints ). There is no evidence that any of
these buildings
Gen 5: ism, which had been translated into more than 100 languages including
English, French, German, Italian, Spanish, Portuguese, Dutch, Swedish, Nor-
wegian, Polish, Hungarian, Slovak, Lithuanian, Estonian, Finnish, Romanian,
Bulgarian, Turkish, Croatian, Serbian, Ukrainian, Russian, Kazakh, Kyrgyz
Gen 7: architecture in England. In an interview with The New York Times, Wright
said : ” I don ’t think there is anything wrong with me being able to do what
I want to do. It just doesn ’t work for me. ” He added : ” I don ’t know if you
can call it funny,
Gen 9: architecture. In addition to being home to some of the world’s largest
populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue
@-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-
2
2 Regularized Density Estimation in Hilbert Space
Here we consider a model of density estimation in a reproducing kernel Hilbert space
(RKHS) presented by Kazashi and Nobile [12]. This choice is motivated by the simi-
larity of model collapse to self-distillation; what is more, in this setting a closed form
solution can be derived. To be precise, Mobahi et al. [13] analyzed the effect of multi-
stage regularized regression in a RKHS – we show that similarly, in expectation, the
density estimator collapses on to the topmost directions in the data.
Let (X , B, µ) be a measure space, and let (NK , ⟨·, ·⟩K , ∥ · ∥K ) denote the RKHS
associated with the positive definite kernel K : X × X → R. The full assumptions
and the corresponding derivations are provided in Section R3.2.1. We further define
the integral operator TK : L2µ (X ) → L2µ (X ) by TK g := X K(·, x)g(x)dµ(x), for
g ∈ L2µ (X ). Let Y1 , . . . , YM : Ω → X be independent random variables that follow
the distribution defined by a density y ∈ NK with respect to µ. P Similar to Kaza-
N
shi and Nobile [12], we are approximating y of the form y(·) ≈ n=1 an K (xn , ·)
for X = {x1 , . . . , xN } ⊂ X a set of specifically chosen points, defining the approxi-
mation space VN := span {K (xj , ·) | j = 1, . . . , N }. The idealised regularised density
estimation would consist of solving the following constrained optimistaion problem:
1 1 ϵn
arg min ∥u∥2K s.t. ∥u − y∥2L2µ (X ) ≤ .
u∈VN 2 2 2
However, in the density estimation setting we do not have access to the true density,
but just samples. Therefore the problem we consider is a sample approximation of the
ideal problem:
1
fn∗ = arg min ∥u∥2K
u∈VN 2
M i⊤ h
1 1 X 1 h −1/2 i ϵ
n
s.t. ∥u∥2L2µ (X ) − u (Ymn ) + Â kY n Â−1/2 kY n ≤ ,
2 M m=1 2 2
where we define Âjk ≡ ⟨K (xj , ·) , K (xk , ·)⟩L2 (X ) , Ajk ≡ K (xj , xk ), and we let
µ
positive definite P ≡ Â−1/2 AÂ−1/2 , and k(·)j ≡ K (xj , ·), and similarly k(Y )j ≡
1
PM
M m=1 K (xj , Ym ), for j = 1, . . . , N . As P is symmetric positive definite, we can
write it as P = V ⊤ DV , with D diagonal.
Qn Let dmin and dmax denote its smallest and
largest eigenvalues, and let Bn = i=0 (ci D + I)−1 . Now, considering the process of
learning with generational data, we have Y n+1 ∼ fn∗ and f0∗ ≡ y, which lets us derive
the following:
Theorem 2.1. Starting with v0 = V Â−1/2 [TK y](x1:N ), such that ∥v0 ∥ > ϵ and
M → ∞ we have that the process does not collapse for at least n generations,
dmin ∥v0 ∥
n= √ −1 .
dmax ϵ
3
In this case, for n ⩽ n, the average density estimate can be written exactly as
h i⊤ h i
E[fn∗ ] = V Â−1/2 k(·) Bn V Â−1/2 [TK y](x1:N ) . (1)
In particular, for dj and dk pair of diagonal elements of D with dk < dj , the following
inequality holds: n
∥v0 ∥ dj
Bn−1 [k, k] √ϵ − 1 + dmax
≥ ∥v ∥ .
Bn−1 [j, j] √0 − 1 + dk
ϵ dmax
This, similarly to Mobahi et al. [13], tells us that the process of model collapse in
this case progressively limits the number of basis functions to represent the density in
the average case scenario, considered in the limit of infinite samples. This model, for
the sake of simplicity, has been stated with M → ∞. As such, Theorem 2.1 portrays the
process of model collapse occuring due to the combination of functional expressivity
and functional approximation errors. The model similarly allows for analysis in the
case of finite M , but that is left for future work.
1 X i 2 1 X
µi+1 = Xj ; σi+1 = (Xji − µi+1 )2 . (2)
Mi j Mi − 1 j
Note here, that if we were to use maximum likelihood estimation, we would instead
arrive at a biased variance estimator. With these estimates, the functional approx-
imation step simply corresponds to considering a normal distribution with these
4
Real distribution 1 Real distribution 2 Resampled 1 and 2
7 7 7
log M
6 6 6
5 5 5
log(Count)
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
10 5 0 5 10 10 5 0 5 10 10 5 0 5 10
Fig. 2: Shown in the middle is a histogram plot of samples from a Gaussian mixture
with means (−4, 4) and variances of 1. To the left of it is a similar distribution, but
with ’fatter’ tails, and on the right the same histograms are shown, but with low
probability events being cut off due to finite resampling. Although distributions 1 and
2 are very different, when resampled (only assuming the expected behaviour), the tails
get cut off, leading to the same observed distribution. In this case this is all states
with probability less than 1/M , or equivalently, bins with log Count ≤ log M .
This provides us with the conditional distribution of Xji , which allows us to cal-
culate the full distribution of Xji . From Equation (4), we see that even after the
first approximation, the distribution of Xji is no longer normal, it follows a variance-
gamma distribution [14]. However, instead of writing the probability density function
at each generation, we can explicitly construct them in terms of independent random
variables. In particular, it is well known [15] that µ1 and σ1 are independent, with
σ2
µ1 ∼ N (µ, M 0
) and (M0 − 1)σ12 ∼ σ 2 Γ( M02−1 , 12 ). In what follows we will denote with
Z random variables that are distributed with N (0, 1) and with S i random variables
1 Mi−1 −1 1
that are distributed with Mi−1 −1 Γ( 2 , 2 ).
σ √
Xj0 = µ + σZj0 ;Xj1 = µ + √ Z 1 + σ S 1 Zj1 ; . . . (4)
M0
σ σ √
Xjn = µ + √ Z1 + √ S1Z 2 + . . .
M0 M1
σ p p
+p S 1 × · · · × S n−1 Z n + σ S 1 × · · · × S n Zjn . (5)
Mn−1
5
estimation of a ( = 0, = 1) estimation of a ( = 0, = 1)
1.0 100
0.8 500
0.8 1000
10000
0.6 0.6 100000
|
1000000
0.4 0.4 10000000
|
0.2 0.2
0.0 0.0
100 101 102 103 100 101 102 103
evolution evolution
These are not joint distributions, as Z n and S n depend directly on Zjn−1 , but when
considering Xjn on its own the formula above provides all the information about the
full distribution.
The first thing we may try calculating is the variance. It is possible to find its
exact value, but the mean and variance of the square root of gamma distribution
are expressed in terms of gamma functions, making the result quite clunky. In what
follows, we will expand everything to second order in each of (1/Mi ) as we assume
each sample size to be large (in practice this becomes quite accurate after M ∼ 100).
We then find that
1 1 1 1
2
Var(Xjn ) = + + ··· + + 1 + O(2).
σ M0 M1 Mn−1
This means that as n → ∞, the variance diverges linearly. This is the same scaling
as for a single dimensional Gaussian random walk. We can further see the similarities
in numerical experiments shown on Figure 3 for a range of different sample sizes,
confirming these theoretical intuitions.
Even though the variance of Xjn diverges, it does not provide us with any infor-
2
mation of what the corresponding estimates of µn+1 and σn+1 are, or how far they
are from the original µ and σ. In particular, we may want to consider what the dis-
tance would be between the true distribution and the approximated distribution at
step n + 1. To measure this we can consider the Wasserstein-2 distance between two
6
normals:
n+1
:= W22 N (µ, σ 2 ), N (µn+1 , σn+1
2
) = ∥µn+1 − µ∥2 + ∥σn+1 − σ∥2
RW2
Now we can calculate the risk that occurs due to finite sampling, i.e. what the expected
value of the distance is (expanding in 1/Mi ):
n+1 3 2 1 1 1
Eµn+1 ,σn+1
2 RW2 = σ + + ··· + + O(2), (6)
2 M0 M1 Mn
n+1 1 4 3 3 3 X 4
Varµn+1 ,σn+1
2 RW2 = σ 2 + 2 + · · · + 2 + + O(3). (7)
2 M0 M1 Mn Mi Mj
i̸=j
This result allows us to interpret exactly what occurs in this formulation of model
collapse. To be precise, due to errors occurring from re-sampling the approximated
distribution, each generation ends up corresponding to a new step in a random walk of
model parameters. The risk that occurs in this model ends up diverging for a constant
sample size at each generation. In order for the end distribution approximation to be
accurate, and for the distance to be finite, the sampling rate Mi needs to increase
superlinearly, i.e. one needs to collect increasingly more samples over time, perhaps
quadratically. However, even in that case the expected distance after n steps remains
non-zero and the only case in which it does in fact end up being 0 is when sampling
is infinite at each step. Overall, this only shows us how far on average we go from the
original distribution, but the process can only ’terminate’ if the estimated variance at a
certain generation becomes small enough, i.e. we effectively turn into a delta function.
Shown on Figures 4 and 5 are different runs of this process for different values of
parameters of αi , βi , γi for different sample sizes, which was investigated numerically
to see whether they can be enough to overcome model collapse, however even in those
cases the changes are inevitable, although attenuated.
This scaling of divergence is the same as that of a single dimensional Gaussian
random walk. We can further see the similarities in numerical experiments shown on
Figure 3 for a range of different sample sizes, confirming these theoretical intuitions.
7
estimation of a ( = 0, = 1) estimation of a ( = 0, = 1)
0.4 100
0.30 500
0.25 1000
0.3 10000
0.20
|
0.2 0.15
|
0.10
0.1
0.05
0.0 0.00
100 101 102 103 100 101 102 103
evolution evolution
estimation of a ( = 0, = 1) estimation of a ( = 0, = 1)
0.10 0.175
0.08 0.150
0.125
0.06 100
0.100 500
|
1000
0.04 0.075 10000
|
0.050
0.02
0.025
0.00 0.000
100 101 102 103 100 101 102 103
evolution evolution
With this, instead of quantifying the distance exactly, we can instead lower bound it.
The only limitation is that we are going to have to specify a functional approxima-
tion model. In order to achieve a W2 bound, we will be required to specify how the
mean changes between generations. In the scenario where we only have access to the
sample mean, we would approximate the mean of the next generation distribution as
Equation (2). However, as more information arrives, or the model begins using it bet-
ter, we may end up diverging from the sample mean. We would still require that the
model have good performance, i.e. on average the mean estimate is the same. We will
8
also have to specify expected behaviour of the model over the the variance calcula-
tion, which once again will be chosen such that it averages out. Thus, we will adopt
the following evolution over generations:
1/2
1 X i Σ
µi+1 = Xj + εi+1 = √i T i+1 + µi + εi+1 ; EXji (Σi+1 ) = Σi (8)
Mi j Mi
−1/2 P
Σ
where we define T i+1 to satisfy the equation above, i.e. T i+1 = √i M i
j Xj − µi .
i
With this normalisation T has mean 0 and covariance IN and by the central limit
D
theorem (CLT) we would have T i+1 |µi , Σi → N (0, IN ), however the lower bound will
not rely on this. To arrive at a lower bound for the risk, similar to that of Equation (6),
we are going to have to make an assumption about the form of εi+1 . Overall the result
can be summarised in the following way:
Theorem 3.1 (Mean and Covariance-consistent Generalisation Error). Let the model
be updated between generations in a way such that its mean and covariance matrix
are on average consistent, i.e. satisfy:
1 X i
Xj + εi+1 Xji ;
µi+1 = EXji (Σi+1 ) = Σi ; E[εi+1 |µi , Σi ] = 0; (9)
Mi j
with εi+1 denoting potential deviations from the usual unbiased estimators due to
some implicit bias of the model or the optimisation routine. Then:
(a) If Cov(εi+1 , µi+1 |µi , Σi ) = 0:
n n+1
n+1 X 1 X
E ∥εi ∥2
E RW ≥ Tr Σ + (10)
2
i=0
Mi i=1
n n+1 n
n+1 X 1 X
2
√ X 1
E RW ≥ Tr Σ + E ∥εi ∥ − 2K N Tr Σ 3/2
(11)
2
i=0
M i i=1 i=0 M i
n n+1
X 1 X
E ∥εi ∥2 + O(3/2).
= Tr Σ +
i=0
Mi i=1
[2] Given all of Xji , epsilon must be constant, i.e. it is a function of only the data:
9
In particular, it is dependent on µi and Σi only through the data.
[3A] The extra noise is orthogonal to the sample mean in the sense of random variables.
This is effectively assuming that the noise does not contain any first moment
information, i.e. we have:
This may seem like a rather strong assumption, compared to the previous ones,
however this property can be shown to hold true when imposing CLT on T i+1 in
the limit of large Mi (note here that Mi can only be assumed to be large, and not
infinite) and assuming that ε is strictly a function of moments higher than first.
Furthermore, a property of this type is necessary to actually provide any infor-
mation, since prior to it there would be no need to separate into an epsilon term
and a sample mean term, since all could be absorbed into ε.
[3B] The extra noise is going to be assumed to be bounded and of the order larger
than the sample mean deviation. To be precise we will have a constant K (not
dependent on generation i), such that for all i:
K
∥εi+1 ∥ ≤ . (15)
Mi
10
(b) Now using Assumption 3B, we can start from
i+1 2
Tr Σ 2
2
⊤ 1/2 i+1
E RW ≥ E ∥µi − µ∥ + + E ∥εi+1 ∥ + √ E (ε i+1 ) Σ i T
2
Mi Mi
(22)
Similar to before, we need to evaluate (which we instead bound this time):
Z
2 2 h i
√ E (εi+1 )⊤ Σ1/2
i T
i+1
= √ 1/2
dΣi p(Σi ) Tr Σi Cov(εi+1 , T i+1 |Σi ) ̸= 0
Mi Mi
√ Z
2 N
q
≥ −√ dΣi p(Σi ) Tr Σi Σϵi+1
Mi
√ q
2 N
E ε⊤
≥ −√ i+1 Σi εi+1 ,
Mi
√ s √
2 N K 2 Tr Σ −2K N √
≥ −√ = √ Tr Σ,
Mi Mi2 Mi M i
where we used the Cauchy-Schwarz and Jensen inequalities. Note that this is far
from optimal inequality, since instead of using the expected value of the largest
eigenvalue, we instead bounded it by Tr Σ. In particular, the per step bound is
then:
√
i+1 2
Tr Σ 2
2K N √
E RW ≥ E ∥µi − µ∥ + + E ∥ε i+1 ∥ − √ Tr Σ. (23)
2
Mi Mi Mi
In particular, we find in the same way, that superlinear scaling would be required to
minimise the lower bound on model collapse even in the case of more generic models
of approximation.
These results allow us to interpret the process of model collapse. To be precise,
each generation corresponds to a new step in a random walk of model parameters.
Due to errors occurring from re-sampling the approximated distribution, on aver-
age, the expected distance from the original distribution is non-zero. For constant
sample size, the risk ends up diverging and in order for the limiting distribution to
be accurate and the distance to be finite, the sampling rate Mi needs to increase
superlinearly, i.e. one needs to collect superlinearly more samples over time, perhaps
quadratically. However, even then, the expected distance remains non-zero unless
sampling is infinite at each step. Overall, this only shows us how far on average we
11
depart from the original distribution, but the process can only ’terminate’ if the
estimated variance at a certain generation becomes small enough, i.e. we effectively
turn into a delta function. In total, the message so far can be summarised as follows:
When learning on generational data, due to finite sampling, we are only able to
approximate the original distribution. While on average we should recover the orig-
inal distribution, the variance arising from this is non-zero. As a result, over the
generations, the average distance of n’th generation from the original grows and can
become infinite in the limit, since errors compound over time.
where W22 denotes the Wasserstein-2 distance between the true distribution and the
approximation at step n.
2
Proof. First note that W22 (N (µn , Σn ), D0 ) ≥ ∥µn − µ0 ∥ . Assume that µ̂0 , Σ̂0 are the
estimates from samples x0j from D0 and Σ̂0 ̸= 0. Then, by triangle inequality,
1
W22 (N (µn , Σn ), D0 ) + W22 D0 , N (µ̂0 , Σ̂0 ) ≥ W22 N (µn , Σn ), N (µ̂0 , Σ̂0 ) .
2
Therefore,
1
2
W22 (N (µn , Σn ), D0 ) ≥ −W22 D0 , N (µ̂0 , Σ̂0 ) + ∥µn − µ̂0 ∥ .
2
Now, we are in the same setting as in the subsections above, and e.g. using the result
from Theorem 3.1 with ε = 0, we have that
1 1 1 1
W22 − W22 D0 , N (µ̂0 , Σ̂0 ) .
E (N (µn , Σn ), D0 ) ≥ Tr Σ + + ··· +
2 M0 M1 Mn
Tr Σ̂0
E W22 (N (µn , Σn ), D0 ) ≥ n − W22 D0 , N (µ̂0 , Σ̂0 ) → ∞.
2M
12
This proves the first claim. Proof of the second claim is heavily based on Alemo-
hammad et al. [17], and is repeated here only for completeness. Once again, looking
after the first step, such that only normal distributions are considered, we can write
1/2
Xni = Σn−1 Zni + µn−1 , with Zni ∼ N (0, I), and denoting the sample mean as
1
P M i
ζn = M i=1 Zn . Now, noting that Tr Σn is a lower bounded submartingale and
writing " ! #
M
1 X i i ⊤
Tr Σn = Tr Z − ζn Z n − ζn Σn−1 ,
M − 1 i=1 n
a.s.
we see that Tr Σn −−−→ σ∞ for some random variable σ∞ by Doobs martingale con-
vergence theorem [18]. Without loss of generality assume Σn is diagonal. Then, Tr Σn
can be seen as a linear combination of N independent χ2 random variables with N − 1
degrees of freedom, and weights diag (Σn−1 ). Writing Tr Σn = Qn Tr Σn−1 , where
Qn is a generalized χ2 random variable with N − 1 degrees of freedom and weights
diag(Σn−1 )
Tr Σn−1 , with E [Qn |Σn−1 ] = 1. Therefore, at least one of the weights is greater
than 1/N for each n. Therefore, for any 0 < ϵ < 1, there exists c > 0 such that
Pr (|Qn − 1| > ϵ) > c. Since |Qn − 1| > ϵ infinitely often with
Qn probability one, σ∞ = 0
is the only random variable that satisfies limn→∞ Tr Σ0 j=1 Qj = σ∞ , and as all
a.s.
matrix norms are equivalent, Σn −−→ 0.
13
where λ > 0 is a parameter and the bilinear form ⟨·, ·⟩λ is an inner product on NK .
We further define the integral operator TK : L2µ (X ) → L2µ (X ) by
Z
TK g := K(·, x)g(x)dµ(x), g ∈ L2µ (X ).
X
Let Y10 , . . . , YM
0
: Ω → X be independent random variables that follow the
distribution defined by a density y ∈ NK with respect to µ.PSimilar to Kazashi
N
and Nobile [12], we are approximating y of the form yθ (·) ≈ n=1 an K (xn , ·) for
X = {x1 , . . . , xN } ⊂ X a set of specifically chosen points, defining the approximation
space VN := span {K (xj , ·) | j = 1, . . . , N }. Unlike Kazashi and Nobile [12], we are
interested in a constrained problem, which provides us with a way to select the reg-
ularisation parameter. The idealised regularised density estimation would consist of
solving the following constrained optimistaion problem:
1 1 ϵ
arg min ∥u∥2K s.t. ∥u − y∥2L2µ (X ) ≤ .
u∈VN 2 2 2
λ 1
arg min ∥u − y∥2L2µ (X ) + ∥u∥2K
u∈VN 2 2
s.t. λ ≥ 0, ∥u − y∥2L2µ (X ) ≤ ϵ
λ ∥u − y∥2L2µ (X ) − ϵ = 0.
However, in the density estimation setting we do not have access to the true density,
but just samples. Therefore the problem we consider is a sample approximation of the
ideal problem:
1
fn∗ = arg min ∥u∥2K
u∈VN 2
M i⊤ h
1 1 X 1 h −1/2 i ϵ
n
s.t. ∥u∥2L2µ (X ) − u (Ymn ) + Â kY n Â−1/2 kY n ≤ ,
2 M m=1 2 2
where we define Âjk ≡ ⟨K (xj , ·) , K (xk , ·)⟩L2 (X ) . In this case, the last term acts as a
µ
proxy for ∥y∥2L2 (X ) . With this, we can follow Mobahi et al. [13] and in analogy write
µ
down the the Karush-Kuhn-Tucker (KKT) condition for this problem,
M
!
1 1 X 1
fn∗ = arg min λn ∥u∥2L2µ (X ) − u (Ymn ) + ∥u∥2K
u∈VN 2 M m=1 2
M i⊤ h
1 1 X 1 h −1/2 i
s.t. λn ≥ 0 , ∥u∥2L2µ (X ) − u (Ymn ) ≤ ϵn − Â kY n Â−1/2 kY n
2 M m=1 2
14
M !
1 1 X ϵn 1 h −1/2 i⊤ h i
λn ∥u∥2L2µ (X ) − u (Ymn ) − − Â kY n Â−1/2 kY n = 0.
2 M m=1 2 2
h i⊤ h i
Then, in analogy, when Â−1/2 kY n Â−1/2 kY n ≤ ϵn , then f ∗ has trivial solution
f ∗ = 0, which is referred to as collapse. And in what follows, the more interesting case
h i⊤ h i
is investigated, i.e. Â−1/2 kY n Â−1/2 kY n > ϵn , with:
h i⊤ h i
Â−1/2 kY n Â−1/2 kY n > ϵn ⇐⇒ λn > 0.
From now on we let cn = 1/λn and write down the following approximate and ideal
problems:
1 c
f := arg min ∥v − y∥2L2µ (X ) + ∥v∥2K .
v∈VN 2 2
" M
#
1 cn 1 X
fn∗ := arg min ∥v∥2L2µ (X ) + ∥v∥2K − v (Ymn ) .
v∈VN 2 2 M m=1
For these, we can write down the exact solutions. For the second, that is equivalent
to finding fn∗ ∈ VN such that
M
1 X
⟨fn∗ , v⟩cn = v (Ymn ) for all v ∈ VN . (25)
M m=1
Reader is referred to Kazashi and Nobile [12], Ciarlet [19] for the equivalence and well-
posedness discussion. In particular, defining Âjk ≡ ⟨K (xj , ·) , K (xk , ·)⟩L2 (X ) , Ajk ≡
µ
K (xj , xk ), and further let P = Â−1/2 AÂ−1/2 , with P symmetric positive definite, and
1
PM
let k(·)j = K (xj , ·), and similarly k(Y ) j = M m=1 K (xj , Ym ) , for j = 1, . . . , N.
Then, the variational problems satisfy
⊤
fn∗ = k(·) Â−1/2 (cn P + I)−1 Â−1/2 kY ,
⊤
f = k(·) Â−1/2 (cn P + I)−1 Â−1/2 [TK y](x1:N )
for Y n ∼ y dµ and by Kazashi and Nobile [12], we have that E [fn∗ ] = f .
T
Y n+1 ∼ fn∗ = k(·) Â−1/2 (cn P + I)−1 Â−1/2 kY n .
15
Now, in order to proceed we need to make sure that one of ϵn or cn is constant, while
the other implicitly becomes a random variable, and can only be treated as such. For
simplicity we choose cn to be constant and ϵn defined implicitly, while cn is chosen
such that, in expectation E(ϵn ) = ϵ. In this case, we have:
" n
#
⊤
Y
E[fn∗ ] = k(·) Â−1/2 (ci P + I)−1 Â−1/2 [TK y](x1:N ).
i=0
This follows from iteratively using Equation (25) similar to Kazashi and Nobile [12],
e.g. for n = 2:
M !
1 ∗ 2 1 X ∗ n ϵn 1 h −1/2 i⊤ h i
∥fn ∥L2µ (X ) − fn (Ym ) − − Â kY n Â−1/2 kY n = 0.
2 M m=1 2 2
Now, based on Equation (25), we need to evaluate 12 ∥fn∗ ∥2L2 (X ) and 12 ∥fn∗ ∥2K :
µ
16
!
1
= kz⊤Y n diag kzY n .
( cn1dj + 1)2
Thus, in analogy to Mobahi et al. [13], we can derive the following upper and lower
bounds on cn , which we evaluate here using the expected values, based on the
discussion above:
p p
E(ϵn ) E(ϵn )
q p ≤ cn ≤ q p
dmax E(kz⊤Y n kzY n ) − E(ϵn ) dmin E(kz⊤Y n kzY n ) − E(ϵn )
Therefore, p
E(ϵn )
cn = q p .
αn E(kz⊤Y n kzY n ) − E(ϵn )
For αn ∈ [dmin , dmax ]. Now, as mentioned above, we set E(ϵn ) = ϵ, i.e. the desired
accuracy is achieved in expectation. Thus, we only need to calculate E(kz⊤Y n kzY n ):
h i⊤ h i
E Â−1/2 kY Â−1/2 kY =
X 1
Â−1
ij E (K(x i , Y )) E (K(x j , Y )) + Cov (K(xi , Y ), K(x j , Y )) .
ij
M
and thus,
⊤
E (K(x1:N , Y n )) Â−1 E (K(x1:N , Y n ))
" #
⊤ n−1 Y
= Â−1/2 [TK y](x1:N ) (ci P + I)−2 Â−1/2 [TK y](x1:N )
i=0
n−1
!
⊤ Y 1
−1/2
= V Â [TK y](x1:N ) diag V Â−1/2 [TK y](x1:N ).
i=0
(ci dj + 1)2
17
−1
1
vn = (cn−1 D + I)−1 vn−1 = D + I vn−1 ,
∥vn−1 ∥
αn √
ϵ
−1
q
with which we have E(kz⊤Y n kzY n ) = ∥vn ∥. This in turn directly coincides with that
in Mobahi et al. [13], replacing D with its inverse. To be precise, the following follows
exactly from Mobahi et al. [13]: √
Proposition 3.1. For any n ≥ 0, if ∥vi ∥ > ϵ for i = 0, . . . , n, then, ∥vn ∥ is decreasing
wrt n and
√ an (κ) − 1
∥vn ∥ ≥ an (κ) ∥v0 ∥ − ϵb(κ) ,
a(κ) − 1
where,
2
(r0 − 1) + x (2r0 − 1)
a(x) ≡ 2
(r0 − 1 + x)
r02 x
b(x) ≡ 2
(r0 − 1 + x)
1 dmax
r0 ≡ √ ∥v0 ∥ , κ ≡ .
ϵ dmin
Which in the same way provides a lower bound, until which no collapse occurs:
Proposition 3.2. Starting from ∥v0 ∥ > ϵ, meaningful (non-collapsing solution)
training on generational data is possible at least for n generations,
∥v0 ∥
√
ϵ
−1
n≡ .
κ
And during this, the effect can be viewed as sparsification of the underlying
function, in the following sense. The expected density is:
" n
#
⊤
Y
E[fn∗ ] = k(·) Â −1/2
V ⊤
(ci D + I) −1
V Â−1/2 [TK y](x1:N ),
i=0
Qn
and denoting Bn = i=0 (ci D + I)−1 , we have the following:
∥v0 ∥
√ −1
Proposition 3.3. Starting from ∥v0 ∥ > ϵ and for n ≤ ϵκ . Then for dj and dk
any pair of diagonals of D with dk < dj , the following inequality holds:
n
∥v0 ∥ dj
Bn−1 [k, k] √ϵ − 1 + dmax
≥ ∥v ∥ .
Bn−1 [j, j] √0 − 1 + dk
ϵ dmax
18
In total, the message so far can be summarised as follows:
When regularising density estimation with generational data, due to functional approx-
imation errors, the approximated distribution gets sparser over generations, even in
the limit of infinite data. As such, even with perfect information, we do not recover
the original distribution, and over time model collapse occurs.
Qk Qk
k 0r×s 0r×s
T = = Pk−1 . (28)
R + RQ + · · · + RQk−1 Is R i=0 Qi Is
Q 0r×s
Finally, for an absorbing Markov chain with T = ,
R Is
1
www.math.umd.edu/∼immortal/MATH401/book/ch absorbing markov chains.pdf
19
Real Data 0 50 100 150 200 350 2000
3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2
1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
Fig. 6: An examples of GMM fitting data at iterations {0, 50, 100, 150, 200, 350, 2000}.
At first the model fits data very well as is shown on the left; yet even at generation 50
the perception of the underlying distribution completely changes. At generation 2000
it converges to a state with very little variance. GMM is sampled a thousand times.
0r×r 0r×s
we have limk→∞ T k = .
R(Ir − Q)−1 Is
Since in the limit the transition probabilities to transient states are zero, we end up
converging to absorbing states and staying there. In the case of discrete distributions,
where we can perfectly approximate a zero-variance dataset (i.e. a delta function), the
absorbing states are delta functions centered at any non-zero probability point from
the original distribution. In practice, we would like to know the expected number of
steps before being absorbed, which may be large. But without knowing our fitting
procedure it is impossible to calculate the matrix Q and therefore the average length
of time before collapse.
20
(a) Original model (b) Generation 5 (c) Generation 10 (d) Generation 20
Fig. 7: Random latent reconstructions from VAEs. No training data comes from the
original distribution. Over the generations, different modes of the original distribution
get entangled and generated data starts looking unimodal.
100
10 2
500
10 4
1000
10000
10 6 50000
200000
0 500 1000 1500 2000
Generation
Fig. 8: Progressive fitting of a GMM with different number of samples. On the y-axis
is shown the logarithm of L2 distance between the two GMM distributions. Over the
generations the distance begins to grow and can become quite large. The jumps in
the distance for large sample sizes occur due to the fixed number of iterations and
precision for the expectation maximization algorithm.
21
Perplexity of generated datapoints
evaluated by model trained with
real wikitext2
1.0 Generation 0
Generation 1
0.8 Generation 2
Probability
Generation 3
0.6 Generation 5
Probability
Generation 9
0.4
10
8
0.2
n
6
tio
ra
4
ne
0.0 10
101 8 2
Ge
6
Perplexity of generated datapoints Perple 4 2
xity 0 0
Probability
10 10
8 8
n
6 6 n
tio
tio
ra
ra
4 4
ne
ne
10 10
8 2 8 2
Ge
Ge
6 6
Perple 4 2 Perple 4 2
xity 0 0 xity 0 0
(a) Figure 1b from main text in 3D. (b) Figure 1c from main text in 3D.
No data preserved. 10% original data preserved.
Fig. 10: Histogram of perplexities of each individual data training sequence produced
by different generations as is evaluated by the very first model trained with the real
data. Over the generations models tend to produce samples that the original model
(trained with real data) is more likely to produce. At the same time, a much longer tail
appears for later generations – later generations start producing samples that would
never be produced by the original model i.e. they start misperceiving reality based on
errors introduced by their ancestors.
22
References
[1] Ven, G.M., Tolias, A.S.: Three scenarios for continual learning. arXiv preprint
arXiv:1904.07734 (2019)
[2] Aljundi, R., Kelchtermans, K., Tuytelaars, T.: Task-free continual learning. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 11254–11263 (2019)
[3] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu,
A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Over-
coming catastrophic forgetting in neural networks. Proceedings of the national
academy of sciences 114(13), 3521–3526 (2017)
[4] Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., Tuytelaars, T.: Memory
aware synapses: Learning what (not) to forget. In: Proceedings of the European
Conference on Computer Vision (ECCV), pp. 139–154 (2018)
[5] Li, Z., Hoiem, D.: Learning without forgetting. IEEE transactions on pattern
analysis and machine intelligence 40(12), 2935–2947 (2017)
[6] Biggio, B., Nelson, B., Laskov, P.: Poisoning attacks against support vector
machines. In: Proceedings of the 29th International Coference on International
Conference on Machine Learning. ICML’12, pp. 1467–1474. Omnipress, Madison,
WI, USA (2012)
[7] Gu, T., Dolan-Gavitt, B., Garg, S.: Badnets: Identifying vulnerabilities in the
machine learning model supply chain. arXiv preprint arXiv:1708.06733 (2017)
[8] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International Conference on Machine Learning,
pp. 8748–8763 (2021). PMLR
[9] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models
are few-shot learners. Advances in neural information processing systems 33,
1877–1901 (2020)
[10] Carlini, N., Terzis, A.: Poisoning and backdooring contrastive learning. In: Inter-
national Conference on Learning Representations (2022). https://openreview.
net/forum?id=iC4UHbQ01Mp
[11] Carlini, N., Jagielski, M., Choquette-Choo, C., Paleka, D., Pearce, W., Anderson,
H., Terzis, A., Thomas, K., Tramèr, F.: Poisoning web-scale training datasets
is practical. In: 2024 IEEE Symposium on Security and Privacy (SP), pp.
175–175. IEEE Computer Society, Los Alamitos, CA, USA (2024). https://doi.
23
org/10.1109/SP54263.2024.00179 . https://doi.ieeecomputersociety.org/10.1109/
SP54263.2024.00179
[12] Kazashi, Y., Nobile, F.: Density estimation in rkhs with application to korobov
spaces in high dimensions. SIAM Journal on Numerical Analysis 61(2), 1080–1102
(2023)
[13] Mobahi, H., Farajtabar, M., Bartlett, P.: Self-distillation amplifies regularization
in hilbert space. Advances in Neural Information Processing Systems 33, 3351–
3361 (2020)
[14] Fischer, A., Gaunt, R., Sarantsev, A.: The variance-gamma distribution: A review.
Statistical Science (2024)
[15] Cochran, W.G.: The distribution of quadratic forms in a normal system, with
applications to the analysis of covariance. Mathematical Proceedings of the
Cambridge Philosophical Society 30(2), 178–191 (1934) https://doi.org/10.1017/
S0305004100016595
[16] Gelbrich, M.: On a formula for the l2 wasserstein metric between measures on
euclidean and hilbert spaces. Mathematische Nachrichten 147(1), 185–203 (1990)
[17] Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A.I., Babaei, H.,
LeJeune, D., Siahkoohi, A., Baraniuk, R.: Self-consuming generative models go
MAD. In: The Twelfth International Conference on Learning Representations
(2024). https://openreview.net/forum?id=ShjMHfmPs0
[19] Ciarlet, P.G.: Linear and nonlinear functional analysis with applications (2013)
[21] Y., L., C, C., C.J.C, B.: The MNIST Database of Handwritten Digits (1998).
http://yann.lecun.com/exdb/mnist/
24