0% found this document useful (0 votes)
24 views25 pages

article - ai models collapse when trained on recursively generated data

Uploaded by

robson.mamedde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views25 pages

article - ai models collapse when trained on recursively generated data

Uploaded by

robson.mamedde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

https://doi.org/10.

1038/s41586-024-07566-y

Supplementary information

AI models collapse when trained on


recursively generated data
In the format provided by the
authors and unedited

Nature | www.nature.com/nature
AI Models Collapse When Trained on Recursively
Generated Data: Supplementary Material

1 Related work
In this section we are going to cover two closest concepts to model collapse from exist-
ing literature: catastrophic forgetting and data poisoning. Neither is able to explain
the phenomenon of model collapse fully, as the setting is fundamentally different, but
they provide another perspective on the observed phenomenon.

1.1 Continual learning and catastrophic forgetting


Unlike traditional machine learning
which seeks to learn from a static 1.6 Generation 0
Generation 1
data distribution, continual learning 1.4 Generation 2
Generation 3
attempts to learn from a dynamic one, 1.2
Generation 4
Generation 5
where data are supplied in a sequential Generation 6
1.0 Generation 7
fashion [1]. This tends to be task-based, Generation 8
Generation 9
Density

where data are provided with delineated 0.8

task boundaries; e.g., classifying dogs 0.6

from cats and recognising handwritten 0.4


digits. Our work is more similar to task- 0.2
free continual learning [2] where data
0.0
distributions gradually change without 3 2 1 0 1 2 3

the notion of separate tasks. Our work Fig. 1: Changing distribution of latents
examines a particular scenario in which over the learning process with generated
the changed data distributions arise data as perceived by the original encoder.
from the model itself, as a result of Just as with the Gaussian case described
training in the previous iteration. A above, the tails get washed away and the
typical challenge in continual learning is model arrives at the mean representation
that the model forgets previous samples of the underlying data.
when learning new information; this is
known as catastrophic forgetting [3]. A typical way of preventing it is to use regulari-
sations (e.g. Memory Aware Synpass [4]) or just rely on data (e.g. Learning without
Forgetting [5]). This has an indirect connection to our work, yet differs since the data
in the process of model collapse are generated by different generations of models.

1
1.2 Data poisoning
Poisoning attacks are crafted and inserted during training in order to degrade the
model’s performance when deployed [6]. Malicious data can be inserted into train-
ing data to induce unintended behaviors that can be activated by special triggers [7].
The early literature on data poisoning focused mainly on supervised learning, where
classifiers are trained with labeled samples. But with the emergence of contrastive
learning [8] and LLMs [9], more recent models are trained with large-scale web
crawls, making data poisoning attacks more feasible on these untrustworthy web
sources. Recent studies have demonstrated that web-scale datasets can be poisoned
by introducing malicious data into a small percentage of samples [10, 11].

Example of text outputs of an OPT-125m model affected by Model collapse –


models degrade over generations, where each new generation is trained on data
produced by the previous generation.

Input: some started before 1360 — was typically accomplished by a master mason
and a small team of itinerant masons, supplemented by local parish labourers,
according to Poyntz Wright. But other authors reject this model, suggesting
instead that leading architects designed the parish church towers based on early
examples of Perpendicular
Outputs:
Gen 0: Revival architecture such as St. John’s Cathedral in London. The earliest
surviving example of Perpendicular Revival architecture is found in the 18th
@-@ century Church of Our Lady of Guernsey, which dates from the late 19th
century. There are two types of perpendicular churches : those
Gen 1: architecture such as St. Peter’s Basilica in Rome or St. Peter’s Basilica in
Buenos Aires. There is no evidence that any of these buildings were built during
the reign of Pope Innocent III, but it is possible that they may have been built
during the reign of his successor, Pope Innocent
Gen 3: architecture such as St. Peter’s Basilica ( now part of the Church of Jesus
Christ of Latter @-@ day Saints ) or St. Peter’s Basilica ( now part of the Church
of Jesus Christ of Latter @-@ day Saints ). There is no evidence that any of
these buildings
Gen 5: ism, which had been translated into more than 100 languages including
English, French, German, Italian, Spanish, Portuguese, Dutch, Swedish, Nor-
wegian, Polish, Hungarian, Slovak, Lithuanian, Estonian, Finnish, Romanian,
Bulgarian, Turkish, Croatian, Serbian, Ukrainian, Russian, Kazakh, Kyrgyz
Gen 7: architecture in England. In an interview with The New York Times, Wright
said : ” I don ’t think there is anything wrong with me being able to do what
I want to do. It just doesn ’t work for me. ” He added : ” I don ’t know if you
can call it funny,
Gen 9: architecture. In addition to being home to some of the world’s largest
populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue
@-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-

2
2 Regularized Density Estimation in Hilbert Space
Here we consider a model of density estimation in a reproducing kernel Hilbert space
(RKHS) presented by Kazashi and Nobile [12]. This choice is motivated by the simi-
larity of model collapse to self-distillation; what is more, in this setting a closed form
solution can be derived. To be precise, Mobahi et al. [13] analyzed the effect of multi-
stage regularized regression in a RKHS – we show that similarly, in expectation, the
density estimator collapses on to the topmost directions in the data.
Let (X , B, µ) be a measure space, and let (NK , ⟨·, ·⟩K , ∥ · ∥K ) denote the RKHS
associated with the positive definite kernel K : X × X → R. The full assumptions
and the corresponding derivations are provided in Section R3.2.1. We further define
the integral operator TK : L2µ (X ) → L2µ (X ) by TK g := X K(·, x)g(x)dµ(x), for
g ∈ L2µ (X ). Let Y1 , . . . , YM : Ω → X be independent random variables that follow
the distribution defined by a density y ∈ NK with respect to µ. P Similar to Kaza-
N
shi and Nobile [12], we are approximating y of the form y(·) ≈ n=1 an K (xn , ·)
for X = {x1 , . . . , xN } ⊂ X a set of specifically chosen points, defining the approxi-
mation space VN := span {K (xj , ·) | j = 1, . . . , N }. The idealised regularised density
estimation would consist of solving the following constrained optimistaion problem:

1 1 ϵn
arg min ∥u∥2K s.t. ∥u − y∥2L2µ (X ) ≤ .
u∈VN 2 2 2

However, in the density estimation setting we do not have access to the true density,
but just samples. Therefore the problem we consider is a sample approximation of the
ideal problem:

1
fn∗ = arg min ∥u∥2K
u∈VN 2
M i⊤ h
1 1 X 1 h −1/2 i ϵ
n
s.t. ∥u∥2L2µ (X ) − u (Ymn ) + Â kY n Â−1/2 kY n ≤ ,
2 M m=1 2 2

where we define Âjk ≡ ⟨K (xj , ·) , K (xk , ·)⟩L2 (X ) , Ajk ≡ K (xj , xk ), and we let
µ

positive definite P ≡ Â−1/2 AÂ−1/2 , and k(·)j ≡ K (xj , ·), and similarly k(Y )j ≡
1
PM
M m=1 K (xj , Ym ), for j = 1, . . . , N . As P is symmetric positive definite, we can
write it as P = V ⊤ DV , with D diagonal.
Qn Let dmin and dmax denote its smallest and
largest eigenvalues, and let Bn = i=0 (ci D + I)−1 . Now, considering the process of
learning with generational data, we have Y n+1 ∼ fn∗ and f0∗ ≡ y, which lets us derive
the following:
Theorem 2.1. Starting with v0 = V Â−1/2 [TK y](x1:N ), such that ∥v0 ∥ > ϵ and
M → ∞ we have that the process does not collapse for at least n generations,

 
dmin ∥v0 ∥
n= √ −1 .
dmax ϵ

3
In this case, for n ⩽ n, the average density estimate can be written exactly as
h i⊤ h i
E[fn∗ ] = V Â−1/2 k(·) Bn V Â−1/2 [TK y](x1:N ) . (1)

In particular, for dj and dk pair of diagonal elements of D with dk < dj , the following
inequality holds:  n
∥v0 ∥ dj
Bn−1 [k, k]  √ϵ − 1 + dmax 
≥ ∥v ∥ .
Bn−1 [j, j] √0 − 1 + dk
ϵ dmax
This, similarly to Mobahi et al. [13], tells us that the process of model collapse in
this case progressively limits the number of basis functions to represent the density in
the average case scenario, considered in the limit of infinite samples. This model, for
the sake of simplicity, has been stated with M → ∞. As such, Theorem 2.1 portrays the
process of model collapse occuring due to the combination of functional expressivity
and functional approximation errors. The model similarly allows for analysis in the
case of finite M , but that is left for future work.

3 Theoretical Intuition: Expanded


In this section we provide a more in-depth theoretical intuition for the phenomenon of
model collapse. We argue that the process of model collapse is universal among gen-
erative models that recursively train on data generated by previous generations. We
illustrate this using simple mathematical models that allow for analytical expressions
of quantities of interest, while also portraying the phenomenon of model collapse. In
the main section we examined three models: a discrete distribution in the absence
of functional expressivity and approximation errors; a multi dimensional Gaussian
approximation, portraying joint functional expressivity and statistical errors; and den-
sity estimation in Hilbert Spaces, illustrating all three errors jointly. Here, we discuss
in more depth the model of a continuous single dimensional Gaussian, a noisy mul-
tidimensional model, and density estimation in a reproducing kernel Hilbert space.

3.1 Continuous random variables


3.1.1 1D Gaussian
In this subsection, we consider learning data coming from a single dimensional Gaus-
sian X 0 ∼ N (µ, σ 2 ). We estimate the density using the sample mean and variance
and fitting a single dimensional Gaussian:

1 X i 2 1 X
µi+1 = Xj ; σi+1 = (Xji − µi+1 )2 . (2)
Mi j Mi − 1 j

Note here, that if we were to use maximum likelihood estimation, we would instead
arrive at a biased variance estimator. With these estimates, the functional approx-
imation step simply corresponds to considering a normal distribution with these

4
Real distribution 1 Real distribution 2 Resampled 1 and 2
7 7 7
log M
6 6 6
5 5 5
log(Count)

4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
10 5 0 5 10 10 5 0 5 10 10 5 0 5 10

Fig. 2: Shown in the middle is a histogram plot of samples from a Gaussian mixture
with means (−4, 4) and variances of 1. To the left of it is a similar distribution, but
with ’fatter’ tails, and on the right the same histograms are shown, but with low
probability events being cut off due to finite resampling. Although distributions 1 and
2 are very different, when resampled (only assuming the expected behaviour), the tails
get cut off, leading to the same observed distribution. In this case this is all states
with probability less than 1/M , or equivalently, bins with log Count ≤ log M .

parameters, which we can sample from:

Xji+1 |µi+1 , σi+1 ∼ N (µi+1 , σi+1


2
). (3)

This provides us with the conditional distribution of Xji , which allows us to cal-
culate the full distribution of Xji . From Equation (4), we see that even after the
first approximation, the distribution of Xji is no longer normal, it follows a variance-
gamma distribution [14]. However, instead of writing the probability density function
at each generation, we can explicitly construct them in terms of independent random
variables. In particular, it is well known [15] that µ1 and σ1 are independent, with
σ2
µ1 ∼ N (µ, M 0
) and (M0 − 1)σ12 ∼ σ 2 Γ( M02−1 , 12 ). In what follows we will denote with
Z random variables that are distributed with N (0, 1) and with S i random variables
1 Mi−1 −1 1
that are distributed with Mi−1 −1 Γ( 2 , 2 ).

σ √
Xj0 = µ + σZj0 ;Xj1 = µ + √ Z 1 + σ S 1 Zj1 ; . . . (4)
M0
σ σ √
Xjn = µ + √ Z1 + √ S1Z 2 + . . .
M0 M1
σ p p
+p S 1 × · · · × S n−1 Z n + σ S 1 × · · · × S n Zjn . (5)
Mn−1

5
estimation of a ( = 0, = 1) estimation of a ( = 0, = 1)
1.0 100
0.8 500
0.8 1000
10000
0.6 0.6 100000

|
1000000
0.4 0.4 10000000

|
0.2 0.2

0.0 0.0
100 101 102 103 100 101 102 103
evolution evolution

(a) Mean estimation (b) Standard Deviation

Fig. 3: Recursive fitting-sampling of a 1D Gaussian with different numbers of samples


drawn. We find that unless sampled a very large number of times, i.e. <100000, both
standard deviation and mean get significantly affected. Here we report a single run;
while re-running the experiment changes the initial performance, both µ and σ drift
over time. The overall graph looks quite similar to that of a Gaussian random walk.

These are not joint distributions, as Z n and S n depend directly on Zjn−1 , but when
considering Xjn on its own the formula above provides all the information about the
full distribution.
The first thing we may try calculating is the variance. It is possible to find its
exact value, but the mean and variance of the square root of gamma distribution
are expressed in terms of gamma functions, making the result quite clunky. In what
follows, we will expand everything to second order in each of (1/Mi ) as we assume
each sample size to be large (in practice this becomes quite accurate after M ∼ 100).
We then find that

1 1 1 1
2
Var(Xjn ) = + + ··· + + 1 + O(2).
σ M0 M1 Mn−1

And if we were to assume that Mi = M are constant, we would find that:


 n
Var(Xjn ) = σ 2 1 + ; E(Xjn ) = µ.
M

This means that as n → ∞, the variance diverges linearly. This is the same scaling
as for a single dimensional Gaussian random walk. We can further see the similarities
in numerical experiments shown on Figure 3 for a range of different sample sizes,
confirming these theoretical intuitions.
Even though the variance of Xjn diverges, it does not provide us with any infor-
2
mation of what the corresponding estimates of µn+1 and σn+1 are, or how far they
are from the original µ and σ. In particular, we may want to consider what the dis-
tance would be between the true distribution and the approximated distribution at
step n + 1. To measure this we can consider the Wasserstein-2 distance between two

6
normals:

n+1
:= W22 N (µ, σ 2 ), N (µn+1 , σn+1
2
) = ∥µn+1 − µ∥2 + ∥σn+1 − σ∥2

RW2

Now we can calculate the risk that occurs due to finite sampling, i.e. what the expected
value of the distance is (expanding in 1/Mi ):
 
 n+1  3 2 1 1 1
Eµn+1 ,σn+1
2 RW2 = σ + + ··· + + O(2), (6)
2 M0 M1 Mn
 
 n+1  1 4 3 3 3 X 4
Varµn+1 ,σn+1
2 RW2 = σ  2 + 2 + · · · + 2 +  + O(3). (7)
2 M0 M1 Mn Mi Mj
i̸=j

This result allows us to interpret exactly what occurs in this formulation of model
collapse. To be precise, due to errors occurring from re-sampling the approximated
distribution, each generation ends up corresponding to a new step in a random walk of
model parameters. The risk that occurs in this model ends up diverging for a constant
sample size at each generation. In order for the end distribution approximation to be
accurate, and for the distance to be finite, the sampling rate Mi needs to increase
superlinearly, i.e. one needs to collect increasingly more samples over time, perhaps
quadratically. However, even in that case the expected distance after n steps remains
non-zero and the only case in which it does in fact end up being 0 is when sampling
is infinite at each step. Overall, this only shows us how far on average we go from the
original distribution, but the process can only ’terminate’ if the estimated variance at a
certain generation becomes small enough, i.e. we effectively turn into a delta function.
Shown on Figures 4 and 5 are different runs of this process for different values of
parameters of αi , βi , γi for different sample sizes, which was investigated numerically
to see whether they can be enough to overcome model collapse, however even in those
cases the changes are inevitable, although attenuated.
This scaling of divergence is the same as that of a single dimensional Gaussian
random walk. We can further see the similarities in numerical experiments shown on
Figure 3 for a range of different sample sizes, confirming these theoretical intuitions.

3.1.2 Multidimensional random variable


With the simple example of a 1D gaussian out of the way, we can now construct a
lower bound on the distance of generation n distribution from the original and show
that without superlinear sampling it similarly diverges in the limit of large n. A nice
property of Wasserstein-2 distance is that Gaussians provide a universal lower bound
for the Wasserstein distance [16]. In particular, for κ and ν probability measures on a
Euclidean N -dimensional space with µκ and µν means, Σκ and Σν covariance matrices,
we have that
  1/2 
2 2
W22 (κ, ν) ≥ ∥µκ − µν ∥ + Tr Σκ + Σv − 2 Σ1/2
κ Σ Σ
v κ
1/2
≥ ∥µκ − µν ∥

7
estimation of a ( = 0, = 1) estimation of a ( = 0, = 1)
0.4 100
0.30 500
0.25 1000
0.3 10000
0.20

|
0.2 0.15

|
0.10
0.1
0.05
0.0 0.00
100 101 102 103 100 101 102 103
evolution evolution

(a) Mean estimation (b) Standard Deviation

Fig. 4: Recursive fitting-sampling of a 1D Gaussian with different numbers of samples


drawn. In this plot data get accumulated in a pool, from which a fixed sample is
drawn. In other words, a model n gets data sampled, its output is mixed with data
sampled from models 1 . . . n, and then the mix gets sampled to fit the model n + 1.
The uncertainty arising from all of the different modalities appearing in data causes
the distribution parameters to jump around quite significantly.

estimation of a ( = 0, = 1) estimation of a ( = 0, = 1)
0.10 0.175
0.08 0.150
0.125
0.06 100
0.100 500
|

1000
0.04 0.075 10000
|

0.050
0.02
0.025
0.00 0.000
100 101 102 103 100 101 102 103
evolution evolution

(a) Mean estimation (b) Standard Deviation

Fig. 5: Recursive fitting-sampling of a 1D Gaussian with different number of samples


drawn. In this plot data are accumulated in a pool, all of which is used to fit a model.
In other words, a model n gets data sampled, its output mixed with data sampled
from models 1 . . . n, and then the result is used to fit the model n + 1. Over time the
variance in estimates reduces due to linear growth of data.

With this, instead of quantifying the distance exactly, we can instead lower bound it.
The only limitation is that we are going to have to specify a functional approxima-
tion model. In order to achieve a W2 bound, we will be required to specify how the
mean changes between generations. In the scenario where we only have access to the
sample mean, we would approximate the mean of the next generation distribution as
Equation (2). However, as more information arrives, or the model begins using it bet-
ter, we may end up diverging from the sample mean. We would still require that the
model have good performance, i.e. on average the mean estimate is the same. We will

8
also have to specify expected behaviour of the model over the the variance calcula-
tion, which once again will be chosen such that it averages out. Thus, we will adopt
the following evolution over generations:
1/2
1 X i Σ
µi+1 = Xj + εi+1 = √i T i+1 + µi + εi+1 ; EXji (Σi+1 ) = Σi (8)
Mi j Mi

−1/2 P
Σ 
where we define T i+1 to satisfy the equation above, i.e. T i+1 = √i M i
j Xj − µi .
i
With this normalisation T has mean 0 and covariance IN and by the central limit
D
theorem (CLT) we would have T i+1 |µi , Σi → N (0, IN ), however the lower bound will
not rely on this. To arrive at a lower bound for the risk, similar to that of Equation (6),
we are going to have to make an assumption about the form of εi+1 . Overall the result
can be summarised in the following way:
Theorem 3.1 (Mean and Covariance-consistent Generalisation Error). Let the model
be updated between generations in a way such that its mean and covariance matrix
are on average consistent, i.e. satisfy:

1 X i
Xj + εi+1 Xji ;

µi+1 = EXji (Σi+1 ) = Σi ; E[εi+1 |µi , Σi ] = 0; (9)
Mi j

with εi+1 denoting potential deviations from the usual unbiased estimators due to
some implicit bias of the model or the optimisation routine. Then:
(a) If Cov(εi+1 , µi+1 |µi , Σi ) = 0:

n n+1
 n+1  X 1 X
E ∥εi ∥2

E RW ≥ Tr Σ + (10)
2
i=0
Mi i=1

(b) If exists K > 0 constant: ∥εi+1 ∥ ≤ K/Mi :

n n+1 n
 n+1  X 1 X
2
 √ X 1
E RW ≥ Tr Σ + E ∥εi ∥ − 2K N Tr Σ 3/2
(11)
2
i=0
M i i=1 i=0 M i
n n+1
X 1 X
E ∥εi ∥2 + O(3/2).

= Tr Σ +
i=0
Mi i=1

Let us first summarise the assumptions in more detail:


Assumptions:
[1] On average we can capture the mean to be the same as at the iteration prior:

E[εi+1 |µi , Σi ] = 0 (12)

[2] Given all of Xji , epsilon must be constant, i.e. it is a function of only the data:

εi+1 = εi+1 Xji



(13)

9
In particular, it is dependent on µi and Σi only through the data.
[3A] The extra noise is orthogonal to the sample mean in the sense of random variables.
This is effectively assuming that the noise does not contain any first moment
information, i.e. we have:

Cov(εi+1 , T i+1 |µi , Σi ) = 0 (14)

This may seem like a rather strong assumption, compared to the previous ones,
however this property can be shown to hold true when imposing CLT on T i+1 in
the limit of large Mi (note here that Mi can only be assumed to be large, and not
infinite) and assuming that ε is strictly a function of moments higher than first.
Furthermore, a property of this type is necessary to actually provide any infor-
mation, since prior to it there would be no need to separate into an epsilon term
and a sample mean term, since all could be absorbed into ε.
[3B] The extra noise is going to be assumed to be bounded and of the order larger
than the sample mean deviation. To be precise we will have a constant K (not
dependent on generation i), such that for all i:

K
∥εi+1 ∥ ≤ . (15)
Mi

Assumption 3A and Assumption 3B are two inherently different assumptions.


Depending on the setting considered, one or the other may be more realistic. With
these, we can now prove Theorem 3.1.
Proof of Theorem 3.1. With all the assumptions in place, we now have the following
bound:
 i+1 
≥ E ∥µi+1 − µ∥2

E RW 2
(16)
1
= E ∥µi − µ∥2 + E ∥εi+1 ∥2 + E (T i+1 )⊤ Σi (T i+1 ) +
  
(17)
Mi
2 
1/2 1/2

+ √ E (εi+1 )⊤ Σi T i+1 + (µi − µ)⊤ Σi T i+1 (18)
Mi
 Tr Σ 2 
1/2

= E ∥µi − µ∥2 + + E ∥εi+1 ∥2 + √ E (εi+1 )⊤ Σi T i+1

(19)
Mi Mi

(a) Now the only quantity to evaluate is


Z
2   2 h i
√ E (εi+1 )⊤ Σ1/2 i T i+1
= √ dΣ i p(Σ i ) Tr Σ
1/2
i Cov(ε i+1 , T i+1
|Σ i ) = 0,
Mi Mi
(20)
by Assumption 3A. Therefore, the overall bound would be similar to the Gaussian
case, but with extra noise variance terms:
  n+1
 n+1  1 1 1 X
E ∥εi ∥2

Eµn+1 ,σn+1
2 RW2 ≥ Tr Σ + + ··· + + (21)
M0 M1 Mn i=1

10
(b) Now using Assumption 3B, we can start from

 i+1  2
 Tr Σ 2
 2 
⊤ 1/2 i+1

E RW ≥ E ∥µi − µ∥ + + E ∥εi+1 ∥ + √ E (ε i+1 ) Σ i T
2
Mi Mi
(22)
Similar to before, we need to evaluate (which we instead bound this time):
Z
2   2 h i
√ E (εi+1 )⊤ Σ1/2
i T
i+1
= √ 1/2
dΣi p(Σi ) Tr Σi Cov(εi+1 , T i+1 |Σi ) ̸= 0
Mi Mi
√ Z
2 N
q  
≥ −√ dΣi p(Σi ) Tr Σi Σϵi+1
Mi
√ q
2 N
E ε⊤

≥ −√ i+1 Σi εi+1 ,
Mi
√ s √
2 N K 2 Tr Σ −2K N √
≥ −√ = √ Tr Σ,
Mi Mi2 Mi M i

where we used the Cauchy-Schwarz and Jensen inequalities. Note that this is far
from optimal inequality, since instead of using the expected value of the largest
eigenvalue, we instead bounded it by Tr Σ. In particular, the per step bound is
then:

 i+1  2
 Tr Σ 2
 2K N √
E RW ≥ E ∥µi − µ∥ + + E ∥ε i+1 ∥ − √ Tr Σ. (23)
2
Mi Mi Mi

Without knowledge of the specific values of K, N or Tr Σ, the best we can do


is consider what this means for the bound as Mi becomes large. In particular,
contribution from the last two terms will be of order at most 3/2. As a result we
recover a bound similar to all of the ones observed so far:
 
1 1 1
Eµn+1 ,σn+1
2 [RW2 ] ≥ Tr Σ + + ··· + + O(3/2) (24)
M0 M1 Mn

In particular, we find in the same way, that superlinear scaling would be required to
minimise the lower bound on model collapse even in the case of more generic models
of approximation.
These results allow us to interpret the process of model collapse. To be precise,
each generation corresponds to a new step in a random walk of model parameters.
Due to errors occurring from re-sampling the approximated distribution, on aver-
age, the expected distance from the original distribution is non-zero. For constant
sample size, the risk ends up diverging and in order for the limiting distribution to
be accurate and the distance to be finite, the sampling rate Mi needs to increase
superlinearly, i.e. one needs to collect superlinearly more samples over time, perhaps
quadratically. However, even then, the expected distance remains non-zero unless
sampling is infinite at each step. Overall, this only shows us how far on average we

11
depart from the original distribution, but the process can only ’terminate’ if the
estimated variance at a certain generation becomes small enough, i.e. we effectively
turn into a delta function. In total, the message so far can be summarised as follows:

When learning on generational data, due to finite sampling, we are only able to
approximate the original distribution. While on average we should recover the orig-
inal distribution, the variance arising from this is non-zero. As a result, over the
generations, the average distance of n’th generation from the original grows and can
become infinite in the limit, since errors compound over time.

3.1.3 Gaussian Model Collapse


In this subsection we briefly summarise the proof behind main theorem, which is
mostly based on facts from discussions in the two subsections above.
Theorem 3.2 (Gaussian Model Collapse). Let x0j be fixed samples from some original
distribution D0 (not necessarily Gaussian), with sample mean and covariance (µ̂0 , Σ̂0 )
and Σ̂0 ̸= 0. Assume X n are fit recursively using the unbiased sample mean and
variance estimators from the previous generation: Xjn |µn , Σn ∼ N (µn , Σn ), with a
fixed sample size. Then,
a.s.
E W22 (N (µn , Σn ), D0 ) → ∞;
 
Σn −−→ 0 as n → ∞,

where W22 denotes the Wasserstein-2 distance between the true distribution and the
approximation at step n.
2
Proof. First note that W22 (N (µn , Σn ), D0 ) ≥ ∥µn − µ0 ∥ . Assume that µ̂0 , Σ̂0 are the
estimates from samples x0j from D0 and Σ̂0 ̸= 0. Then, by triangle inequality,
  1  
W22 (N (µn , Σn ), D0 ) + W22 D0 , N (µ̂0 , Σ̂0 ) ≥ W22 N (µn , Σn ), N (µ̂0 , Σ̂0 ) .
2
Therefore,
  1
2
W22 (N (µn , Σn ), D0 ) ≥ −W22 D0 , N (µ̂0 , Σ̂0 ) + ∥µn − µ̂0 ∥ .
2
Now, we are in the same setting as in the subsections above, and e.g. using the result
from Theorem 3.1 with ε = 0, we have that
 
1 1 1 1  
W22 − W22 D0 , N (µ̂0 , Σ̂0 ) .
 
E (N (µn , Σn ), D0 ) ≥ Tr Σ + + ··· +
2 M0 M1 Mn

Since in our case M is assumed to be constant, we have

 Tr Σ̂0  
E W22 (N (µn , Σn ), D0 ) ≥ n − W22 D0 , N (µ̂0 , Σ̂0 ) → ∞.

2M

12
This proves the first claim. Proof of the second claim is heavily based on Alemo-
hammad et al. [17], and is repeated here only for completeness. Once again, looking
after the first step, such that only normal distributions are considered, we can write
1/2
Xni = Σn−1 Zni + µn−1 , with Zni ∼ N (0, I), and denoting the sample mean as
1
P M i
ζn = M i=1 Zn . Now, noting that Tr Σn is a lower bounded submartingale and
writing " ! #
M
1 X i  i ⊤
Tr Σn = Tr Z − ζn Z n − ζn Σn−1 ,
M − 1 i=1 n
a.s.
we see that Tr Σn −−−→ σ∞ for some random variable σ∞ by Doobs martingale con-
vergence theorem [18]. Without loss of generality assume Σn is diagonal. Then, Tr Σn
can be seen as a linear combination of N independent χ2 random variables with N − 1
degrees of freedom, and weights diag (Σn−1 ). Writing Tr Σn = Qn Tr Σn−1 , where
Qn is a generalized χ2 random variable with N − 1 degrees of freedom and weights
diag(Σn−1 )
Tr Σn−1 , with E [Qn |Σn−1 ] = 1. Therefore, at least one of the weights is greater
than 1/N for each n. Therefore, for any 0 < ϵ < 1, there exists c > 0 such that
Pr (|Qn − 1| > ϵ) > c. Since |Qn − 1| > ϵ infinitely often with
Qn probability one, σ∞ = 0
is the only random variable that satisfies limn→∞ Tr Σ0 j=1 Qj = σ∞ , and as all
a.s.
matrix norms are equivalent, Σn −−→ 0.

3.2 Regularised density estimation in RKHS


3.2.1 In-depth setup
In this subsection we provide the theoretical framework for the density estimation
problem in a RKHS. The setting closely follows that presented in Kazashi and Nobile
[12]. Let (X , B, µ) be a measure space, and let (NK , ⟨·, ·⟩K , ∥ · ∥K ) denote the RKHS
associated with the positive definite kernel K : X × X → R, i.e. K (x, x′ ) = K (x′ , x)
x′ ∈ X , and for any m ∈ N, tj , tk ∈ R, and xj , xk ∈ X , j, k = 0, . . . , m,
for all x,P
m
we have j,k=0 tj K (xj , xk ) tk ≥ 0. This kernel is allowed to be unbounded, but we
R p R
assume X K(x, x)dµ(x) < ∞ and X K(x, x)dµ(x) < ∞. We assume that K admits
a representation

X
K (x, x′ ) = βℓ φℓ (x)φℓ (x′ ) x, x′ ∈ X ,
ℓ=0

with a positive
√ sequence (βℓ )ℓ=0 ⊂ (0, ∞) converging to 0 , and a complete orthonormal
system βℓ φℓ of NK such that the series is absolutely (point-wise) convergent and
that {φℓ } is an orthonormal system of L2µ (X ). Then, the inner product for NK may
be represented by
∞ ⟨f, φ ⟩
X ℓ L2 (X ) ⟨g, φℓ ⟩L2 (X )
µ µ
⟨f, g⟩K = ,
βℓ
ℓ=0
where using the notation ⟨u, w⟩L2µ (X ) := X u(x)v(x)dµ(x) for the L2µ (X )-inner
R

product. Similar to Kazashi and Nobile [12] we introduce the notation

⟨u, w⟩λ := ⟨u, w⟩L2µ (X ) + λ⟨u, w⟩K for u, w ∈ NK ,

13
where λ > 0 is a parameter and the bilinear form ⟨·, ·⟩λ is an inner product on NK .
We further define the integral operator TK : L2µ (X ) → L2µ (X ) by
Z
TK g := K(·, x)g(x)dµ(x), g ∈ L2µ (X ).
X

Let Y10 , . . . , YM
0
: Ω → X be independent random variables that follow the
distribution defined by a density y ∈ NK with respect to µ.PSimilar to Kazashi
N
and Nobile [12], we are approximating y of the form yθ (·) ≈ n=1 an K (xn , ·) for
X = {x1 , . . . , xN } ⊂ X a set of specifically chosen points, defining the approximation
space VN := span {K (xj , ·) | j = 1, . . . , N }. Unlike Kazashi and Nobile [12], we are
interested in a constrained problem, which provides us with a way to select the reg-
ularisation parameter. The idealised regularised density estimation would consist of
solving the following constrained optimistaion problem:

1 1 ϵ
arg min ∥u∥2K s.t. ∥u − y∥2L2µ (X ) ≤ .
u∈VN 2 2 2

The Karush-Kuhn-Tucker (KKT) condition for this problem yields,

λ 1
arg min ∥u − y∥2L2µ (X ) + ∥u∥2K
u∈VN 2 2
s.t. λ ≥ 0, ∥u − y∥2L2µ (X ) ≤ ϵ
 
λ ∥u − y∥2L2µ (X ) − ϵ = 0.

However, in the density estimation setting we do not have access to the true density,
but just samples. Therefore the problem we consider is a sample approximation of the
ideal problem:

1
fn∗ = arg min ∥u∥2K
u∈VN 2
M i⊤ h
1 1 X 1 h −1/2 i ϵ
n
s.t. ∥u∥2L2µ (X ) − u (Ymn ) + Â kY n Â−1/2 kY n ≤ ,
2 M m=1 2 2

where we define Âjk ≡ ⟨K (xj , ·) , K (xk , ·)⟩L2 (X ) . In this case, the last term acts as a
µ

proxy for ∥y∥2L2 (X ) . With this, we can follow Mobahi et al. [13] and in analogy write
µ
down the the Karush-Kuhn-Tucker (KKT) condition for this problem,

M
!
1 1 X 1
fn∗ = arg min λn ∥u∥2L2µ (X ) − u (Ymn ) + ∥u∥2K
u∈VN 2 M m=1 2
M i⊤ h
1 1 X 1 h −1/2 i
s.t. λn ≥ 0 , ∥u∥2L2µ (X ) − u (Ymn ) ≤ ϵn − Â kY n Â−1/2 kY n
2 M m=1 2

14
M  !
1 1 X ϵn 1 h −1/2 i⊤ h i
λn ∥u∥2L2µ (X ) − u (Ymn ) − − Â kY n Â−1/2 kY n = 0.
2 M m=1 2 2

h i⊤ h i
Then, in analogy, when Â−1/2 kY n Â−1/2 kY n ≤ ϵn , then f ∗ has trivial solution
f ∗ = 0, which is referred to as collapse. And in what follows, the more interesting case
h i⊤ h i
is investigated, i.e. Â−1/2 kY n Â−1/2 kY n > ϵn , with:

h i⊤ h i
Â−1/2 kY n Â−1/2 kY n > ϵn ⇐⇒ λn > 0.

From now on we let cn = 1/λn and write down the following approximate and ideal
problems:  
1 c
f := arg min ∥v − y∥2L2µ (X ) + ∥v∥2K .
v∈VN 2 2
" M
#
1 cn 1 X
fn∗ := arg min ∥v∥2L2µ (X ) + ∥v∥2K − v (Ymn ) .
v∈VN 2 2 M m=1
For these, we can write down the exact solutions. For the second, that is equivalent
to finding fn∗ ∈ VN such that

M
1 X
⟨fn∗ , v⟩cn = v (Ymn ) for all v ∈ VN . (25)
M m=1

Reader is referred to Kazashi and Nobile [12], Ciarlet [19] for the equivalence and well-
posedness discussion. In particular, defining Âjk ≡ ⟨K (xj , ·) , K (xk , ·)⟩L2 (X ) , Ajk ≡
µ

K (xj , xk ), and further let P = Â−1/2 AÂ−1/2 , with P symmetric positive definite, and
1
PM
let k(·)j = K (xj , ·), and similarly k(Y ) j = M m=1 K (xj , Ym ) , for j = 1, . . . , N.
Then, the variational problems satisfy

fn∗ = k(·) Â−1/2 (cn P + I)−1 Â−1/2 kY ,


f = k(·) Â−1/2 (cn P + I)−1 Â−1/2 [TK y](x1:N )
for Y n ∼ y dµ and by Kazashi and Nobile [12], we have that E [fn∗ ] = f .

3.2.2 Unrolling the model collapse


Now, we would like to understand what happens when we perform this approximation
repeatedly. As the xi are fixed, the result turns out to be very close to Mobahi et al.
[13]. In particular, for this subsection we will denote samples at generation n as Y n ,

and letting f−1 ≡ y, we have the following:

T
Y n+1 ∼ fn∗ = k(·) Â−1/2 (cn P + I)−1 Â−1/2 kY n .

15
Now, in order to proceed we need to make sure that one of ϵn or cn is constant, while
the other implicitly becomes a random variable, and can only be treated as such. For
simplicity we choose cn to be constant and ϵn defined implicitly, while cn is chosen
such that, in expectation E(ϵn ) = ϵ. In this case, we have:
" n
#

Y
E[fn∗ ] = k(·) Â−1/2 (ci P + I)−1 Â−1/2 [TK y](x1:N ).
i=0

This follows from iteratively using Equation (25) similar to Kazashi and Nobile [12],
e.g. for n = 2:

⟨EY 2 ,Y 1 ,Y 0 [f2∗ ], v⟩c2


Z Z
= EY 2 |Y 1 =z1 [v(Y 2 )]dµ(z 1 ) f0∗ |Y 0 =z0 (z 1 )y(z 0 )dµ(z 0 ) =
1 0
Zz Z z
⊤ −1/2
= dµ(x)v(x)k(x) Â (c1 P + I)−1 Â−1/2
x z 1 ,z 0
h T
i
· kz1 k(z 1 ) Â−1/2 (c0 P + I)−1 Â−1/2 kz0 dµ(z 1 )y(z 0 )dµ(z 0 )
Z

= dµ(x)v(x)k(x) Â−1/2 (c1 P + I)−1 (c0 P + I)−1 Â−1/2 [TK y](x1:N ).
x

As P is symmetric positive definite, we can write it as P = V ⊤ DV , with D


diagonal.
Qn Let dmin and dmax denote its smallest and largest eigenvalues, and let
Bn = i=0 (ci D + I)−1 , and kzY = V Â−1/2 kY . Now, the only thing missing is the
relevant bounds on ci , which would allow us to directly utilise results in [13]. In our
case they are defined through:

M  !
1 ∗ 2 1 X ∗ n ϵn 1 h −1/2 i⊤ h i
∥fn ∥L2µ (X ) − fn (Ym ) − − Â kY n Â−1/2 kY n = 0.
2 M m=1 2 2

Now, based on Equation (25), we need to evaluate 12 ∥fn∗ ∥2L2 (X ) and 12 ∥fn∗ ∥2K :
µ

∥fn∗ ∥2L2µ (X ) = kY⊤n Â−1/2 (cn P + I)−2 Â−1/2 kY n .

∥fn∗ ∥2K = kY⊤n Â−1/2 P (cn P + I)−2 Â−1/2 kY n .


This, in turn lets us calculate value of ϵn :
h i⊤ h i
ϵn = −∥fn∗ ∥2L2µ (X ) − 2cn ∥fn∗ ∥2K + Â−1/2 kY Â−1/2 kY

= kY⊤n Â−1/2 c2n P 2 (cn P + I)−2 Â−1/2 kY n


!

c2n d2j
= kzY n diag kzY n
(cn dj + 1)2

16
!
1
= kz⊤Y n diag kzY n .
( cn1dj + 1)2

Thus, in analogy to Mobahi et al. [13], we can derive the following upper and lower
bounds on cn , which we evaluate here using the expected values, based on the
discussion above:
p p
E(ϵn ) E(ϵn )
q p  ≤ cn ≤ q p 
dmax E(kz⊤Y n kzY n ) − E(ϵn ) dmin E(kz⊤Y n kzY n ) − E(ϵn )

Therefore, p
E(ϵn )
cn = q p .
αn E(kz⊤Y n kzY n ) − E(ϵn )
For αn ∈ [dmin , dmax ]. Now, as mentioned above, we set E(ϵn ) = ϵ, i.e. the desired
accuracy is achieved in expectation. Thus, we only need to calculate E(kz⊤Y n kzY n ):

h i⊤ h i
E Â−1/2 kY Â−1/2 kY =
 
X 1
Â−1
ij E (K(x i , Y )) E (K(x j , Y )) + Cov (K(xi , Y ), K(x j , Y )) .
ij
M

Now in order to simplify our calculations, we assume M → ∞ in order to be able to


drop the second term. Now this is possible, due to the second term being bounded, as
a product of two kernels is a kernel. From now on working in the limit, based on the
previous calculations, we know that:
"n−1 #
Y
−1/2 −1
E[Â kY n ] = (ci P + I) Â−1/2 [TK y](x1:N ), (26)
i=0

and thus,


E (K(x1:N , Y n )) Â−1 E (K(x1:N , Y n ))
" #
 ⊤ n−1 Y
= Â−1/2 [TK y](x1:N ) (ci P + I)−2 Â−1/2 [TK y](x1:N )
i=0
n−1
!
 ⊤ Y 1
−1/2
= V Â [TK y](x1:N ) diag V Â−1/2 [TK y](x1:N ).
i=0
(ci dj + 1)2

We use the following concise notation:

v0 = V Â−1/2 [TK y](x1:N ),

17
 −1
1
vn = (cn−1 D + I)−1 vn−1 =    D + I vn−1 ,
∥vn−1 ∥
αn √
ϵ
−1

q
with which we have E(kz⊤Y n kzY n ) = ∥vn ∥. This in turn directly coincides with that
in Mobahi et al. [13], replacing D with its inverse. To be precise, the following follows
exactly from Mobahi et al. [13]: √
Proposition 3.1. For any n ≥ 0, if ∥vi ∥ > ϵ for i = 0, . . . , n, then, ∥vn ∥ is decreasing
wrt n and
√ an (κ) − 1
∥vn ∥ ≥ an (κ) ∥v0 ∥ − ϵb(κ) ,
a(κ) − 1
where,

2
(r0 − 1) + x (2r0 − 1)
a(x) ≡ 2
(r0 − 1 + x)
r02 x
b(x) ≡ 2
(r0 − 1 + x)
1 dmax
r0 ≡ √ ∥v0 ∥ , κ ≡ .
ϵ dmin

Which in the same way provides a lower bound, until which no collapse occurs:
Proposition 3.2. Starting from ∥v0 ∥ > ϵ, meaningful (non-collapsing solution)
training on generational data is possible at least for n generations,

∥v0 ∥

ϵ
−1
n≡ .
κ

And during this, the effect can be viewed as sparsification of the underlying
function, in the following sense. The expected density is:
" n
#

Y
E[fn∗ ] = k(·) Â −1/2
V ⊤
(ci D + I) −1
V Â−1/2 [TK y](x1:N ),
i=0

Qn
and denoting Bn = i=0 (ci D + I)−1 , we have the following:
∥v0 ∥
√ −1
Proposition 3.3. Starting from ∥v0 ∥ > ϵ and for n ≤ ϵκ . Then for dj and dk
any pair of diagonals of D with dk < dj , the following inequality holds:

 n
∥v0 ∥ dj
Bn−1 [k, k]  √ϵ − 1 + dmax
≥ ∥v ∥  .
Bn−1 [j, j] √0 − 1 + dk
ϵ dmax

18
In total, the message so far can be summarised as follows:

When regularising density estimation with generational data, due to functional approx-
imation errors, the approximated distribution gets sparser over generations, even in
the limit of infinite data. As such, even with perfect information, we do not recover
the original distribution, and over time model collapse occurs.

3.3 Absorbing Markov Chain


The subsection explains a well-known fact about absorbing Markov chains, that they
converge to an absorbing state with probability one. Assume that Xm form a Markov
chain. In order to reason about this chain we need to consider the transition proba-
bilities. In general, these correspond to our functional approximation scheme. Due to
the stochastic nature of the Markov chain, we expect to have the variance go up and
down. But as the variance decreases, the newly sampled data, due to its finiteness,
will be more concentrated, leading in the limit to a set of i.e. a delta functions. This
argument assumes that the approximation scheme is good and can converge to delta
functions. If not, the errors in approximation may prevent the propagation of errors
in stochasticity.
As discussed in the previous section, we can model the process of repeated ‘sam-
pling’ and ‘fitting’ as a Markov chain. In this subsection, we explain how such a process
can converge to a stationary state i.e. the absorbing state of a Markov Chain. In this
derivation we follow Allan Yashinski 1 . Suppose we have an absorbing Markov Chain
with r transient states t1 , . . . , tr and s absorbing states a1 , . . . , as . The whole Markov
chain has r + s states, ordered as follows: t1 , . . . , tr , a1 , . . . , as . The transition matrix
is then defined as
 
Q 0r×s
T = , (27)
R Is
where
• Q is an r × r matrix holds the probabilities of moving from a transient state to
another transient state
• R is an s × r matrix which holds the probabilities of moving from a transient
state to an absorbing state.
• 0r×s is the r × s matrix of all 0’s. There 0’s represent the probabilities of moving
from an absorbing state to a transient state (which is impossible by definition).
• Is holds the probabilities of transitioning between the absorbing states. As
transition is impossible, this is just the s × s identity matrix.
We are interested in limk→∞ T k (X0 ). For a given k, the matrix becomes

Qk Qk
   
k 0r×s 0r×s
T = = Pk−1 . (28)
R + RQ + · · · + RQk−1 Is R i=0 Qi Is
 
Q 0r×s
Finally, for an absorbing Markov chain with T = ,
R Is

1
www.math.umd.edu/∼immortal/MATH401/book/ch absorbing markov chains.pdf

19
Real Data 0 50 100 150 200 350 2000
3 3 3 3 3 3 3 3

2 2 2 2 2 2 2 2

1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2

Fig. 6: An examples of GMM fitting data at iterations {0, 50, 100, 150, 200, 350, 2000}.
At first the model fits data very well as is shown on the left; yet even at generation 50
the perception of the underlying distribution completely changes. At generation 2000
it converges to a state with very little variance. GMM is sampled a thousand times.

 
0r×r 0r×s
we have limk→∞ T k = .
R(Ir − Q)−1 Is
Since in the limit the transition probabilities to transient states are zero, we end up
converging to absorbing states and staying there. In the case of discrete distributions,
where we can perfectly approximate a zero-variance dataset (i.e. a delta function), the
absorbing states are delta functions centered at any non-zero probability point from
the original distribution. In practice, we would like to know the expected number of
steps before being absorbed, which may be large. But without knowing our fitting
procedure it is impossible to calculate the matrix Q and therefore the average length
of time before collapse.

4 GMM and VAE experiments


4.1 Training from scratch with GMMs and VAEs
Gaussian Mixture Models. In this subsection we evaluate the performance of Gaus-
sian Mixture Models (GMM) [20]. The underlying task here is that a given GMM tries
to separate two artificially-generated Gaussians. Figure 6 shows the progression of the
GMM fitting process over time. The left-most plot shows the original two Gaussians
with the ground truth labels. The next plot shows the GMM fitted on the original data
with no cross-generational data used i.e. αi = γi = 0, where the error is minimal. Yet,
within 50 iterations of re-sampling we arrive to a point where the underlying distri-
bution is mis-perceived. The performance worsens over time and by iteration 2000 we
arrive at a point estimate of the distribution with very little variance. The L2 distance
between the original GMM and its descendants is plotted in Figure 8.
Variational Autoencoders. In this subsection we turn to Variational Autoencoders
(VAE). As before, we train an autoencoder on an original data source, in this case
MNIST [21], which we later sample. Here, we generate latents from a Gaussian dis-
tribution which are then used by the decoder to generate data for the subsequent
generation. Figure 7 on the left shows an example of generated data using the setting
described by Kingma and Welling [22].

20
(a) Original model (b) Generation 5 (c) Generation 10 (d) Generation 20

Fig. 7: Random latent reconstructions from VAEs. No training data comes from the
original distribution. Over the generations, different modes of the original distribution
get entangled and generated data starts looking unimodal.

Distance between the original GMM and its approximation


as function of a number of data samples
102
log(||GMM0, GMMevolution||2)

100

10 2

500
10 4
1000
10000
10 6 50000
200000
0 500 1000 1500 2000
Generation
Fig. 8: Progressive fitting of a GMM with different number of samples. On the y-axis
is shown the logarithm of L2 distance between the two GMM distributions. Over the
generations the distance begins to grow and can become quite large. The jumps in
the distance for large sample sizes occur due to the fixed number of iterations and
precision for the expectation maximization algorithm.

Having performed the process a number of times we arrive at a representation that


has very little resemblance of the original classes learned from data. On the right, one
sees the generated images from generation 20, which appear to be a mix of all of the
different digits. Interestingly, the original encoder perceives the generated data from
its descendant with ever-growing confidence – the encoder places such data closer
and closer to the mean. Figure 1 shows the density of the latent representation of
the original model when presented with data generated by its descendants. As with
single-dimensional Gaussians, tails disappear over time and all of the density shifts
towards the mean. We find that such degradation happens consistently over multple
independent runs.

21
Perplexity of generated datapoints
evaluated by model trained with
real wikitext2

1.0 Generation 0
Generation 1
0.8 Generation 2
Probability
Generation 3
0.6 Generation 5

Probability
Generation 9
0.4
10
8
0.2

n
6

tio
ra
4

ne
0.0 10
101 8 2

Ge
6
Perplexity of generated datapoints Perple 4 2
xity 0 0

(a) Overlaid histograms (b) 3D view

Fig. 9: Histogram of perplexities of each individual data training sequence produced


by different generations as is evaluated by the very first model trained with the real
data. Over the generations models tend to produce samples that the original model
(trained with real data) is more likely to produce. At the same time, a much longer tail
appears for later generations – later generations start producing samples that would
never be produced by the original model i.e. they start misperceiving reality based on
errors introduced by their ancestors. Models here are explicitly forced to not repeat
sequences with a penalty of 2.0.
Probability

Probability

10 10
8 8
n

6 6 n
tio

tio
ra

ra

4 4
ne

ne

10 10
8 2 8 2
Ge

Ge

6 6
Perple 4 2 Perple 4 2
xity 0 0 xity 0 0

(a) Figure 1b from main text in 3D. (b) Figure 1c from main text in 3D.
No data preserved. 10% original data preserved.

Fig. 10: Histogram of perplexities of each individual data training sequence produced
by different generations as is evaluated by the very first model trained with the real
data. Over the generations models tend to produce samples that the original model
(trained with real data) is more likely to produce. At the same time, a much longer tail
appears for later generations – later generations start producing samples that would
never be produced by the original model i.e. they start misperceiving reality based on
errors introduced by their ancestors.

22
References
[1] Ven, G.M., Tolias, A.S.: Three scenarios for continual learning. arXiv preprint
arXiv:1904.07734 (2019)

[2] Aljundi, R., Kelchtermans, K., Tuytelaars, T.: Task-free continual learning. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 11254–11263 (2019)

[3] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu,
A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Over-
coming catastrophic forgetting in neural networks. Proceedings of the national
academy of sciences 114(13), 3521–3526 (2017)

[4] Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., Tuytelaars, T.: Memory
aware synapses: Learning what (not) to forget. In: Proceedings of the European
Conference on Computer Vision (ECCV), pp. 139–154 (2018)

[5] Li, Z., Hoiem, D.: Learning without forgetting. IEEE transactions on pattern
analysis and machine intelligence 40(12), 2935–2947 (2017)

[6] Biggio, B., Nelson, B., Laskov, P.: Poisoning attacks against support vector
machines. In: Proceedings of the 29th International Coference on International
Conference on Machine Learning. ICML’12, pp. 1467–1474. Omnipress, Madison,
WI, USA (2012)

[7] Gu, T., Dolan-Gavitt, B., Garg, S.: Badnets: Identifying vulnerabilities in the
machine learning model supply chain. arXiv preprint arXiv:1708.06733 (2017)

[8] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International Conference on Machine Learning,
pp. 8748–8763 (2021). PMLR

[9] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models
are few-shot learners. Advances in neural information processing systems 33,
1877–1901 (2020)

[10] Carlini, N., Terzis, A.: Poisoning and backdooring contrastive learning. In: Inter-
national Conference on Learning Representations (2022). https://openreview.
net/forum?id=iC4UHbQ01Mp

[11] Carlini, N., Jagielski, M., Choquette-Choo, C., Paleka, D., Pearce, W., Anderson,
H., Terzis, A., Thomas, K., Tramèr, F.: Poisoning web-scale training datasets
is practical. In: 2024 IEEE Symposium on Security and Privacy (SP), pp.
175–175. IEEE Computer Society, Los Alamitos, CA, USA (2024). https://doi.

23
org/10.1109/SP54263.2024.00179 . https://doi.ieeecomputersociety.org/10.1109/
SP54263.2024.00179

[12] Kazashi, Y., Nobile, F.: Density estimation in rkhs with application to korobov
spaces in high dimensions. SIAM Journal on Numerical Analysis 61(2), 1080–1102
(2023)

[13] Mobahi, H., Farajtabar, M., Bartlett, P.: Self-distillation amplifies regularization
in hilbert space. Advances in Neural Information Processing Systems 33, 3351–
3361 (2020)

[14] Fischer, A., Gaunt, R., Sarantsev, A.: The variance-gamma distribution: A review.
Statistical Science (2024)

[15] Cochran, W.G.: The distribution of quadratic forms in a normal system, with
applications to the analysis of covariance. Mathematical Proceedings of the
Cambridge Philosophical Society 30(2), 178–191 (1934) https://doi.org/10.1017/
S0305004100016595

[16] Gelbrich, M.: On a formula for the l2 wasserstein metric between measures on
euclidean and hilbert spaces. Mathematische Nachrichten 147(1), 185–203 (1990)

[17] Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A.I., Babaei, H.,
LeJeune, D., Siahkoohi, A., Baraniuk, R.: Self-consuming generative models go
MAD. In: The Twelfth International Conference on Learning Representations
(2024). https://openreview.net/forum?id=ShjMHfmPs0

[18] Williams, D.: Probability with martingales (1991)

[19] Ciarlet, P.G.: Linear and nonlinear functional analysis with applications (2013)

[20] Reynolds, D.A., et al.: Gaussian mixture models. Encyclopedia of biometrics


741(659-663) (2009)

[21] Y., L., C, C., C.J.C, B.: The MNIST Database of Handwritten Digits (1998).
http://yann.lecun.com/exdb/mnist/

[22] Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes (2022)

24

You might also like