0% found this document useful (0 votes)
87 views

PFGM++ - Unlocking The Potential of Physics-Inspired Generative Models

This document introduces PFGM++, a new family of physics-inspired generative models that unifies diffusion models and Poisson Flow Generative Models (PFGM). PFGM++ models data using an N+D dimensional space to generate samples, reducing to PFGM when D=1 and diffusion models as D approaches infinity. The flexibility of choosing D allows balancing robustness against rigidity. The paper establishes equivalences between PFGM++, PFGM, and diffusion models and introduces an unbiased training objective. Experiments show PFGM++ models with finite D outperform diffusion models on CIFAR-10 and FFHQ while having improved robustness against errors.

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

PFGM++ - Unlocking The Potential of Physics-Inspired Generative Models

This document introduces PFGM++, a new family of physics-inspired generative models that unifies diffusion models and Poisson Flow Generative Models (PFGM). PFGM++ models data using an N+D dimensional space to generate samples, reducing to PFGM when D=1 and diffusion models as D approaches infinity. The flexibility of choosing D allows balancing robustness against rigidity. The paper establishes equivalences between PFGM++, PFGM, and diffusion models and introduces an unbiased training objective. Experiments show PFGM++ models with finite D outperform diffusion models on CIFAR-10 and FFHQ while having improved robustness against errors.

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Yilun Xu 1 Ziming Liu 1 Yonglong Tian 1 Shangyuan Tong 1 Max Tegmark 1 Tommi Jaakkola 1

Abstract involve iteratively de-noising samples by following phys-


ically meaningful trajectories. Diffusion models learn a
We introduce a new family of physics-inspired noise-level dependent score function so as to reverse the ef-
generative models termed PFGM++ that unifies
arXiv:2302.04265v2 [cs.LG] 10 Feb 2023

fects of forward diffusion, progressively reducing the noise


diffusion models and Poisson Flow Generative level σ along the generation trajectory. PFGMs in turn aug-
Models (PFGM). These models realize generative ment N -dimensional data points with an extra dimension
trajectories for N dimensional data by embed- and evolve samples drawn from a uniform distribution over
ding paths in N +D dimensional space while still a large N +1-dimensional hemisphere back to the z=0 hy-
controlling the progression with a simple scalar perplane where the clean data (as charges) reside by tracing
norm of the D additional variables. The new mod- learned electric field lines. Diffusion models in particular
els reduce to PFGM when D=1 and to diffusion have been demonstrated across image (Song et al., 2021b;
models when D→∞. The flexibility of choos- Nichol et al., 2022a; Ramesh et al., 2022), 3D (Zeng et al.,
ing D allows us to trade off robustness against 2022; Poole et al., 2022), audio (Kong et al., 2020; Chen
rigidity as increasing D results in more concen- et al., 2020) and biological data (Shi et al., 2021; Watson
trated coupling between the data and the addi- et al., 2022) generation, and have more stable training objec-
tional variable norms. We dispense with the bi- tives compared to GANs (Arjovsky et al., 2017; Brock et al.,
ased large batch field targets used in PFGM and 2019). More recent PFGM (Xu et al., 2022) rival diffusion
instead provide an unbiased perturbation-based models on image generation.
objective similar to diffusion models. To explore
different choices of D, we provide a direct align- In this paper, we introduce a broader family of physics-
ment method for transferring well-tuned hyperpa- inspired generative models that we call PFGM++. These
rameters from diffusion models (D→∞) to any models extend the electrostatic view into higher dimen-
finite D values. Our experiments show that mod- sions through multi-dimensional z ∈ RD augmentations.
els with finite D can be superior to previous state- When interpreting N -dimensional data points x as posi-
of-the-art diffusion models on CIFAR-10/FFHQ tive charges, the electric field lines define a surjection from
64×64 datasets, with FID scores of 1.91/2.43 a uniform distribution on an infinite N +D-dimensional
when D=2048/128. In class-conditional setting, hemisphere to the data distribution located on the z=0 hy-
D=2048 yields current state-of-the-art FID of perplane. We can therefore draw generative samples by
1.74 on CIFAR-10. In addition, we demonstrate following the electric field lines, evolving points from the
that models with smaller D exhibit improved ro- hemisphere back to the z=0 hyperplane. Since the electric
bustness against modeling errors. Code is avail- field has rotational symmetry on the surface of the D-dim
able at https://github.com/Newbeeer/ cylinder kzk2 = r for any r > 0, we can track the sampling
pfgmpp trajectory with a simple scalar r instead of every compo-
nent of z. The use of symmetry turns the aforementioned
surjection into a bijection between an easy-to-sample prior
on a large r = rmax hyper-cylinder to the data distribution.
1. Introduction The symmetry reduction also permits D to take any positive
Physics continues to inspire new deep generative models values, including reals. We derive a new perturbation-based
such as diffusion models (Sohl-Dickstein et al., 2015; Ho training objective akin to denoising score matching (Vincent,
et al., 2020; Song et al., 2021b; Karras et al., 2022) based 2011) that avoids the need to use large batches to construct
on thermodynamics (Jarzynski, 1997) or Poisson flow gener- electric field line targets in PFGM. The perturbation-based
ative models (PFGM) (Xu et al., 2022) derived from electro- objective is more efficient, unbiased, and compatible with
statics (Griffiths, 2005). The associated generative processes paired sample training of conditional generation models.
1
Massachusetts Institute of Technology, MIT, Cambridge, MA,
The models in the new family differ based on their aug-
USA. Correspondence to: Yilun Xu <[email protected]>. mentation dimension D which is now a hyper-parameter.
By setting D=1 we obtain PFGM while D→∞ leads to
arXiv preprint.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Diffusion models
Sec 5 Sweet spot balancing VE/VP (Song et al, 2021)
PFGM (Xu et al, 2022) robustness and rigidity EDM (Karras et al, 2022)

D=1 D* D→∞
PFGM++ (D ∈ ℝ+)
Extension from PFGM Equivalence between D → ∞ and diffusion models

Sec 3.1 Higher-dimensional augmentation Thm 4.1 Field / Sampling equivalence


Sec 3.2 Perturbation-based training objective Prop 4.2 Training equivalence

Figure 1. Overview of paper contributions and structure. PFGM++ unify PFGM and diffusion models, as well as the potential to combine
their strengths (robustness and rigidity).

diffusion models. We establish D→∞ equivalence with demonstrate the trade-off between robustness and rigidity
popular diffusion models (Song et al., 2021b; Karras et al., by varying D (Sec 5). We also detail the hyperparameter
2022) both in terms of their training objectives as well as transfer procedures from EDM/DDPM (D → ∞) to finite
their inferential processes. We demonstrate that the hyper- Ds in Appendix C.2; (5) We empirically show that mod-
parameter D controls the balance between robustness and els with finite D achieve superior performance to diffusion
rigidity: using a small D widens the distribution of noisy models while exhibiting improved robustness (Sec 6).
training sample norms in comparison to the norm of the
augmented variables. However, small D also leads to a
2. Background and Related Works
heavy-tailed sampling problem at any fixed augmentation
norm making learning more challenging. Neither D=1 nor Diffusion Model Diffusion models (Sohl-Dickstein et al.,
D→∞ offers an ideal balance between being insensitive 2015; Ho et al., 2020; Song et al., 2021b; Karras et al.,
to missteps (robustness) and allowing effective learning 2022) are often presented as a pair of two processes. A fixed
(rigidity). Instead, we adjust D in response to different ar- forward process governs the training of the model, which
chitectures and tasks. To facilitate quickly finding the best learns to denoise data of different noise levels. A corre-
D we provide an alignment method to directly transfer other sponding backward process involves utilizing the trained
hyperparameters across different choices of D. model iteratively to denoise the samples starting from a fully
noisy prior distribution. Karras et al. (2022) propose a unify-
Experimentally, we show that some models with finite
ing framework for popular diffusion models (VE/VP (Song
D outperform the previous state-of-the-art diffusion mod-
et al., 2021b) and EDM (Karras et al., 2022)), and their sam-
els (D→∞), i.e., EDM (Karras et al., 2022), on image
pling process can be understood as traveling in time with a
generation tasks. In particular, intermediate D=2048/128
probability flow ordinary differential equation (ODE):
achieve the best performance among other choices of D
ranging from 64 to ∞, with min FID scores of 1.91/2.43 dx = −σ̇(t)σ(t)∇x log pσ(t) (x)dt
on CIFAR-10 and FFHQ 64×64 datasets in unconditional
generation, using 35/79 NFE. In class-conditional genera- where σ(t) is a predefined noise schedule w.r.t. time, and
tion, D=2048 achieves new state-of-the-art FID of 1.74 on ∇x log pσ(t) (x) is the score of noise-injected data distribu-
CIFAR-10. We further verify that in general, decreasing D tion at time t. A neural network fθ (x, σ) is trained to learn
leads to improved robustness against a variety of sources of the score ∇x log pσ(t) (x) by minimizing a weighted sum of
errors, i.e., controlled noise injection, large sampling step the denoising score-matching objectives (Vincent, 2011):
sizes and post-training quantization.
Eσ∼p(σ) λ(σ)Ey∼p(y) Ex∼pσ (x|y)
Our contributions are summarized as follows: (1) We
kfθ (x, σ) − ∇x log pσ (x|y)k22 (1)
 
propose PFGM++ as a new family of generative models
based on expanding augmented dimensions and show that where p(σ) defines a training distribution of noise levels,
symmetries involved enable us to define generative paths λ(σ) is a weighting function, p(y) is the data distribution,
simply based on the scalar norm of the augmented vari- and pσ (x|y) = N (0, σ 2 I) defines a Gaussian perturbation
ables (Sec 3.1); (2) We propose a perturbation-based objec- kernel which samples a noisy version x of the clean data y.
tive to dispense with any biased large batch derived electric Please refer to Table 1 in Karras et al. (2022) for specific
field targets, allowing unbiased training (Sec 3.2); (3) We instantiations of different diffusion models.
prove that the score field and the training objective of dif-
fusion models arise in the limit D→∞ (Sec 4); (4) We PFGM Inspired by the theory of electrostatics (Griffiths,
2005), Xu et al. (2022) propose Poisson flow generative
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

models (PFGM), which interpret the N -dimensional data to the inherent symmetry of the electric field. To improve the
x ∈ RN as electric charges in an N +1-dimensional space training process, we propose an efficient perturbation-based
augmented with an extra dimension z: x̃ = (x, z) ∈ RN +1 . objective for training PFGM++ (Sec 3.2) without relying on
In particular, the training data is placed on the z=0 hyper- the large batch approximation in the original PFGM.
plane, and the electric field lines emitted by the charges
define a bijection between the data distribution and a uni- 3.1. Electric field in N +D-dimensional space
form distribution on the infinite hemisphere of the aug-
mented space1 . To perform generative modeling, PFGM While PFGM (Xu et al., 2022) consider the electric field
learn the following high-dimensional electric field, which is in a N +1-dimensional augmented space, we augment the
the derivative of the electric potential in a Poisson equation: data x with D-dimensional variables z = (z1 , . . . , zD ), i.e.,
x̃ = (x, z) and D ∈ Z+ . Similar to the N +1-dimensional
1
Z
x̃ − ỹ electric field (Eq. (2)), the electric field at the augmented
E(x̃) = p(y)dy (2) data x̃ = (x, z) ∈ RN +D is:
SN (1) kx̃ − ỹkN +1
x̃ − ỹ
Z
1
where SN (1) is the surface area of a unit N -sphere (a ge- E(x̃) = p(y)dy (3)
SN +D−1 (1) kx̃ − ỹkN +D
ometric constant), and p(y) is the data distribution. Sam-
ples are then generated by following the electric field lines, Analogous to the theoretical results presented in PFGM,
which are described by the ODE dx̃ = E(x̃)dt. In prac- with the electric field as the drift term, the ODE dx̃=E(x̃)dt
tice, the network is trained to estimate a normalized ver- defines a surjection from a uniform distribution on an infinite
sion of the following empirical electric field: Ê(x̃) = N +D-dim hemisphere and the data distribution on the N -
Pn x̃−ỹi Pn 1
c(x̃) i=1 kx̃−ỹ ik
N +1 , where c(x̃) = 1/ i=1 kx̃−ỹi kN +1 dim z=0 hyperplane. However, the mapping has SO(D)
PD
and {ỹi }ni=1 ∼ p̃(ỹ) is a large batch used to approximate symmetry on the surface of D-dim cylinder i=1 zi2 = r2
the integral in Eq. (2). The training objective is minimizing for any positive r. We provide an illustrative example at
the `2 -loss between the neural model prediction fθ (x̃) and the bottom of Fig. 2 (D=2, N =1), where the electric flux
the normalized field E(x̃)/kE(x̃)k at various positions of x̃. emitted from a line segment (red) has rotational symmetry
These positions are heuristically designed to carefully cover through the ring area (blue) on the z12 + z22 = r2 cylinder.
the regions that the sampling trajectories pass through. Hence, instead of modeling the individual behavior of each
zi , it suffices to track the norm of augmented variables —
Phases of Score Field Xu et al. (2023) show that the score
r(x̃) = kzk2 — due to symmetry. Specifically, note that
field in the forward process of diffusion models can be
dzi = E(x̃)zi dt, and the time derivative of r is
decomposed into three phases. When moving from the
near field (Phase 1) to the far field (Phase 3), the perturbed D Z PD 2
data get influenced by more modes in the data distribution. dr X zi dzi i=1 zi
= = p(y)dy
They show that the posterior p0|σ (y|x) ∝ pσ (x|y)p(y) dt i=1
r dt SN +D−1 (1)rkx̃ − ỹkN +D
serves as a phase indicator, as it gradually evolves from 1
Z
r
a delta distribution to uniform distribution when shifting = p(y)dy
SN +D−1 (1) kx̃ − ỹkN +D
from Phase 1 to Phase 3. The relevant concepts of phases
have also been explored in Karras et al. (2022); Choi et al. Henceforth we replace the notation for augmented data with
(2022); Xiao et al. (2022). Similar to the PFGM training x̃ = (x, r) for simplicity. After the symmetry reduction, the
objective, Xu et al. (2023) approximates the score field by field to be modeled has a similar form as Eq. (3) except that
large batches to reduce the variance of training targets in the last D sub-components {E(x̃) D
R zi }i=1 are condensed into
Phase 2, where multiple data points exert comparable but a scalar E(x̃)r = SN +D−11 r
p(y)dy. There-
(1) kx̃−ỹkN +D
distinct influences on the scores. These observations inspire fore, we can use the physically meaningful r as the anchor
us to align the phases of different Ds in Sec 4. variable in the ODE dx/dr by change-of-variable:

3. PFGM++: A Novel Generative Framework dx dx dt E(x̃)x


= = E(x̃)x · E(x̃)−1
r = (4)
dr dt dr E(x̃)r
In this section, we present our new family of generative
models PFGM++, generalizing PFGM (Xu et al., 2022) in Indeed, the ODE dx/dr turns the aforementioned surjection
terms of the augmented space dimensionality. We show that into a bijection between an easy-to-sample prior distribution
the electric fields in N +D-dimensional space with D ∈ Z+ on the r=rmax hyper-cylinder2 and the data distribution on
still constitute a valid generative model (Sec 3.1). Fur- r=0 (i.e., z=0) hyperplane. The following theorem states
thermore, we show that the additional D-dimensional aug- the observation formally:
mented variable can be condensed into their scalar norm due 2
The hyper-cylinder here is consistent with the hemisphere in
1
In practice, the hemisphere is projected to a hyperplane PFGM (Xu et al., 2022), because hyper-cylinders degrade to hyper-
z=zmax , so that all samples have the initial z. planes for D = 1, which are in turn isomorphic to hemispheres.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

objective to dispense with the large batch in PFGM. The


objective from PFGM paper (Xu et al., 2022) requires sam-
pling a large batch of data {yi }ni=1 ∼pn (y) in each training
step to approximate the integral in the electric field (Eq. (3)):

Ep̃train (x̃) E{yi }ni=1 ∼pn (y) Ex∼pσ (x|y1 )


" Pn−1 x̃−ỹi 2
#
i=0 kx̃−ỹi kN +D
fθ (x̃) − Pn−1 x̃−ỹi
i=0 kx̃−ỹi kN +D 2 +γ 2

where p̃train is heuristically designed to cover the regions


that the backward ODE traverses. This objective has several
obvious drawbacks: (1) The large batch incurs additional
overheads; (2) Its minimizer is a biased estimator of the
electric field (Eq. (3)); (3) The large batch is incompatible
with typical paired sample training of conditional generation,
where each condition is paired with only one sample, such
as text-to-image (Rombach et al., 2021; Saharia et al., 2022)
Figure 2. The augmented dimension D affects electric field lines and text-to-3D generation (Poole et al., 2022; Nichol et al.,
(gray), which connect charge/data on a line (purple) to latent 2022b).
space (green). When D = 1 (top) or D = 2 (bottom), electric field
lines map the same red line segment to a blue line segment or onto To remedy these issues, we propose a perturbation-based ob-
a blue ring, respectively. The mapping defined by electric lines has jective without the need for the large batch, while achieving
SO(2) symmetry on the surface of z12 + z22 = r2 cylinder. an unbiased minimizer and enabling paired sample train-
ing of conditional generation. Inspired by denoising score-
Theorem 3.1. Assume the data distribution p ∈ C 1 and matching (Vincent, 2011), we design the perturbation kernel
p has compact support. As rmax →∞, for D ∈ R+ , the to guarantee that the minimizer in the following square loss
ODE dx/dr = E(x̃)x /E(x̃)r defines a bijection between objective matches the ground-truth electric field in Eq. (3):
N +D
D
limrmax →∞ prmax (x) ∝ limrmax →∞ rmax /(kxk22 + rmax
2
) 2
Er∼p(r) Ep(y) Epr (x|y) kfθ (x̃) − (x̃ − ỹ)k22
 
when r = rmax and the data distribution p when r = 0. (5)

where r ∈ (0, ∞), p(r) is the training distribution over r,


Proof sketch. The r-dependentRintermediate distribution of
pr (x|y) is the perturbation kernel and ỹ=(y, 0)/x̃=(x, r)
the ODE (Eq. (4)) is pr (x)∝ rD /kx̃ − ỹkN +D p(y)dy,
are the clean/perturbed augmented data. The mini-
which satisfies initial/terminal conditions, i.e., pr=0 =p,
mizer of Eq. (5) is fθ∗ (x̃)∝ pr (x|y)(x̃ − ỹ)p(y)dy,
R
N +D
D
limrmax →∞ prmax ∝ limrmax →∞ rmax /(kxk22 + rmax
2
) 2 , as which R matches the direction of electric field
well as the continuity equation of the ODE, i.e., ∂r pr + ∇x · E(x̃)∝ (x̃ − ỹ)/kx̃ − ỹkN +D p(y)dy when setting the
(pr E(x̃)x /E(x̃)r ) = 0. N +D
perturbation kernel to pr (x|y)∝1/(kx − yk22 + r2 ) 2 .
We defer the formal proof to Appendix A.1. Note that in the Denoting the r-dependent intermediate marginal dis-
theorem we further extend the domain of D from positive in- R
tribution as pr (x)= pr (x|y)p(y)dy, the following
tegers to positive real numbers. In practice, the starting con- proposition states that the choice of pr (·|y) guarantee that
dition of the ODE is some sufficiently large rmax such that the minimizer of the square loss to match the direction of
N +D
prmax (x) ∝ D 2 2
∼ rmax /(kxk2 + rmax ) 2 . The terminal condi- the electric field:
tion is r= 0, which represents the generated samples reach-
ing the data support. The proposed PFGM++ framework Proposition 3.2. With perturbation kernel pr (x|y) ∝
N +D
thus permits choosing arbitrary D, including D = 1 which 1/(kx − yk22 + r2 ) 2 , for ∀x ∈ RN , r > 0, the mini-
recovers the original PFGM formulation. Interestingly, we mizer fθ∗ (x̃) in the PFGM++ objective (Eq. (5)) matches
will also show that when D→∞, PFGM++ recover the dif- the direction of electric field E(x̃) in Eq. (3). Specifically,
fusion models (Sec 4). In addition, as discussed in Sec 5, fθ∗ (x̃) = (SN +D−1 (1)/pr (x))E(x̃).
the choice of D is important, since it controls two properties
of the associated electric field, i.e., robustness and rigidity, We defer the proof to Appendix A.2. The proposition in-
which affect the sampling performance. dicates that the minimizer fθ∗ (x̃) can match the direction
of E(x̃) with sufficient data and model capacity. The cur-
3.2. New objective with Perturbation Kernel rent training target in Eq. (5) is the directional vector be-
tween the clean data ỹ and perturbed data x̃ akin to de-
Although the training process in PFGM can be directly ap- noising score-matching for diffusion models (Song et al.,
plied to PFGM++, we propose a more efficient training 2021b; Karras et al., 2022). In addition, the new objective
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

allows for conditional generations under a one-sample-per- limD→∞,r=σ√D pr (x|y) ∝ exp(−kx − yk22 /2σ 2 ) for
condition setup. Since the perturbation kernel is isotropic, ∀x, y ∈ RN +D :
we can decompose pr (·|y) in hyperspherical coordinates
to Uψ (ψ)pr (R), where Uψ is the uniform distribution over 1
lim √ N +D
the angle component and the distribution of the perturbed D→∞,r=σ D (kx − yk22 + r2 ) 2
radius R = kx − yk2 is (N +D) kx−yk2
∝ lim√ e− 2 ln(1+ r2 )
N −1 D→∞,r=σ D
R
pr (R) ∝ N +D (N +D)kx−yk2
2 kx−yk2 2
(R2 + r2 ) 2
= lim√ e− 2r 2 = e− 2σ 2 (7)
D→∞,r=σ D
We defer the practical sampling procedure of the perturba-
tion kernel to Appendix B. The meanp of the r-dependent The equivalence of√trajectories can be proven by change-of-
radius distribution pr (R) is around r N/D.√ Hence we variable dσ = dr/ D. Their prior distributions are also the
explicitly normalize the target in Eq. (5) by √ r/ D, to keep same since limD→∞ prmax =σmax √D (x) = N (0, σmax I).
the norm of the target around the constant N , similar to
diffusion models (Song et al., 2021b). In addition, we drop We defer the formal proof to Appendix A.3. Since kx −
the last dimension yk22 /r2 ≈ N/D when x ∼ pr (x), y ∼ p(y), Eq. (7) ap-
√ of √ the target because it is a constant —
(x̃ − ỹ)r /(r/ D) = D. Together, the new objective is proximately holds under the condition D  N . Remark-
ably, the theorem states that PFGM++ recover the field
and sampling of previous popular diffusion models, such
 
x−y 2
Er∼p(r) Ep(ỹ) Epr (x̃|ỹ) fθ (x̃) − √ (6) as VE/VP (Song & Ermon, 2020) and EDM (Karras et al.,
r/ D 2
2022), by choosing the appropriate schedule and scale func-
After training the neural network through objective Eq. (6), tion in Karras et al. (2022).
we can use the ODE (Eq. (4)) anchored by r to √
generate sam- In addition to the field and sampling equivalence, we demon-
ples, i.e., dx/dr = E(x̃)x /E(x̃)r = fθ (x̃)/ D, starting strate that the proposed PFGM++ objective (Eq. (6)) with
from the prior distribution prmax . N +D
perturbation kernel pr (x|y) ∝ 1/(kx − yk22 + r2 ) 2 re-
covers the weighted sum of the denoising score matching
4. Diffusion Models as D→∞ Special Cases objective (Vincent, 2011) for training continuous diffusion
model (Karras et al., 2022; Song et al., 2021b) when D→∞.
Diffusion models generate samples by simulating ODE/SDE All previous objectives for training diffusion models can be
involving the score function ∇x log pσ (x) at different inter- subsumed in the following form (Karras et al., 2022), under
mediate distributions pσ (Song et al., 2021b; Karras et al., different parameterizations of the neural networks fθ :
2022), where σ is the standard deviation of the Gaussian
kernel. In this section, we show that both sampling and
 
x−y 2
training schemes in diffusion models are equivalent to those Eσ∼p(σ) λ(σ)Ep(y) Epσ (x|y) fθ (x, σ) − (8)
σ 2
in D→∞ case under the PFGM++ framework. To begin
with, we show that the electric field (Eq. (3)) in PFGM++ where pσ (x|y) ∝ exp(−kx − yk22 /2σ 2 ). The ob-
has the same direction as the score function when D tends jective of the diffusion models resembles the one of
to infinity, and their sampling processes are also identical. PFGM++ (Eq. (6)). Indeed, we show that when D→∞, the
Theorem 4.1. Assume the data distribution p ∈ C 1 . Con-

minimizer of the√ proposed PFGM++ objective at x̃=(x, r)
sider taking the limit D → ∞ while holding σ = r/ D is fθ∗ (x, r = σ D)=σ∇x log pσ (x), the same as the √mini-
fixed. Then, for all x, mizer of diffusion objective at the noise level σ=r/ D.

√ Proposition 4.2. When r = σ D, D → ∞, the minimizer
D
lim E(x̃)x − σ∇x log pσ=r/√D (x) = 0 in the PFGM++ objective (Eq. (6)) is equaivalent to the
D→∞√ E(x̃)r 2 minimizer in the weighted sum of denoising score matching
r=σ D
objective (Eq. (8))
where E(x̃ = (x, r))x is given in Eq. (3). Fur-
ther, given the same initial point, the trajectory of We defer the proof to Appendix A.4. The proposition states
the PFGM++ ODE (dx/dr=E(x̃)x /E(x̃)r ) matches that the training objective of diffusion models is essentially
the diffusion ODE (Karras et al., 2022) (dx/dt= − the same as PFGM++’s when D→∞. Combined with The-
σ̇(t)σ(t)∇x log pσ(t) (x)) in the same limit. orem 4.1, PFGM++ thus recover both the training and sam-
pling processes of diffusion models when D→∞.
Proof sketch. By re-expressing the x component E(x̃)x
in the electric field and the score ∇x log pσ in dif- Transfer hyperparameters to finite Ds The training
fusion models, the proof boils down to show that hyperparameters of diffusion models (D→∞) have been
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
No alignment r = D alignment
1.0 D = 24 1.0 D = 24 5.1. Behavior of perturbation kernel when varying D
0.8 D = 28 D = 28
0.8
Mean TVD

D = 212 D = 212 According to Theorem 4.1, when D→∞, the field in


0.6 D = 216 D = 216
D = 220 0.6 D = 220
0.4
PFGM++√ has the same direction as the score function,
0.4 i.e., DE(x̃)x /E(x̃)r =σ∇x log pσ=r/√D (x). In addi-
0.2
0.2 tion to the theoretical analysis, we provide further em-
0.0
0 20000 40000 60000 80000 0 20 40 60 80 pirical study to characterize the convergence towards
r diffusion models as D increases. Fig. 4(a) reports

(a) No alignment (b) r = σ D alignment the average `√ 2 difference between the two quantities,
i.e., Epσ (x) [k DE(x̃)x /E(x̃)r −σ∇x log pσ (x)k2 ] with
Figure 3. Mean TVD between the posterior p0|r (·|x) (x is per- √
turbed sample) and√the uniform prior, w/o (a) and w/ (b) the phase r=σ D. We observe that the difference monotonically
alignment (r = σ D). decreases as a function of D, and converges to 0 as pre-
dicted by theory. For σ=1, the distance remains 0 since
the empirical posterior p0|r concentrates around a single
highly optimized through a series of works (Ho et al., example for all D.
2020; Song et al., 2021b; Karras et al., 2022). It mo-
tivates us to transfer hyperparameters, such as rmax and Next, we examine the behavior of the perturbation kernel
p(r), of D→∞ to finite Ds. Here we present an align- after the phase alignment. Recall that the isotropic per-
N +D
ment method that enables a “zero-shot” transfer of hyper- turbation kernel pr (x|y) ∝ 1/(kx − yk22 + r2 ) 2 can
parameters across different Ds. Our alignment method is be decomposed into a uniform angle component and a ra-
N +D
inspired by the concept of phases in Xu et al. (2023), which dius distribution pr (R) ∝ RN −1 /(R2 + r2 ) 2 . Fig. 4(b)
is related to the variation of training targets. We aim to shows the variance of the radius distribution significantly
align the intermediate marginal distributions pr for two decreases as D increases. The results imply that with rel-
distinct D1 , D2√> 0. In Appendix C.1, we demonstrate atively large r, the norm of the training sample in pr (x)
that when r ∝ D, the phase of the intermediate distribu- becomes increasingly concentrated around a specific value
tion pr is approximately invariant to all D > p0 (including as D increases, reaching its highest level of concentration as
D→∞). In other words, when rD1 /rD2 = D1 /D2 , the D→∞ (diffusion models). Fig. 4(c) further shows the den-
phases of prD1 and prD2 , under D1 and D2 respectively, are sity of training sample norms in pr=σ√D (x) on CIFAR-10.
roughly
√ aligned. Theorem 4.1 further shows that the relation We can see that the range of the high-mass region gradually
r=σ D makes PFGM++ equivalent √ to diffusion models shrinks when D increases.
when D→∞. Together, the r=σ D formula aligns the
phases of pσ in diffusion models and pr=σ√D in PFGM++ 5.2. Balancing the trade-off by controlling D
for ∀D>0. Such alignment enables directly transferring
the finely tuned hyperparameters σmax , p(σ) in previous As noted in Xu et al. (2022), diffusion models (D→∞)
state-of-the-art
√ diffusion models √ (Karras
√ et al., 2022) with are more susceptible to estimation errors compared to
rmax =σmax D, p(r)=p(σ=r/ D)/ D. We put the prac- PFGM (D=1) due to the strong correlation between σ
tical hyperparameter transfer procedures in Appendix C.2. and the training sample norm, as demonstrated in Fig. 4(c).
When D and r are large, the marginal distribution p pr (x) is
We empirically verify the alignment formula on the CIFAR-
approximately supported on the sphere with radius r N/D.
10 (Krizhevsky, 2009). Xu et al. (2023) shows that the pos-
The backward ODE can lead to unexpected results if the
terior p0|r (y|x) ∝ pr (x|y)p(y) gradually grows towards
sampling trajectories deviate from this norm-r relation
a uniform distribution from the near to the far field. As a
present in training samples. This phenomenon was em-
result, the mean total variational distance (TVD) between a
pirically confirmed by Xu et al. (2022) for PFGM/diffusion
uniform distribution and the posterior serves as an indicator
models (D=1 and D→∞ cases) using a weaker architec-
of the phase of pr : Epr (x) TVD U (·) k p0|r (·|x) . Fig. 3
√ ture NCSNv2 (Song & Ermon, 2020), where PFGM was
reports the mean TVD before and after the r=σ D align- shown to be significantly more robust than diffusion models.
ment. We observe that the mean TVDs of a wide range of
Ds take similar values after the alignment, suggesting that Smaller D, however, implies a heavy-tailed input distribu-
the phases of pr=σ√D are roughly aligned for different Ds. tion. Fig. 4(c) illustrates that the examples used as the input
to the neural network have a broader range of norms when D
is small. In particular, when D<25 , the variance of pertur-
5. Balancing Robustness and Rigidity bation radius can be larger than 210 (Fig. 4(b)). This broader
In this section, we first delve into the behaviors of PFGM++ input range can be challenging for any finite-capacity neural
with different Ds (Sec 5.1) based on the alignment formula. network. Although Xu et al. (2022) introduced heuristics
Then we demonstrate how to leverage D to balance the to bypass this issue in the D=1 case, e.g., restricting the
robustness and rigidity of models (Sec 5.2). We defer all sampling/training regions, these heuristics also prevent the
experimental details in this section to Appendix D.1. sampling process from faithfully recovering the data distri-
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
D = 22
D = 26
1e 1 D = 210
=1 D = 214
Average 2 difference

1.50 =1 15 D = 218
1.25 =5 =5

log2Varpr(R)

D (||x||2 )
= 20 = 20
1.00 = 80 10 = 80
0.75

pr =
0.50 5
0.25 80

0.00 0
0
1e3 1
20 25 210 215 220 25 210 215 2 40

D D
3
||x|4|2 5
6 20
7
8 10
(a) (b) (c)
Figure 4. (a) Average `2 difference between scaled electric field and score function, versus D. (b) Log-variance of radius distribution
versus D. (c) Density of radius distributions pr=σ√D (R) with varying σ and D.

bution. Heun’s 2nd method (Ascher & Petzold, 1998) as in EDM.


Thus, we can view D as a parameter to optimize so as to
balance the robustness of generation against rigidity that Table 1. CIFAR-10 sample quality (FID) and number of function
helps learning. Increased robustness allows practitioners to evaluations (NFE).
use smaller neural networks, e.g., by applying post-training Min FID ↓ Top-3 Avg FID ↓ NFE ↓
quantization (Han et al., 2015; Banner et al., 2018). In DDPM (Ho et al., 2020) 3.17 - 1000
other words, smaller D allows for more aggressive quantiza- DDIM (Song et al., 2021a) 4.67 - 50
VE-ODE (Song et al., 2021b) 5.29 - 194
tion/larger sampling step sizes/smaller architectures. These VP-ODE (Song et al., 2021b) 2.86 - 134
can be crucial in real-world applications where computa- PFGM (Xu et al., 2022) 2.48 - 104

tional resources and storage are limited. On the other hand, PFGM++ (unconditional)
such gains need to be balanced against easier training af- D = 64 1.96 1.98 35
D = 128 1.92 1.94 35
forded by larger values of D. The ability to optimize the D = 2048 1.91 1.93 35
balance by varying D can be therefore advantageous. We D = 3072000 1.99 2.02 35
D → ∞ (Karras et al., 2022) 1.98 2.00 35
expect that there exists a sweet spot of D in the middle
striking the balance, as the model robustness and rigidity go PFGM++ (class-conditional)

in opposite directions. D = 2048 1.74 - 35


D → ∞ (Karras et al., 2022) 1.79 - 35

6. Experiments
In this section, we assess the performance of different gen- Table 2. FFHQ sample quality (FID) with 79 NFE in unconditional
erative models on image generation tasks (Sec 6.1), where setting
models with some median Ds outperform previous state- Min FID ↓ Top-3 Avg FID ↓
of-the-art diffusion models (D→∞), consistent with the
D = 128 2.43 2.48
sweet spot argument in Sec 5. We also demonstrate the D = 2048 2.46 2.47
improved robustness against three kinds of error as D de- D = 3072000 2.49 2.52
creases (Sec 6.2). D → ∞ (Karras et al., 2022) 2.53 2.54

6.1. Image generation We compare models trained with D→∞ (EDM) and
We consider the widely used benchmarks CIFAR-10 D∈{64, 128, 2048, 3072000}. In our experiments, we ex-
32×32 (Krizhevsky, 2009) and FFHQ 64×64 (Karras et al., clude the case of D=1 (PFGM) because the perturbation
2018) for image generation. For training, we utilize the im- kernel is extremely heavy-tailed (Fig. 4(b)), making it diffi-
proved NCSN++/DDPM++ architectures, preconditioning cult to integrate with our perturbation-based objective with-
techniques and hyperparameters from the state-of-the-art out the restrictive region heuristics proposed in Xu et al.
diffusion model EDM (Karras et al., 2022). Specifically, (2022). We also exclude the small D = 64 for the higher-
we use the alignment method developed in Sec 4 to transfer resolution dataset FFHQ. We include several popular gen-
their tuned critical hyperparameters σmax , σmin , p(σ) in the erative models for reference and defer more training and
D→∞ case to finite D cases. According to the experimen- sampling details to Appendix D.
tal results in Karras et al. (2018), the log-normal training Results: In Table 1 and Table 2, we report the sample qual-
distribution p(σ) has the most substantial impact on the final ity measured by the FID score (Heusel et al., 2017) (lower
performances. For ODE solver during sampling, we use is better), and inference speed measured by the number
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

of function evaluations. As in EDM, we report the min- In addition to the controlled scenario, we conduct two more
imum FID score over checkpoints. Since we empirically realistic experiments: (1) We introduce more estimation
observe a large variation of FID scores on FFHQ across error of neural networks by applying post-training quan-
checkpoints (Appendix D.4), we also use the average FID tization (Sung et al., 2015), which can directly compress
score over the Top-3 checkpoints as another metric. Our neural networks without fine-tuning. Table 3 reports the FID
main findings are (1) Median Ds outperform diffusion score with varying quantization bit-widths for the convolu-
models (D→∞) under PFGM++ framework. We ob- tion weight values. We can see that finite Ds have better
serve that the D=2048/128 cases achieve the best perfor- robustness than the infinite case, and a lower D exhibits
mance among our choices on CIFAR-10 and FFHQ, with a larger performance gain when applying lower bit-widths
a current state-of-the-art min FID score of 1.91/2.43 in quantization. (2) We increase the discretization error dur-
unconditional setting, using the perturbation-based objec- ing sampling by using smaller NFEs, i.e., larger sample
tive. In addition, the D=2048 case obtain better Top-3 aver- steps. As shown in Fig. 5(b), gaps between D=128 and
age FID scores (1.93/2.47) than EDM (2.00/2.54) on both diffusion models gradually widen, indicating greater robust-
datasets in unconditional setting, and achieve current state- ness against the discretization error. The rigidity issue of
of-the-art FID score of 1.74 on CIFAR-10 class-conditional smaller D also affects the robustness to discretization error,
setting. (2) There is a sweet spot between (1, ∞). Nei- as D=64 is consistently inferior to D=128.
ther small D nor infinite D obtains the best performance,
which confirms that there is a sweet spot in the middle,
well-balancing rigidity and robustness. (3) Model with Table 3. FID score versus quantization bit-widths on CIFAR-10.
DN recovers diffusion models. We find that model Quantization bits: 9 8 7 6 5
with sufficiently large D roughly matches the performance D = 64 1.96 1.96 2.12 2.94 28.50
of diffusion models, as predicted by the theory. Further re- D = 128 1.93 1.97 2.15 3.68 34.26
sults in Appendix E.1 show that D=3072000 and diffusion D = 2048 1.91 1.97 2.12 5.67 47.02
D →∞ 1.97 2.04 2.16 5.91 50.09
models obtain the same FID score when using a more stable
training target (Xu et al., 2023) to mitigate the variations
between different runs and checkpoints.
7. Conclusion and Future Directions
6.2. Model robustness versus D We present a new family of physics-inspired generative
models called PFGM++, by extending the dimensionality
300
D = 64 2.8 D = 64 of augmented variable in PFGM from 1 to D ∈ R+ . Re-
250 D = 128 D = 128
D = 2048 2.6 D = 2048 markably, PFGM++ includes diffusion models as special
FID Score

200
D (Diffusion) D (Diffusion) cases when D→∞. To address issues related to large batch
150 2.4
100 2.2 training, we propose a perturbation-based objective. In addi-
50
2.0
tion, we show that D effectively controls the robustness and
0 rigidity in the PFGM++ family. Empirical results show that
0.0 0.1 0.2 0.3 0.4 20 25 30 35
NFE models with finite values of D can perform better than previ-
ous state-of-the-art diffusion models, while also exhibiting
Figure 5. FID score versus (left) α and (right) NFE on CIFAR-10. improved robustness.

In Section 5, we show that the model robustness degrades There are many potential avenues for future research in the
with an increasing D by analyzing the behavior of pertur- PFGM++ framework. For example, it may be possible to
bation kernels. To further validate the phenomenon, we identify the “sweet spot” value of D for different architec-
conduct three sets of experiments with different sources of tures and tasks by analyzing the behavior of errors. Since
errors on CIFAR-10. We defer more details to Appedix D.5. PFGM++ enables adjusting robustness, another direction is
Firstly, we perform controlled experiments to compare the to apply aggressive network compression techniques, i.e.,
robustness of models quantitatively. To simulate the errors, pruning and low-bit training, to smaller D. Furthermore,
we inject noise into the intermediate point xr in each√of the there may be opportunities to develop stochastic samplers
35 ODE steps: xr = xr + αr where r ∼ N (0, r/ DI), for PFGM++, with the reverse SDE in diffusion models as
and α is a positive number controlling the amount of a special case. Lastly, as diffusion models have been highly
noise. Fig. 5(a) demonstrates that as α increases, FID optimized for image generation, the PFGM++ framework
score exhibits a much slower degradation for smaller D. may show a greater advantage over its special case (diffusion
In particular, when D=64, 128, the sample quality degrades models) in emergent fields, such as biology data.
gracefully. We further visualize the generated samples in
Appendix E.2. It shows that when α=0.2, models with
D=64, 128 can still produce clean images while the sam-
pling process of diffusion models (D→∞) breaks down.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Acknowledgements Karras, T., Laine, S., and Aila, T. A style-based generator


architecture for generative adversarial networks. 2019
YX and TJ acknowledge support from MIT-DSTA Singa- IEEE/CVF Conference on Computer Vision and Pattern
pore collaboration, from NSF Expeditions grant (award Recognition (CVPR), pp. 4396–4405, 2018.
1918839) “Understanding the World Through Code”, and
from MIT-IBM Grand Challenge project. ZL and MT would Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating
like to thank the Center for Brains, Minds, and Machines the design space of diffusion-based generative models.
(CBMM) for hospitality. ZL and MT are supported by The ArXiv, abs/2206.00364, 2022.
Casey and Family Foundation, the Foundational Questions
Institute, the Rothberg Family Fund for Cognitive Science Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B.
and IAIFI through NSF grant PHY-2019786. ST and TJ Diffwave: A versatile diffusion model for audio synthesis.
also acknowledge support from the ML for Pharmaceutical ArXiv, abs/2009.09761, 2020.
Discovery and Synthesis Consortium (MLPDS). Krizhevsky, A. Learning multiple layers of features from
tiny images. 2009.
References Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin,
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gen- P., McGrew, B., Sutskever, I., and Chen, M. Glide: To-
erative adversarial networks. In International Conference wards photorealistic image generation and editing with
on Machine Learning, 2017. text-guided diffusion models. In ICML, 2022a.

Ascher, U. M. and Petzold, L. R. Computer methods for Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., and Chen, M.
ordinary differential equations and differential-algebraic Point-e: A system for generating 3d point clouds from
equations. 1998. complex prompts. ArXiv, abs/2212.08751, 2022b.

Banner, R., Nahshan, Y., and Soudry, D. Post training Poole, B., Jain, A., Barron, J. T., and Mildenhall, B.
4-bit quantization of convolutional networks for rapid- Dreamfusion: Text-to-3d using 2d diffusion. ArXiv,
deployment. In Neural Information Processing Systems, abs/2209.14988, 2022.
2018. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,
M. Hierarchical text-conditional image generation with
Brock, A., Donahue, J., and Simonyan, K. Large scale gan
clip latents. ArXiv, abs/2204.06125, 2022.
training for high fidelity natural image synthesis. ArXiv,
abs/1809.11096, 2019. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. High-resolution image synthesis with la-
Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and
tent diffusion models. 2022 IEEE/CVF Conference on
Chan, W. Wavegrad: Estimating gradients for waveform
Computer Vision and Pattern Recognition (CVPR), pp.
generation. ArXiv, abs/2009.00713, 2020.
10674–10685, 2021.
Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., and Yoon, S. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-
Perception prioritized training of diffusion models. In ton, E. L., Ghasemipour, S. K. S., Ayan, B. K., Mah-
Proceedings of the IEEE/CVF Conference on Computer davi, S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet,
Vision and Pattern Recognition, pp. 11472–11481, 2022. D. J., and Norouzi, M. Photorealistic text-to-image dif-
fusion models with deep language understanding. ArXiv,
Griffiths, D. J. Introduction to electrodynamics, 2005.
abs/2205.11487, 2022.
Han, S., Mao, H., and Dally, W. J. Deep compression: Shi, C., Luo, S., Xu, M., and Tang, J. Learning gradient
Compressing deep neural network with pruning, trained fields for molecular conformation generation. In ICML,
quantization and huffman coding. arXiv: Computer Vi- 2021.
sion and Pattern Recognition, 2015.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Ganguli, S. Deep unsupervised learning using nonequi-
Hochreiter, S. Gans trained by a two time-scale update librium thermodynamics. In International Conference on
rule converge to a local nash equilibrium. In NIPS, 2017. Machine Learning, pp. 2256–2265. PMLR, 2015.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- Song, J., Meng, C., and Ermon, S. Denoising diffusion
bilistic models. ArXiv, abs/2006.11239, 2020. implicit models. ArXiv, abs/2010.02502, 2021a.
Jarzynski, C. Equilibrium free-energy differences from Song, Y. and Ermon, S. Improved techniques for training
nonequilibrium measurements: A master-equation ap- score-based generative models. ArXiv, abs/2006.09011,
proach. Physical Review E, 56:5018–5035, 1997. 2020.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A.,


Ermon, S., and Poole, B. Score-based generative mod-
eling through stochastic differential equations. ArXiv,
abs/2011.13456, 2021b.
Sung, W., Shin, S., and Hwang, K. Resiliency of deep neural
networks under quantization. ArXiv, abs/1511.06488,
2015.
Vincent, P. A connection between score matching and de-
noising autoencoders. Neural Computation, 23:1661–
1674, 2011.
Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L.,
Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte,
R. J., Milles, L. F., Wicky, B. I. M., Hanikel, N., Pellock,
S. J., Courbet, A., Sheffler, W., Wang, J., Venkatesh, P.,
Sappington, I., Torres, S. V., Lauko, A., Bortoli, V. D.,
Mathieu, E., Barzilay, R., Jaakkola, T., DiMaio, F., Baek,
M., and Baker, D. Broadly applicable and accurate pro-
tein design by integrating structure prediction networks
and diffusion generative models. bioRxiv, 2022.
Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative
learning trilemma with denoising diffusion GANs. In
International Conference on Learning Representations,
2022. URL https://openreview.net/forum?
id=JprM0p-q0Co.
Xu, Y., Liu, Z., Tegmark, M., and Jaakkola, T. Poisson flow
generative models. ArXiv, abs/2209.11178, 2022.
Xu, Y., Tong, S., and Jaakkola, T. Stable target field for
reduced variance score estimation in diffusion models.
ArXiv, abs/2302.00670, 2023.
Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O.,
Fidler, S., and Kreis, K. Lion: Latent point diffusion
models for 3d shape generation. ArXiv, abs/2210.06978,
2022.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Appendix
A. Proofs
A.1. Proof of Theorem 3.1
Theorem 3.1. Assume the data distribution p ∈ C 1 and p has compact support. As rmax →∞, for D ∈ R+ , the ODE
N +D
D
dx/dr = E(x̃)x /E(x̃)r defines a bijection between limrmax →∞ prmax (x) ∝ limrmax →∞ rmax /(kxk22 + rmax
2
) 2 when
r = rmax and the data distribution p when r = 0.

Proof. Let qr (x) ∝ rD /kx̃ − ỹkN +D p(y)dy. We will show that qr ∝ rD /kx̃ − ỹkN +D p(y)dy is equal to
R R
the r-dependent marginal distribution pr by verifying (1) the starting distribution is correct when r=0; (2) the con-
tinuity equation holds, i.e., ∂r qr + ∇x · (qr E(x̃)x /E(x̃)r ) = 0. The starting distribution is limr→0 qr (x) ∝
limr→0 rD /kx̃ − ỹkN +D p(y)dy ∝ p(x), which confirms that qr =p. The continuity equation can be expressed as:
R

∂r qr + ∇x · (qr E(x̃)x /E(x̃)r )


x̃−ỹ
R !
Z
rD
 Z
rD kx̃−ỹkN +D
p(y)dy
= ∂r p(y)dy + ∇x · p(y)dy R r
kx̃ − ỹkN +D kx̃ − ỹkN +D kx̃−ỹkN +D
p(y)dy
DrD−1
Z    
x̃ − ỹ
Z
(N + D)r D−1
= − p(y)dy + ∇x · r p(y)dy
kx̃ − ỹkN +D kx̃ − ỹkN +D−2 kx̃ − ỹkN +D
DrD−1
Z    
x̃ − ỹ
Z
(N + D)r D−1
= − p(y)dy + ∇x · r p(y)dy
kx̃ − ỹkN +D kx̃ − ỹkN +D−2 kx̃ − ỹkN +D
DrD−1
Z  
(N + D)r
= N +D
− p(y)dy
kx̃ − ỹk kx̃ − ỹkN +D−2
N Z
D−1
X kx̃ − ỹkN +D − kx̃ − ỹkN +D−2 (xi − yi )2 (N + D)
+r p(y)dy
i=1
kx̃ − ỹk2(N +D)
DrD−1 (N + D)rD+1
Z  
= − p(y)dy
kx̃ − ỹkN +D kx̃ − ỹkN +D−2
N kx̃ − ỹkN +D − kx̃ − ỹkN +D−2 kx − yk2 (N + D)
Z
+ rD−1 p(y)dy
kx̃ − ỹk2(N +D)
kx̃−ỹkN +D D − (N +D)r2 kx̃ − ỹkN +D−2 + N kx̃−ỹkN +D − kx̃−ỹkN +D−2 kx−yk2 (N +D)
Z
= rD−1 p(y)dy
kx̃−ỹk2(N +D)
(N + D)(kx̃ − ỹkN +D − kx̃ − ỹkN +D−2 kx − yk2 ) − (N + D)r2 kx̃ − ỹkN +D−2
Z
= rD−1 p(y)dy
kx̃ − ỹk2(N +D)
(N + D)r2 kx̃ − ỹkN +D−2 − (N + D)r2 kx̃ − ỹkN +D−2
Z
= rD−1 p(y)dy
kx̃ − ỹk2(N +D)
=0

It means that qr satisfies the continuity equation for any r ∈ R≥0 . Together, we conclude that qr = pr . Lastly, note that the
terminal distribution is
Z D Z D
rmax rmax
lim prmax (x) ∝ lim N +D
p(y)dy = lim N +D p(y)dy
rmax →∞ rmax →∞ kx̃ − ỹk rmax →∞ (kx − yk2 + rmax
2 ) 2
!
D Z D D
rmax rmax rmax
= lim N +D + lim N +D − N +D p(y)dy
rmax →∞ (kxk2 + r 2 ) 2 rmax →∞ (kx − yk2 + rmax
2 ) 2 (kxk2 + rmax
2 ) 2
max
D
rmax
= lim N +D (p has a compact support)
rmax →∞ (kxk2 + rmax
2 ) 2
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

A.2. Proof of Theorem 3.2


N +D
Proposition A.1. With perturbation kernel pr (x|y) ∝ 1/(kx − yk22 + r2 ) 2 , for ∀x ∈ RN , r > 0, the minimizer
fθ∗ (x̃) in the PFGM++ objective (Eq. (5)) matches the direction of electric field E(x̃) in Eq. (3). Specifically, fθ∗ (x̃) =
(SN +D−1 (1)/pr (x))E(x̃).

Proof. The minimizer at x̃ in Eq. (5) is


R
pr (x|y)(x̃ − ỹ)p(y)dy
Z
fθ∗ (x̃) = pr (y|x)(x̃ − ỹ)dỹ = (9)
pr (x)

The choice of perturbation kernel is

1 1
pr (x|y) ∝ = N +D
kx̃ − ỹkN +D 2
(kx − yk2 + r2 ) 2

By substituting the perturbation kernel in Eq. (9), we have:


x̃−ỹ
R
N +D p(y)dy
(kx−yk22 +r 2 ) 2
fθ∗ (x̃) =
pr (x)
x̃−ỹ
R
kx̃−ỹk2 N +D
p(y)dy
=
pr (x)
= (SN +D−1 (1)/pr (x))E(x̃)

A.3. Proof of Theorem 4.1



Theorem 4.1. Assume the data distribution p ∈ C 1 . Consider taking the limit D → ∞ while holding σ = r/ D fixed.
Then, for all x,

D
lim E(x̃)x − σ∇x log pσ=r/√D (x) =0
D→∞
√ E(x̃)r 2
r=σ D

where E(x̃ = (x, r))x is given in Eq. (3). Further, given the same initial point, the trajectory of the PFGM++
ODE (dx/dr=E(x̃)x /E(x̃)r ) matches the diffusion ODE (Karras et al., 2022) (dx/dt= − σ̇(t)σ(t)∇x log pσ(t) (x))
in the same limit.

Proof. The x component in the Poisson field can be re-expressed as

x−y
Z
1
E(x̃)x = p(y)dy
SN +D−1 (1) kx̃ − ỹkN +D
Z
∝ pr (x|y)(x − y)p(y)dy

N +D
where the perturbation kernel pr (x|y) ∝ 1/(kx − yk22 + r2 ) 2 . The direction of the score can also be written down in a
similar form:

pσ (x|y) y−x
R
σ 2 p(y)dy
Z
∇x log pσ (x) = ∝ pσ (x|y)(x − y)p(y)dy
pσ (x)

kx−yk2
pσ (x|y) ∝ exp − 2σ2 2 . Since p ∈ C 1 , and obviously pr (x|y) ∈ C 1 , then limD→∞ pr (x|y)(x − y)p(y)dy =
R
where
R
limD→∞ pr (x|y)(x − y)p(y)dy. It suffices to prove that the perturbation kernel pr (x|y) point-wisely converge to the
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Gaussian kernel pσ (x|y), i.e., limD→∞ pr (x|y) = pσ (x|y), to ensure E(x)x ∝ ∇x log pσ (x). Given ∀x, y ∈ RN ,

1
lim pr (x|y) ∝ lim N +D
D→∞ D→∞ (kx − yk22 + r2 ) 2

N +D
= lim (kx − yk22 + r2 )− 2
D→∞
kx − yk22 − N +D
∝ lim (1 + ) 2
D→∞ r2
kx − yk22 − N +D √
= lim (1 + ) 2 (r = σ D)
D→∞ Dσ 2
kx − yk22
 
N +D
= lim exp − ln(1 + )
D→∞ 2 Dσ 2
N + D kx − yk22
 
kx−yk22
= lim exp − ( limD→∞ Dσ 2 = 0)
D→∞ 2 Dσ 2
2
kx − yk2
= exp −
2σ 2
∝ pσ (x|y)

Hence limD→∞ pr (x|y) = pσ (x|y), and we establish that E(x̃)x ∝ ∇x log pσ (x). We can rewrite the drift term in the
PFGM++ ODE as


R
D pr (x|y)(x − y)p(y)dy
lim DE(x̃)x /E(x̃)r = lim R
D→∞
√ D→∞
√ pr (x|y)(−r)p(y)dy
r=σ D r=σ D
√ R
D pr (x|y)(y − x)p(y)dy
= lim
D→∞
√ rpr (x)
r=σ D
√ R
D pσ (x|y)(y − x)p(y)dy
= lim
D→∞
√ rpσ (x)
r=σ D

pσ (x|y) y−x
R
σ 2 p(y)dy
= σ∇x log pσ (x) (∇x log pσ (x) = ) (10)
pσ (x)

which establishes the first part of the theorem. For the second part, by the change-of-variable dσ = dr/ D, the PFGM++
ODE is

dx dx dr
lim = ·
D→∞
√ dσ dr dσ
r=σ D

= lim E(x̃)x · E(x̃)−1
r · D
D→∞

r=σ D
σ∇x log pσ (x) √
= lim √ · D (by Eq. (10))
D→∞
√ D
r=σ D
= σ∇x log pσ (x)

which is equivalent to the diffusion ODE.

A.4. Proof of Proposition 4.2



Proposition A.2. When r = σ D, D → ∞, the minimizer in the PFGM++ objective (Eq. (6)) is equaivalent to the
minimizer in the weighted sum of denoising score matching objective (Eq. (8))
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Proof. For ∀x ∈ RN , the minimizer in PFGM++ objective (Eq. (6)) at point x̃ = (x, r) is

x−y
R
pr (x|y) r/√ p(y)dy
D

fθ,PFGM++ (x̃) = lim
D→∞
√ pr (x)
r=σ D
x−y
R
pσ (x|y) r/√ p(y)dy
D
= lim (By Theorem 4.1, limD→∞ pr (x|y) = pσ (x|y))
D→∞
√ pσ (x)
r=σ D

pσ (x|y) x−y
R
σ p(y)dy
= (11)
pσ (x)


On the other hand, the minimizer in denoising score matching at point x in noise level σ = r/ N + D is

pσ (x|y) x−y
R
∗ σ p(y)dy
fθ,DSM (x, σ) = (12)
pσ (x)

Combining Eq. (11) and Eq. (12), we have


√ ∗
lim fθ,PFGM++ (x, σ N + D) = fθ,DSM (x, σ)
D→∞

r=σ D

B. Practical Sampling Procedures of Perturbation Kernel and Prior Distribution


N +D
In this section, we discuss how to simple from the perturbation kernel pr (x|y) ∝ 1/(kx − yk22 + r2 ) 2 in practice. We
first decompose pr (·|y) in hyperspherical coordinates to Uψ (ψ)pr (R), where Uψ is the uniform distribution over the angle
component and the distribution of the perturbed radius R = kx − yk2 is

RN −1
pr (R) ∝ N +D (13)
(R2 + r2 ) 2

The sampling procedure of the radius distribution encompasses three steps:

N D
R1 ∼ Beta(α = ,β = )
2 2
R1
R2 =
1 − R1
p
R3 = r2 R2

Next, we prove that p(R3 ) = pr (R3 ). Note that the pdf of the inverse beta distribution is

N
−1 N +D
p(R2 ) ∝ R22 (1 + R2 )− 2
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
p
By change-of-variable, the pdf of R3 = 2
rmax R2 is

N
−1 N D 2R3
p(R3 ) ∝ R22 (1 + R2 )− 2 − 2 ∗ 2
rmax
N
−1
R3 R22
∝ N +D
(1 + R2 ) 2
(R3 /r)N −1
= N +D
(1 + (R32 /r2 )) 2

R3N −1
∝ N +D
(1 + (R32 /r2 )) 2

R3N −1
∝ N +D ∝ pr (R3 ) (By Eq. (13))
(r2 + R32 ) 2

Note that R1 has mean NN ND


+D and variance O( (N +D)3 ). Hence when D = O(N ), pr (R) would highly concentrate on a
specific value, resolving the heavy-tailed problem. We can sample the uniform angel component by u = w/kwk, w ∼
N (0, IN ×N ). Together, sampling from the perturbation kernel pr (x|y) is equivalent to setting x = y + R3 u. On the other
hand, the prior distribution is

Z
N +D
D
prmax (x) ∝ lim rmax /kx̃ − ỹkN +D p(y)dy = D
lim rmax /(kxk2 + rmax
2
) 2
rmax →∞ rmax →∞

We observe that prmax (x) the same as the perturbation kernel prmax (x|y = 0). Hence we can sample from the prior following
x = R3 u with R3 , u defined above and r = rmax .

C. r = σ D for Phase Alignment
C.1. Analysis
In this section, we examine the phase of intermediate marginal distribution pr under different Ds to derive an alignment
method for hyper-parameters. Consider a N -dimensional dataset D in which the average distance to the nearest neighbor is
about l. We consider an arbitrary datapoint x1 ∈ D and denote its nearest neighbor as x2 . We assume kx1 − x2 k2 = l, and
uniform prior on D.
pr , ∀r > 0, we study the perturbation point y ∼ pr (y|x1 ). According to Appendix B, the
To characterize the phases of q
N
distance kx1 −yk is roughly r D−1 . Since pr (y|x1 ) is isotropic, with high probability, the two vectors y −x1 , x2 −x1 are
approximately orthogonal. In particular, the vector product (y − x1 )T (x1 − x2 ) = O( √1N ky − x1 kkx1 − x2 k) = O( √rlD )
q
N
w.h.p. It reveals that ky − x2 k = l2 + r2 D−1 + O( √rlD ). Fig. 6 depicts the relative positions of x1 , x2 and the perturbed
point y.
2 |y)
The ratio of the posterior of the x2 and x1 — pprr (x
(x1 |y) — is an indicator of different phases of field (Xu et al., 2023): point
in the nearer field tends to have a smaller ratio. Indeed, the ratio would gradually decay from 1 to 0 when moving from the
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Figure 6. Illustration of the phase alignment analysis

far to the near field. We can calculate the ratio of the coefficients after approximating the distance ky − x2 k:

N
! N +D
pr (x2 |y) pr (y|x2 ) l2 + r2 D−1 + O( √rlD ) + r2 2

= = N
pr (x1 |y) pr (y|x1 ) r2 D−1 + r2
! N +D
l2 + O( √rlD ) 2

= 1+ N
r2 D−1 + r2
l2 + O( √rlD )
!
N +D
= exp ln(1 + 2 N )·
r D−1 + r2 2
l2 + O( √rlD ) N + D
!
≈ exp N
·
r2 D−1 + r2 2
l2 + O( √rlD )
!
N +D
= exp · · (D − 1)
r2 2(N + D − 1)
l2 + O( √rlD )
!
≈ exp ·D (14)
r2


Hence the relation r ∝ D should hold to keep the ratio invariant of the parameter D. On the other hand, by Theorem 4.1
we know that pσ is equivalent to pr=σ√D when D → ∞. To achieve phase alignment on the dataset, one should roughly set

r = σ D.

C.2. Practical Hyperparameter Transfer from Diffusion Models


C.2.1. T RANSFER EDM TRAINING AND SAMPLING
We list out and compare the EDM training algorithm (Alg 1) and the PFGM++ with transferred hyper-parameters (Alg 2).
The major modification
√ is to replace the Gaussian noise ni ∼ N (0, σ 2 I) with the additive noise Ri vi ∼ Uψ (ψ)pr (R),
where r = σ D. We highlight the major modifications in blue.
We also show the sampling algorithms of EDM (Alg 3) and PFGM++ (Alg 4). Note that we only change the prior
sampling process while the for-loop is identical for both algorithms, since EDM (Karras et al., 2022) sets σ = t, and
x−fθ (x,r)
dx
dr = r = x−fθ (x,r)

σ D
dx
= √Ddσ dx dσ
= dσ dx dx
dr = dσ = dt . Thus we can use the original samplers of EDM without
further modification.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Algorithm 1 EDM training Algorithm 2 PFGM++ training with hyperparameter trans-


ferred from EDM
1: Sample a batch of data {yi }B
i=1 from p(y)
Sample standard deviations {σi }B 1: Sample a batch of data {yi }B
i=1 from p(y)
2: i=1 from p(σ) B
3: Sample noise vectors {ni ∼ N (0, σi2 I)}B 2: Sample standard deviations {σ
√ i }i=1 from p(σ)
i=1
4: Get perturbed data {ŷi = yi + ni }B
i=1
3: Sample r from pr : {ri = σi D}B i=1
5:
PB
Calculate loss `(θ) = i=1 λ(σi )kfθ (ŷi , σi )−yi k22 4: Sample radiuses {Ri ∼ pri (R)}B i=1
6: Update the network parameter θ via Adam optimizer 5: Sample uniform angles {vi = kuuiik2 }B i=1 , with ui ∼
N (0, I)
6: Get perturbed data {ŷi = yi + Ri vi }Bi=1
PB
7: Calculate loss `(θ) = i=1 λ(σi )kfθ (ŷi , σi ) − yi k22
8: Update the network parameter θ via Adam optimizer

Algorithm 3 EDM sampling (Heun’s 2nd order method) Algorithm 4 PFGM++ training with hyperparameter trans-
2 ferred from EDM
1: x0 ∼ N (0, σmax I) √
2: for i = 0, . . . , T − 1 do 1: Set rmax = σmax D
u
3: di = (xi − fθ (xi , ti ))/ti 2: Sample radius R ∼ prmax (R) and uniform angle v = kuk 2
,
4: xi+1 = xi + (ti+1 − ti )di with u ∼ N (0, I)
5: if ti+1 > 0 then 3: Get initial data x0 = Rv
6: d0i = (xi+1 − fθ (xi+1 , ti+1 ))/ti+1 4: for i = 0, . . . , T − 1 do
7: xi+1 = xi + (ti+1 − ti )( 21 di + 21 d0i ) 5: di = (xi − fθ (xi , ti ))/ti
8: end if 6: xi+1 = xi + (ti+1 − ti )di
9: end for 7: if ti+1 > 0 then
8: d0i = (xi+1 − fθ (xi+1 , ti+1 ))/ti+1
9: xi+1 = xi + (ti+1 − ti )( 12 di + 12 d0i )
10: end if
11: end for

C.2.2. T RANSFER DDPM ( CONTINUOUS ) TRAINING AND SAMPLING



Here we demonstrate the “zero-shot” transfer of hyperparameters from DDPM to PFGM++, using the r = σ D formula.
We highlight the modifications in blue. In particular, we list the DDPM training/sampling algorithms (Alg 5/Alg 7), and
their counterparts in PFGM++ (Alg 6/Alg 8) for comparions. Let βT and β1 be the maximum/minimum values of β in
1 2
DDPM (Ho et al., 2020). Similar to Song et al. (2021b), we denote αt = e− 2 t (β̄max −β̄min )−tβ̄min , with β̄max = βT · T and
β̄min = β1 · T . For example, on CIFAR-10, β̄min = 1e − 1 and β̄max = 20 with T = 1000. We would like to note that the ti s
in the sampling algorithms (Alg 7 and Alg 8) monotonically decrease from 1 to 0 as i increases.

Algorithm 5 DDPM training Algorithm 6 PFGM++ training with hyperparameter trans-


ferred from DDPM
1: Sample a batch of data {yi }Bi=1 from p(y)
2: Sample time {ti =t0i /T }B with t0i ∼U({1, . . . , T }) 1: Sample a batch of data {yi }B
i=1 from p(y)
i=1
√ √ 2: Sample time {ti }B from U[0, 1]
3: Get perturbed data {ŷi = αti yi + 1 − αti i }B i=1 , i=1 q
1−αti
where i ∼ N (0, I) 3: Get σi from ti : {σi = }
αti
PB
4: Calculate loss `(θ) = i=1 λ(ti )kfθ (ŷi , ti ) − i k22 √
4: Sample r from pr : {ri = σi D}Bi=1
5: Update the network parameter θ via Adam optimizer
5: Sample radiuses {Ri ∼ pri (R)}B
i=1
6: Sample uniform angles {vi = kuuiik2 }B
i=1 , with ui ∼
N (0, I)

7: Get perturbed data {ŷi = αti (yi + Ri vi )}B
i=1
PB √
8: Calculate loss `(θ) = i=1 λ(ti )kfθ (ŷi , ti )− DR
r
i vi 2
k2
9: Update the network parameter θ via Adam optimizer
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

Algorithm 7 DDIM sampling Algorithm 8 PFGM++ sampling transferred from DDIM


1: xT ∼ N (0, I)
q
1−α1

1: Set σmax = α1 , rmax = σmax D
2: for i = T, .q
. . , 1 do u
αti−1 2: Sample radius R ∼ prmax (R) and uniform angle v = kuk ,
3: xi−1 = xiαti
2
with u ∼ N (0, I)

ti−1 √ √
3: Get initial data xT = α1 Rv
p
+( 1 − αti−1 − αti 1 − αti )fθ (xi , ti )
4: end for 4: for i = T, .q. . , 1 do
αti−1
5: xi−1 = αti xi
ti−1 √
p qα
+( 1 − αti−1 − αti 1 − αti )fθ (xi , ti )
6: end for

D. Experimental Details
We show the experimental setups in section 5, as well as the training, sampling, and evaluation details for PFGM++. All the
experiments are run on four NVIDIA A100 GPUs or eight NVIDIA V100 GPUs.

D.1. Experiments for the Analysis in Sec 5


In the experiments of section 4 and section 5.1, we need to access the posterior p0|r (y|x) ∝ pr (x|y)p(y) to calculate the
mean TVD. We sample a large batch {yi }ni=1 with n = 1024 on CIFAR-10 to empirically approximate the posterior:
N +D
pr (x|yi )p(yi ) pr (x|yi ) 1/(kx − yi k22 + r2 ) 2
p0|r (yi |x) = ≈ Pn = Pn N +D
pr (x) j=1 pr (x|yj ) 1/(kx − yj k22 + r2 ) 2
j=1

We sample a large batch of 256 to approximate all the expectations in section 5, such as the average TVDs.

D.2. Training Details


We borrow the architectures, preconditioning techniques, optimizers, exponential moving average (EMA) schedule, and
hyper-parameters from previous state-of-the-art diffusion model EDM (Karras et al., 2022). We apply the alignment method
in section 4 to transfer their well-tuned hyper-parameters.
For architecture, we use the improved NCSN++ (Karras et al., 2022) for the CIFAR-10 dataset (batch size 512), and the
improved DDPM++ for the FFHQ dataset (batch size 256). For optimizers, following EDM, we adopt the Adam optimizer
with a learning rate of 10e − 4. We further incorporate the EMA schedule, learning rate warm-up, and data augmentations
in EDM. Please refer to Appendix F in EDM paper (Karras et al., 2022) for details.
The most prominent improvements in EDM are the preconditioning and the new training distribution for σ, i.e., p(σ).
Specifically, adding these two techniques to the vanilla diffusion objective (Eq. (8)), their effective training objective can be
written as:
 
1 2
2
Eσ∼p(σ) λ(σ)cout (σ) Ep(y) Epσ (x|y) Fθ (cin (σ) · x, cnoise (σ)) − (y − cskip (σ) · x) (15)
cout (σ) 2

with the predicted normalized score function in the vanilla diffusion objective (Eq. (8)) re-parameterized as

cskip (σ)x + cout (σ)Fθ (cin (σ)x, cnoise (σ)) − x


fθ (x, σ) = ≈ σ∇x log pσ (x)
σ
p p 2 2 2 1
cin (σ) = 1/ σ 2 + σdata 2 , c (σ) = σ · σ 2 2
out data / σ + σdata , cskip (σ) = σdata /(σ + σdata ), cnoise (σ) = 4 ln(σ), with σdata =
0.5. {cin (σ), cout (σ), cskip (σ), cdata , cnoise (σ)} are all the hyper-parameters in the preconditioning. The training distribution
p(σ) is the log-normal distribution with ln(σ) ∼ N (−1.2, 1.22 ), and the loss weighting λ(σ) = 1/cout (σ)2 .

Recall that the hyper-parameter alignment rule √ r = σ D can transfer the hyper-parameter from diffusion models (D→∞)
to finite Ds. Hence we can directly set σ = r/ D in those hyper-parameters for preconditioning. In addition, the training
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
√ √
distribution p(r) can be derived via the change-of-variable formula, i.e., p(r) = p(σ = r/ D)/ D. The final PFGM++
objective after incorporating these techniques into Eq. (6) is:
√ √ √ √ √
 
1 2
Er∼p(r) λ(r/ D)cout (r/ D)2 Ep(y) Epr (x|y) Fθ (cin (r/ D) · x, cnoise (r/ D)) − (y − cskip (r/ D) · x)
cout (σ) 2

with the predicted normalized electric field in the vanilla PFGM++ objective (Eq. (6)) re-parameterized as
√ √ √ √
cskip (r/ D)x + cout (r/ D)Fθ (cin (r/ D)x, cnoise (r/ D)) − x √ E(x̃)x
fθ (x̃) = √ ≈ D
r/ D E(x̃)r

D.3. Sampling Details


For sampling, following EDM (Karras et al., 2022), we also use Heun’s√2nd method (improved Euler method) (Ascher &
Petzold, 1998) as the ODE solver for dx/dr = E(x̃)x /E(x̃)r = fθ (x̃)/ D.
We adopt the same parameterized scheme in EDM to determine the evaluation points during N -step ODE sampling:
1 i 1 1
ri = (rmax ρ + (rmin ρ − rmax ρ ))ρ and rN = 0
N −1

√ ρ controls the √
where relative density√of evaluation points in the near field. We set ρ = 7 as in EDM, and rmax = σmax D =
80 D, rmin = σmin D = 0.002 D√(σmax , σmin are the hyper-parameters in EDM, controlling the starting/terminal
evaluation points) following the r = σ D alignment rule.

D.4. Evaluation Details


For the evaluation, we compute the Fréchet distance between 50000 generated samples and the pre-computed statistics of
CIFAR-10 and FFHQ. On CIFAR-10, we follow the evaluation protocol in EDM (Karras et al., 2022), which repeats the
generation three times with different seeds for each checkpoint and reports the minimum FID score. However, we observe
that the FID score has a large fluctuation across checkpoints, and the minimum FID score of EDM in our re-run experiment
does not align with the original results reported in (Karras et al., 2022). Fig. 7(a) shows that the FID score could have a
variation of ±0.2 during the training of a total of 200 million images (Karras et al., 2022). To better evaluate the model
performance, Table 2 reports the average FID over the Top-3 checkpoints instead. In Fig. 7(b), we further demonstrate the
moving average of the FID score with a window of 10000K images. It shows that D = 2048 consistently outperforms other
baselines in the same training iterations, in agreement with the results in Table 2.

D.5. Experiments for Robustness


Controlled experiments with α In the controlled noise setting,
√ we inject noise into the intermediate point xr in each of
the 35 ODE steps by xr = xr + αr where r ∼ N (0, r/ DI). Since pr has roughly the same phase as pσ=r/√D in

diffusion models, we pick r/ D standard deviation of r when the intermediate step is r.

Post-training quantization In the post-training quantization experiments on CIFAR-10, we quantize the weights of
convolutional layers excluding the 32 × 32 layers, as we empirically observe that these input/output layers are more critical
for sample quality.

E. Extra Experiments
E.1. Stable Target Field
Xu et al. (2023) propose a Stable Target Field objective for training the diffusion models:
" n #
X pt|0 (x|yk )
∇x log pσ (x) ≈ Ey1 ∼p0|t (·|x) E{yi }ni=2 ∼pn−1 P ∇x log pt|0 (x|yk )
k=1 j pt|0 (x|yj )

where they sample a large batch of samples {yi }ni=2from the data distribution to approximate the score function at x. They
show that the new target can enhance the stability of converged models in different runs/seeds. PFGM++ can be trained in a
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

2.80 D = 128
D = 2048
D = 3072000
2.75 D (Diffusion)

2.70
FID Score

2.65

2.60

2.55

2.50

2.45
150000 160000 170000 180000 190000 200000
Kimg
(a) w/o moving average

D = 128
D = 2048
D = 3072000
D (Diffusion)
2.70
FID Score

2.65

2.60

2.55

150000 160000 170000 180000 190000 200000


Kimg
(b) w/ moving average

Figure 7. FID score in the training course when varying D, (a) w/o and (b) w/ moving average.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

x−y
similar fashion by replacing the target √
r/ D
in perturbation-based objective (Eq. (6)) with
" n N +D
#!
1  1 X 1/(kx − yk k22 + r2 ) 2
√ x − Ep0|r(y|x) [y] ≈ √ x − Ey1 ∼p0|r (·|x) E{yi }ni=2 ∼pn−1 N +D yk
2
P 2
r/ D r/ D k=1 j 1/(kx − yj k2 + r )
2

When n = 1, the new target reduces to the original target. Similar to (Xu et al., 2023), one can show that the bias of
the new target together with its trace-of-covariance shrinks to zero as we increase the size of the large batch. This new
target can alleviate the variations between random seeds. With the new STF-style target, Table 4 shows that when setting
D = 3072000  N = 3072, the model obtains the same FID score as the diffusion models (EDM (Karras et al., 2022)). It
aligns with the theoretical results in Sec 4, which states that PFGM++ recover the diffusion model when D → ∞.

Table 4. FID and NFE on CIFAR-10, using the Stable Target Field (Xu et al., 2023) in training objective.
FID ↓ NFE ↓
D = 3072000 1.90 35
D → ∞ (Karras et al., 2022) 1.90 35

E.2. Extended CIFAR-10 Samples when varying α


To see how the sample quality varies with α, we visualize the generative samples of models trained with D ∈ {64, 128, 2048}
and D → ∞. We pick α ∈ {0, 0.1, 0.2}. Fig. 8 shows that the smaller Ds produce better samples compared to larger D.
Diffusion models (D → ∞) generate noisy images that appear to be out of the data distribution when α = 0.2, in contrast
to the clean images by D = 64, 128.

E.3. Extended FFHQ Samples


In Fig. 9, we provide samples generated by the D = 128 case and EDM (the D → ∞ case).

F. Potential Negative Social Impact


The deep generative model is a burgeoning field and has significant potential for shaping our society. Our work presents a
novel family of generative models, the PFGM++, which subsume previous high-performing models and provide greater
flexibility. The PFGM++ have many potential applications, particularly in areas that require both robustness and high-quality
output. However, it is important to note that the usage of these models can have both positive and negative implications,
depending on the specific application. For instance, the PFGM++ can be used to create realistic image and audio samples,
but it can also contribute to the development of deepfake technology and potentially lead to social scams. Additionally, the
data-collecting process for generative models may infringe upon intellectual property rights. To address these concerns,
further research is needed to provide robustness guarantees for generative models and to foster collaborations with experts
in socio-technical fields.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

(a) D=64, α = 0 (FID=1.96) (b) D=64, α = 0.1 (FID=1.97) (c) D=64, α = 0.2 (FID=2.07)

(d) D=128, α = 0 (FID=1.92) (e) D=128, α = 0.1 (FID=1.95) (f) D=128, α = 0.2 (FID=2.19)

(g) D=2048, α = 0 (FID=1.92) (h) D=2048, α = 0.1 (FID=1.95) (i) D=2048, α = 0.2 (FID=2.19)

(j) D → ∞, α = 0 (FID=1.98) (k) D → ∞, α = 0.1 (FID=9.27) (l) D → ∞, α = 0.2 (FID=92.41)

Figure 8. Generated samples on CIFAR-10 with varied hyper-parameter for noise injection (α). Images from top to bottom rows are
produced by models trained with D = 64/128/2048/∞. We use the same random seeds for finite Ds during image generation.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

(a) D = 128 (FID=2.43) (b) EDM (D → ∞) (FID=2.53)

Figure 9. Generated images on FFHQ 64 × 64 dataset, by (left) D = 128 and (right) EDM (D → ∞).

You might also like