PFGM++ - Unlocking The Potential of Physics-Inspired Generative Models
PFGM++ - Unlocking The Potential of Physics-Inspired Generative Models
Yilun Xu 1 Ziming Liu 1 Yonglong Tian 1 Shangyuan Tong 1 Max Tegmark 1 Tommi Jaakkola 1
Diffusion models
Sec 5 Sweet spot balancing VE/VP (Song et al, 2021)
PFGM (Xu et al, 2022) robustness and rigidity EDM (Karras et al, 2022)
D=1 D* D→∞
PFGM++ (D ∈ ℝ+)
Extension from PFGM Equivalence between D → ∞ and diffusion models
Figure 1. Overview of paper contributions and structure. PFGM++ unify PFGM and diffusion models, as well as the potential to combine
their strengths (robustness and rigidity).
diffusion models. We establish D→∞ equivalence with demonstrate the trade-off between robustness and rigidity
popular diffusion models (Song et al., 2021b; Karras et al., by varying D (Sec 5). We also detail the hyperparameter
2022) both in terms of their training objectives as well as transfer procedures from EDM/DDPM (D → ∞) to finite
their inferential processes. We demonstrate that the hyper- Ds in Appendix C.2; (5) We empirically show that mod-
parameter D controls the balance between robustness and els with finite D achieve superior performance to diffusion
rigidity: using a small D widens the distribution of noisy models while exhibiting improved robustness (Sec 6).
training sample norms in comparison to the norm of the
augmented variables. However, small D also leads to a
2. Background and Related Works
heavy-tailed sampling problem at any fixed augmentation
norm making learning more challenging. Neither D=1 nor Diffusion Model Diffusion models (Sohl-Dickstein et al.,
D→∞ offers an ideal balance between being insensitive 2015; Ho et al., 2020; Song et al., 2021b; Karras et al.,
to missteps (robustness) and allowing effective learning 2022) are often presented as a pair of two processes. A fixed
(rigidity). Instead, we adjust D in response to different ar- forward process governs the training of the model, which
chitectures and tasks. To facilitate quickly finding the best learns to denoise data of different noise levels. A corre-
D we provide an alignment method to directly transfer other sponding backward process involves utilizing the trained
hyperparameters across different choices of D. model iteratively to denoise the samples starting from a fully
noisy prior distribution. Karras et al. (2022) propose a unify-
Experimentally, we show that some models with finite
ing framework for popular diffusion models (VE/VP (Song
D outperform the previous state-of-the-art diffusion mod-
et al., 2021b) and EDM (Karras et al., 2022)), and their sam-
els (D→∞), i.e., EDM (Karras et al., 2022), on image
pling process can be understood as traveling in time with a
generation tasks. In particular, intermediate D=2048/128
probability flow ordinary differential equation (ODE):
achieve the best performance among other choices of D
ranging from 64 to ∞, with min FID scores of 1.91/2.43 dx = −σ̇(t)σ(t)∇x log pσ(t) (x)dt
on CIFAR-10 and FFHQ 64×64 datasets in unconditional
generation, using 35/79 NFE. In class-conditional genera- where σ(t) is a predefined noise schedule w.r.t. time, and
tion, D=2048 achieves new state-of-the-art FID of 1.74 on ∇x log pσ(t) (x) is the score of noise-injected data distribu-
CIFAR-10. We further verify that in general, decreasing D tion at time t. A neural network fθ (x, σ) is trained to learn
leads to improved robustness against a variety of sources of the score ∇x log pσ(t) (x) by minimizing a weighted sum of
errors, i.e., controlled noise injection, large sampling step the denoising score-matching objectives (Vincent, 2011):
sizes and post-training quantization.
Eσ∼p(σ) λ(σ)Ey∼p(y) Ex∼pσ (x|y)
Our contributions are summarized as follows: (1) We
kfθ (x, σ) − ∇x log pσ (x|y)k22 (1)
propose PFGM++ as a new family of generative models
based on expanding augmented dimensions and show that where p(σ) defines a training distribution of noise levels,
symmetries involved enable us to define generative paths λ(σ) is a weighting function, p(y) is the data distribution,
simply based on the scalar norm of the augmented vari- and pσ (x|y) = N (0, σ 2 I) defines a Gaussian perturbation
ables (Sec 3.1); (2) We propose a perturbation-based objec- kernel which samples a noisy version x of the clean data y.
tive to dispense with any biased large batch derived electric Please refer to Table 1 in Karras et al. (2022) for specific
field targets, allowing unbiased training (Sec 3.2); (3) We instantiations of different diffusion models.
prove that the score field and the training objective of dif-
fusion models arise in the limit D→∞ (Sec 4); (4) We PFGM Inspired by the theory of electrostatics (Griffiths,
2005), Xu et al. (2022) propose Poisson flow generative
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
models (PFGM), which interpret the N -dimensional data to the inherent symmetry of the electric field. To improve the
x ∈ RN as electric charges in an N +1-dimensional space training process, we propose an efficient perturbation-based
augmented with an extra dimension z: x̃ = (x, z) ∈ RN +1 . objective for training PFGM++ (Sec 3.2) without relying on
In particular, the training data is placed on the z=0 hyper- the large batch approximation in the original PFGM.
plane, and the electric field lines emitted by the charges
define a bijection between the data distribution and a uni- 3.1. Electric field in N +D-dimensional space
form distribution on the infinite hemisphere of the aug-
mented space1 . To perform generative modeling, PFGM While PFGM (Xu et al., 2022) consider the electric field
learn the following high-dimensional electric field, which is in a N +1-dimensional augmented space, we augment the
the derivative of the electric potential in a Poisson equation: data x with D-dimensional variables z = (z1 , . . . , zD ), i.e.,
x̃ = (x, z) and D ∈ Z+ . Similar to the N +1-dimensional
1
Z
x̃ − ỹ electric field (Eq. (2)), the electric field at the augmented
E(x̃) = p(y)dy (2) data x̃ = (x, z) ∈ RN +D is:
SN (1) kx̃ − ỹkN +1
x̃ − ỹ
Z
1
where SN (1) is the surface area of a unit N -sphere (a ge- E(x̃) = p(y)dy (3)
SN +D−1 (1) kx̃ − ỹkN +D
ometric constant), and p(y) is the data distribution. Sam-
ples are then generated by following the electric field lines, Analogous to the theoretical results presented in PFGM,
which are described by the ODE dx̃ = E(x̃)dt. In prac- with the electric field as the drift term, the ODE dx̃=E(x̃)dt
tice, the network is trained to estimate a normalized ver- defines a surjection from a uniform distribution on an infinite
sion of the following empirical electric field: Ê(x̃) = N +D-dim hemisphere and the data distribution on the N -
Pn x̃−ỹi Pn 1
c(x̃) i=1 kx̃−ỹ ik
N +1 , where c(x̃) = 1/ i=1 kx̃−ỹi kN +1 dim z=0 hyperplane. However, the mapping has SO(D)
PD
and {ỹi }ni=1 ∼ p̃(ỹ) is a large batch used to approximate symmetry on the surface of D-dim cylinder i=1 zi2 = r2
the integral in Eq. (2). The training objective is minimizing for any positive r. We provide an illustrative example at
the `2 -loss between the neural model prediction fθ (x̃) and the bottom of Fig. 2 (D=2, N =1), where the electric flux
the normalized field E(x̃)/kE(x̃)k at various positions of x̃. emitted from a line segment (red) has rotational symmetry
These positions are heuristically designed to carefully cover through the ring area (blue) on the z12 + z22 = r2 cylinder.
the regions that the sampling trajectories pass through. Hence, instead of modeling the individual behavior of each
zi , it suffices to track the norm of augmented variables —
Phases of Score Field Xu et al. (2023) show that the score
r(x̃) = kzk2 — due to symmetry. Specifically, note that
field in the forward process of diffusion models can be
dzi = E(x̃)zi dt, and the time derivative of r is
decomposed into three phases. When moving from the
near field (Phase 1) to the far field (Phase 3), the perturbed D Z PD 2
data get influenced by more modes in the data distribution. dr X zi dzi i=1 zi
= = p(y)dy
They show that the posterior p0|σ (y|x) ∝ pσ (x|y)p(y) dt i=1
r dt SN +D−1 (1)rkx̃ − ỹkN +D
serves as a phase indicator, as it gradually evolves from 1
Z
r
a delta distribution to uniform distribution when shifting = p(y)dy
SN +D−1 (1) kx̃ − ỹkN +D
from Phase 1 to Phase 3. The relevant concepts of phases
have also been explored in Karras et al. (2022); Choi et al. Henceforth we replace the notation for augmented data with
(2022); Xiao et al. (2022). Similar to the PFGM training x̃ = (x, r) for simplicity. After the symmetry reduction, the
objective, Xu et al. (2023) approximates the score field by field to be modeled has a similar form as Eq. (3) except that
large batches to reduce the variance of training targets in the last D sub-components {E(x̃) D
R zi }i=1 are condensed into
Phase 2, where multiple data points exert comparable but a scalar E(x̃)r = SN +D−11 r
p(y)dy. There-
(1) kx̃−ỹkN +D
distinct influences on the scores. These observations inspire fore, we can use the physically meaningful r as the anchor
us to align the phases of different Ds in Sec 4. variable in the ODE dx/dr by change-of-variable:
allows for conditional generations under a one-sample-per- limD→∞,r=σ√D pr (x|y) ∝ exp(−kx − yk22 /2σ 2 ) for
condition setup. Since the perturbation kernel is isotropic, ∀x, y ∈ RN +D :
we can decompose pr (·|y) in hyperspherical coordinates
to Uψ (ψ)pr (R), where Uψ is the uniform distribution over 1
lim √ N +D
the angle component and the distribution of the perturbed D→∞,r=σ D (kx − yk22 + r2 ) 2
radius R = kx − yk2 is (N +D) kx−yk2
∝ lim√ e− 2 ln(1+ r2 )
N −1 D→∞,r=σ D
R
pr (R) ∝ N +D (N +D)kx−yk2
2 kx−yk2 2
(R2 + r2 ) 2
= lim√ e− 2r 2 = e− 2σ 2 (7)
D→∞,r=σ D
We defer the practical sampling procedure of the perturba-
tion kernel to Appendix B. The meanp of the r-dependent The equivalence of√trajectories can be proven by change-of-
radius distribution pr (R) is around r N/D.√ Hence we variable dσ = dr/ D. Their prior distributions are also the
explicitly normalize the target in Eq. (5) by √ r/ D, to keep same since limD→∞ prmax =σmax √D (x) = N (0, σmax I).
the norm of the target around the constant N , similar to
diffusion models (Song et al., 2021b). In addition, we drop We defer the formal proof to Appendix A.3. Since kx −
the last dimension yk22 /r2 ≈ N/D when x ∼ pr (x), y ∼ p(y), Eq. (7) ap-
√ of √ the target because it is a constant —
(x̃ − ỹ)r /(r/ D) = D. Together, the new objective is proximately holds under the condition D N . Remark-
ably, the theorem states that PFGM++ recover the field
and sampling of previous popular diffusion models, such
x−y 2
Er∼p(r) Ep(ỹ) Epr (x̃|ỹ) fθ (x̃) − √ (6) as VE/VP (Song & Ermon, 2020) and EDM (Karras et al.,
r/ D 2
2022), by choosing the appropriate schedule and scale func-
After training the neural network through objective Eq. (6), tion in Karras et al. (2022).
we can use the ODE (Eq. (4)) anchored by r to √
generate sam- In addition to the field and sampling equivalence, we demon-
ples, i.e., dx/dr = E(x̃)x /E(x̃)r = fθ (x̃)/ D, starting strate that the proposed PFGM++ objective (Eq. (6)) with
from the prior distribution prmax . N +D
perturbation kernel pr (x|y) ∝ 1/(kx − yk22 + r2 ) 2 re-
covers the weighted sum of the denoising score matching
4. Diffusion Models as D→∞ Special Cases objective (Vincent, 2011) for training continuous diffusion
model (Karras et al., 2022; Song et al., 2021b) when D→∞.
Diffusion models generate samples by simulating ODE/SDE All previous objectives for training diffusion models can be
involving the score function ∇x log pσ (x) at different inter- subsumed in the following form (Karras et al., 2022), under
mediate distributions pσ (Song et al., 2021b; Karras et al., different parameterizations of the neural networks fθ :
2022), where σ is the standard deviation of the Gaussian
kernel. In this section, we show that both sampling and
x−y 2
training schemes in diffusion models are equivalent to those Eσ∼p(σ) λ(σ)Ep(y) Epσ (x|y) fθ (x, σ) − (8)
σ 2
in D→∞ case under the PFGM++ framework. To begin
with, we show that the electric field (Eq. (3)) in PFGM++ where pσ (x|y) ∝ exp(−kx − yk22 /2σ 2 ). The ob-
has the same direction as the score function when D tends jective of the diffusion models resembles the one of
to infinity, and their sampling processes are also identical. PFGM++ (Eq. (6)). Indeed, we show that when D→∞, the
Theorem 4.1. Assume the data distribution p ∈ C 1 . Con-
√
minimizer of the√ proposed PFGM++ objective at x̃=(x, r)
sider taking the limit D → ∞ while holding σ = r/ D is fθ∗ (x, r = σ D)=σ∇x log pσ (x), the same as the √mini-
fixed. Then, for all x, mizer of diffusion objective at the noise level σ=r/ D.
√
√ Proposition 4.2. When r = σ D, D → ∞, the minimizer
D
lim E(x̃)x − σ∇x log pσ=r/√D (x) = 0 in the PFGM++ objective (Eq. (6)) is equaivalent to the
D→∞√ E(x̃)r 2 minimizer in the weighted sum of denoising score matching
r=σ D
objective (Eq. (8))
where E(x̃ = (x, r))x is given in Eq. (3). Fur-
ther, given the same initial point, the trajectory of We defer the proof to Appendix A.4. The proposition states
the PFGM++ ODE (dx/dr=E(x̃)x /E(x̃)r ) matches that the training objective of diffusion models is essentially
the diffusion ODE (Karras et al., 2022) (dx/dt= − the same as PFGM++’s when D→∞. Combined with The-
σ̇(t)σ(t)∇x log pσ(t) (x)) in the same limit. orem 4.1, PFGM++ thus recover both the training and sam-
pling processes of diffusion models when D→∞.
Proof sketch. By re-expressing the x component E(x̃)x
in the electric field and the score ∇x log pσ in dif- Transfer hyperparameters to finite Ds The training
fusion models, the proof boils down to show that hyperparameters of diffusion models (D→∞) have been
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
No alignment r = D alignment
1.0 D = 24 1.0 D = 24 5.1. Behavior of perturbation kernel when varying D
0.8 D = 28 D = 28
0.8
Mean TVD
1.50 =1 15 D = 218
1.25 =5 =5
log2Varpr(R)
D (||x||2 )
= 20 = 20
1.00 = 80 10 = 80
0.75
pr =
0.50 5
0.25 80
0.00 0
0
1e3 1
20 25 210 215 220 25 210 215 2 40
D D
3
||x|4|2 5
6 20
7
8 10
(a) (b) (c)
Figure 4. (a) Average `2 difference between scaled electric field and score function, versus D. (b) Log-variance of radius distribution
versus D. (c) Density of radius distributions pr=σ√D (R) with varying σ and D.
tional resources and storage are limited. On the other hand, PFGM++ (unconditional)
such gains need to be balanced against easier training af- D = 64 1.96 1.98 35
D = 128 1.92 1.94 35
forded by larger values of D. The ability to optimize the D = 2048 1.91 1.93 35
balance by varying D can be therefore advantageous. We D = 3072000 1.99 2.02 35
D → ∞ (Karras et al., 2022) 1.98 2.00 35
expect that there exists a sweet spot of D in the middle
striking the balance, as the model robustness and rigidity go PFGM++ (class-conditional)
6. Experiments
In this section, we assess the performance of different gen- Table 2. FFHQ sample quality (FID) with 79 NFE in unconditional
erative models on image generation tasks (Sec 6.1), where setting
models with some median Ds outperform previous state- Min FID ↓ Top-3 Avg FID ↓
of-the-art diffusion models (D→∞), consistent with the
D = 128 2.43 2.48
sweet spot argument in Sec 5. We also demonstrate the D = 2048 2.46 2.47
improved robustness against three kinds of error as D de- D = 3072000 2.49 2.52
creases (Sec 6.2). D → ∞ (Karras et al., 2022) 2.53 2.54
6.1. Image generation We compare models trained with D→∞ (EDM) and
We consider the widely used benchmarks CIFAR-10 D∈{64, 128, 2048, 3072000}. In our experiments, we ex-
32×32 (Krizhevsky, 2009) and FFHQ 64×64 (Karras et al., clude the case of D=1 (PFGM) because the perturbation
2018) for image generation. For training, we utilize the im- kernel is extremely heavy-tailed (Fig. 4(b)), making it diffi-
proved NCSN++/DDPM++ architectures, preconditioning cult to integrate with our perturbation-based objective with-
techniques and hyperparameters from the state-of-the-art out the restrictive region heuristics proposed in Xu et al.
diffusion model EDM (Karras et al., 2022). Specifically, (2022). We also exclude the small D = 64 for the higher-
we use the alignment method developed in Sec 4 to transfer resolution dataset FFHQ. We include several popular gen-
their tuned critical hyperparameters σmax , σmin , p(σ) in the erative models for reference and defer more training and
D→∞ case to finite D cases. According to the experimen- sampling details to Appendix D.
tal results in Karras et al. (2018), the log-normal training Results: In Table 1 and Table 2, we report the sample qual-
distribution p(σ) has the most substantial impact on the final ity measured by the FID score (Heusel et al., 2017) (lower
performances. For ODE solver during sampling, we use is better), and inference speed measured by the number
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
of function evaluations. As in EDM, we report the min- In addition to the controlled scenario, we conduct two more
imum FID score over checkpoints. Since we empirically realistic experiments: (1) We introduce more estimation
observe a large variation of FID scores on FFHQ across error of neural networks by applying post-training quan-
checkpoints (Appendix D.4), we also use the average FID tization (Sung et al., 2015), which can directly compress
score over the Top-3 checkpoints as another metric. Our neural networks without fine-tuning. Table 3 reports the FID
main findings are (1) Median Ds outperform diffusion score with varying quantization bit-widths for the convolu-
models (D→∞) under PFGM++ framework. We ob- tion weight values. We can see that finite Ds have better
serve that the D=2048/128 cases achieve the best perfor- robustness than the infinite case, and a lower D exhibits
mance among our choices on CIFAR-10 and FFHQ, with a larger performance gain when applying lower bit-widths
a current state-of-the-art min FID score of 1.91/2.43 in quantization. (2) We increase the discretization error dur-
unconditional setting, using the perturbation-based objec- ing sampling by using smaller NFEs, i.e., larger sample
tive. In addition, the D=2048 case obtain better Top-3 aver- steps. As shown in Fig. 5(b), gaps between D=128 and
age FID scores (1.93/2.47) than EDM (2.00/2.54) on both diffusion models gradually widen, indicating greater robust-
datasets in unconditional setting, and achieve current state- ness against the discretization error. The rigidity issue of
of-the-art FID score of 1.74 on CIFAR-10 class-conditional smaller D also affects the robustness to discretization error,
setting. (2) There is a sweet spot between (1, ∞). Nei- as D=64 is consistently inferior to D=128.
ther small D nor infinite D obtains the best performance,
which confirms that there is a sweet spot in the middle,
well-balancing rigidity and robustness. (3) Model with Table 3. FID score versus quantization bit-widths on CIFAR-10.
DN recovers diffusion models. We find that model Quantization bits: 9 8 7 6 5
with sufficiently large D roughly matches the performance D = 64 1.96 1.96 2.12 2.94 28.50
of diffusion models, as predicted by the theory. Further re- D = 128 1.93 1.97 2.15 3.68 34.26
sults in Appendix E.1 show that D=3072000 and diffusion D = 2048 1.91 1.97 2.12 5.67 47.02
D →∞ 1.97 2.04 2.16 5.91 50.09
models obtain the same FID score when using a more stable
training target (Xu et al., 2023) to mitigate the variations
between different runs and checkpoints.
7. Conclusion and Future Directions
6.2. Model robustness versus D We present a new family of physics-inspired generative
models called PFGM++, by extending the dimensionality
300
D = 64 2.8 D = 64 of augmented variable in PFGM from 1 to D ∈ R+ . Re-
250 D = 128 D = 128
D = 2048 2.6 D = 2048 markably, PFGM++ includes diffusion models as special
FID Score
200
D (Diffusion) D (Diffusion) cases when D→∞. To address issues related to large batch
150 2.4
100 2.2 training, we propose a perturbation-based objective. In addi-
50
2.0
tion, we show that D effectively controls the robustness and
0 rigidity in the PFGM++ family. Empirical results show that
0.0 0.1 0.2 0.3 0.4 20 25 30 35
NFE models with finite values of D can perform better than previ-
ous state-of-the-art diffusion models, while also exhibiting
Figure 5. FID score versus (left) α and (right) NFE on CIFAR-10. improved robustness.
In Section 5, we show that the model robustness degrades There are many potential avenues for future research in the
with an increasing D by analyzing the behavior of pertur- PFGM++ framework. For example, it may be possible to
bation kernels. To further validate the phenomenon, we identify the “sweet spot” value of D for different architec-
conduct three sets of experiments with different sources of tures and tasks by analyzing the behavior of errors. Since
errors on CIFAR-10. We defer more details to Appedix D.5. PFGM++ enables adjusting robustness, another direction is
Firstly, we perform controlled experiments to compare the to apply aggressive network compression techniques, i.e.,
robustness of models quantitatively. To simulate the errors, pruning and low-bit training, to smaller D. Furthermore,
we inject noise into the intermediate point xr in each√of the there may be opportunities to develop stochastic samplers
35 ODE steps: xr = xr + αr where r ∼ N (0, r/ DI), for PFGM++, with the reverse SDE in diffusion models as
and α is a positive number controlling the amount of a special case. Lastly, as diffusion models have been highly
noise. Fig. 5(a) demonstrates that as α increases, FID optimized for image generation, the PFGM++ framework
score exhibits a much slower degradation for smaller D. may show a greater advantage over its special case (diffusion
In particular, when D=64, 128, the sample quality degrades models) in emergent fields, such as biology data.
gracefully. We further visualize the generated samples in
Appendix E.2. It shows that when α=0.2, models with
D=64, 128 can still produce clean images while the sam-
pling process of diffusion models (D→∞) breaks down.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
Ascher, U. M. and Petzold, L. R. Computer methods for Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., and Chen, M.
ordinary differential equations and differential-algebraic Point-e: A system for generating 3d point clouds from
equations. 1998. complex prompts. ArXiv, abs/2212.08751, 2022b.
Banner, R., Nahshan, Y., and Soudry, D. Post training Poole, B., Jain, A., Barron, J. T., and Mildenhall, B.
4-bit quantization of convolutional networks for rapid- Dreamfusion: Text-to-3d using 2d diffusion. ArXiv,
deployment. In Neural Information Processing Systems, abs/2209.14988, 2022.
2018. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,
M. Hierarchical text-conditional image generation with
Brock, A., Donahue, J., and Simonyan, K. Large scale gan
clip latents. ArXiv, abs/2204.06125, 2022.
training for high fidelity natural image synthesis. ArXiv,
abs/1809.11096, 2019. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. High-resolution image synthesis with la-
Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and
tent diffusion models. 2022 IEEE/CVF Conference on
Chan, W. Wavegrad: Estimating gradients for waveform
Computer Vision and Pattern Recognition (CVPR), pp.
generation. ArXiv, abs/2009.00713, 2020.
10674–10685, 2021.
Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., and Yoon, S. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-
Perception prioritized training of diffusion models. In ton, E. L., Ghasemipour, S. K. S., Ayan, B. K., Mah-
Proceedings of the IEEE/CVF Conference on Computer davi, S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet,
Vision and Pattern Recognition, pp. 11472–11481, 2022. D. J., and Norouzi, M. Photorealistic text-to-image dif-
fusion models with deep language understanding. ArXiv,
Griffiths, D. J. Introduction to electrodynamics, 2005.
abs/2205.11487, 2022.
Han, S., Mao, H., and Dally, W. J. Deep compression: Shi, C., Luo, S., Xu, M., and Tang, J. Learning gradient
Compressing deep neural network with pruning, trained fields for molecular conformation generation. In ICML,
quantization and huffman coding. arXiv: Computer Vi- 2021.
sion and Pattern Recognition, 2015.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Ganguli, S. Deep unsupervised learning using nonequi-
Hochreiter, S. Gans trained by a two time-scale update librium thermodynamics. In International Conference on
rule converge to a local nash equilibrium. In NIPS, 2017. Machine Learning, pp. 2256–2265. PMLR, 2015.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- Song, J., Meng, C., and Ermon, S. Denoising diffusion
bilistic models. ArXiv, abs/2006.11239, 2020. implicit models. ArXiv, abs/2010.02502, 2021a.
Jarzynski, C. Equilibrium free-energy differences from Song, Y. and Ermon, S. Improved techniques for training
nonequilibrium measurements: A master-equation ap- score-based generative models. ArXiv, abs/2006.09011,
proach. Physical Review E, 56:5018–5035, 1997. 2020.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
Appendix
A. Proofs
A.1. Proof of Theorem 3.1
Theorem 3.1. Assume the data distribution p ∈ C 1 and p has compact support. As rmax →∞, for D ∈ R+ , the ODE
N +D
D
dx/dr = E(x̃)x /E(x̃)r defines a bijection between limrmax →∞ prmax (x) ∝ limrmax →∞ rmax /(kxk22 + rmax
2
) 2 when
r = rmax and the data distribution p when r = 0.
Proof. Let qr (x) ∝ rD /kx̃ − ỹkN +D p(y)dy. We will show that qr ∝ rD /kx̃ − ỹkN +D p(y)dy is equal to
R R
the r-dependent marginal distribution pr by verifying (1) the starting distribution is correct when r=0; (2) the con-
tinuity equation holds, i.e., ∂r qr + ∇x · (qr E(x̃)x /E(x̃)r ) = 0. The starting distribution is limr→0 qr (x) ∝
limr→0 rD /kx̃ − ỹkN +D p(y)dy ∝ p(x), which confirms that qr =p. The continuity equation can be expressed as:
R
It means that qr satisfies the continuity equation for any r ∈ R≥0 . Together, we conclude that qr = pr . Lastly, note that the
terminal distribution is
Z D Z D
rmax rmax
lim prmax (x) ∝ lim N +D
p(y)dy = lim N +D p(y)dy
rmax →∞ rmax →∞ kx̃ − ỹk rmax →∞ (kx − yk2 + rmax
2 ) 2
!
D Z D D
rmax rmax rmax
= lim N +D + lim N +D − N +D p(y)dy
rmax →∞ (kxk2 + r 2 ) 2 rmax →∞ (kx − yk2 + rmax
2 ) 2 (kxk2 + rmax
2 ) 2
max
D
rmax
= lim N +D (p has a compact support)
rmax →∞ (kxk2 + rmax
2 ) 2
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
1 1
pr (x|y) ∝ = N +D
kx̃ − ỹkN +D 2
(kx − yk2 + r2 ) 2
where E(x̃ = (x, r))x is given in Eq. (3). Further, given the same initial point, the trajectory of the PFGM++
ODE (dx/dr=E(x̃)x /E(x̃)r ) matches the diffusion ODE (Karras et al., 2022) (dx/dt= − σ̇(t)σ(t)∇x log pσ(t) (x))
in the same limit.
x−y
Z
1
E(x̃)x = p(y)dy
SN +D−1 (1) kx̃ − ỹkN +D
Z
∝ pr (x|y)(x − y)p(y)dy
N +D
where the perturbation kernel pr (x|y) ∝ 1/(kx − yk22 + r2 ) 2 . The direction of the score can also be written down in a
similar form:
pσ (x|y) y−x
R
σ 2 p(y)dy
Z
∇x log pσ (x) = ∝ pσ (x|y)(x − y)p(y)dy
pσ (x)
kx−yk2
pσ (x|y) ∝ exp − 2σ2 2 . Since p ∈ C 1 , and obviously pr (x|y) ∈ C 1 , then limD→∞ pr (x|y)(x − y)p(y)dy =
R
where
R
limD→∞ pr (x|y)(x − y)p(y)dy. It suffices to prove that the perturbation kernel pr (x|y) point-wisely converge to the
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
Gaussian kernel pσ (x|y), i.e., limD→∞ pr (x|y) = pσ (x|y), to ensure E(x)x ∝ ∇x log pσ (x). Given ∀x, y ∈ RN ,
1
lim pr (x|y) ∝ lim N +D
D→∞ D→∞ (kx − yk22 + r2 ) 2
N +D
= lim (kx − yk22 + r2 )− 2
D→∞
kx − yk22 − N +D
∝ lim (1 + ) 2
D→∞ r2
kx − yk22 − N +D √
= lim (1 + ) 2 (r = σ D)
D→∞ Dσ 2
kx − yk22
N +D
= lim exp − ln(1 + )
D→∞ 2 Dσ 2
N + D kx − yk22
kx−yk22
= lim exp − ( limD→∞ Dσ 2 = 0)
D→∞ 2 Dσ 2
2
kx − yk2
= exp −
2σ 2
∝ pσ (x|y)
Hence limD→∞ pr (x|y) = pσ (x|y), and we establish that E(x̃)x ∝ ∇x log pσ (x). We can rewrite the drift term in the
PFGM++ ODE as
√
√
R
D pr (x|y)(x − y)p(y)dy
lim DE(x̃)x /E(x̃)r = lim R
D→∞
√ D→∞
√ pr (x|y)(−r)p(y)dy
r=σ D r=σ D
√ R
D pr (x|y)(y − x)p(y)dy
= lim
D→∞
√ rpr (x)
r=σ D
√ R
D pσ (x|y)(y − x)p(y)dy
= lim
D→∞
√ rpσ (x)
r=σ D
pσ (x|y) y−x
R
σ 2 p(y)dy
= σ∇x log pσ (x) (∇x log pσ (x) = ) (10)
pσ (x)
√
which establishes the first part of the theorem. For the second part, by the change-of-variable dσ = dr/ D, the PFGM++
ODE is
dx dx dr
lim = ·
D→∞
√ dσ dr dσ
r=σ D
√
= lim E(x̃)x · E(x̃)−1
r · D
D→∞
√
r=σ D
σ∇x log pσ (x) √
= lim √ · D (by Eq. (10))
D→∞
√ D
r=σ D
= σ∇x log pσ (x)
Proof. For ∀x ∈ RN , the minimizer in PFGM++ objective (Eq. (6)) at point x̃ = (x, r) is
x−y
R
pr (x|y) r/√ p(y)dy
D
∗
fθ,PFGM++ (x̃) = lim
D→∞
√ pr (x)
r=σ D
x−y
R
pσ (x|y) r/√ p(y)dy
D
= lim (By Theorem 4.1, limD→∞ pr (x|y) = pσ (x|y))
D→∞
√ pσ (x)
r=σ D
pσ (x|y) x−y
R
σ p(y)dy
= (11)
pσ (x)
√
On the other hand, the minimizer in denoising score matching at point x in noise level σ = r/ N + D is
pσ (x|y) x−y
R
∗ σ p(y)dy
fθ,DSM (x, σ) = (12)
pσ (x)
∗
√ ∗
lim fθ,PFGM++ (x, σ N + D) = fθ,DSM (x, σ)
D→∞
√
r=σ D
RN −1
pr (R) ∝ N +D (13)
(R2 + r2 ) 2
N D
R1 ∼ Beta(α = ,β = )
2 2
R1
R2 =
1 − R1
p
R3 = r2 R2
Next, we prove that p(R3 ) = pr (R3 ). Note that the pdf of the inverse beta distribution is
N
−1 N +D
p(R2 ) ∝ R22 (1 + R2 )− 2
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
p
By change-of-variable, the pdf of R3 = 2
rmax R2 is
N
−1 N D 2R3
p(R3 ) ∝ R22 (1 + R2 )− 2 − 2 ∗ 2
rmax
N
−1
R3 R22
∝ N +D
(1 + R2 ) 2
(R3 /r)N −1
= N +D
(1 + (R32 /r2 )) 2
R3N −1
∝ N +D
(1 + (R32 /r2 )) 2
R3N −1
∝ N +D ∝ pr (R3 ) (By Eq. (13))
(r2 + R32 ) 2
Z
N +D
D
prmax (x) ∝ lim rmax /kx̃ − ỹkN +D p(y)dy = D
lim rmax /(kxk2 + rmax
2
) 2
rmax →∞ rmax →∞
We observe that prmax (x) the same as the perturbation kernel prmax (x|y = 0). Hence we can sample from the prior following
x = R3 u with R3 , u defined above and r = rmax .
√
C. r = σ D for Phase Alignment
C.1. Analysis
In this section, we examine the phase of intermediate marginal distribution pr under different Ds to derive an alignment
method for hyper-parameters. Consider a N -dimensional dataset D in which the average distance to the nearest neighbor is
about l. We consider an arbitrary datapoint x1 ∈ D and denote its nearest neighbor as x2 . We assume kx1 − x2 k2 = l, and
uniform prior on D.
pr , ∀r > 0, we study the perturbation point y ∼ pr (y|x1 ). According to Appendix B, the
To characterize the phases of q
N
distance kx1 −yk is roughly r D−1 . Since pr (y|x1 ) is isotropic, with high probability, the two vectors y −x1 , x2 −x1 are
approximately orthogonal. In particular, the vector product (y − x1 )T (x1 − x2 ) = O( √1N ky − x1 kkx1 − x2 k) = O( √rlD )
q
N
w.h.p. It reveals that ky − x2 k = l2 + r2 D−1 + O( √rlD ). Fig. 6 depicts the relative positions of x1 , x2 and the perturbed
point y.
2 |y)
The ratio of the posterior of the x2 and x1 — pprr (x
(x1 |y) — is an indicator of different phases of field (Xu et al., 2023): point
in the nearer field tends to have a smaller ratio. Indeed, the ratio would gradually decay from 1 to 0 when moving from the
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
far to the near field. We can calculate the ratio of the coefficients after approximating the distance ky − x2 k:
N
! N +D
pr (x2 |y) pr (y|x2 ) l2 + r2 D−1 + O( √rlD ) + r2 2
= = N
pr (x1 |y) pr (y|x1 ) r2 D−1 + r2
! N +D
l2 + O( √rlD ) 2
= 1+ N
r2 D−1 + r2
l2 + O( √rlD )
!
N +D
= exp ln(1 + 2 N )·
r D−1 + r2 2
l2 + O( √rlD ) N + D
!
≈ exp N
·
r2 D−1 + r2 2
l2 + O( √rlD )
!
N +D
= exp · · (D − 1)
r2 2(N + D − 1)
l2 + O( √rlD )
!
≈ exp ·D (14)
r2
√
Hence the relation r ∝ D should hold to keep the ratio invariant of the parameter D. On the other hand, by Theorem 4.1
we know that pσ is equivalent to pr=σ√D when D → ∞. To achieve phase alignment on the dataset, one should roughly set
√
r = σ D.
Algorithm 3 EDM sampling (Heun’s 2nd order method) Algorithm 4 PFGM++ training with hyperparameter trans-
2 ferred from EDM
1: x0 ∼ N (0, σmax I) √
2: for i = 0, . . . , T − 1 do 1: Set rmax = σmax D
u
3: di = (xi − fθ (xi , ti ))/ti 2: Sample radius R ∼ prmax (R) and uniform angle v = kuk 2
,
4: xi+1 = xi + (ti+1 − ti )di with u ∼ N (0, I)
5: if ti+1 > 0 then 3: Get initial data x0 = Rv
6: d0i = (xi+1 − fθ (xi+1 , ti+1 ))/ti+1 4: for i = 0, . . . , T − 1 do
7: xi+1 = xi + (ti+1 − ti )( 21 di + 21 d0i ) 5: di = (xi − fθ (xi , ti ))/ti
8: end if 6: xi+1 = xi + (ti+1 − ti )di
9: end for 7: if ti+1 > 0 then
8: d0i = (xi+1 − fθ (xi+1 , ti+1 ))/ti+1
9: xi+1 = xi + (ti+1 − ti )( 12 di + 12 d0i )
10: end if
11: end for
D. Experimental Details
We show the experimental setups in section 5, as well as the training, sampling, and evaluation details for PFGM++. All the
experiments are run on four NVIDIA A100 GPUs or eight NVIDIA V100 GPUs.
We sample a large batch of 256 to approximate all the expectations in section 5, such as the average TVDs.
with the predicted normalized score function in the vanilla diffusion objective (Eq. (8)) re-parameterized as
with the predicted normalized electric field in the vanilla PFGM++ objective (Eq. (6)) re-parameterized as
√ √ √ √
cskip (r/ D)x + cout (r/ D)Fθ (cin (r/ D)x, cnoise (r/ D)) − x √ E(x̃)x
fθ (x̃) = √ ≈ D
r/ D E(x̃)r
Post-training quantization In the post-training quantization experiments on CIFAR-10, we quantize the weights of
convolutional layers excluding the 32 × 32 layers, as we empirically observe that these input/output layers are more critical
for sample quality.
E. Extra Experiments
E.1. Stable Target Field
Xu et al. (2023) propose a Stable Target Field objective for training the diffusion models:
" n #
X pt|0 (x|yk )
∇x log pσ (x) ≈ Ey1 ∼p0|t (·|x) E{yi }ni=2 ∼pn−1 P ∇x log pt|0 (x|yk )
k=1 j pt|0 (x|yj )
where they sample a large batch of samples {yi }ni=2from the data distribution to approximate the score function at x. They
show that the new target can enhance the stability of converged models in different runs/seeds. PFGM++ can be trained in a
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
2.80 D = 128
D = 2048
D = 3072000
2.75 D (Diffusion)
2.70
FID Score
2.65
2.60
2.55
2.50
2.45
150000 160000 170000 180000 190000 200000
Kimg
(a) w/o moving average
D = 128
D = 2048
D = 3072000
D (Diffusion)
2.70
FID Score
2.65
2.60
2.55
Figure 7. FID score in the training course when varying D, (a) w/o and (b) w/ moving average.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
x−y
similar fashion by replacing the target √
r/ D
in perturbation-based objective (Eq. (6)) with
" n N +D
#!
1 1 X 1/(kx − yk k22 + r2 ) 2
√ x − Ep0|r(y|x) [y] ≈ √ x − Ey1 ∼p0|r (·|x) E{yi }ni=2 ∼pn−1 N +D yk
2
P 2
r/ D r/ D k=1 j 1/(kx − yj k2 + r )
2
When n = 1, the new target reduces to the original target. Similar to (Xu et al., 2023), one can show that the bias of
the new target together with its trace-of-covariance shrinks to zero as we increase the size of the large batch. This new
target can alleviate the variations between random seeds. With the new STF-style target, Table 4 shows that when setting
D = 3072000 N = 3072, the model obtains the same FID score as the diffusion models (EDM (Karras et al., 2022)). It
aligns with the theoretical results in Sec 4, which states that PFGM++ recover the diffusion model when D → ∞.
Table 4. FID and NFE on CIFAR-10, using the Stable Target Field (Xu et al., 2023) in training objective.
FID ↓ NFE ↓
D = 3072000 1.90 35
D → ∞ (Karras et al., 2022) 1.90 35
(a) D=64, α = 0 (FID=1.96) (b) D=64, α = 0.1 (FID=1.97) (c) D=64, α = 0.2 (FID=2.07)
(d) D=128, α = 0 (FID=1.92) (e) D=128, α = 0.1 (FID=1.95) (f) D=128, α = 0.2 (FID=2.19)
(g) D=2048, α = 0 (FID=1.92) (h) D=2048, α = 0.1 (FID=1.95) (i) D=2048, α = 0.2 (FID=2.19)
Figure 8. Generated samples on CIFAR-10 with varied hyper-parameter for noise injection (α). Images from top to bottom rows are
produced by models trained with D = 64/128/2048/∞. We use the same random seeds for finite Ds during image generation.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
Figure 9. Generated images on FFHQ 64 × 64 dataset, by (left) D = 128 and (right) EDM (D → ∞).