Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis
Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis
Figure 1: Reverse generative processes of two different diffusion models. (a) Previous diffusion
models generate images by gradually strengthening signals. (b) The proposed method synthesizes
images through progressive deblurring in a coarse-to-fine manner.
Abstract
Recently, diffusion models have shown remarkable results in image synthesis by
gradually removing noise and amplifying signals. Although the simple generative
process surprisingly works well, is this the best way to generate image data?
For instance, despite the fact that human perception is more sensitive to the low-
frequencies of an image, diffusion models themselves do not consider any relative
1 Introduction
After the initial development by Sohl-Dickstein et al. [2015], diffusion models have been rapidly
improved [Ho et al., 2020, Dhariwal and Nichol, 2021, Song and Ermon, 2019, Song et al., 2020]
to the point that they achieve superior results than GANs in both fidelity and diversity in image
synthesis. Since these models offer better mode coverage, they are widely used in various tasks
such as image generation [Dhariwal and Nichol, 2021], super resolution [Li et al., 2022, Saharia
et al., 2021], text-conditional image generation [Nichol et al., 2021, Ramesh et al., 2022], video
generation [Ho et al., 2022b], etc.
Since a forward process of diffusion models attenuates signals by adding noise progressively, a
reverse process generates data by gradually removing noise and amplifying signals. Although this
formulation gives a great simplicity (e.g., no need to deal with the covariance matrix) and surprisingly
works well, it may not be the best way to generate image data. For instance, despite the fact that
human perception is more sensitive to the low-frequencies of an image, diffusion models themselves
do not consider any relative importance of each frequency component.
To incorporate the inductive bias for image data, several methods have been suggested to focus on
coarse patterns of an image to improve the perceptual quality of generated samples. For instance,
diffusion models are usually trained on a re-weighted variational lower bound [Ho et al., 2020], which
emphasizes the global consistency and coarse level pattern of images and gives less focus on the
imperceptible details [Kingma et al., 2021]. The performance of diffusion models can be greatly
improved by adopting a coarse-to-fine strategy, where a low-resolution image is generated first and
then upsampled by separate diffusion upsamplers [Ho et al., 2022a, Dhariwal and Nichol, 2021,
Nichol and Dhariwal, 2021, Nichol et al., 2021, Ramesh et al., 2022]. By explicitly partitioning the
generative process into the stage of generating coarse structure and the stages of adding details, these
models are capable of producing convincing images, especially at high-resolution.
However, dividing into the predetermined number of stages is somewhat arbitrary and requires
learning separate upsampler for each stage. In this paper, we propose a novel generative process
that synthesizes images in a coarse-to-fine manner. Our model does not require any upsamplers
or separate stages. Instead, we generalize the standard diffusion models by enabling diffusion in
a rotated coordinate system with different velocities for each component of the vector. We further
propose a blur diffusion as a special case of it, where each frequency component of an image is
diffused at different speeds. In particular, our blur diffusion consists of a forward process that blurs
an image while adding noise gradually, and a corresponding reverse process that deblurs an image
while removing noise progressively. Experiments show that the proposed model outperforms the
previous method in FID on LSUN bedroom and church datasets (64×64).
2 Blur diffusion
Coarse-to-fine generation in image synthesis is a successful strategy for both GANs [Karras et al.,
2017, 2019] and diffusion models [Ho et al., 2022a, Dhariwal and Nichol, 2021, Nichol and Dhariwal,
2021, Nichol et al., 2021, Ramesh et al., 2022]. The most intuitive way to enable the strategy without
separate stages is to define a gradual blurring forward process and reverse it. This can be seen as
diffusion in a rotated coordinate system with different velocities for each component of the vector.
2
We first introduce a generalized diffusion process (Sec. 2.1) and propose the blur diffusion as a special
case of it (Sec. 2.2).
Before introducing our contributions, we refer the readers to Appendix A for a background on
diffusion models. For each training data x0 ∼ q0 (x), a forward process of the variance preserving
diffusion models [Song et al., 2020] is defined from the following Markov chain:
p p
xi = 1 − βi xi−1 + βi zi , i = 1, ..., N (1)
where zi ∼ N (0, I), and {βi }Ni=1 is a pre-defined noise schedule. As one can see, a standard diffusion
process is defined in the image space directly, assuming the independence between each pixels. Our
aim is to generalize this process in a rotated coordinate system. For this, we define an orthogonal
matrix U, and subsequently some vector rotated by the matrix as x̄ := UT x. With slight abuse of
notation, we define the fractional powers of a positive semi-definite matrix Pp as taking the powers
of each eigenvalue, i.e, Pp = (UΛUT )p = UΛ0 UT , where [Λ0 ]ii = [Λ]pii .
Then, we define a generalized forward diffusion process with the following Markov chain:
1
q(x̄i |x̄i−1 ) = N (x̄i ; (I − Bi ) 2 x̄i−1 , Bi I), (2)
where Bi is a diagonal matrix that defines the noise schedule of the process. Note that (2) is a
generalized version of the standard diffusion, as standard diffusion is retrieved when we set U = I,
and Bi = βi I. In other words, we are introducing more flexibility into the design space of diffusion
models by enabling 1) diffusion in the rotated coordinate, where the dependency between pixels can
be imposed, 2) diffusion with different velocities for each component of the vector.
Due to the properties of diagonal matrices, we arrive at an analytically tractable conditional distribu-
tion
1
q(x̄i |x̄0 ) = N (x̄i ; Āi2 x̄0 , (I − Āi )), (3)
Qi
where we have defined Ai := I − Bi , and Āi := j=1 Aj , analagous to [Ho et al., 2020]. Eq. (3)
allows one to directly calculate xi using x0 :
1 1
xi = UĀi2 UT x0 + U(I − Āi ) 2 UT , (4)
where ∼ N (0, I). Tractability of Eq. (4) in turn means that we can efficiently train these models
with denoising score matching [Vincent, 2011] as in prior studies.
While the choice of the rotation matrix U and the noise schedule Bi are flexible, here we propose an
especially effective choice that can be characterized as blurring diffusion for the forward process. For
simplicity and ease of computation, we utilize Gaussian blur with symmetric kernels that are separable,
with a pre-defined variance of σ 2 . Since Gaussian blur is a linear operation, it can be approximated as
a matrix multiplication using a circular symmetric matrix W. With some monotonically increasing
function f (i) that determines a blur schedule and Wi = Wf (i) , we define a blurring diffusion
process as follows:
p
q(xi |xi−1 ) = N (xi ; 1 − βi Wi xi−1 , Ci ), (5)
where we set Ci = I − (1 − βi )Wi2 to ensure the process preserves unit variance. Eq. (5) can also
be written as follows: 1
xi = xi−1 − H(xi−1 , i − 1) + Ci2 zi , (6)
p
where H(xi , i) = xi − 1 − βi+1 Wi+1 xi is an unnormalized Gaussian high-pass filter. Unlike
the standard forward diffusion process where the signal is attenuated holistically (see eq. (14)),
our forward process destroys high frequencies much faster. In order to match the definition of the
generalized diffusion, we propose to factor the symmetric matrix W by eigenvalue-decomposition
W = ŨDŨT and subsequently Wi = ŨDf (i) ŨT . We employ the memory-efficicent eigen-
decompisition by Kawar et al. [2022] (see Appendix D of DDRM). This leads us to the following
proposition
3
Proposition 1. Let Bi = I − (1 − βi )D2f (i) and U = Ũ. Then, (2) is equivalent to (5).
Proof is deferred to Appendix E. Due to Proposition 1, we can efficiently train the model using the
denoising score matching objective:
L = Ei {λ(i)Eq0 (x) Eq(xi |x0 ) [||sθ (xi , i) − (−U(I − Āi )−1 UT ) ||22 ]}. (7)
=∇xi log qi (xi |x0 ))
Ei {λ(i)Eq0 (x) Eq(xi |x0 ) [||U(I − Āi )−1 UT (θ (xi , i) − )||22 ]}. (9)
In practice, we found it beneficial to sample quality to use the following variant of Eq. (9):
L = Ei {λ(i)Eq0 (x) Eq(xi |x0 ) [||θ (xi , i) − ||22 ]}, (10)
which resembles a re-weighted VLB [Ho et al., 2020].
Reverse deblurring process After training, we can sample images using a reverse diffusion
sampler:
1
xi−1 = xi + H(xi , i) + UBi+1 UT sθ (xi , i) + UBi+1
2
U T zi (11)
or equivalently,
1
xi−1 = xi + H(xi , i) − UBi+1 (I − Āi )UT θ (xi , i) + UBi+1
2
UT z i , (12)
which is analogous to unsharp masking [Szeliski, 2010] followed by the denoising term to remove
amplified noise. Through the process, our model generates an image in a coarse-to-fine manner by
progressive deblurring followed by denoising (see Figure 1).
FID-10K
f (N ) f _type bedroom church
0 (w/o blur) N/A 9.24 6.04
0.6 log 73.23
0.14 quartic 7.86 5.89
Table 1: FID-10K results on LSUN bedroom and church-outdoor datasets (64×64). We fix f (0) to 0.
We note that standard diffusion models are the special case of our model when f (i) = 0. Table 1
demonstrates that our model outperforms the standard diffusion model when f _type is quartic with
f (N ) = 0.14. We provide a functional form for each f _type in Appendix. We compute FID using
only 10K samples, and this is acceptable as we measure the relative differences within the same
framework. See Appendix B for additional experiments.
4 Conclusion
In this paper, we generalized the previous diffusion models and provide an effective way to impose
the inductive bias on diffusion models. We further proposed the blur diffusion as a special case.
Blur diffusion generates images in a coarse-to-fine manner by progressive deblurring followed by
denoising. Experiments showed that our model can synthesize more perceptually compelling samples
than previous methods. We look forward to scaling up and applying the model to various applications.
4
Acknowledgement
This work was supported by the National Research Foundation of Korea under Grant NRF-
2020R1A2B5B03001980, and by the KAIST Key Research Institute (Interdisciplinary Research
Group) Project.
References
B. D. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications,
12(3):313–326, 1982.
P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in Neural
Information Processing Systems, 34, 2021.
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems, 33:6840–6851, 2020.
J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for
high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022a.
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models.
arXiv preprint arXiv:2204.03458, 2022b.
B. Jing, G. Corso, R. Berlinghieri, and T. Jaakkola. Subspace diffusion generative models. arXiv
preprint arXiv:2205.01490, 2022.
T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality,
stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial
networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 4401–4410, 2019.
B. Kawar, M. Elad, S. Ermon, and J. Song. Denoising diffusion restoration models. arXiv preprint
arXiv:2201.11793, 2022.
D. P. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models. arXiv preprint
arXiv:2107.00630, 2021.
H. Li, Y. Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen. Srdiff: Single image
super-resolution with diffusion probabilistic models. Neurocomputing, 2022.
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen.
Glide: Towards photorealistic image generation and editing with text-guided diffusion models.
arXiv preprint arXiv:2112.10741, 2021.
A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International
Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image
generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
S. Rissanen, M. Heinonen, and A. Solin. Generative modelling with inverse heat dissipation. arXiv
e-prints, pages arXiv–2206, 2022.
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis
with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 10684–10695, 2022.
C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi. Image super-resolution via
iterative refinement. arXiv preprint arXiv:2104.07636, 2021.
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages
2256–2265. PMLR, 2015.
5
Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances
in Neural Information Processing Systems, 32, 2019.
Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative
modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
R. Szeliski. Computer vision: algorithms and applications. Springer Science & Business Media,
2010.
A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modeling in latent space. Advances in
Neural Information Processing Systems, 34:11287–11302, 2021.
P. Vincent. A connection between score matching and denoising autoencoders. Neural computation,
23(7):1661–1674, 2011.
6
Figure 2: Results on LSUN-bedroom 64 × 64. f_type : log (left), f_type : quartic (right).
APPENDIX
A Background
In this section, we briefly overview the variance preserving diffusion models [Song et al., 2020]. For
each training data x0 ∼ q0 (x), a forward process is defined from the following Markov chain:
p p
xi = 1 − βi xi−1 + βi zi , i = 1, ..., N (13)
where zi ∼ N (0, I) and {βi }N
i=1 is a pre-defined noise schedule. Another way to see Eq. (13) is:
p p
xi = xi−1 − (1 − 1 − βi )xi−1 + βi zi , (14)
which means that each step of the forward process consists of attenuating signals holistically and
adding Gaussian noise. It is noteworthy that q(xi |x0 ) is also Gaussian distribution and can be
written in a closed form, allowing efficient training. Specifically, using the notation αi = 1 − βi and
Qi
ᾱi = j=1 αi , we have
√
q(xi |x0 ) = N (xi ; ᾱi x0 , (1 − ᾱi )2 I). (15)
To generate clean images, we need to invert the noising process using sampling methods,
which
R requires estimating the time-conditional score function ∇x log qi (x) where qi (x) =
q(xi |x0 )q0 (x0 )dx0 . One way to estimate the score function is to minimize the denoising score
matching objective [Vincent, 2011]:
θ∗ = arg min Ei {λ(i)Eq0 (x) Eq(xi |x0 ) [||sθ (xi , i) − ∇x log qi (xi |x0 )||22 ]}, (16)
θ
where λ(i) is a non-negative weighting function. When sθ (xi , i) is a reasonable predictor of the score
function, sampling can be done using, for instance, a reverse diffusion sampler Song et al. [2020]:
p p
xi−1 = xi − ( 1 − βi+1 − 1)xi + βi+1 sθ (xi , i) + βi+1 zi . (17)
Another popular sampling method is ancestral sampling [Ho et al., 2020], which is a different
discretization of the same reverse-time stochastic differential equation (SDE) [Song et al., 2020].
Since the diffusion sampler (17) can be derived in a conceptually simple manner for an arbitrary SDE,
we extensively use it in this paper.
B Experiments
Experiment details We set N = 1000 and sample using N steps for all experiments. All models
are trained on a single V100 with a batch size of 16. We train models for 450K and 600K iterations
7
Figure 3: Comparison of generated images with different generation strategies. Left: fine-to-coarse,
right: coarse-to-fine.
on LSUN bedroom and church datasets, respectively. We set the learning rate to 5e − 5, EMA decay
factor to 0.9999, σ to 0.4, and λ(i) to 1. For pre-processing, we resize images to 64 × 64 without
cropping. We do not use any dropout in our experiments. We provide detailed model configuration
and computational requirements in Appendix.
C Discussion
Perceptual quality While diffusion models can be trained in a per- Figure 4: Different functional
ceptual quality-oriented way using hand-crafted weighting functions, forms of blur schedule f (i) we
our method provides a more explicit way to focus on the coarse experimented with.
pattern of images: to train a score estimator on the blurred images.
Moreover, our model emphasizes the coarse patterns during the sam-
pling process as the score function points in the steepest direction to
the high-density region of the low-frequencies, especially when i is
close to N . Since we did not sweep over blur schedules other than what we reported, it would be a
valuable future direction to further find the optimal noise and blur schedule for our model.
Different basis Although we mainly discuss the blur diffusion as a special case, our generalized
diffusion is broadly applicable to the arbitrary coordinate depending on the choice of the orthonormal
basis U. For instance, one can perform diffusion in different frequency domains of such as Fourier
transform or discrete cosine transform. Moreover, our approach is not restricted to the image data and
8
can be used for different data modalities as we provide a general method for imposing the inductive
bias on diffusion models.
D Related works
Several approaches have been proposed to find a better space for diffusion models. Recently, Vahdat
et al. [2021] and Rombach et al. [2022] proposed to train the diffusion model in the learned latent
space. Unlike our method, these approaches require the training of an autoencoder and have no
control over the learned space, which is necessary to impose the inductive bias for a certain data
modality of interest. Jing et al. [2022] utilize the orthogonal projection to destroy the component
orthogonal to the data manifold faster. Unlike our work, they focus on reducing the costs for sampling,
and the method requires the predetermined time steps in which the projection is performed.
Rissanen et al. [2022] concurrently proposed a deblurring generative process by reversing the heat
equation. Indeed, their method is a special case of our generalized diffusion, in which the columns
of U are cosine basis. Although solving heat equation does not involve noise, they empirically
found that a small amount of noise (with the variance of 0.1) in the forward process as well as in the
generative process is crucial for the sample quality. The noise strength for both processes is chosen
by trial and error. In contrast, our approach naturally involves noise as we interpret the proposed
diffusion process from the SDE perspective. Therefore, once the noise schedule of the forward
process is determined, the reverse-time noise strength is rigorously derived from the forward process,
while Rissanen et al. [2022] swept over the reverse-time noise strength δ. Finally, their novel iterative
method does not show any improvements in performance yet, while our method demonstrates the
improved performance as a generalization of standard diffusion models.
E Proof
In this section, we provide a proof for Proposition 1. With Eq. (2), x̄i is represented as follows:
p 1
x̄i = 1 − βi Df (i) x̄i−1 + (I − (1 − βi )D2f (i) ) 2 zi (18)
Using the definition of x̄i , we have
p 1
xi = 1 − βi ŨDf (i) ŨT xi−1 + Ũ(I − (1 − βi )D2f (i) ) 2 ŨT z̄i (19)
p 1
= 1 − βi ŨDf (i) ŨT xi−1 + (Ũ(I − (1 − βi )D2f (i) )ŨT ) 2 z̄i (20)
p 1
= 1 − βi Wi xi−1 + Ci2 z̄i , (21)
2f (i) T 2f (i) T
where z̄i ∼ N (0, I). Note that Ci = I − (1 − βi )ŨD Ũ = Ũ(I − (1 − βi )D )Ũ .
9
For sampling, we further discretize Eq. (25) as follows:
xi−1 = xi − fi (xi , i) + Gi GTi sθ (xi , i) + Gi zi , (26)
and this is called the reverse diffusion sampler. For blur diffusion, we set fi (xi , i) = −H(xi , i) and
1 1
Gi = Ci+1
2
= ŨBi+1
2
ŨT from Eq. (6), and thus we have:
1
xi−1 = xi + H(xi , i) + ŨBi+1 ŨT sθ (xi , i) + ŨBi+1
2
ŨT zi . (27)
G Architecture configuration
We utilize the UNet architecture of Dhariwal and Nichol [2021]. A detailed architecture configuration
is as follows:
10