0% found this document useful (0 votes)
9 views18 pages

2302.04304v3

Uploaded by

astecisgood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

2302.04304v3

Uploaded by

astecisgood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Q-Diffusion: Quantizing Diffusion Models

Xiuyu Li Yijiang Liu Long Lian Huanrui Yang Zhen Dong


UC Berkeley Nanjing University UC Berkeley UC Berkeley UC Berkeley

Daniel Kang Shanghang Zhang Kurt Keutzer


UIUC Peking University UC Berkeley
arXiv:2302.04304v3 [cs.CV] 8 Jun 2023

Abstract strate their power in various applications such as image


super-resolution [36, 17], inpainting [43], shape generation
Diffusion models have achieved great success in image [3], graph generation [30], image-to-image translation [39],
synthesis through iterative noise estimation using deep neu- and molecular conformation generation [46].
ral networks. However, the slow inference, high memory However, the generation process for diffusion models can
consumption, and computation intensity of the noise estima- be slow due to the need for an iterative noise estimation of 50
tion model hinder the efficient adoption of diffusion models. to 1,000 time steps [12, 41] using complex neural networks.
Although post-training quantization (PTQ) is considered a While previous state-of-the-art approaches (e.g., GANs) are
go-to compression method for other tasks, it does not work able to generate multiple images in under 1 second, it nor-
out-of-the-box on diffusion models. We propose a novel mally takes several seconds for a diffusion model to sample
PTQ method specifically tailored towards the unique multi- a single image. Consequently, speeding up the image gener-
timestep pipeline and model architecture of the diffusion ation process becomes an important step toward broadening
models, which compresses the noise estimation network to the applications of diffusion models. Previous work has
accelerate the generation process. We identify the key diffi- been solving this problem by finding shorter, more effective
culty of diffusion model quantization as the changing output sampling trajectories [41, 29, 38, 21, 1, 23], which reduces
distributions of noise estimation networks over multiple time the number of steps in the denoising process. However, they
steps and the bimodal activation distribution of the shortcut have largely ignored another important factor: the noise esti-
layers within the noise estimation network. We tackle these mation model used in each iteration itself is compute- and
challenges with timestep-aware calibration and split short- memory-intensive. This is an orthogonal factor to the repet-
cut quantization in this work. Experimental results show itive sampling, which not only slows down the inference
that our proposed method is able to quantize full-precision speed of diffusion models, but also poses crucial challenges
unconditional diffusion models into 4-bit while maintaining in terms of high memory footprints.
comparable performance (small FID change of at most 2.34
compared to >100 for traditional PTQ) in a training-free This work explores the compression of the noise esti-
manner. Our approach can also be applied to text-guided mation model used in the diffusion model to accelerate the
image generation, where we can run stable diffusion in 4-bit denoising of all time steps. Specifically, we propose explor-
weights with high generation quality for the first time. ing post-training quantization (PTQ) on the diffusion model.
PTQ has already been well studied in other learning do-
mains like classification and object detection [4, 2, 19, 9, 22],
1. Introduction and has been considered a go-to compression method given
Diffusion models have shown great success in gener- its minimal requirement for training data and the straight-
ating images with both high diversity and high fidelity forward deployment on real hardware devices. However,
[42, 12, 43, 41, 7, 32, 35, 33]. Recent work [15, 14] the iterative computation process of the diffusion model
has demonstrated superior performance than state-of-the- and the model architecture of the noise estimation network
art GAN models, which suffer from unstable training. As a brings unique challenges to the PTQ of diffusion models.
class of flexible generative models, diffusion models demon- PTQ4DM [40] presents an inaugural application of PTQ to
compress diffusion models down to 8-bit, but it primarily
Preprint. Under review. focuses on smaller datasets and lower resolutions.

1
(a) Calibration dataset creation (b) Model Inference
Traditional PTQ
Training Dataset
θq
Sample from Training Data Synthesized by NN Random Sampling
Input Output

Calibration
Dataset

Quantization errors accumulate at each time step


pθ(xt−1 | xt)
⋯ ⋯ ⋯ θq θq θq
Q-Diffusion (Ours)

xT xt xt−1 xt−5 x0
pθq(xT−1 | xT ) pθq(xT−2 | xT−1) pθq(x1 | x0)

Time Step-aware
Uniform Sampling
xT xT−1 x0

Naive Linear Quantization incurs severe quality loss


Calibration
Dataset
x0

Figure 1: Conventional PTQ scenarios and Q-Diffusion differ in (a) calibration dataset creation and (b) model inference
workflow. Traditional PTQ approaches sample data randomly [9], synthesize with statistics in model layers [4], or draw from
the training set to create calibration dataset [27, 19], which either contains inconsistency with real inputs during the inference
time or are not data-free. In contrast, Q-Diffusion constructed calibration datasets with inputs that are an accurate reflection
of data seen during the production in a data-free manner. Traditional PTQ inference only needs to go through the quantized
model θq one time, while Q-Diffusion needs to address the accumulated quantization errors in the multi-time step inference.

Our work, evolving concurrently with [40], offers a com- and the quantization error accumulation across time
prehensive analysis of the novel challenges of performing steps via a thorough analysis.
PTQ on diffusion models. Specifically, as visualized in Fig-
3. We propose time step-aware calibration data sampling
ure 1(a), we discover that the output distribution of the noise
to improve calibration quality, and propose a special-
estimation network at each time step can be largely different,
ized quantizer for the noise estimation network.
and naively applying previous PTQ calibration methods with
an arbitrary time step leads to poor performance. Further- 4. Extensive results show Q-Diffusion enables W4A8
more, as illustrated in Figure 1(b), the iterative inference of PTQ for both pixel-space and latent-space uncondi-
the noise estimation network leads to an accumulation of tional diffusion models with an FID increment of only
quantization error, which poses higher demands on design- 0.39-2.34 over full precision models. It can also pro-
ing novel quantization schemes and calibration objectives duce qualitatively comparable images when plugged
for the noise estimation network. into Stable Diffusion [33] for text-guided synthesis.
To address these challenges, we propose Q-Diffusion, a
PTQ solution to compress the cumbersome noise estimation 2. Related work
network in diffusion models in a data-free manner, while
maintaining comparable performance to the full precision Diffusion Models. Diffusion models generate images
counterparts. We propose a time step-aware calibration data through a Markov chain, as illustrated in Figure 2. A forward
sampling mechanism from the pretrained diffusion model, diffusion process adds Gaussian noise to data x0 ∼ q(x) for
which represents the activation distribution of all time steps. T times, resulting in noisy samples x1 , ..., xT :
We further tailor the design of the calibration objective and p
q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I) (1)
the weight and activation quantizer to the commonly used
noise estimation model architecture to reduce quantization where βt ∈ (0, 1) is the variance schedule that controls the
error. We perform thorough ablation studies to verify our strength of the Gaussian noise in each step. When T → ∞,
design choices, and demonstrate good generation results xT approaches an isotropic Gaussian distribution.
with diffusion models quantized to only 4 bits. The reverse process removes noise from a sample from
In summary, our contributions are: the Gaussian noise input xT ∼ N (0, I) to gradually gen-
1. We propose Q-Diffusion, a data-free PTQ solution for erate high-fidelity images. However, since the real reverse
the noise estimation network in diffusion models. conditional distribution q(xt−1 |xt ) is unavailable, diffusion
models sample from a learned conditional distribution:
2. We identify the novel challenge of performing PTQ on
diffusion models as the activation distribution diversity pθ (xt−1 |xt ) = N (xt−1 ; µ̃θ,t (xt ), β̃t I). (2)

2
q(xt | xt−1)
⋯ ⋯ ⋯
Previous PTQ research in classification and detection
tasks focused on the calibration objective and the acqui-
x0 xt−10 xt−1 xt xT
(a) Forward diffusion process
sition of calibration data. For example, EasyQuant [45]
determines appropriate cmin and cmax based on training data,
pθ(xt−1 | xt )
⋯ ⋯ ⋯ and BRECQ [19] introduces Fisher information into the
xT xt xt−1 xt−10 x0 objective. ZeroQ [4] employs a distillation technique to
(b) Reverse diffusion process generate proxy input images for PTQ, and SQuant [9] uses
random samples with objectives based on sensitivity deter-
Figure 2: The forward diffusion process (a) repeatedly adds mined through the Hessian spectrum. For diffusion model
Gaussian noise. The reverse diffusion process (b) uses a quantization, a training dataset is not needed as the calibra-
trained network to denoise from a standard Gaussian noise tion data can be constructed by sampling the full-precision
image in order to generate an image. model with random inputs. However, the multi-time step in-
ference of the noise estimation model brings new challenges
With the reparameterization trick in [12], the mean µ̃θ,t (xt )
in modeling the activation distribution. In parallel to our
and variance β̃t could be derived as follows: work, PTQ4DM [40] introduces the method of Normally
1 1 − αt Distributed Time-step Calibration, generating calibration
µ̃θ,t (xt ) = √ (xt − √ ϵθ,t ) (3) data across all time steps with a specific distribution. Nev-
αt 1 − ᾱt
1 − ᾱt−1 ertheless, their explorations remain confined to lower reso-
β̃t = · βt (4) lutions, 8-bit precision, floating-point attention activation-
1 − ᾱt
Qt to-activation matmuls, and with limited ablation study on
where αt = 1 − βt , ᾱt = i=1 αi . We refer readers to [25] other calibration schemes. This results in worse applicabil-
for a more detailed introduction. ity of their method to lower precisions (see Appendix E).
In practice, the noise at each time step t are computed Our work delves into the implications of calibration dataset
from xt by a noise estimation model, with the same weights creation in a holistic manner, establishing an efficient cal-
for all time steps. The UNet [34] dominates the design of the ibration objective for diffusion models. We fully quantize
noise estimation model in diffusion models [41, 33, 32, 35], act-to-act matmuls, validated by experiments involving both
with some recent exploration on Transformer [31]. This pixel-space and latent-space diffusion models on large-scale
work designs the PTQ method for the acceleration of the datasets up to resolutions of 512 × 512.
noise estimation model, especially for the common UNet.
Accelerated diffusion process. Related methods include 3. Method
simulating the diffusion process in fewer steps by gener-
alizing it to a non-Markovian process [41], adjusting the We present our method for post-training quantization of
variance schedule [29], and the use of high-order solvers diffusion models in this section. Different from convention-
to approximate diffusion generation [21, 1, 23, 24]. Others ally studied deep learning models and tasks such as CNNs
have employed the technique of caching and reusing feature and VITs for classification and detection, diffusion models
maps [18]. Efforts to distill the diffusion model into fewer are trained and evaluated in a distinctive multi-step man-
time steps have also been undertaken [38, 26], which have ner with a unique UNet architecture. This presents notable
achieved notable success but involve an extremely expen- challenges to the PTQ process. We analyze the challenges
sive retraining process. Our work focuses on accelerating brought by the multi-step inference process and the UNet
the noise estimation model inference in each step, with a architecture in Section 3.1 and 3.2 respectively and describe
training-free PTQ process. the full Q-Diffusion PTQ pipeline in Section 3.3.
Post-training Quantization. Post-training quantization
3.1. Challenges under the Multi-step Denoising
(PTQ) compresses deep neural networks by rounding el-
ements w to a discrete set of values, where the quantization We identify two major challenges in quantizing mod-
and de-quantization can be formulated as: els that employ multi-step inference process. Namely, we
w investigate the accumulation of quantization error across
ŵ = s · clip(round( ), cmin , cmax ), (5) time steps and the difficulty of sampling a small calibration
s
dataset to reduce the quantization error at each time step.
where s denotes the quantization scale parameters, cmin and
cmax are the lower and upper bounds for the clipping function Challenge 1: Quantization errors accumulate across time
clip(·). These parameters can be calibrated with the weight steps. Performing quantization on a neural network model
and activation distribution estimated in the PTQ process. introduces noise on the weight and activation of the well-
The operator round(·) represents rounding, which can be trained model, leading to quantization errors in each layer’s
either rounding-to-nearest [45, 4] or adaptive rounding [19]. output. Previous research has identified that quantization

3
100 as possible.
MSE with xt (FP32)

10 1 Challenge 2: Activation distributions vary across time


10 2 steps. To reduce the quantization errors at each time step,
previous PTQ research [27, 19] calibrates the clipping range
10 3
and scaling factors of the quantized model with a small set of
10 4 xt (INT8)
xt (INT5) calibration data. The calibration data should be sampled to
10 5 xt (INT4) resemble the true input distribution so that the activation dis-
xt (INT4 Q-Diffusion)
10 6 tribution of the model can be estimated correctly for proper
100 90 80 70 60 50 40 30 20 10
Time Step (t) calibration. Given that the Diffusion model uses the same
noise estimation network to take inputs from all time steps,
Figure 3: MSE between FP32 outputs and weight-quantized determining the data sampling policy across different time
outputs of different precisions with Linear Quantization and steps becomes an outstanding challenge. Here we start by an-
our approach across time steps. Here the data is obtained by alyzing the output activation distribution of the UNet model
passing a batch with 64 samples through a model trained on across different time steps. We conduct the same CIFAR-10
CIFAR-10 [16] with DDIM sampling steps 100. experiment using DDIM with 100 denoising steps, and draw
the activations ranges of 1000 random samples among all
100.00
98.56 time steps. As Figure 5 shows, the activation distributions
45.36 Q-Diffusion gradually change, with neighboring time steps being similar
31.98 (cover full process) and distant ones being distinctive. This is also echoed by the
14.29
visualized xt in Figure 2.
The fact that the output activations distribution varies
FID↓

10.00
6.07 across time steps further brings challenges to quantization.
Calibrating the noise estimation model using only a few time
steps that do not reflect the full range of activations seen
1.00
among all time steps by the noise estimation model during
First 25 First 50 Mid 50 Last 50 Last 25 the denoising process can cause overfitting to the activation
Calibration data from part of denoising process distribution described by those specific time steps, while
not generalizing to other time steps, which hurts the overall
Figure 4: Effects of time steps in calibration dataset creation performance. For instance, here we try to calibrate the quan-
on 4-bit weights quantization results with DDIM on CIFAR- tized DDIM on the CIFAR-10 dataset with data sampled
10. First n, Mid n, Last n denotes that 5120 samples are from different parts of the denoising process. As shown in
selected uniformly from the first, middle, last n time steps. Figure 4, if we simply take 5120 samples from time steps
that fall into a certain stage of the denoising process, signifi-
errors are likely to accumulate across layers [6], making cant performance drops will be induced under 4-bit weights
deeper neural networks harder to quantize. In the case of quantization. Note that the case with samples taken from
diffusion models, at any time step t, the input of the denois- the middle 50 time steps caused smaller drops compared to
ing model (denoted as xt ) is derived by xt+1 , the output of cases with samples taken from either the first or the last n
the model at the previous time step t + 1 (as depicted by time steps, and with n increases, the drops are also alleviated.
Equation 2). This process effectively multiplies the number These results illustrate the gradual “denoising” process as
of layers involved in the computation by the number of de- depicted in Figure 5: the activations distribution changes
noising steps for the input xt at time step t, leading to an gradually throughout time steps, with the middle part captur-
accumulation of quantization errors towards later steps in ing the full range to some degree, while parts of the distant
the denoising process. endpoints differing the most. To recover the performance of
We run the denoising process of DDIM [41] on CIFAR- the quantized diffusion models, we need to select calibration
10 [16] with a sampling batch size of 64, and compare the data in a way that comprehensively takes into account the
MSE differences between the full-precision model and the distributions of the output of different time steps.
model quantized to INT8, INT5, and INT4 at each time step.
3.2. Challenges on Noise Estimation Model Quan-
As shown in Figure 3, there is a dramatic increase in the
tization
quantization errors when the model is quantized to 4-bit, and
the errors accumulate quickly through iterative denoising. Most diffusion models (Imagen [35], Stable Diffu-
This brings difficulty in preserving the performance after sion [33], VDMs [13]) adopt UNets as denoising backbones
quantizing the model down to low precision, which requires that downsample and upsample latent features. Although
the reduction of quantization errors at all time steps as much recent studies show that transformer architectures are also

4
4

Activation Range 2

100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5
Time Step
Figure 5: Activation ranges of xt across all 100 time steps of FP32 DDIM model on CIFAR-10.

capable of serving as the noise estimation backbone [31], component that contains residual connections in the diffu-
convolutional UNets are still the de facto choice of archi- sion model UNet as a block, such as a Residual Bottleneck
tecture today. UNets utilize shortcut layers to merge con- Block or a Transformer Block. Other parts of the model
catenated deep and shallow features and transmit them to that do not satisfy this condition are calibrated in a per-layer
subsequent layers. Through our analysis presented in Fig- manner. This technique has been shown to improve the per-
ure 6, we observe that input activations in shortcut layers formance compared to fully layer-by-layer calibration since
exhibit abnormal value ranges in comparison to other layers. it address the inter-layer dependencies and generalization
Notably, the input activations in DDIM’s shortcut layers can better [19]. For activation quantization, since activations
be up to 200 times larger than other neighboring layers. are constantly changing during inference, doing adaptive
To analyze the reason for this, we visualize the weight and rounding is infeasible. Therefore, we only adjust the step
activation tensor of a DDIM shortcut layer. As demonstrated sizes of activation quantizers according to to [8]. The overall
in the dashed box in Figure 6, the ranges of activations from calibration workflow is described in Alg. 2.
the deep feature channels (X1 ) and shallow feature channels
(X2 ) being concatenated together vary significantly, which Algorithm 1 Q-Diffusion Calibration
also resulted in a bimodal weight distribution in the corre- Require: Pretrained full precision diffusion model and the
sponding channels (see also Figure 7). Naively quantizing quantized diffusion model [Wθ , Ŵθ ]
the entire weight and activation distribution with the same Require: Empty calibration dataset D
quantizer will inevitably lead to large quantization errors. Require: Number of denoising sampling steps T
Require: Calibration sampling interval c, amount of calibra-
3.3. Post-Training Quantization of Diffusion model tion data per sampling step n
We propose two techniques: time step-aware calibration for t = 1, . . . , T time step do
data sampling and shortcut-splitting quantization to tackle if t % c = 0 then
(1) (n)
the challenges identified in the previous sections respectively. Sample n intermediate inputs xt , . . . , xt ran-
domly at t from Wθ and add them to D
end if
3.3.1 Time step-aware calibration end for
Since the output distributions of consecutive time steps are for all i = 1, . . . , N blocks do
often very similar, we propose to randomly sample interme- Update the weight quantizers of the i-th block in Ŵθ
diate inputs uniformly in a fixed interval across all time steps with D and Wθ
to generate a small calibration set. This effectively balances end for
the size of the calibration set and its representation ability of if do activation quantization then
the distribution across all time steps. Empirically, we have for all i = 1, . . . , N blocks do
found that the sampled calibration data can recover most of Update the activation quantizers step sizes of the i-th
the INT4 quantized models’ performance after the calibra- block with Ŵθ , Wθ , D.
tion, making it an effective sampling scheme for calibration end for
data collection for quantization error correction. end if
To calibrate the quantized model, we divide the model
into several reconstruction blocks [19], and iteratively re-
construct outputs and tune the clipping range and scaling 3.3.2 Shortcut-splitting quantization
factors of weight quantizers in each block with adaptive
rounding [27] to minimize the mean squared errors between To address the abnormal activation and weight distributions
the quantized and full precision outputs. We define a core in shortcut layers, we propose a “split” quantization tech-

5
Input value

Input value
Activation range
three shortcuts with Channel Channel
abnormal input

Input value
activation ranges

Channel number

~ ~
Layer number
Figure 6: Activation ranges of DDIM’s FP32 outputs across layers averaging among all time steps. We point out three shortcuts
with the largest input activation ranges compared to other neighboring layers. Figures in the dashed box illustrate concatenation
along channels. ⊕ denotes the concatenation operation.

X2 X1

Pooling Upsampling Others


UNet encoder UNet decoder

Figure 7: (Left) The typical UNet architecture with shortcut layers that concatenate features from the deep and shallow layers.
(Right) The ranges of activations from the deep (X1 ) and shallow (X2 ) feature channels vary significantly, which also results
in a bimodal weight distribution in the corresponding channels.

nique that performs quantization prior to concatenation, re- latent-space diffusion model Latent Diffusion [33] for un-
quiring negligible additional memory or computational re- conditional image generation. We also visualize the images
sources. This strategy can be employed for both activation generated by Q-Diffusion on Stable Diffusion. To the best of
and weight quantization in shortcut layers, and is expressed our knowledge, there is currently no published work done on
mathematically as follows: diffusion model quantization. Therefore, we report the ba-
sic channel-wise Linear Quantization (i.e., Equation 5) as a
QX (X) = QX1 (X1 ) ⊕ QX2 (X2 ) (6)
baseline. We also re-implement the state-of-the-art data-free
QW (W ) = QW1 (W1 ) ⊕ QW2 (W2 ) (7) PTQ method SQuant [9] and include the results for compar-
QX (X)QW (W ) =QX1 (X1 )QW1 (W1 ) ison. Furthermore, we apply our approach to text-guided
(8) image synthesis with Stable Diffusion [33]. Experiments
+ QX2 (X2 )QW2 (W2 ) show that our approach can achieve competitive generation
where X ∈ Rw×h×cin and W ∈ Rcin ×cout are the input quality to the full-precision scenario on all tasks, even under
activation and layer weight, which can be naturally split INT4 quantization for weights.
into X1 ∈ Rw×h×c1 , X2 ∈ Rw×h×c2 , W1 ∈ Rc1 ×cout , and
W2 ∈ Rc2 ×cout , respectively. c1 and c2 are determined by 4.2. Unconditional Generation
the concatenation operation. Q(·) denotes the quantization
operator and ⊕ denotes the concatenation operator. We conducted evaluations using the 32 × 32 CIFAR-10
[16], 256 × 256 LSUN Bedrooms, and 256 × 256 LSUN
4. Experiments Church-Outdoor [47]. We use the pretrained DDIM sampler
[41] with 100 denoising time steps for CIFAR-10 experi-
4.1. Experiments Setup
ments and Latent Diffusion (LDM) [33] for the higher reso-
In this section, we evaluate the proposed Q-Diffusion lution LSUN experiments. We evaluated the performance in
framework on pixel-space diffusion model DDPM [12] and terms of Frechet Inception Distance (FID) [11] and addition-

6
Bedroom Q-Diffusion Bedroom Linear Quantization Church Q-Diffusion Church Linear Quantization
(W4A8) (W4A8) (W4A8) (W4A8)

Figure 8: 256 × 256 unconditional image generation results using Q-Diffusion and Linear Quantization under W4A8 precision.

Table 1: Quantization results for unconditional image gener- Table 2: Quantization results for unconditional image gener-
ation with DDIM on CIFAR-10 (32 × 32). ation with LDM-4 on LSUN-Bedrooms (256 × 256). The
downsampling factor for the latent space is 4.
Method Bits (W/A) Size (Mb) GBops FID↓ IS↑
Full Precision 32/32 143.2 6597 4.22 9.12 Method Bits (W/A) Size (Mb) TBops FID↓
Linear Quant 8/32 35.8 2294 4.71 8.93 Full Precision 32/32 1096.2 107.17 2.98
SQuant 8/32 35.8 2294 4.61 8.99
Linear Quant 8/32 274.1 37.28 3.02
Q-Diffusion 8/32 35.8 2294 4.27 9.15
SQuant 8/32 274.1 37.28 2.94
Linear Quant 4/32 17.9 1147 141.47 4.20 Q-Diffusion 8/32 274.1 37.28 2.97
SQuant 4/32 17.9 1147 160.40 2.91
Q-Diffusion 4/32 17.9 1147 5.09 8.78 Linear Quant 4/32 137.0 18.64 82.69
SQuant 4/32 137.0 18.64 149.97
Linear Quant 8/8 35.8 798 118.26 5.23 Q-Diffusion 4/32 137.0 18.64 4.86
SQuant 8/8 35.8 798 464.69 1.17
Q-Diffusion 8/8 35.8 798 3.75 9.48 Linear Quant 8/8 274.1 12.97 6.69
SQuant 8/8 274.1 12.97 4.92
Linear Quant 4/8 17.9 399 188.11 2.45
SQuant 4/8 17.9 399 456.21 1.16 Q-Diffusion 8/8 274.1 12.97 4.40
Q-Diffusion 4/8 17.9 399 4.93 9.12 Linear Quant 4/8 137.0 6.48 24.86
SQuant 4/8 137.0 6.48 95.92
ally evaluated the Inception Score (IS) [37] for CIFAR-10 Q-Diffusion 4/8 137.0 6.48 5.32
results, since IS is not an accurate reference for datasets that
differ significantly from ImageNet’s domain and categories. Table 3: Quantization results for unconditional image gen-
The results are reported in Table 1- 3 and Figure 8, where eration with LDM-8 on LSUN-Churches (256 × 256). The
Bops is calculated for one denoising step without consider- downsampling factor for the latent space is 8.
ing the decoder compute cost for latent diffusion. Method Bits (W/A) Size (Mb) TBops FID↓
The experiments show that Q-Diffusion significantly pre-
Full Precision 32/32 1179.9 22.17 4.06
serves the image generation quality and outperforms Linear
Quantization by a large margin for all resolutions and types Linear Quant 8/32 295.0 10.73 3.84
SQuant 8/32 295.0 10.73 4.01
of diffusion models tested when the number of bits is low.
Q-Diffusion 8/32 295.0 10.73 4.03
Although 8-bit weight quantization has almost no perfor-
mance loss compared to FP32 for both Linear Quantization Linear Quant 4/32 147.5 5.36 32.54
SQuant 4/32 147.5 5.36 33.77
and our approach, the generation quality with Linear Quanti-
Q-Diffusion 4/32 147.5 5.36 4.45
zation drops drastically under 4-bit weight quantization. In
contrast, Q-Diffusion still preserves most of the perceptual Linear Quant 8/8 295.0 2.68 14.62
SQuant 8/8 295.0 2.68 54.15
quality with at most 2.34 increase in FID and imperceptible
Q-Diffusion 8/8 295.0 2.68 3.65
distortions in produced samples.
Linear Quant 4/8 147.5 1.34 14.92
4.3. Text-guided Image Generation SQuant 4/8 147.5 1.34 24.50
Q-Diffusion 4/8 147.5 1.34 4.12
We evaluate Q-Diffusion on Stable Diffusion pretrained
on subsets of 512 × 512 LAION-5B for text-guided image
generation. Following [33], we sample text prompts from

7
Full Precision Q-Diffusion (W4A8) Linear Quantization (W4A8)

Figure 9: Stable Diffusion 512 × 512 text-guided image synthesis results using Q-Diffusion and Linear Quantization under
W4A8 precision with prompt A puppy wearing a hat.

the MS-COCO [20] dataset to generate a calibration dataset ent sampling strategies. We then evaluated the quality of
with texts condition using Algorithm 1. In this work, we the models quantized by each variant. We experimented
fix the guidance strength to the default 7.5 in Stable Diffu- with varying numbers of time steps used for sampling and
sion as the trade-off between sample quality and diversity. samples used for calibration. In addition to calibration sets
Qualitative results are shown in Figure 9. Compared to Lin- from uniform timestep intervals, we also employed sampling
ear Quantization, our Q-Diffusion provides higher-quality at the first 50 and last 50 steps. As in Figure 10, uniform
images with more realistic details and better demonstration sampling that spans all time steps results in superior perfor-
of the semantic information. Similar performance gain is mance compared to sampling from only partial time steps.
also observed in other random samples showcased in Ap- Furthermore, adjusting the sampling hyperparams, includ-
pendix G, and quantitatively reported in Appendix C. The ing using more calibration samples, does not significantly
output of the W4A8 Q-Diffusion model largely resembles improve the performance. Therefore, we simply choose to
the output of the full precision model. Interestingly, we find sample uniformly every 20 steps for a total of 5,120 samples
some diversity in the lower-level semantics between the Q- for calibration, resulting in a high-quality quantized model
Diffusion model and the FP models, like the heading of the with low computational costs during quantization.
horse or the shape of the hat. We leave it to future work to
understand how quantization contributes to the diversity.
4.4. Ablation Study Effects of Split Previous linear quantization approaches
suffer from severe performance degradation as shown in
First 50 steps Figure 11, where 4-bit weight quantization achieves a high
FID of 141.47 in DDIM CIFAR-10 generation. Employing
additional 8-bit activation quantization further degrades the
performance (FID: 188.11). By splitting shortcuts in quanti-
zation, we significantly improve the generation performance,
Last 50 steps
achieving an FID of 4.93 on W4A8 quantization.

Every 20 steps
5. Conclusion
Every 20 steps
Every 100/10/50 steps
This work studies the use of quantization to accelerate
Every 20 steps Every 20 steps
diffusion models. We propose Q-Diffusion, a novel post-
training quantization scheme that conducts calibration with
multiple time steps in the denoising process and achieves
significant improvements in the performance of the quan-
Figure 10: Uniform sampling strategies which cover all time tized model. Q-Diffusion models under 4-bit quantization
steps are better than strategies that cover only a part of the achieve comparable results to the full precision models.
time steps, as in Fig. 4. Furthermore, adjusting the sampling
techniques within uniform sampling, such as tuning the sam-
pling interval and the number of samples, has a marginal Acknowledgement
effect on the performance of the quantized model.
We thank Berkeley Deep Drive, Intel Corporation, Pana-
Effects of Sampling Strategies To analyze the effect of sonic, and NVIDIA for supporting this research. We also
different sampling strategies for calibration in detail, we thank Sehoon Kim, Muyang Li, and Minkai Xu for their
implemented multiple variants of our method using differ- valuable feedback.

8
Linear Quantization + Calibration + Split shortcuts ence on Computer Vision and Pattern Recognition (CVPR),
pages 13166–13175, 2020. 1, 2, 3
188.11 [5] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo-
141.47
100.00 pher Ré. Flashattention: Fast and memory-efficient exact
54.22 attention with io-awareness. Advances in Neural Information
FID↓

Processing Systems, 35:16344–16359, 2022. 13


[6] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettle-
10.00
6.58
moyer. Llm. int8 (): 8-bit matrix multiplication for trans-
5.09 4.93 formers at scale. arXiv preprint arXiv:2208.07339, 2022.
4
[7] Prafulla Dhariwal and Alexander Nichol. Diffusion models
1.00 beat gans on image synthesis. Advances in Neural Information
W4A32 W4A8
Linear Quant Q-Di usion
Processing Systems, 34, 2021. 1
[8] Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani,
ff
Figure 11: Splitting the shortcut convolution is crucial for Rathinakumar Appuswamy, and Dharmendra S. Modha.
both weight and activation quantization. Comparisons on Learned step size quantization. In International Conference
CIFAR-10 show that Q-Diffusion could achieve comparable on Learning Representations, 2020. 5
image generation quality to the model with full precision [9] Cong Guo, Yuxian Qiu, Jingwen Leng, Xiaotian Gao, Chen
(dashed line) with shortcut splitting. Zhang, Yunxin Liu, Fan Yang, Yuhao Zhu, and Minyi Guo.
Squant: On-the-fly data-free quantization via diagonal hessian
approximation. ArXiv, abs/2202.07471, 2022. 1, 2, 3, 6
[10] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
and Yejin Choi. Clipscore: A reference-free evaluation metric
for image captioning. EMNLP, 2021. 12
[11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
hard Nessler, and Sepp Hochreiter. Gans trained by a two
time-scale update rule converge to a local nash equilibrium.
Advances in neural information processing systems, 30, 2017.
6, 12
W6A32 W6A32 W4A32 W32A32
Linear Split Q-Diffusion [12] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion probabilistic models. Advances in Neural Information
Figure 12: Examples of text-to-image generation with a Processing Systems, 33:6840–6851, 2020. 1, 3, 6
quantized Stable Diffusion model. Naive linear quantiza- [13] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan,
tion degrades the appearance of teeth, which gets fixed by Mohammad Norouzi, and David J Fleet. Video diffusion
shortcut splitting. Q-Diffusion further improves the semantic models. arXiv preprint arXiv:2204.03458, 2022. 4
consistency of eyes through calibration. [14] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
Jaakko Lehtinen, and Timo Aila. Training generative ad-
versarial networks with limited data. Advances in Neural
References Information Processing Systems, 33:12104–12114, 2020. 1
[1] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic- [15] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
dpm: an analytic estimate of the optimal reverse variance in Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
diffusion probabilistic models. ArXiv, abs/2201.06503, 2022. ing the image quality of stylegan. In Proceedings of the
1, 3 IEEE/CVF conference on computer vision and pattern recog-
[2] Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen nition, pages 8110–8119, 2020. 1
Blankevoort, and Nojun Kwak. Lsq+: Improving low-bit [16] Alex Krizhevsky et al. Learning multiple layers of features
quantization through learnable offsets and better initialization. from tiny images. 2009. 4, 6
2020 IEEE/CVF Conference on Computer Vision and Pattern [17] Haoying Li, Yifan Yang, Meng Chang, Huajun Feng, Zhi
Recognition Workshops (CVPRW), pages 2978–2985, 2020. hai Xu, Qi Li, and Yue ting Chen. Srdiff: Single image
1 super-resolution with diffusion probabilistic models. Neuro-
[3] Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun computing, 479:47–59, 2021. 1
Hao, Serge J. Belongie, Noah Snavely, and Bharath Hariharan. [18] Muyang Li, Ji Lin, Chenlin Meng, Stefano Ermon, Song Han,
Learning gradient fields for shape generation. In European and Jun-Yan Zhu. Efficient spatially sparse inference for
Conference on Computer Vision, 2020. 1 conditional gans and diffusion models. In Advances in Neural
[4] Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Information Processing Systems (NeurIPS), 2022. 3
Michael W. Mahoney, and Kurt Keutzer. Zeroq: A novel [19] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi
zero shot quantization framework. 2020 IEEE/CVF Confer- Zhang, Fengwei Yu, Wei Wang, and Shi Gu. {BRECQ}:

9
Pushing the limit of post-training quantization by block re- Intervention–MICCAI 2015: 18th International Conference,
construction. In International Conference on Learning Rep- Munich, Germany, October 5-9, 2015, Proceedings, Part III
resentations, 2021. 1, 2, 3, 4, 5 18, pages 234–241. Springer, 2015. 3
[20] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, [35] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay
Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,
Zitnick. Microsoft coco: Common objects in context. In Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes,
European Conference on Computer Vision, 2014. 8, 12 et al. Photorealistic text-to-image diffusion models with deep
[21] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo language understanding. arXiv preprint arXiv:2205.11487,
numerical methods for diffusion models on manifolds. arXiv 2022. 1, 3, 4
preprint arXiv:2202.09778, 2022. 1, 3 [36] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali-
[22] Yijiang Liu, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, mans, David J. Fleet, and Mohammad Norouzi. Image super-
and Shanghang Zhang. Noisyquant: Noisy bias-enhanced resolution via iterative refinement. IEEE transactions on
post-training activation quantization for vision transformers. pattern analysis and machine intelligence, PP, 2021. 1
arXiv preprint arXiv:2211.16056, 2022. 1 [37] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
[23] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Cheung, Alec Radford, and Xi Chen. Improved techniques
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu- for training gans. Advances in neural information processing
sion probabilistic model sampling in around 10 steps. ArXiv, systems, 29, 2016. 7
abs/2206.00927, 2022. 1, 3, 12 [38] Tim Salimans and Jonathan Ho. Progressive distillation for
[24] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- fast sampling of diffusion models. ArXiv, abs/2202.00512,
uan Li, and Jun Zhu. Dpm-solver++: Fast solver for 2022. 1, 3
guided sampling of diffusion probabilistic models. ArXiv, [39] Hiroshi Sasaki, Chris G. Willcocks, and T. Breckon. Unit-
abs/2211.01095, 2022. 3, 12 ddpm: Unpaired image translation with denoising diffusion
[25] Calvin Luo. Understanding diffusion models: A unified probabilistic models. ArXiv, abs/2104.05358, 2021. 1
perspective. arXiv preprint arXiv:2208.11970, 2022. 3 [40] Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and
[26] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Yan Yan. Post-training quantization on diffusion models.
Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. CVPR, 2023. 1, 2, 3, 13, 14, 15
On distillation of guided diffusion models, 2022. 3
[41] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
[27] Markus Nagel, Rana Ali Amjad, Mart van Baalen, Chris-
diffusion implicit models. arXiv preprint arXiv:2010.02502,
tos Louizos, and Tijmen Blankevoort. Up or down?
2020. 1, 3, 4, 6, 11
adaptive rounding for post-training quantization. ArXiv,
[42] Yang Song and Stefano Ermon. Generative modeling by
abs/2004.10568, 2020. 2, 4, 5
estimating gradients of the data distribution. Advances in
[28] Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W.
Neural Information Processing Systems, 32, 2019. 1
Battaglia. Generating images with sparse representations.
[43] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
International Conference On Machine Learning, 2021. 13
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
[29] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
generative modeling through stochastic differential equations.
denoising diffusion probabilistic models. In International
arXiv preprint arXiv:2011.13456, 2020. 1
Conference on Machine Learning, pages 8162–8171. PMLR,
2021. 1, 3 [44] Xudong Wang, Long Lian, and Stella X Yu. Unsupervised
[30] Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, selective labeling for more effective semi-supervised learning.
Aditya Grover, and Stefano Ermon. Permutation invariant In Computer Vision–ECCV 2022: 17th European Conference,
graph generation via score-based generative modeling. In Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXX,
International Conference on Artificial Intelligence and Statis- pages 427–445. Springer, 2022. 16
tics, 2020. 1 [45] Di Wu, Qingming Tang, Yongle Zhao, Ming Zhang, Ying Fu,
[31] William Peebles and Saining Xie. Scalable diffusion models and Debing Zhang. Easyquant: Post-training quantization via
with transformers. arXiv preprint arXiv:2212.09748, 2022. 3, scale optimization. ArXiv, abs/2006.16669, 2020. 3
5 [46] Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Er-
[32] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, mon, and Jian Tang. Geodiff: a geometric diffusion model for
and Mark Chen. Hierarchical text-conditional image genera- molecular conformation generation. ArXiv, abs/2203.02923,
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. 2022. 1
1, 3 [47] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
[33] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
Esser, and Björn Ommer. High-resolution image synthesis large-scale image dataset using deep learning with humans in
with latent diffusion models. 2022 IEEE/CVF Conference the loop. arXiv preprint arXiv:1506.03365, 2015. 6
on Computer Vision and Pattern Recognition (CVPR), pages
10674–10685, 2021. 1, 2, 3, 4, 6, 7, 11, 12, 14
[34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
net: Convolutional networks for biomedical image segmenta-
tion. In Medical Image Computing and Computer-Assisted

10
A. Extended Experimental Settings
A.1. Implementation Details
We describe the implementation and compute details of the experiments in this section. We adapt the official implementation
for DDIM [41] 1 and Latent Diffusion [33] 2 . For Stable Diffusion, we use the CompVis codebase 3 and its v1.4 checkpoint.
We use the torch-fidelity library 4 to evaluate FID and IS scores as done in [33]. We use 100 denoising time steps for DDIM
CIFAR-10. We select 200 and 500 denoising time steps for LSUN-Bedrooms and LSUN-Churches respectively, which are
the configurations that achieve the best results provided by [33]. For text-guided image generation with Stable Diffusion, we
choose the default PNDM sampler with 50 time steps.
For quantization experiments, we quantize all weights and activations involved in matrix multiplications, but leave activation
functions (e.g. SoftMax, SiLU) and normalization layers (e.g. GroupNorm) running with full precision. Additionally, for
Linear Quantization and SQuant experiments, we dynamically update the activation quantizers throughout the image generation
process to establish strongest possible baselines, which explains why sometimes their results are better than weight-only
quantization cases. For text-guided image generation with Stable Diffusion, we find that attention matrices in cross attentions
are difficult to quantize after the SoftMax and may have considerable influences on the generation quality, so we utilize
INT16 mixed-precision for attention scores under W8A8 & W4A8 cases, while q, k, v matrices are still quantized down to
8-bit. No special modifications or mixed precision are done for other experiments.

A.2. Text-guided Image Generation Calibration Dataset Generation Details


For text-guided image generation with Stable Diffusion, we need to also include text conditioning in the calibration dataset.
We randomly sample text prompts from the MS-COCO dataset, and for each prompt we add a pair of data with both a
conditional feature ct and an unconditional feature uct derived from the prompt. This updated calibration dataset creation
process is described by Algorithm 2. Note that we ignore showing the corresponding time embedding tt for each time step t is
also added with the sample in Algorithm 1 of the main paper.

Algorithm 2 Q-Diffusion Calibration for Text-guided Image Generation


Require: Pretrained full precision diffusion model and the quantized diffusion model [Wθ , Ŵθ ]
Require: Empty calibration dataset D
Require: Number of denoising sampling steps T
Require: Calibration sampling interval c, amount of calibration data per sampling step n
for t = 1, . . . , T time step do
if t % c = 0 then
(1) (1) (1) (1) (1) (1) (n) (n) (n) (n) (n) (n)
Sample 2n intermediate inputs (xt , ct , tt ), (xt , uct , tt ), . . . , (xt , ct , tt ), (xt , uct , tt ) ran-
domly at t from Wθ and add them to D
end if
end for
for all i = 1, . . . , N blocks do
Update the weight quantizers of the i-th block in Ŵθ with D and Wθ
end for
if do activation quantization then
for all i = 1, . . . , N blocks do
Update the activation quantizers step sizes of the i-th block with Ŵθ , Wθ , D.
end for
end if

A.3. Hyperparameters
Here we provide the hyperparameters used for our Q-Diffusion calibration in Table 4.
1 https://github.com/ermongroup/ddim
2 https://github.com/CompVis/latent-diffusion
3 https://github.com/CompVis/stable-diffusion
4 https://github.com/toshas/torch-fidelity

11
Experiment T c n N
DDIM CIFAR-10 100 5 256 5120
LDM-4 LSUN-Bedroom 200 10 256 5120
LDM-8 LSUN-Church 500 25 256 5120
Stable Diffusion (weights only) 50 2 256 (128) 6400
Stable Diffusion (weights & activations) 50 1 256 (128) 12800

Table 4: Hyperparameters for all experiments, including the number of denoising time steps T , intervals for sampling
calibration data c, amount of calibration data per sampling step n, and the size of calibration dataset N . Note that for Stable
Diffusion with classifier-free guidance, every text prompt (128 in total for each sampling step) will add a pair of two samples
to the calibration dataset.

For all unconditional generation experiments, we keep the total calibration dataset size as 5120 and the amount of calibration
data per sampling step as 256. Q-Diffusion is able to obtain high-quality images with insignificant fidelity loss by uniformly
sampling from 20 time steps without any hyperparameters tuning. For text-guided image generation with Stable Diffusion, the
introduction of text conditioning makes activation quantization harder, thus we sample a larger calibration dataset using all
time steps.

B. Layer-wise Activations Distribution in DDIM and LDM


We analyze the ranges of activation values across all time steps in DDIM on CIFAR-10, LDM on LSUN-Bedroom and
LSUN-Church, and Stable Diffusion on the text-to-image task. Figure 13 shows that all Conv layers with residual connections
in DDIM exhibit noticeably wider activation ranges. Specifically, the first Conv layer can reach up to 1200 and others with
residual connections have ranges larger than 100, whereas the majority of the layers without residual connections have ranges
less than 50. Similar results could be observed from Stable Diffusion with the text-to-image generation task with COCO
captions as well as LSUN-Bedroom in latent diffusion. On the other hand, all layers in LDM on LSUN-Church share relatively
uniform activation distributions, with ranges < 15.
Furthermore, Figure 14 illustrates that the distribution of activation values of multiple layers in DDIM on CIFAR-10 varies
significantly across different time steps.

C. Quantitative Evaluation on Text-guided Image Generation


To quantitatively evaluate the extent of the impacts on generation performance induced by quantization, we follow the
practice in [33], Stable Diffusion v1-5 model card 5 , and Diffusers library 6 to generate 10k images using prompts from the
MS-COCO [20] 2017-val dataset. Subsequently, we compute the FID [11] and CLIP score [10] against the 2017-val dataset.
The ViT-B/16 is used as the backbone when computing the CLIP score. Results are illustrated in Figure 15.
Our Q-Diffusion has minimal quality degradation in generated images measured by these two metrics under all settings,
while the direct Linear Quantization incurs significant quality drops, especially when the activations are also quantized. Note
that FID and CLIP score on COCO may not be good metrics that align well with human preferences; we do observe that slight
artifacts appear more often on images generated with models that have both weights and activations quantized by Q-Diffusion,
while these are not reflected by the FID results.

D. Study of Combining with Fast Samplers


Another line of work to speed-up diffusion models is to find shorter and more effective sampling trajectories in order
to reduce the number of steps in the denoising process. These approaches tackle an orthogonal factor that Q-Diffusion is
addressing, indicating that there’s great potential to design a method to take advantage of both directions. Here we investigate
if Q-Diffusion can be combined with DPM-Solver [23, 24], a fast high-order solver for diffusion ODEs that can greatly
bring down the number of steps required for generation. For unconditional generation, we use a 3rd-order DPM-Solver++ as
suggested by the authors, and sample for 50 time steps, which is the number of steps required to get a converged sample. For
5 https://huggingface.co/runwayml/stable-diffusion-v1-5
6 https://huggingface.co/docs/diffusers/main/en/conceptual/evaluation

12
1200 DDIM (CIFAR-10)
1000
Activation range

800
600
400
200
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113
Layer number

300 Stable Diffusion (T2I)


Activation range

250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280
Layer number

100
LDM (LSUN-Bedroom)
80
Activation range

60
40
20
0
1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120
Layer number

20
LDM (LSUN-Church)
Activation range

15

10

0
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161
Layer number

Figure 13: Activation ranges of FP32 outputs across layers averaging among all time steps. The figures, from top to bottom,
are respectively DDIM, Stable Diffusion, LDM-Bedroom, and LDM-Church.

text-guided image generation with Stable Diffusion, we use 2nd-order DPM-Solver++ with 20 time steps. We directly apply
this DPM-Solver++ sampler to our INT4 quantized model. Results are shown in Table 6 and Figure 16.
Q-Diffusion only has a minor performance drop when only weights are quantized. The generation quality degrades under
W4A8 precision, but all Q-Diffusion results still outperform Linear Quant and SQuant with 100, 200, and 500 steps for
CIFAR-10, LSUN-Bedrooms, and LSUN-Churches respectively. The reason is likely due to the distribution of activations
becoming inconsistent with how Q-Diffusion is calibrated when the sampling trajectories change. We leave the design for a
systematic pipeline that can effectively combine these two directions in diffusion model acceleration as future work.

E. Comparing with PTQ4DM [40]


We evaluated Q-Diffusion on the settings employed in [40], which computed Inception Score (IS), Frechet Inception
Distance (FID), and sFID [28] over only 10k generated samples. Although [40] did not specify details in the paper, their official
implementation computed activation-to-activation matrix multiplications in the attention (q ∗ k and attn ∗ v) in FP16/327 ,
while we conducted them in full-integer. These matmuls account for a substantial portion of FLOPs (e.g. 9.8% of the model in
SD) and can induce considerable memory overheads [5], which subsequently increase the inference costs. Contrarily, our
work reduces the memory & compute in this part by 2x/4x theoretically.
The evaluation results are demonstrated in Table 5, where numbers inside the parentheses of PTQ4DM are its results
7 https://github.com/42Shawn/PTQ4DM (05/31/2023)

13
down.3.block.0.conv1

up.0.block.0.shortcut
up.1.upsample.conv

down.3.block.0.conv1 up.2.upsample.conv

up.0.block.0.shortcut

Figure 14: The distribution of activation values of multiple layers in DDIM on CIFAR-10 varies significantly across different
time steps.

reproduced with integer attention matmuls. Q-Diffusion consistently outperforms PTQ4DM [40], which achieves better results
with attention matmuls in INT8 than PTQ4DM with them computed in FP16/32. Note that directly applying [40] to quantize
attention matmuls in 8-bit would even further degrade generation quality, as shown by the numbers inside parentheses.

F. Limitations of this work


This work focuses on providing a PTQ solution for the noise estimation network of the diffusion models on the unconditional
image generation task. Meanwhile, we notice the recent advancement of text-guided image generation [33] and other multi-
modality conditional generation tasks. As we have demonstrated the possibility of directly applying Q-Diffusion to the noise
estimation network of Stable Diffusion, we believe it is important to provide a systematic analysis of the quantization’s impact
on the text encoder and the cross-attention mechanism for the classifier-free guidance conditioning, to enable a fully quantized
conditional generation framework. For unconditional generation, this work discovers the need to sample calibration data
across all time steps, and apply specialized split quantizers for the concatenation layers in the noise estimation model. The

14
Linear Quantization Q-Diffusion Linear Quantization Q-Diffusion
60.00 0.3000
57.90

45.00 0.2865
0.2850 0.2865 0.2863 0.2858 0.2845 0.2836

CLIP Score↑
37.16
0.2752
FID↓

30.00 0.2700
28.69 0.2687

20.32 20.23 20.00


20.13 0.2589
18.69 18.46
15.00 0.2550

0.00 0.2400
W8A32 W4A32 W8A8 W4A8 W8A32 W4A32 W8A8 W4A8

(a) Q-Diffusion has negligible increases in FID (b) Q-Diffusion maintains similar CLIP score

Figure 15: Stable Diffusion (cfg scale = 7.5) 512 × 512 text-guided image synthesis FID and CLIP score results quantized
using Q-Diffusion and Linear Quantization under different precisions. The dotted lines values are obtained under full precision.

Table 5: Q-Diffusion and PTQ4DM [40] results. The numbers inside the PTQ4DM parentheses refer to [40] results with INT8
attention act-to-act matmuls.

Task Method IS↑ FID↓ sFID↓


FP 9.18 10.05 19.71
CIFAR-10 PTQ4DM (W8A8) 9.31 (9.02) 14.18 (19.59) 22.59 (20.89)
DDIM 100 steps Q-Diffusion (W8A8) 9.47 7.82 17.96
Q-Diffusion (W4A8) 9.19 8.85 19.64
FP 9.19 8.83 18.31
CIFAR-10 PTQ4DM (W8A8) 9.70 (9.30) 11.66 (16.54) 19.71 (20.08)
DDIM 250 steps Q-Diffusion (W8A8) 9.60 8.00 18.13
Q-Diffusion (W4A8) 9.18 8.54 18.58

Table 6: Q-Diffusion results when directly applying 3rd-order DPM-Solver++ with 50 denoising time steps.

Task Bits (W/A) FID↓


DDIM CIFAR-10 32/32 3.57
DDIM CIFAR-10 4/32 5.38
DDIM CIFAR-10 4/8 10.27
LDM-4 LSUN-Bedrooms 32/32 4.27
LDM-4 LSUN-Bedrooms 4/32 4.88
LDM-4 LSUN-Bedrooms 4/8 10.77
LDM-8 LSUN-Churches 32/32 5.40
LDM-8 LSUN-Churches 4/32 5.74
LDM-8 LSUN-Churches 4/8 8.19

combination of these techniques demonstrates good performance for quantized diffusion models. Meanwhile, there exist other
interesting design choices, like non-uniform sampling across different time steps, and additional quantizer design for attention
softmax output, etc., that can be explored. We leave further investigation of these points as future work.

F.1. Non-uniform sampling methods that Did Not Work


As a preliminary exploration of non-uniform calibration data sampling across time steps, we explore the following 3
sampling methods. Yet none of those achieves better performance than Uniform sampling as proposed in this paper under the
same amount of calibration data (5120), as shown in Table 7.

15
Q-Diffusion (W4A32)

Q-Diffusion DPM-Solver++ (W4A32)

Figure 16: Text-guided image generation results on 512 × 512 resolution from our INT4 weights-quantized Stable Diffusion
with default PNDM 50 time steps and DPM-Solver++ 20 time steps.

Std Since we observe the diverse activation range across time steps in Fig. 5, we would like to sample more data from the
time step with a larger variance in its distribution, so as to better represent the overall output distribution across all time steps.
To this end, we propose to sample calibration data from each time step in proportion to the pixel-wise standard deviation (Std)
of each time step. Specifically, we randomly sample 256 xt among all time steps and compute the Std of all pixel values in xt
at each time step, which we denote as st . Then for calibration data, we sample Pstst N examples out of the total N calibration
t
data from time step t.

Norm Std Similar to Std, we also consider modeling the variance of each time step’s distribution with the standard deviation
of ||xt ||2 , instead of the Std of all pixel values. We expect the Norm Std can better capture the diversity across different
samples instead of capturing the pixel-wise diversity within each sample compared to pixel-wise Std.

Unsupervised Selective Labeling (USL) We also try to use Unsupervised Selecting Labeling [44] to select both represen-
tative and diverse samples as the calibration samples. The intuition is that samples that are both representative and diverse
could provide a wide range of activations that we will encounter at inference time, focusing on which could bring us good
performance on generation most of the time. We select 5120 samples in total for CIFAR-10 by combining the samples for all
time steps. We adopt the training-free version of Unsupervised Selective Labeling for sample selection, with the pooled latent
space feature from the noise estimation UNet as the selection feature.

Table 7: Quantization results for unconditional image generation with DDIM on CIFAR-10 (32 × 32). We compare different
calibration data sampling schemes under W4A32 quantization.

Method Std Norm Std USL Uniform (ours)


FID↓ 5.66 5.58 5.54 5.09

G. Additional Random Samples


In this section, we provide more random samples from our weight-only quantized and fully quantized diffusion models
obtained using Q-Diffusion and Linear Quantization under 4-bit quantization. Results are shown in the figures below.

16
Q-Diffusion (W4A32) Q-Diffusion (W4A8) Linear Quant (W4A32) Linear Quant (W4A8)

Figure 17: Random samples from our INT4 quantized 256 ×256 LSUN-Bedroom models with a fixed random seed.

Q-Diffusion (W4A32) Q-Diffusion (W4A8) Linear Quant (W4A32) Linear Quant (W4A8)

Figure 18: Random samples from our INT4 quantized 256 ×256 LSUN-Church models with a fixed random seed.

17
Full Precision Q-Diffusion (W4A32) Q-Diffusion (W4A8) Linear Quant (W4A32)
Prompt: “A puppy wearing a hat.”

Full Precision Q-Diffusion (W4A32) Q-Diffusion (W4A8) Linear Quant (W4A32)


Prompt: “A photograph of an astronaut riding a horse.”

Figure 19: Text-guided image generation on 512 × 512 LAION-5B from our INT4 quantized Stable Diffusion model with a
fixed random seed.

18

You might also like