2302.04304v3
2302.04304v3
1
(a) Calibration dataset creation (b) Model Inference
Traditional PTQ
Training Dataset
θq
Sample from Training Data Synthesized by NN Random Sampling
Input Output
Calibration
Dataset
xT xt xt−1 xt−5 x0
pθq(xT−1 | xT ) pθq(xT−2 | xT−1) pθq(x1 | x0)
⋯
Time Step-aware
Uniform Sampling
xT xT−1 x0
Figure 1: Conventional PTQ scenarios and Q-Diffusion differ in (a) calibration dataset creation and (b) model inference
workflow. Traditional PTQ approaches sample data randomly [9], synthesize with statistics in model layers [4], or draw from
the training set to create calibration dataset [27, 19], which either contains inconsistency with real inputs during the inference
time or are not data-free. In contrast, Q-Diffusion constructed calibration datasets with inputs that are an accurate reflection
of data seen during the production in a data-free manner. Traditional PTQ inference only needs to go through the quantized
model θq one time, while Q-Diffusion needs to address the accumulated quantization errors in the multi-time step inference.
Our work, evolving concurrently with [40], offers a com- and the quantization error accumulation across time
prehensive analysis of the novel challenges of performing steps via a thorough analysis.
PTQ on diffusion models. Specifically, as visualized in Fig-
3. We propose time step-aware calibration data sampling
ure 1(a), we discover that the output distribution of the noise
to improve calibration quality, and propose a special-
estimation network at each time step can be largely different,
ized quantizer for the noise estimation network.
and naively applying previous PTQ calibration methods with
an arbitrary time step leads to poor performance. Further- 4. Extensive results show Q-Diffusion enables W4A8
more, as illustrated in Figure 1(b), the iterative inference of PTQ for both pixel-space and latent-space uncondi-
the noise estimation network leads to an accumulation of tional diffusion models with an FID increment of only
quantization error, which poses higher demands on design- 0.39-2.34 over full precision models. It can also pro-
ing novel quantization schemes and calibration objectives duce qualitatively comparable images when plugged
for the noise estimation network. into Stable Diffusion [33] for text-guided synthesis.
To address these challenges, we propose Q-Diffusion, a
PTQ solution to compress the cumbersome noise estimation 2. Related work
network in diffusion models in a data-free manner, while
maintaining comparable performance to the full precision Diffusion Models. Diffusion models generate images
counterparts. We propose a time step-aware calibration data through a Markov chain, as illustrated in Figure 2. A forward
sampling mechanism from the pretrained diffusion model, diffusion process adds Gaussian noise to data x0 ∼ q(x) for
which represents the activation distribution of all time steps. T times, resulting in noisy samples x1 , ..., xT :
We further tailor the design of the calibration objective and p
q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I) (1)
the weight and activation quantizer to the commonly used
noise estimation model architecture to reduce quantization where βt ∈ (0, 1) is the variance schedule that controls the
error. We perform thorough ablation studies to verify our strength of the Gaussian noise in each step. When T → ∞,
design choices, and demonstrate good generation results xT approaches an isotropic Gaussian distribution.
with diffusion models quantized to only 4 bits. The reverse process removes noise from a sample from
In summary, our contributions are: the Gaussian noise input xT ∼ N (0, I) to gradually gen-
1. We propose Q-Diffusion, a data-free PTQ solution for erate high-fidelity images. However, since the real reverse
the noise estimation network in diffusion models. conditional distribution q(xt−1 |xt ) is unavailable, diffusion
models sample from a learned conditional distribution:
2. We identify the novel challenge of performing PTQ on
diffusion models as the activation distribution diversity pθ (xt−1 |xt ) = N (xt−1 ; µ̃θ,t (xt ), β̃t I). (2)
2
q(xt | xt−1)
⋯ ⋯ ⋯
Previous PTQ research in classification and detection
tasks focused on the calibration objective and the acqui-
x0 xt−10 xt−1 xt xT
(a) Forward diffusion process
sition of calibration data. For example, EasyQuant [45]
determines appropriate cmin and cmax based on training data,
pθ(xt−1 | xt )
⋯ ⋯ ⋯ and BRECQ [19] introduces Fisher information into the
xT xt xt−1 xt−10 x0 objective. ZeroQ [4] employs a distillation technique to
(b) Reverse diffusion process generate proxy input images for PTQ, and SQuant [9] uses
random samples with objectives based on sensitivity deter-
Figure 2: The forward diffusion process (a) repeatedly adds mined through the Hessian spectrum. For diffusion model
Gaussian noise. The reverse diffusion process (b) uses a quantization, a training dataset is not needed as the calibra-
trained network to denoise from a standard Gaussian noise tion data can be constructed by sampling the full-precision
image in order to generate an image. model with random inputs. However, the multi-time step in-
ference of the noise estimation model brings new challenges
With the reparameterization trick in [12], the mean µ̃θ,t (xt )
in modeling the activation distribution. In parallel to our
and variance β̃t could be derived as follows: work, PTQ4DM [40] introduces the method of Normally
1 1 − αt Distributed Time-step Calibration, generating calibration
µ̃θ,t (xt ) = √ (xt − √ ϵθ,t ) (3) data across all time steps with a specific distribution. Nev-
αt 1 − ᾱt
1 − ᾱt−1 ertheless, their explorations remain confined to lower reso-
β̃t = · βt (4) lutions, 8-bit precision, floating-point attention activation-
1 − ᾱt
Qt to-activation matmuls, and with limited ablation study on
where αt = 1 − βt , ᾱt = i=1 αi . We refer readers to [25] other calibration schemes. This results in worse applicabil-
for a more detailed introduction. ity of their method to lower precisions (see Appendix E).
In practice, the noise at each time step t are computed Our work delves into the implications of calibration dataset
from xt by a noise estimation model, with the same weights creation in a holistic manner, establishing an efficient cal-
for all time steps. The UNet [34] dominates the design of the ibration objective for diffusion models. We fully quantize
noise estimation model in diffusion models [41, 33, 32, 35], act-to-act matmuls, validated by experiments involving both
with some recent exploration on Transformer [31]. This pixel-space and latent-space diffusion models on large-scale
work designs the PTQ method for the acceleration of the datasets up to resolutions of 512 × 512.
noise estimation model, especially for the common UNet.
Accelerated diffusion process. Related methods include 3. Method
simulating the diffusion process in fewer steps by gener-
alizing it to a non-Markovian process [41], adjusting the We present our method for post-training quantization of
variance schedule [29], and the use of high-order solvers diffusion models in this section. Different from convention-
to approximate diffusion generation [21, 1, 23, 24]. Others ally studied deep learning models and tasks such as CNNs
have employed the technique of caching and reusing feature and VITs for classification and detection, diffusion models
maps [18]. Efforts to distill the diffusion model into fewer are trained and evaluated in a distinctive multi-step man-
time steps have also been undertaken [38, 26], which have ner with a unique UNet architecture. This presents notable
achieved notable success but involve an extremely expen- challenges to the PTQ process. We analyze the challenges
sive retraining process. Our work focuses on accelerating brought by the multi-step inference process and the UNet
the noise estimation model inference in each step, with a architecture in Section 3.1 and 3.2 respectively and describe
training-free PTQ process. the full Q-Diffusion PTQ pipeline in Section 3.3.
Post-training Quantization. Post-training quantization
3.1. Challenges under the Multi-step Denoising
(PTQ) compresses deep neural networks by rounding el-
ements w to a discrete set of values, where the quantization We identify two major challenges in quantizing mod-
and de-quantization can be formulated as: els that employ multi-step inference process. Namely, we
w investigate the accumulation of quantization error across
ŵ = s · clip(round( ), cmin , cmax ), (5) time steps and the difficulty of sampling a small calibration
s
dataset to reduce the quantization error at each time step.
where s denotes the quantization scale parameters, cmin and
cmax are the lower and upper bounds for the clipping function Challenge 1: Quantization errors accumulate across time
clip(·). These parameters can be calibrated with the weight steps. Performing quantization on a neural network model
and activation distribution estimated in the PTQ process. introduces noise on the weight and activation of the well-
The operator round(·) represents rounding, which can be trained model, leading to quantization errors in each layer’s
either rounding-to-nearest [45, 4] or adaptive rounding [19]. output. Previous research has identified that quantization
3
100 as possible.
MSE with xt (FP32)
10.00
6.07 across time steps further brings challenges to quantization.
Calibrating the noise estimation model using only a few time
steps that do not reflect the full range of activations seen
1.00
among all time steps by the noise estimation model during
First 25 First 50 Mid 50 Last 50 Last 25 the denoising process can cause overfitting to the activation
Calibration data from part of denoising process distribution described by those specific time steps, while
not generalizing to other time steps, which hurts the overall
Figure 4: Effects of time steps in calibration dataset creation performance. For instance, here we try to calibrate the quan-
on 4-bit weights quantization results with DDIM on CIFAR- tized DDIM on the CIFAR-10 dataset with data sampled
10. First n, Mid n, Last n denotes that 5120 samples are from different parts of the denoising process. As shown in
selected uniformly from the first, middle, last n time steps. Figure 4, if we simply take 5120 samples from time steps
that fall into a certain stage of the denoising process, signifi-
errors are likely to accumulate across layers [6], making cant performance drops will be induced under 4-bit weights
deeper neural networks harder to quantize. In the case of quantization. Note that the case with samples taken from
diffusion models, at any time step t, the input of the denois- the middle 50 time steps caused smaller drops compared to
ing model (denoted as xt ) is derived by xt+1 , the output of cases with samples taken from either the first or the last n
the model at the previous time step t + 1 (as depicted by time steps, and with n increases, the drops are also alleviated.
Equation 2). This process effectively multiplies the number These results illustrate the gradual “denoising” process as
of layers involved in the computation by the number of de- depicted in Figure 5: the activations distribution changes
noising steps for the input xt at time step t, leading to an gradually throughout time steps, with the middle part captur-
accumulation of quantization errors towards later steps in ing the full range to some degree, while parts of the distant
the denoising process. endpoints differing the most. To recover the performance of
We run the denoising process of DDIM [41] on CIFAR- the quantized diffusion models, we need to select calibration
10 [16] with a sampling batch size of 64, and compare the data in a way that comprehensively takes into account the
MSE differences between the full-precision model and the distributions of the output of different time steps.
model quantized to INT8, INT5, and INT4 at each time step.
3.2. Challenges on Noise Estimation Model Quan-
As shown in Figure 3, there is a dramatic increase in the
tization
quantization errors when the model is quantized to 4-bit, and
the errors accumulate quickly through iterative denoising. Most diffusion models (Imagen [35], Stable Diffu-
This brings difficulty in preserving the performance after sion [33], VDMs [13]) adopt UNets as denoising backbones
quantizing the model down to low precision, which requires that downsample and upsample latent features. Although
the reduction of quantization errors at all time steps as much recent studies show that transformer architectures are also
4
4
Activation Range 2
100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5
Time Step
Figure 5: Activation ranges of xt across all 100 time steps of FP32 DDIM model on CIFAR-10.
capable of serving as the noise estimation backbone [31], component that contains residual connections in the diffu-
convolutional UNets are still the de facto choice of archi- sion model UNet as a block, such as a Residual Bottleneck
tecture today. UNets utilize shortcut layers to merge con- Block or a Transformer Block. Other parts of the model
catenated deep and shallow features and transmit them to that do not satisfy this condition are calibrated in a per-layer
subsequent layers. Through our analysis presented in Fig- manner. This technique has been shown to improve the per-
ure 6, we observe that input activations in shortcut layers formance compared to fully layer-by-layer calibration since
exhibit abnormal value ranges in comparison to other layers. it address the inter-layer dependencies and generalization
Notably, the input activations in DDIM’s shortcut layers can better [19]. For activation quantization, since activations
be up to 200 times larger than other neighboring layers. are constantly changing during inference, doing adaptive
To analyze the reason for this, we visualize the weight and rounding is infeasible. Therefore, we only adjust the step
activation tensor of a DDIM shortcut layer. As demonstrated sizes of activation quantizers according to to [8]. The overall
in the dashed box in Figure 6, the ranges of activations from calibration workflow is described in Alg. 2.
the deep feature channels (X1 ) and shallow feature channels
(X2 ) being concatenated together vary significantly, which Algorithm 1 Q-Diffusion Calibration
also resulted in a bimodal weight distribution in the corre- Require: Pretrained full precision diffusion model and the
sponding channels (see also Figure 7). Naively quantizing quantized diffusion model [Wθ , Ŵθ ]
the entire weight and activation distribution with the same Require: Empty calibration dataset D
quantizer will inevitably lead to large quantization errors. Require: Number of denoising sampling steps T
Require: Calibration sampling interval c, amount of calibra-
3.3. Post-Training Quantization of Diffusion model tion data per sampling step n
We propose two techniques: time step-aware calibration for t = 1, . . . , T time step do
data sampling and shortcut-splitting quantization to tackle if t % c = 0 then
(1) (n)
the challenges identified in the previous sections respectively. Sample n intermediate inputs xt , . . . , xt ran-
domly at t from Wθ and add them to D
end if
3.3.1 Time step-aware calibration end for
Since the output distributions of consecutive time steps are for all i = 1, . . . , N blocks do
often very similar, we propose to randomly sample interme- Update the weight quantizers of the i-th block in Ŵθ
diate inputs uniformly in a fixed interval across all time steps with D and Wθ
to generate a small calibration set. This effectively balances end for
the size of the calibration set and its representation ability of if do activation quantization then
the distribution across all time steps. Empirically, we have for all i = 1, . . . , N blocks do
found that the sampled calibration data can recover most of Update the activation quantizers step sizes of the i-th
the INT4 quantized models’ performance after the calibra- block with Ŵθ , Wθ , D.
tion, making it an effective sampling scheme for calibration end for
data collection for quantization error correction. end if
To calibrate the quantized model, we divide the model
into several reconstruction blocks [19], and iteratively re-
construct outputs and tune the clipping range and scaling 3.3.2 Shortcut-splitting quantization
factors of weight quantizers in each block with adaptive
rounding [27] to minimize the mean squared errors between To address the abnormal activation and weight distributions
the quantized and full precision outputs. We define a core in shortcut layers, we propose a “split” quantization tech-
5
Input value
Input value
Activation range
three shortcuts with Channel Channel
abnormal input
Input value
activation ranges
Channel number
~ ~
Layer number
Figure 6: Activation ranges of DDIM’s FP32 outputs across layers averaging among all time steps. We point out three shortcuts
with the largest input activation ranges compared to other neighboring layers. Figures in the dashed box illustrate concatenation
along channels. ⊕ denotes the concatenation operation.
X2 X1
Figure 7: (Left) The typical UNet architecture with shortcut layers that concatenate features from the deep and shallow layers.
(Right) The ranges of activations from the deep (X1 ) and shallow (X2 ) feature channels vary significantly, which also results
in a bimodal weight distribution in the corresponding channels.
nique that performs quantization prior to concatenation, re- latent-space diffusion model Latent Diffusion [33] for un-
quiring negligible additional memory or computational re- conditional image generation. We also visualize the images
sources. This strategy can be employed for both activation generated by Q-Diffusion on Stable Diffusion. To the best of
and weight quantization in shortcut layers, and is expressed our knowledge, there is currently no published work done on
mathematically as follows: diffusion model quantization. Therefore, we report the ba-
sic channel-wise Linear Quantization (i.e., Equation 5) as a
QX (X) = QX1 (X1 ) ⊕ QX2 (X2 ) (6)
baseline. We also re-implement the state-of-the-art data-free
QW (W ) = QW1 (W1 ) ⊕ QW2 (W2 ) (7) PTQ method SQuant [9] and include the results for compar-
QX (X)QW (W ) =QX1 (X1 )QW1 (W1 ) ison. Furthermore, we apply our approach to text-guided
(8) image synthesis with Stable Diffusion [33]. Experiments
+ QX2 (X2 )QW2 (W2 ) show that our approach can achieve competitive generation
where X ∈ Rw×h×cin and W ∈ Rcin ×cout are the input quality to the full-precision scenario on all tasks, even under
activation and layer weight, which can be naturally split INT4 quantization for weights.
into X1 ∈ Rw×h×c1 , X2 ∈ Rw×h×c2 , W1 ∈ Rc1 ×cout , and
W2 ∈ Rc2 ×cout , respectively. c1 and c2 are determined by 4.2. Unconditional Generation
the concatenation operation. Q(·) denotes the quantization
operator and ⊕ denotes the concatenation operator. We conducted evaluations using the 32 × 32 CIFAR-10
[16], 256 × 256 LSUN Bedrooms, and 256 × 256 LSUN
4. Experiments Church-Outdoor [47]. We use the pretrained DDIM sampler
[41] with 100 denoising time steps for CIFAR-10 experi-
4.1. Experiments Setup
ments and Latent Diffusion (LDM) [33] for the higher reso-
In this section, we evaluate the proposed Q-Diffusion lution LSUN experiments. We evaluated the performance in
framework on pixel-space diffusion model DDPM [12] and terms of Frechet Inception Distance (FID) [11] and addition-
6
Bedroom Q-Diffusion Bedroom Linear Quantization Church Q-Diffusion Church Linear Quantization
(W4A8) (W4A8) (W4A8) (W4A8)
Figure 8: 256 × 256 unconditional image generation results using Q-Diffusion and Linear Quantization under W4A8 precision.
Table 1: Quantization results for unconditional image gener- Table 2: Quantization results for unconditional image gener-
ation with DDIM on CIFAR-10 (32 × 32). ation with LDM-4 on LSUN-Bedrooms (256 × 256). The
downsampling factor for the latent space is 4.
Method Bits (W/A) Size (Mb) GBops FID↓ IS↑
Full Precision 32/32 143.2 6597 4.22 9.12 Method Bits (W/A) Size (Mb) TBops FID↓
Linear Quant 8/32 35.8 2294 4.71 8.93 Full Precision 32/32 1096.2 107.17 2.98
SQuant 8/32 35.8 2294 4.61 8.99
Linear Quant 8/32 274.1 37.28 3.02
Q-Diffusion 8/32 35.8 2294 4.27 9.15
SQuant 8/32 274.1 37.28 2.94
Linear Quant 4/32 17.9 1147 141.47 4.20 Q-Diffusion 8/32 274.1 37.28 2.97
SQuant 4/32 17.9 1147 160.40 2.91
Q-Diffusion 4/32 17.9 1147 5.09 8.78 Linear Quant 4/32 137.0 18.64 82.69
SQuant 4/32 137.0 18.64 149.97
Linear Quant 8/8 35.8 798 118.26 5.23 Q-Diffusion 4/32 137.0 18.64 4.86
SQuant 8/8 35.8 798 464.69 1.17
Q-Diffusion 8/8 35.8 798 3.75 9.48 Linear Quant 8/8 274.1 12.97 6.69
SQuant 8/8 274.1 12.97 4.92
Linear Quant 4/8 17.9 399 188.11 2.45
SQuant 4/8 17.9 399 456.21 1.16 Q-Diffusion 8/8 274.1 12.97 4.40
Q-Diffusion 4/8 17.9 399 4.93 9.12 Linear Quant 4/8 137.0 6.48 24.86
SQuant 4/8 137.0 6.48 95.92
ally evaluated the Inception Score (IS) [37] for CIFAR-10 Q-Diffusion 4/8 137.0 6.48 5.32
results, since IS is not an accurate reference for datasets that
differ significantly from ImageNet’s domain and categories. Table 3: Quantization results for unconditional image gen-
The results are reported in Table 1- 3 and Figure 8, where eration with LDM-8 on LSUN-Churches (256 × 256). The
Bops is calculated for one denoising step without consider- downsampling factor for the latent space is 8.
ing the decoder compute cost for latent diffusion. Method Bits (W/A) Size (Mb) TBops FID↓
The experiments show that Q-Diffusion significantly pre-
Full Precision 32/32 1179.9 22.17 4.06
serves the image generation quality and outperforms Linear
Quantization by a large margin for all resolutions and types Linear Quant 8/32 295.0 10.73 3.84
SQuant 8/32 295.0 10.73 4.01
of diffusion models tested when the number of bits is low.
Q-Diffusion 8/32 295.0 10.73 4.03
Although 8-bit weight quantization has almost no perfor-
mance loss compared to FP32 for both Linear Quantization Linear Quant 4/32 147.5 5.36 32.54
SQuant 4/32 147.5 5.36 33.77
and our approach, the generation quality with Linear Quanti-
Q-Diffusion 4/32 147.5 5.36 4.45
zation drops drastically under 4-bit weight quantization. In
contrast, Q-Diffusion still preserves most of the perceptual Linear Quant 8/8 295.0 2.68 14.62
SQuant 8/8 295.0 2.68 54.15
quality with at most 2.34 increase in FID and imperceptible
Q-Diffusion 8/8 295.0 2.68 3.65
distortions in produced samples.
Linear Quant 4/8 147.5 1.34 14.92
4.3. Text-guided Image Generation SQuant 4/8 147.5 1.34 24.50
Q-Diffusion 4/8 147.5 1.34 4.12
We evaluate Q-Diffusion on Stable Diffusion pretrained
on subsets of 512 × 512 LAION-5B for text-guided image
generation. Following [33], we sample text prompts from
7
Full Precision Q-Diffusion (W4A8) Linear Quantization (W4A8)
Figure 9: Stable Diffusion 512 × 512 text-guided image synthesis results using Q-Diffusion and Linear Quantization under
W4A8 precision with prompt A puppy wearing a hat.
the MS-COCO [20] dataset to generate a calibration dataset ent sampling strategies. We then evaluated the quality of
with texts condition using Algorithm 1. In this work, we the models quantized by each variant. We experimented
fix the guidance strength to the default 7.5 in Stable Diffu- with varying numbers of time steps used for sampling and
sion as the trade-off between sample quality and diversity. samples used for calibration. In addition to calibration sets
Qualitative results are shown in Figure 9. Compared to Lin- from uniform timestep intervals, we also employed sampling
ear Quantization, our Q-Diffusion provides higher-quality at the first 50 and last 50 steps. As in Figure 10, uniform
images with more realistic details and better demonstration sampling that spans all time steps results in superior perfor-
of the semantic information. Similar performance gain is mance compared to sampling from only partial time steps.
also observed in other random samples showcased in Ap- Furthermore, adjusting the sampling hyperparams, includ-
pendix G, and quantitatively reported in Appendix C. The ing using more calibration samples, does not significantly
output of the W4A8 Q-Diffusion model largely resembles improve the performance. Therefore, we simply choose to
the output of the full precision model. Interestingly, we find sample uniformly every 20 steps for a total of 5,120 samples
some diversity in the lower-level semantics between the Q- for calibration, resulting in a high-quality quantized model
Diffusion model and the FP models, like the heading of the with low computational costs during quantization.
horse or the shape of the hat. We leave it to future work to
understand how quantization contributes to the diversity.
4.4. Ablation Study Effects of Split Previous linear quantization approaches
suffer from severe performance degradation as shown in
First 50 steps Figure 11, where 4-bit weight quantization achieves a high
FID of 141.47 in DDIM CIFAR-10 generation. Employing
additional 8-bit activation quantization further degrades the
performance (FID: 188.11). By splitting shortcuts in quanti-
zation, we significantly improve the generation performance,
Last 50 steps
achieving an FID of 4.93 on W4A8 quantization.
Every 20 steps
5. Conclusion
Every 20 steps
Every 100/10/50 steps
This work studies the use of quantization to accelerate
Every 20 steps Every 20 steps
diffusion models. We propose Q-Diffusion, a novel post-
training quantization scheme that conducts calibration with
multiple time steps in the denoising process and achieves
significant improvements in the performance of the quan-
Figure 10: Uniform sampling strategies which cover all time tized model. Q-Diffusion models under 4-bit quantization
steps are better than strategies that cover only a part of the achieve comparable results to the full precision models.
time steps, as in Fig. 4. Furthermore, adjusting the sampling
techniques within uniform sampling, such as tuning the sam-
pling interval and the number of samples, has a marginal Acknowledgement
effect on the performance of the quantized model.
We thank Berkeley Deep Drive, Intel Corporation, Pana-
Effects of Sampling Strategies To analyze the effect of sonic, and NVIDIA for supporting this research. We also
different sampling strategies for calibration in detail, we thank Sehoon Kim, Muyang Li, and Minkai Xu for their
implemented multiple variants of our method using differ- valuable feedback.
8
Linear Quantization + Calibration + Split shortcuts ence on Computer Vision and Pattern Recognition (CVPR),
pages 13166–13175, 2020. 1, 2, 3
188.11 [5] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo-
141.47
100.00 pher Ré. Flashattention: Fast and memory-efficient exact
54.22 attention with io-awareness. Advances in Neural Information
FID↓
9
Pushing the limit of post-training quantization by block re- Intervention–MICCAI 2015: 18th International Conference,
construction. In International Conference on Learning Rep- Munich, Germany, October 5-9, 2015, Proceedings, Part III
resentations, 2021. 1, 2, 3, 4, 5 18, pages 234–241. Springer, 2015. 3
[20] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, [35] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay
Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,
Zitnick. Microsoft coco: Common objects in context. In Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes,
European Conference on Computer Vision, 2014. 8, 12 et al. Photorealistic text-to-image diffusion models with deep
[21] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo language understanding. arXiv preprint arXiv:2205.11487,
numerical methods for diffusion models on manifolds. arXiv 2022. 1, 3, 4
preprint arXiv:2202.09778, 2022. 1, 3 [36] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali-
[22] Yijiang Liu, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, mans, David J. Fleet, and Mohammad Norouzi. Image super-
and Shanghang Zhang. Noisyquant: Noisy bias-enhanced resolution via iterative refinement. IEEE transactions on
post-training activation quantization for vision transformers. pattern analysis and machine intelligence, PP, 2021. 1
arXiv preprint arXiv:2211.16056, 2022. 1 [37] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
[23] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Cheung, Alec Radford, and Xi Chen. Improved techniques
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu- for training gans. Advances in neural information processing
sion probabilistic model sampling in around 10 steps. ArXiv, systems, 29, 2016. 7
abs/2206.00927, 2022. 1, 3, 12 [38] Tim Salimans and Jonathan Ho. Progressive distillation for
[24] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- fast sampling of diffusion models. ArXiv, abs/2202.00512,
uan Li, and Jun Zhu. Dpm-solver++: Fast solver for 2022. 1, 3
guided sampling of diffusion probabilistic models. ArXiv, [39] Hiroshi Sasaki, Chris G. Willcocks, and T. Breckon. Unit-
abs/2211.01095, 2022. 3, 12 ddpm: Unpaired image translation with denoising diffusion
[25] Calvin Luo. Understanding diffusion models: A unified probabilistic models. ArXiv, abs/2104.05358, 2021. 1
perspective. arXiv preprint arXiv:2208.11970, 2022. 3 [40] Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and
[26] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Yan Yan. Post-training quantization on diffusion models.
Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. CVPR, 2023. 1, 2, 3, 13, 14, 15
On distillation of guided diffusion models, 2022. 3
[41] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
[27] Markus Nagel, Rana Ali Amjad, Mart van Baalen, Chris-
diffusion implicit models. arXiv preprint arXiv:2010.02502,
tos Louizos, and Tijmen Blankevoort. Up or down?
2020. 1, 3, 4, 6, 11
adaptive rounding for post-training quantization. ArXiv,
[42] Yang Song and Stefano Ermon. Generative modeling by
abs/2004.10568, 2020. 2, 4, 5
estimating gradients of the data distribution. Advances in
[28] Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W.
Neural Information Processing Systems, 32, 2019. 1
Battaglia. Generating images with sparse representations.
[43] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
International Conference On Machine Learning, 2021. 13
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
[29] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
generative modeling through stochastic differential equations.
denoising diffusion probabilistic models. In International
arXiv preprint arXiv:2011.13456, 2020. 1
Conference on Machine Learning, pages 8162–8171. PMLR,
2021. 1, 3 [44] Xudong Wang, Long Lian, and Stella X Yu. Unsupervised
[30] Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, selective labeling for more effective semi-supervised learning.
Aditya Grover, and Stefano Ermon. Permutation invariant In Computer Vision–ECCV 2022: 17th European Conference,
graph generation via score-based generative modeling. In Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXX,
International Conference on Artificial Intelligence and Statis- pages 427–445. Springer, 2022. 16
tics, 2020. 1 [45] Di Wu, Qingming Tang, Yongle Zhao, Ming Zhang, Ying Fu,
[31] William Peebles and Saining Xie. Scalable diffusion models and Debing Zhang. Easyquant: Post-training quantization via
with transformers. arXiv preprint arXiv:2212.09748, 2022. 3, scale optimization. ArXiv, abs/2006.16669, 2020. 3
5 [46] Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Er-
[32] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, mon, and Jian Tang. Geodiff: a geometric diffusion model for
and Mark Chen. Hierarchical text-conditional image genera- molecular conformation generation. ArXiv, abs/2203.02923,
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. 2022. 1
1, 3 [47] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
[33] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
Esser, and Björn Ommer. High-resolution image synthesis large-scale image dataset using deep learning with humans in
with latent diffusion models. 2022 IEEE/CVF Conference the loop. arXiv preprint arXiv:1506.03365, 2015. 6
on Computer Vision and Pattern Recognition (CVPR), pages
10674–10685, 2021. 1, 2, 3, 4, 6, 7, 11, 12, 14
[34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
net: Convolutional networks for biomedical image segmenta-
tion. In Medical Image Computing and Computer-Assisted
10
A. Extended Experimental Settings
A.1. Implementation Details
We describe the implementation and compute details of the experiments in this section. We adapt the official implementation
for DDIM [41] 1 and Latent Diffusion [33] 2 . For Stable Diffusion, we use the CompVis codebase 3 and its v1.4 checkpoint.
We use the torch-fidelity library 4 to evaluate FID and IS scores as done in [33]. We use 100 denoising time steps for DDIM
CIFAR-10. We select 200 and 500 denoising time steps for LSUN-Bedrooms and LSUN-Churches respectively, which are
the configurations that achieve the best results provided by [33]. For text-guided image generation with Stable Diffusion, we
choose the default PNDM sampler with 50 time steps.
For quantization experiments, we quantize all weights and activations involved in matrix multiplications, but leave activation
functions (e.g. SoftMax, SiLU) and normalization layers (e.g. GroupNorm) running with full precision. Additionally, for
Linear Quantization and SQuant experiments, we dynamically update the activation quantizers throughout the image generation
process to establish strongest possible baselines, which explains why sometimes their results are better than weight-only
quantization cases. For text-guided image generation with Stable Diffusion, we find that attention matrices in cross attentions
are difficult to quantize after the SoftMax and may have considerable influences on the generation quality, so we utilize
INT16 mixed-precision for attention scores under W8A8 & W4A8 cases, while q, k, v matrices are still quantized down to
8-bit. No special modifications or mixed precision are done for other experiments.
A.3. Hyperparameters
Here we provide the hyperparameters used for our Q-Diffusion calibration in Table 4.
1 https://github.com/ermongroup/ddim
2 https://github.com/CompVis/latent-diffusion
3 https://github.com/CompVis/stable-diffusion
4 https://github.com/toshas/torch-fidelity
11
Experiment T c n N
DDIM CIFAR-10 100 5 256 5120
LDM-4 LSUN-Bedroom 200 10 256 5120
LDM-8 LSUN-Church 500 25 256 5120
Stable Diffusion (weights only) 50 2 256 (128) 6400
Stable Diffusion (weights & activations) 50 1 256 (128) 12800
Table 4: Hyperparameters for all experiments, including the number of denoising time steps T , intervals for sampling
calibration data c, amount of calibration data per sampling step n, and the size of calibration dataset N . Note that for Stable
Diffusion with classifier-free guidance, every text prompt (128 in total for each sampling step) will add a pair of two samples
to the calibration dataset.
For all unconditional generation experiments, we keep the total calibration dataset size as 5120 and the amount of calibration
data per sampling step as 256. Q-Diffusion is able to obtain high-quality images with insignificant fidelity loss by uniformly
sampling from 20 time steps without any hyperparameters tuning. For text-guided image generation with Stable Diffusion, the
introduction of text conditioning makes activation quantization harder, thus we sample a larger calibration dataset using all
time steps.
12
1200 DDIM (CIFAR-10)
1000
Activation range
800
600
400
200
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113
Layer number
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280
Layer number
100
LDM (LSUN-Bedroom)
80
Activation range
60
40
20
0
1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120
Layer number
20
LDM (LSUN-Church)
Activation range
15
10
0
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161
Layer number
Figure 13: Activation ranges of FP32 outputs across layers averaging among all time steps. The figures, from top to bottom,
are respectively DDIM, Stable Diffusion, LDM-Bedroom, and LDM-Church.
text-guided image generation with Stable Diffusion, we use 2nd-order DPM-Solver++ with 20 time steps. We directly apply
this DPM-Solver++ sampler to our INT4 quantized model. Results are shown in Table 6 and Figure 16.
Q-Diffusion only has a minor performance drop when only weights are quantized. The generation quality degrades under
W4A8 precision, but all Q-Diffusion results still outperform Linear Quant and SQuant with 100, 200, and 500 steps for
CIFAR-10, LSUN-Bedrooms, and LSUN-Churches respectively. The reason is likely due to the distribution of activations
becoming inconsistent with how Q-Diffusion is calibrated when the sampling trajectories change. We leave the design for a
systematic pipeline that can effectively combine these two directions in diffusion model acceleration as future work.
13
down.3.block.0.conv1
up.0.block.0.shortcut
up.1.upsample.conv
down.3.block.0.conv1 up.2.upsample.conv
up.0.block.0.shortcut
Figure 14: The distribution of activation values of multiple layers in DDIM on CIFAR-10 varies significantly across different
time steps.
reproduced with integer attention matmuls. Q-Diffusion consistently outperforms PTQ4DM [40], which achieves better results
with attention matmuls in INT8 than PTQ4DM with them computed in FP16/32. Note that directly applying [40] to quantize
attention matmuls in 8-bit would even further degrade generation quality, as shown by the numbers inside parentheses.
14
Linear Quantization Q-Diffusion Linear Quantization Q-Diffusion
60.00 0.3000
57.90
45.00 0.2865
0.2850 0.2865 0.2863 0.2858 0.2845 0.2836
CLIP Score↑
37.16
0.2752
FID↓
30.00 0.2700
28.69 0.2687
0.00 0.2400
W8A32 W4A32 W8A8 W4A8 W8A32 W4A32 W8A8 W4A8
(a) Q-Diffusion has negligible increases in FID (b) Q-Diffusion maintains similar CLIP score
Figure 15: Stable Diffusion (cfg scale = 7.5) 512 × 512 text-guided image synthesis FID and CLIP score results quantized
using Q-Diffusion and Linear Quantization under different precisions. The dotted lines values are obtained under full precision.
Table 5: Q-Diffusion and PTQ4DM [40] results. The numbers inside the PTQ4DM parentheses refer to [40] results with INT8
attention act-to-act matmuls.
Table 6: Q-Diffusion results when directly applying 3rd-order DPM-Solver++ with 50 denoising time steps.
combination of these techniques demonstrates good performance for quantized diffusion models. Meanwhile, there exist other
interesting design choices, like non-uniform sampling across different time steps, and additional quantizer design for attention
softmax output, etc., that can be explored. We leave further investigation of these points as future work.
15
Q-Diffusion (W4A32)
Figure 16: Text-guided image generation results on 512 × 512 resolution from our INT4 weights-quantized Stable Diffusion
with default PNDM 50 time steps and DPM-Solver++ 20 time steps.
Std Since we observe the diverse activation range across time steps in Fig. 5, we would like to sample more data from the
time step with a larger variance in its distribution, so as to better represent the overall output distribution across all time steps.
To this end, we propose to sample calibration data from each time step in proportion to the pixel-wise standard deviation (Std)
of each time step. Specifically, we randomly sample 256 xt among all time steps and compute the Std of all pixel values in xt
at each time step, which we denote as st . Then for calibration data, we sample Pstst N examples out of the total N calibration
t
data from time step t.
Norm Std Similar to Std, we also consider modeling the variance of each time step’s distribution with the standard deviation
of ||xt ||2 , instead of the Std of all pixel values. We expect the Norm Std can better capture the diversity across different
samples instead of capturing the pixel-wise diversity within each sample compared to pixel-wise Std.
Unsupervised Selective Labeling (USL) We also try to use Unsupervised Selecting Labeling [44] to select both represen-
tative and diverse samples as the calibration samples. The intuition is that samples that are both representative and diverse
could provide a wide range of activations that we will encounter at inference time, focusing on which could bring us good
performance on generation most of the time. We select 5120 samples in total for CIFAR-10 by combining the samples for all
time steps. We adopt the training-free version of Unsupervised Selective Labeling for sample selection, with the pooled latent
space feature from the noise estimation UNet as the selection feature.
Table 7: Quantization results for unconditional image generation with DDIM on CIFAR-10 (32 × 32). We compare different
calibration data sampling schemes under W4A32 quantization.
16
Q-Diffusion (W4A32) Q-Diffusion (W4A8) Linear Quant (W4A32) Linear Quant (W4A8)
Figure 17: Random samples from our INT4 quantized 256 ×256 LSUN-Bedroom models with a fixed random seed.
Q-Diffusion (W4A32) Q-Diffusion (W4A8) Linear Quant (W4A32) Linear Quant (W4A8)
Figure 18: Random samples from our INT4 quantized 256 ×256 LSUN-Church models with a fixed random seed.
17
Full Precision Q-Diffusion (W4A32) Q-Diffusion (W4A8) Linear Quant (W4A32)
Prompt: “A puppy wearing a hat.”
Figure 19: Text-guided image generation on 512 × 512 LAION-5B from our INT4 quantized Stable Diffusion model with a
fixed random seed.
18