Llm-Qbench: A Benchmark Towards The Best Practice For Post-Training Quantization of Large Language Models
Llm-Qbench: A Benchmark Towards The Best Practice For Post-Training Quantization of Large Language Models
Ruihao Gong∗1,2 Yang Yong∗2 Shiqiao Gu∗2 Yushi Huang∗1,2 Yunchen Zhang2
Xianglong Liu†1 Dacheng Tao3
1 Beihang University 2 SenseTime Research 3 Nanyang Technological University
{gongruihao, yongyang, gushiqiao, huangyushi}@sensetime.com [email protected]
[email protected]
arXiv:2405.06001v1 [cs.LG] 9 May 2024
Abstract
Recent advancements in large language models (LLMs) are propelling us
toward artificial general intelligence, thanks to their remarkable emergent
abilities and reasoning capabilities. However, the substantial computational
and memory requirements of LLMs limit their widespread adoption. Quan-
tization, a key compression technique, offers a viable solution to mitigate
these demands by compressing and accelerating LLMs, albeit with poten-
tial risks to model accuracy. Numerous studies have aimed to minimize
the accuracy loss associated with quantization. However, the quantization
configurations in these studies vary and may not be optimized for hard-
ware compatibility. In this paper, we focus on identifying the most effective
practices for quantizing LLMs, with the goal of balancing performance with
computational efficiency. For a fair analysis, we develop a quantization
toolkit LLMC, and design four crucial principles considering the inference
efficiency, quantized accuracy, calibration cost, and modularization. By
benchmarking on various models and datasets with over 500 experiments,
three takeaways corresponding to calibration data, quantization algorithm,
and quantization schemes are derived. Finally, a best practice of LLM PTQ
pipeline is constructed. All the benchmark results and the toolkit can be
found at https://github.com/ModelTC/llmc.
1 Introduction
Recently, large Language models (LLMs) such as GPT-4 (OpenAI et al., 2024) have demon-
strated unprecedented generative capabilities in the field of natural language process-
ing (NLP), and achieving widespread applications across various industries. However, their
substantial computational and storage costs have impeded their further popularization
among users. For instance, BLOOM (Touvron et al., 2023), an open-access multilingual LLM
with 176 billion parameters, requires a minimum of 350 GB of space merely to store model
weights in full-precision (FP16) format. At a minimum, it requires 5×80GB A100 or 9×40GB
A800 NVIDIA GPUs to perform inference with this model. Therefore, reducing their serving
cost is paramount to further enhance the application of LLMs.
For the aforementioned challenge, model quantization (Nagel et al., 2021) can be an effective
resolution strategy. It maps weights and/or activations to a lower-bit data format to reduce
memory footpoints and accelerate model inference. Existing quantization approaches can
be categorized into two types: quantization-aware-training (QAT) (Bhalgat et al., 2020;
Gong et al., 2019; Esser et al., 2020; Egiazarian et al., 2024; van Baalen et al., 2024) and
post-training quantization (PTQ) (Wei et al., 2023a; Jhunjhunwala et al., 2021; Li et al., 2021).
Although with prominent high performance, the necessity for QAT to undergo finetuning
or retraining with substantial training data and training cost renders it unattainable for the
majority of users. Correspondingly, PTQ compresses models without retraining, making
∗ Equal contribution.
† Corresponding authors.
1
Preprint
it a preferred method for LLMs due to its minimal resource requirements. Therefore,
considering the quantization cost, we do not mention some QAT methods (Du et al., 2024;
Liu et al., 2024; 2023) in our paper. On the other hand, quantization can also be classified
into non-uniform (Kim et al., 2024; Egiazarian et al., 2024) and uniform quantization. We
only benchmark the latter one, since non-uniform quantization needs complex specialized
kernels. However, they always slow down inference speed. Besides these, we also notice
some approaches (Chee et al., 2024; Tseng et al., 2024) with additional non-negligible
computational overhead during inference. Despite their decent performance, we still ignore
them in our research due to their unfriendliness towards inference.
Current uniform PTQ methods always evaluate across distinct datasets in different quantiza-
tion configurations and with simulated quantization. This current state would lead to users
being unable to accurately assess the configurations that should be selected for the efficient
and accurate quantization of LLMs. To provide a comprehensive quantization options menu
for users to obtain hardware-friendly quantized LLMs with high performance, we make a
fair benchmark, which considers two aspects: factors influencing LLM quantization and
inference efficiency under our design principles. The former perspective encompassed three
dimensions, e.g., calibration data, algorithm, and target bits. Consequently, we evaluate
across various kinds of tasks and find our best practice, encapsulated within an end-to-end
pipeline that realizes both high efficiency and accuracy LLM quantization. This best practice
has been integrated into our quantization toolkit, LLMC. Notably, LLMC, a user-friendly,
plug-and-play quantization tool, incorporates dozens of outstanding PTQ algorithms, pro-
vides the freedom to select quantization strategies, and also supports deploying quantized
LLMs on different inference backends (TensorRT-LLM (Nvidia, 2023), PPL-LLM (OpenPPL,
2023), LightLLM (ModelTC, 2023)) and hardware (Nvidia GPU, Qualcomm mobile chips,
TPU). In a word, our main contributions can be described as follows:
2 Benchmark Overview
In this section, we first provide our benchmark’s design principles subsection 2.1, outlining
its primary objective. We then detail LLM quantization subsection 2.2. In Section. subsec-
tion 2.2, after introducing the preliminary of quantization, we overview our exploration in
the benchmark, e.g, factors influencing LLM quantization and inference efficiency. Finally,
we exhibit our plug-and-play quantization toolkit within our benchmark.
Our benchmark focuses on four essential aspects for effective and practical LLM quantiza-
tion: inference performance, calibration cost, quantized accuracy, and modularization.
Inference Performance: In our LLM quantization benchmark, we prioritize the importance
of selecting a quantization approach that enhances inference performance. This means our
chosen setting should either increase throughput or decrease memory requirements, thereby
optimizing the efficiency of the model during the inference phase.
Calibration Cost: The process of post-training quantization for LLMs are also named as
calibration. The resources and time invested in calibration for LLM are crucial factors that
2
Preprint
Weight-only Quantization
FP16 x w̄
Dequant
ŵ
INT8
LayerNorm
INT32
FP16 GEMM
Module to quantize
Q K V
LN to FC Tranformation Weight-act Quantization
FC to FC Tranformation
x w̄
RotEmb
Quant
x̄
Concat KV Cache LayerNorm INT8 GEMM
3
Preprint
• Calibration data: Calibration data can help to evaluate the range of tensors, and
then determine the quantization parameters, which is crucial for maintaining model
performance post-quantization. Based on that, the impact of different corpora as
calibration data warrants further investigation.
• Algorithm: Naive low-bit quantization always brings the accuracy drop for LLM,
therefore, efficient remedies to help maintain model performance make a lot of
sense. Current effective and efficient algorithms can be summarized into three
types: 1) Transformation (Xiao et al., 2023; Lin et al., 2023; Shao et al., 2023; Wei et al.,
2023b): Leveraging magnitude between weight and activation before quantization
is widely used to balance quantization errors:
W X = (W s)(s−1 X ) (2)
, where s denotes the balance factor. 2) Clipping (Lin et al., 2023; Shao et al., 2023; Wei
et al., 2022; Du et al., 2024): Clipping some outliers with minimal impact in weights
before quantization can help with range estimation and the representation of the
rest in calibration:
W = clip(W, α, β), l ≤ α < β ≤ u (3)
, where α and β mean clipping lower bound and upper bound, respectively. 3)
Reconstruction (Frantar et al., 2022; Lee et al., 2023; Dettmers et al., 2023): This kind of
approach employs the Hessian matrix to evaluate the quantization perturbations,
and update the rest intact elements, which can be concisely represented as follows:
W ← W − EH −1 (4)
, where E denotes the perturbation, and H −1 is the inverse Hessian matrix. This
process is conducted incrementally during the quantization process.
• Target bits: The bit adopted for weight, activation, and KV cache impacts the final
accuracy. Usually, the hardware-friendly bits are 2-bit, 4-bit and 8-bit. In this bench-
mark, we also investigate 3-bit or 6-bit to compare the potential of quantization
algorithms. But for the practical deployment, 2/4/8-bit is mainly used.
Quantized inference of LLM. As shown in Figure 1, the quantization mainly targets the
Linear layers with weights, i.e., the Q, K, V, and O layers in self-attention modules and
the Up, Gate, and Down layer in FFN modules. Figure 1(b) presents 3 types of quantiza-
tion including weight-activation quantization, weight-only quantization, and KV-cache
quantization. They bring different benefits for reducing the prefill and decode latency.
3 LLM-QBench
Under the principles in subsection 2.1, powered by our quantization toolkit LLMC, in this
section, we explore the best practice for quantizing large language models from the aspect
of calibration data, quantization algorithm, and target bit.
4
Preprint
We first illustrate our experiment settings, more details can be found in the subsection A.1.
Models. To demonstrate the generability of our benchmark, We access performance on
LLAMA-2 (Touvron et al., 2023) family, spanning model sizes from 7B to 70B for general
language tasks. To broaden the scope of our evaluation benchmarks, we also benchmark on
ChatGLM (Zeng et al., 2023) for long context abilities, CodeLLAMA (Roziere et al., 2023) for
coding tasks and WizardMath (Luo et al., 2023) for mathematical problems.
Datasets. We categorize the datasets into upstream datasets and downstream datasets.
For the upstream datasets, we employ WikiText2 (Foundation) and C4 (Raffel et al., 2019)
dataset with the perplexity metric for evaluation, since perplexity can stably reflect the
LLM’s perfomance (Dettmers & Zettlemoyer, 2023). For the downstream tasks, we select
examination tasks including MMLU (Hendrycks et al., 2021) and ARC-e (Clark et al., 2018),
knowledge task BoolQ (Clark et al., 2019), understanding task Lambada (Paperno et al.,
2016), reasoning tasks including PIQA (Bisk et al., 2020), HellaSwag (Zellers et al., 2019) and
GSM8K (Cobbe et al., 2021), coding tasks HumanEval (Chen et al., 2021) and MBPP (Austin
et al., 2021), and the long context evaluation LongBench (Bai et al., 2023).
Hardware. Benefiting from the versatility of our tool, we can efficiently and conveniently
quantize LLMs to support multiple inference backends and hardware platforms. In this
paper, we mainly measured the inference efficiency of low-bit kernel on NVIDIA server and
edge GPUs with NVIDIA’s TensorRT-LLM (Nvidia, 2023) framework.
Takeaway 1.
• For LLMs aimed to solve general tasks, we need to utilize the diverse data across various tasks it will face.
• For specific-purpose LLM, it would be better to calibrate the model with the same domain.
5
Preprint
we also adopt this strategy for the weight-activation quantization. Clipping should be
fully utilized in the pipeline of best practices. What’s more, initialized from the asymmetric
clipping, the accuracy can be boosted by further learning. This good initialization contributes
to a fast convergence.
Reconstruction. GPTQ (Frantar et al., 2022) reconstruction involves the non-equivalent
transformation of weights on the channel dimension, hindering simultaneous optimization
of weights and clip values. Pre-reconstruction weight clip yields suboptimal results due to
weight changes. If reconstruction precedes clip value search, initial quantization parameters
won’t match the updated ones. Moreover, when paired with an equivalent transformation,
it yields minimal benefits. This limitation may stem from the alteration of gradients and
the disruption of assumptions regarding Hessian information. Furthermore, it requires
an extended calibration period. Therefore, reconstruction may not be considered a best
practice.
Transformation. The transformation technique utilizes the linear operation to reduce
the outlier problem in LLM or preserve the important weights. for both the weight-only
and weight-activation quantization, such equivalent transformation brings an accuracy
improvement, especially for the activations. From the table, we can infer that manually
setting the scaling number is rigid and may not help in all scenarios. On the contrary, a
suitable search for the transformation scale s is effective. There are different search strategies
and both help a lot at improving the accuracy. A learning process can be further adopted
with a pre-searched range. Fortunately, with the support of fast pre-search, the calibration
can achieve learning with fewer epochs.
Calibration cost for each strategy. In the analysis of calibration costs detailed in Table 4, we
observe that within the suite of transformation techniques, the search-based (v1) strategy
requires roughly 10 minutes, making it twice as fast as the (v2) strategy. While rule-based
transformations are quicker, they often fall short of achieving acceptable accuracy levels.
On the other hand, learning-based transformation methods incur a considerable increase in
time to attain satisfactory accuracy levels. However, initializing the learning process with
pre-searched values can halve the number of epochs required and yield higher accuracy.
Regarding clipping methods, employing direct min-max value clipping is time-efficient but
typically results in significant accuracy loss. The search-based clipping method, whether
using asymmetric or symmetric ranges, proves efficient, requiring only about 20 minutes.
Yet, when applying a learning-based approach to clipping, the calibration time can extend
to nearly 7 hours. Therefore, a combined approach of the search-based transformation v1
and search-based asymmetric clipping emerges as the most effective in balancing accuracy
and efficiency. Furthermore, initiating with pre-searched values and conducting additional
learning for a few epochs may offer further accuracy improvements.
6
Preprint
Full Prec. 3.32 5.71 4.52 70.91 88.44 83.33 80.00 83.50 81.24
TR 7.56 10.79 9.18 51.44 38.19 59.00 69.20 76.50 58.87
TS-v1 6.69 9.41 8.05 40.21 45.73 73.33 67.60 77.50 60.87
TS-v2 7.25 10.42 8.83 49.63 48.74 62.67 70.00 78.50 61.91
CM 10.32 15.16 12.74 34.45 34.67 54.00 62.40 73.50 51.80
CS-sym 7.2e4 6.5e4 6.9e4 27.79 26.63 41.67 25.60 51.50 34.64
CS-asym 5.67 8.26 6.97 53.44 69.85 78.67 72.80 78.00 70.55
CL 6.13 8.62 7.38 49.59 47.24 75.67 72.80 79.50 64.96
RH 6.68 9.40 8.04 54.65 42.21 70.33 67.20 77.00 62.28
TS-v1+RH 6.69 9.45 8.07 50.00 42.71 73.67 65.60 73.50 61.10
TS-v1+CS-sym 7.1e4 6.5e4 6.8e4 27.79 26.63 41.67 25.60 51.00 34.54
TS-v1+CS-asym 5.24 7.73 6.49 59.52 77.89 82.33 74.80 82.00 75.31
TS-v1+CL w/ CS-asym init. 5.24 7.70 6.47 59.40 77.25 82.39 75.46 78.51 74.60
Table 4: Calibration cost on LLAMA-2-70B of different strategies in Table 2. Ones init. means
we use a vector of ones as the start point of s before learning.
Takeaway 2.
• Search-based clipping and transformation are optimal solutions for balancing the calibration cost and
accuracy. The searched values benefit initializing the learning-based solutions.
• Incorrect clipping easily leads to an accuracy crash. Asymmetric clipping is simple yet effective for
improving the accuracy.
• Transformation searching influences both the calibration efficiency and quantized accuracy. The v1
strategy in Table 2 enjoys a good tradeoff between them.
7
Preprint
Accuracy (%) ↑
Model KV Cache Prec.
NarrativeQA QASPER MultiFieldQA-en MultiFieldQA-zh Avg.
Full Prec. 25.93 43.35 51.57 62.36 45.80
int8 25.74 43.57 51.81 62.48 45.90
ChatGLM3-6B-32k
int4 26.13 43.43 51.63 61.04 45.56
int2 1.89 4.68 3.13 1.08 2.70
Figure 2: Inference speed of 7B, 13B and 70B LLAMA-2 models on NVIDIA A100 GPU.
(Input sequence length: 32K, Output sequence length: 512)
As shown in Table 22, Table 23, and Table 24, for weight-only quantization, employing
Hessain disturbance as bit allocate strategy outperforms others. High-bit quantization
benefits from lower mixture rates, while low-bit requires more full-precision weights in small
LLMs for better performance. For weight-activation quantization, dynamic bit allocation
with slower inference speed and higher computational overhead during inference gains
more accuracy improvements rather than static strategy, even though the latter uses a double
mixture rate. Details are presented in the subsection A.6.
Inference Speed. To assess the practical benefits of different quantization approaches,
we conducted evaluations using NVIDIA’s cloud (SMX 80G A100) and edge (Drive Orin)
GPUs, alongside the official inference library, TensorRT-LLM. Part of our results, as depicted
in Figure 2, highlight the throughput improvements achieved through TensorRT-LLM-
supported quantization schemes for models with 32,000 input tokens and 512 output tokens.
The findings indicate that quantization with 8-bit weights and activations enhances the
prefill stage’s speed by 20%-30% and the decode stage by 40%-60%. In contrast, 4-bit
weight-only quantization reduces the prefill speed by 10% but increases the decode speed
by 40%-60%. It’s important to note that these acceleration rates tend to diminish for larger
models. Besides, 8-bit KV cache quantization has minimal impact on prefill times and
slightly reduces decoding throughput for very large models, such as those with 70B model.
Results for more models and hardware can be found in subsection A.5.
8
Preprint
W .s
Bit-width: Fix / mixed Search
Transformation Transformation s A /s
Calibration data Quant
Configuration Initialized from
Clipping
Quant: W / A / KV cache 𝛼 𝛽 searched value
Tasks x Models W Search
Asymmetric 𝛼, 𝛽 Learning
CodeLlama-7b WizardMath-7b
#Bits Method
HumanEval(Pass@1(%) ↑) GSM8K-100(Acc.(%) ↑)
Table 7: Main results of code and math analyses. † indicates calibration with the correspond-
ing data. For the CodeLlama-7b model, we use a sample of 10 instances from the MBPP and
HumanEval datasets, respectively. Similarly, for the WizardMath-7b model, we sample 10
instances from the MATH and GSM8K datasets, respectively .
Takeaway 3.
• For fixed-precision, considering the principle of inference speed and quantized accuracy, 4-bit weight-only
quantization, w4a8/w8a8 weight-activation quantization, and 4-bit KV cache quantization with group
size=8 are promising settings. Larger models can tolerate lower bit for weights.
• Weight quantization benefits decoding speed and harms prefill speed. Weight-activation quantization
benefits both the prefill and decode speed, and KV cache quantization only brings little speedup for small
models but helps reduce memory consumption for long context.
• For mixed precision (specialized kernel requirements), hessian-based metrics excel in determining
the precision for weight quantization, while dynamic magnitude-based strategies with non-negligible
overhead are better for enhancing accuracy toward weight activation quantization.
Based on the takeaways distilled from the above exploration, we summarize the best practice
of PTQ pipeline for LLM. As depicted in Figure 3, first, we should collect the best calibration
data according to the task and model under the guide of Takeway 1. Then the bit-width
and quantization scheme could be determined considering the Takeway 3. Finally, the
calibration process can be conducted using the algorithm pipeline based on Takeway 2. The
results in Table 6 and Table 7 of general-purpose model LLAMA-2-70B and specific-domain
code model CodeLLAMA-7b and math model WizardMath-7b proved the effectiveness,
especially for maintaining high accuracy. More experimental results on other models and
9
Preprint
datasets to validate our best practice for decent performance and efficient inference can be
found in subsection A.3.
4 Conclusion
References
Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren,
Torsten Hoefler, and Dan Alistarh. Quik: Towards end-to-end 4-bit inference on generative
large language models, 2023.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David
Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with
large language models. arXiv preprint arXiv:2108.07732, 2021.
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao
Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Long-
bench: A bilingual, multitask benchmark for long context understanding. arXiv preprint
arXiv:2308.14508, 2023.
Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak. Lsq+:
Improving low-bit quantization through learnable offsets and better initialization, 2020.
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning
about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on
Artificial Intelligence, 2020.
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantiza-
tion of large language models with guarantees, 2024.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray,
Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin,
Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo-
hammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings,
Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji,
Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh
Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage,
Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish,
Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code.
2021.
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and
Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions,
2019.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning
challenge. arXiv:1803.05457v1, 2018.
10
Preprint
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz
Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher
Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168, 2021.
Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling
laws, 2023.
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix
multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar,
Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A
sparse-quantized representation for near-lossless llm weight compression. arXiv preprint
arXiv:2306.03078, 2023.
Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu.
Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation, 2024.
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and
Dan Alistarh. Extreme compression of large language models via additive quantization,
2024.
Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and
Dharmendra S. Modha. Learned step size quantization, 2020.
Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate
post-training quantization for generative pre-trained transformers. arXiv preprint
arXiv:2210.17323, 2022.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster,
Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor
Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint
arXiv:2101.00027, 2020.
Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei
Yu, and Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit
neural networks. In The IEEE International Conference on Computer Vision (ICCV), October
2019.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of
the International Conference on Learning Representations (ICLR), 2021.
Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon,
and Dongsoo Lee. Rethinking channel dimensions to isolate outliers for low-bit weight
quantization of large language models, 2023.
Divyansh Jhunjhunwala, Advait Gadhikar, Gauri Joshi, and Yonina C. Eldar. Adaptive
quantization of model updates for communication-efficient federated learning, 2021.
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen,
Michael W. Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization,
2024.
Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. Owq: Lessons
learned from activation outliers for weight quantization in large language models. arXiv
preprint arXiv:2306.02272, 2023.
Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang,
and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction,
2021.
11
Preprint
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq:
Activation-aware weight quantization for llm compression and acceleration. arXiv preprint
arXiv:2306.00978, 2023.
Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. Qllm:
Accurate and efficient low-bitwidth quantization for large language models, 2024.
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad,
Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free
quantization aware training for large language models, 2023.
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo
Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering
mathematical reasoning for large language models via reinforced evol-instruct. arXiv
preprint arXiv:2308.09583, 2023.
ModelTC. Lightllm. https://github.com/ModelTC/lightllm, 2023.
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen,
and Tijmen Blankevoort. A white paper on neural network quantization, 2021.
Nvidia. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM, 2023.
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat,
Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao,
Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brak-
man, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie
Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke
Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen,
Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings,
Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien
Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty
Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman,
Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun
Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott
Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han,
Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade
Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost
Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun
Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali
Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick,
Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt
Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis,
Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike,
Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz
Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam
Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne,
Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil,
David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela
Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg
Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan,
Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex
Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy
Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila
Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny,
Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth
Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Fran-
cis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario
Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr,
12
Preprint
John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah
Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina
Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski
Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thomp-
son, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry
Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss,
Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei,
CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff,
Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sher-
win Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan,
Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024.
OpenPPL. Ppl-llm. https://github.com/openppl-public/ppl.nn.llm, 2023.
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella
Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The
LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pp. 1525–1534, Berlin, Germany, August 2016. Association for Computational
Linguistics. URL http://www.aclweb.org/anthology/P16-1144.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a
unified text-to-text transformer. arXiv e-prints, 2019.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,
Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation
models for code. arXiv preprint arXiv:2308.12950, 2023.
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng
Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated
quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W
Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of
bert. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8815–8821, 2020.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas
Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude
Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman
Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas,
Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning
Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew
Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva,
Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor,
Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang,
Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,
Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat
models, 2023.
Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip#:
Even better llm quantization with hadamard incoherence and lattice codebooks, 2024.
Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric
Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The blessing of dimension-
ality for llm quantization, 2024.
Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang,
Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit
13
Preprint
14
Preprint
A Appendix
channels. Notably, we ignore the batch size dimension for activation X ∈ Rn×d , where n means token
number, d means hidden size.
15
Preprint
16
Preprint
• For per-token quantization, at higher bits (w8a8), the benefits of asymmetric quan-
tization are marginal. However, at lower bits (w4a4), the gains from asymmetric
weight quantization become more evident as presented for LLaMA-2-13B. The lack
of clear benefits for activation might be due to dynamic quantization, which allows
symmetric quantization to adapt to the range of various activations effectively.
• Per-tensor quantization is more sensitive to symmetric/asymmetric quantization
rather than per-token quantization. We can roughly assert that per-tensor quan-
tization does not adapt well to the range of different tokens and their various
dimensions within large language models, especially at lower bits (w4a4).
17
Preprint
Full Prec. 5.47 7.26 32.89 46.55 58.38 74.86 73.91 77.69 66.28
LLaMA-2-7B w4a16g128 5.73 7.59 47.10 45.06 56.79 77.40 72.48 77.26 65.79
w8a8 5.55 33.63 7.35 46.41 57.32 68.96 72.78 77.04 64.50
Full Prec. 4.88 6.73 48.82 55.82 75.13 82.39 77.32 79.92 74.12
LLaMA-2-13B w4a16g128 4.98 6.87 52.34 54.43 71.25 81.71 76.37 78.89 72.53
w8a8 4.92 50.38 6.76 49.77 71.96 74.13 76.96 79.33 70.43
Full Prec. 3.32 5.71 20.78 69.52 89.59 87.58 82.12 81.77 82.12
LLaMA-2-70B w4a16g128 3.46 5.83 20.67 68.36 90.30 86.82 81.50 80.85 81.57
w8a8 3.39 22.17 5.76 61.76 90.12 81.65 81.95 81.39 79.37
Beyond the aforementioned details, we elaborate on the specific size of the downstream data
subsets, marked with “*” in the ablation study, used in our paper for ablation studies. We
randomly extract 600 questions from MMLU (Hendrycks et al., 2021), 200 from ARC-e (Clark
et al., 2018), 300 from BoolQ (Clark et al., 2019), 250 from HellaSwag (Zellers et al., 2019)
and 200 from PIQA (Bisk et al., 2020), which can also reflect real model performance.
18
Preprint
Full Prec. - 5.47 7.26 6.37 46.82 62.31 76.33 72.40 79.00 67.37
TR 6.69 9.11 7.90 41.03 47.24 73.00 68.00 78.50 61.55
TS-v1 6.46 8.59 7.53 38.20 53.77 69.33 66.80 76.00 60.82
TS-v2 6.60 8.88 7.74 39.84 45.23 65.67 68.80 77.00 59.31
TL 6.42 8.62 7.52 39.13 49.25 71.00 67.60 77.00 60.80
CM 6.66 8.98 7.82 39.42 44.22 68.00 68.00 79.00 59.73
CS-sym 6.28 8.38 7.33 44.22 41.21 76.67 68.00 77.00 61.42
w3a16g128 CS-asym 6.22 8.32 7.27 41.87 43.22 73.67 69.60 77.00 61.07
CL 6.21 8.32 7.27 44.86 52.26 71.33 69.20 77.00 62.93
RH 6.33 8.53 7.43 41.48 50.75 72.00 69.60 77.50 62.27
TS-v1+RH 6.31 8.51 7.41 38.69 51.26 69.67 69.60 78.00 61.44
TS-v1+CS-sym 6.22 8.28 7.25 22.83 24.62 58.33 22.40 55.50 36.74
TS-v1+CS-asym 6.18 8.24 7.21 28.86 24.62 65.67 59.60 68.00 49.33
TS+CL w/ CS-asym init. 5.94 8.10 7.02 42.26 47.74 74.00 70.40 79.50 62.78
TR 178.47 119.11 148.79 23.70 27.64 46.67 40.00 58.00 39.20
TS-v1 34.68 34.47 34.58 24.11 29.15 58.33 44.40 56.00 42.40
TS-v2 95.22 82.29 88.76 23.04 31.66 43.00 42.40 61.50 40.32
TL 39.94 47.66 43.80 25.28 29.65 50.33 44.00 58.50 41.55
CM 421.33 559.34 490.34 NaN 30.15 42.67 41.20 60.50 NaN
CS-sym 2.18e5 1.65e5 1.91e5 22.83 24.12 58.33 25.60 55.50 37.28
w2a16g64 CS-asym 13.61 18.97 16.29 28.18 22.61 59.67 56.40 68.00 46.97
CL 11.55 13.96 12.76 27.81 28.64 60.67 55.60 69.50 48.44
RH 15.85 19.90 17.88 28.70 28.14 53.33 51.60 68.00 45.96
TS-v1+RH 15.75 19.77 17.76 26.72 31.66 53.00 50.00 60.50 44.38
TS-v1+CS-sym 2.09e5 1.59e5 1.84e5 22.83 24.62 58.33 22.40 55.50 36.74
TS-v1+CS-asym 11.69 14.83 13.26 28.86 24.62 65.67 59.60 68.00 49.33
TS-v1+CL w/ CS-asym init. 8.66 12.30 10.48 31.79 31.16 64.67 57.60 69.50 50.94
19
Preprint
Full Prec. - 4.88 6.73 5.81 56.69 74.37 82.33 75.60 81.00 74.00
TR 5.56 7.63 6.60 49.84 63.82 78.67 72.00 79.00 68.67
TS-v1 5.43 7.42 6.42 56.11 65.83 77.00 74.00 79.50 70.49
TS-v2 5.47 7.51 6.49 50.63 65.33 76.33 73.20 76.50 68.40
TL 5.44 7.48 6.46 52.36 65.33 73.67 73.60 78.50 68.69
CM 5.52 7.58 6.55 50.48 64.32 78.00 74.00 77.50 68.86
CS-sym 5.36 7.37 6.37 53.29 70.35 78.67 72.00 78.00 70.46
w3a16g128 CS-asym 5.35 7.34 6.34 54.40 70.35 80.33 74.00 80.50 71.92
CL 5.42 7.40 6.41 52.48 59.80 75.00 74.80 77.50 67.92
RH 5.55 7.67 6.61 53.34 69.85 77.67 73.20 79.00 70.61
TS-v1+RH 5.41 7.45 6.43 53.37 61.31 76.33 74.00 78.00 68.60
TS-v1+CS-sym 5.32 7.30 6.31 55.56 67.84 79.33 75.20 80.00 71.59
TS-v1+CS-asym 5.30 7.28 6.29 55.33 72.36 81.00 74.40 80.50 72.72
TS-v1+CL w/ CS-asym init. 5.23 7.28 6.26 53.95 70.85 79.00 74.00 80.50 71.66
TR 16.39 19.39 17.89 25.83 34.17 57.67 54.80 61.00 46.69
TS-v1 12.30 15.45 13.88 26.63 34.67 55.00 58.80 70.00 49.02
TS-V2 14.36 17.05 15.71 25.50 32.66 43.33 53.20 64.50 43.84
TL 12.39 15.76 14.08 26.24 33.67 52.33 55.20 69.00 47.29
CM 26.22 30.46 28.43 23.89 31.66 51.00 36.40 53.00 39.19
CS-sym 1.25e5 9.73e4 1.11e5 28.10 17.09 58.33 26.00 47.00 35.30
w2a16g64 CS-asym 8.96 12.52 10.74 32.65 33.67 63.00 60.40 72.00 52.34
CL 8.40 11.02 9.71 32.45 34.67 64.00 64.00 71.50 53.32
RH 9.51 12.61 11.06 31.77 31.16 66.00 60.00 73.00 52.39
TS-v1+RH 9.81 12.99 11.40 35.01 31.66 60.00 56.80 70.50 50.79
TS-v1+CS-sym 1.22e5 1.22e5 1.22e5 28.10 17.09 58.33 26.00 47.00 35.30
TS-v1+CS-asym 7.88 10.84 9.36 39.57 43.72 71.00 64.40 74.50 58.64
TS-v1+CL w/ CS-asym init. 6.97 10.01 8.49 41.53 42.21 67.67 68.80 76.50 59.34
Table 10: Ablation results of LLaMA-2-13B weight-only quantization. * means the subset of
the corresponding dataset.
Full Prec. 3.32 5.71 4.52 70.91 88.44 83.33 80.00 83.50 81.24
TR 3.95 6.24 5.10 65.40 86.93 82.00 80.40 82.50 79.45
TS-v1 3.85 6.12 4.99 68.14 88.44 83.00 79.20 83.00 80.36
TS-v2 3.93 6.21 5.07 66.62 87.94 82.33 79.20 83.50 79.92
CM 3.98 6.27 5.13 64.90 88.94 79.33 78.00 83.00 78.83
CS-sym 3.85 6.13 4.99 66.34 89.95 82.00 79.20 83.00 80.10
CS-asym 3.84 6.14 4.99 66.03 91.46 83.00 79.20 82.50 80.44
CL 3.81 6.09 4.95 68.90 86.93 81.00 80.00 84.00 80.17
RH 3.93 6.17 5.05 68.01 87.94 84.00 79.20 82.00 80.23
TS-v1+RH 3.95 6.18 5.07 65.66 86.93 83.67 79.20 81.50 79.39
TS-v1+CS-sym 3.75 6.05 4.90 68.38 85.93 83.67 82.40 84.00 80.88
TS-v1+CS-asym 3.74 6.04 4.89 68.21 89.45 84.00 81.60 82.50 81.15
TS-v1+CL w/ CS-asym init. 3.74 6.04 4.89 68.86 88.44 85.00 80.00 81.00 80.66
20
Preprint
• We still find that TS-v1 has fewer accuracy drops than other learning-free transfor-
mation methods. In w4a4 experiments at Table 12, we try different start points for
TL and notice that TS-v1 can help gain satisfactory model precision.
• We also provide two best practices aligned with weight-only quantization, e.g., 1)
TS-v1+CS-asym: and 2) TL w/ TS-v1 init.+CL w/CS-asym init..
PPL ↓
#Bits Method
WikiText2 C4 Avg.
Table 12: Ablation results of LLaMA-2-7B weight-activation quantization. Ones init. means
we set all of elements in s as one before learning.
In a word, we have found four best practices, two for weight-only quantization, and the
other two for weight-activation quantization. Furthermore, we suggest that users use the
learning-free best practices for relatively higher-bit quantization and learning-based best
practices for others.
In this section, we present additional experimental results to further validate the effec-
tiveness of our proposed best practices for model quantization. Specifically, we focus on
weight-only and weight-activation quantization for the LLaMA-2 model across various sizes
(7B, 13B, and 70B). The following tables summarize the main results of these experiments,
demonstrating the effectiveness of our best practices in mitigating the impact of quantization
on model performance across different configurations (Table 15, Table 16, Table 17, Table 18,
Table 19, Table 20).
Weight-only. Our weight-only quantization experiments, as shown in Table 15, Table 16,
and Table 17, provide compelling evidence that our best practices significantly preserve
model performance, even under low bit settings. Notably, our methodologies achieve SOTA
performance. For instance, in the w3a16 setting, the 7B model (Table 15) maintains an
21
Preprint
PPL ↓
#Bits Method
WikiText2 C4 Avg.
PPL ↓
#Bits Method
WikiText2 C4 Avg.
22
Preprint
average accuracy decrement of only 4.13% compared to the full-precision model. Similarly,
the 13B model (Table 16) exhibits an average accuracy reduction of 2.6%. Intriguingly, the
70B model (Table 17) demonstrates the most striking resilience with a mere average accuracy
decline of 0.07%, suggesting that our best practices are particularly effective at scale.
These results indicate the robustness of our quantization strategies and underscore the
potential for their application in larger, more complex models. By enabling more efficient
deployment without substantial loss in performance, our best practices for weight-only
quantization facilitate wider accessibility and applicability of large-scale language models.
This part shows the accuracy of KV cache quantization for code generation tasks. From
Table 21, we can found that the int8 and int4 KV cache quantization brings almost no
accuracy degradation for both the Human-Eval and MBPP dataset. This conclusion is
consistent with the case of long context text in the main text, further proving that 4-bit KV
cache can be adopted without harm to performance. However, 2-bit KV cache will bring a
crash for the generation, and thus should not be adpoted.
Figure 6 and Figure 7 supplementarily illustrated the speedup brought by various quantiza-
tion schemes on 1K and 4K input context length. We can find that the conclusion is the same
as the 32K input context length in the main text. The w8a8 setting significantly improves
the prefill speed and the weight-only quantization helps the decoding speed. The int8 KV
cache quantization does not affect the speed much but helps a lot for reducing memory
23
Preprint
Accuracy (%) ↑
#Bits Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.
Accuracy (%) ↑
Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.
consumption for long context length. Figure 8 shows the speed up on the Drive Orin edge
GPU. It can be seen that weight-only quantization also helps the prefill under this setting,
which is different from cloud GPUs.
In this section, we present detailed analyses for mix-precision quantization, which can be
concisely classified into magnitude-based (Ashkboos et al., 2023; Dettmers et al., 2022; Kim
et al., 2024) and Hessian-based (Dettmers et al., 2023; Lee et al., 2023) mix-precision in LLM
quantization. In weight-only quantization, previous methods mainly consider the latter
type, since Hessian information can help capture weight sensitivity towards quantization.
In weight-activation quantization, recent studies only utilize the former type, due to it is
efficient to allocate bits to a model. Moreover, the Taylor expansion underlying Hessian
information may not be suitable for approximating the impact of quantization, when weight
and activation are being quantized with relatively extensive quantization error rather than
weight-only quantization.
24
Preprint
Accuracy(%) ↑
#Bits Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.
Accuracy(%) ↑
#Bits Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.
25
Preprint
Accuracy(%) ↑
#Bits Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.
Pass@1 (%) ↑
Model KV Cache Prec.
Human-Eval MBPP Avg.
26
Preprint
Figure 6: Inference speed of 7B, 13B, and 70B LLAMA-2 models on NVIDIA A100 GPU.
(Input sequence length: 1024, Output sequence length: 512)
Figure 7: Inference speed of 7B, 13B, and 70B LLAMA-2 models on NVIDIA A100 GPU.
(Input sequence length: 4096, Output sequence length: 512)
27
Preprint
80
60
Value
40
20
0
Full Prec. w4a16 w8a8
LLaMA-2-7B
Figure 8: Throughput comparison of quantization on the edge GPU (Drive Orin). (Token/s)
we conducted the mixture rate experiment for Hessian Disturb. column-wise meth-
ods in Table 23. We have found that in 3-bit quantization, performance between
high (20%) and low (10%) mixture rates are closely similar. However, in 2-bit quanti-
zation, a high (20%) mixture rate for an LLM with a relatively small model size (7B)
can gain significant accuracy improvement. More experiments for fix-precision
algorithms combined with mix-precision strategy are required to be evaluated in
the future.
• Weight-activation. In Table 24, we measure two kinds of magnitude mix-precision
strategies: Dynamic (Dettmers et al., 2022) and Static (Ashkboos et al., 2023). Al-
though the former brings non-negligible inference overhead, it exhibits higher
accuracy rather than Static. Considering our design principles, we try to allocate
more columns in full-precision in the Static strategy and try to keep Down layers in
the LLM in 8-bit, since it is more sensitive to quantization (Heo et al., 2023; Ashk-
boos et al., 2023). However, all of them are still inferior to the Dynamic method,
therefore, it is necessary to explore new algorithms for inference-efficient static
weight-activation mix-precision quantization.
28
Preprint
PPL ↓ PPL ↓
Model #Bits Metric Granularity Model #Bits Mixture Rate (%)
WikiText2 C4 Avg. WikiText2 C4 Avg.
Full Prec. - - 5.47 7.26 6.37 Full Prec. 0 5.47 7.26 6.37
Hessian Disturb. element 8.54 11.54 10.04 w2a16g64 5 13.00 16.52 14.76
Hessian Diag. column 5.25 7.20 6.22 Full Prec. 0 4.88 6.73 5.81
w3a16g128
Hessian Disturb. column 5.19 7.13 6.16 0 5.52 7.58 6.55
LLaMA-2-13B Hessian Disturb. element 5.16 7.06 6.11 1 5.38 7.38 6.38
Diag. uses the magnitude of diagonal ele- w2a16g64 5 5.73 8.25 6.99
29
Preprint
PPL ↓
Model #Bits Method
WikiText2 C4 Avg.
30