0% found this document useful (0 votes)
53 views

Llm-Qbench: A Benchmark Towards The Best Practice For Post-Training Quantization of Large Language Models

Uploaded by

nikebeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Llm-Qbench: A Benchmark Towards The Best Practice For Post-Training Quantization of Large Language Models

Uploaded by

nikebeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Preprint

LLM-QBench: A Benchmark Towards the Best Practice for


Post-training Quantization of Large Language Models

Ruihao Gong∗1,2 Yang Yong∗2 Shiqiao Gu∗2 Yushi Huang∗1,2 Yunchen Zhang2
Xianglong Liu†1 Dacheng Tao3
1 Beihang University 2 SenseTime Research 3 Nanyang Technological University
{gongruihao, yongyang, gushiqiao, huangyushi}@sensetime.com [email protected]
[email protected]
arXiv:2405.06001v1 [cs.LG] 9 May 2024

Abstract
Recent advancements in large language models (LLMs) are propelling us
toward artificial general intelligence, thanks to their remarkable emergent
abilities and reasoning capabilities. However, the substantial computational
and memory requirements of LLMs limit their widespread adoption. Quan-
tization, a key compression technique, offers a viable solution to mitigate
these demands by compressing and accelerating LLMs, albeit with poten-
tial risks to model accuracy. Numerous studies have aimed to minimize
the accuracy loss associated with quantization. However, the quantization
configurations in these studies vary and may not be optimized for hard-
ware compatibility. In this paper, we focus on identifying the most effective
practices for quantizing LLMs, with the goal of balancing performance with
computational efficiency. For a fair analysis, we develop a quantization
toolkit LLMC, and design four crucial principles considering the inference
efficiency, quantized accuracy, calibration cost, and modularization. By
benchmarking on various models and datasets with over 500 experiments,
three takeaways corresponding to calibration data, quantization algorithm,
and quantization schemes are derived. Finally, a best practice of LLM PTQ
pipeline is constructed. All the benchmark results and the toolkit can be
found at https://github.com/ModelTC/llmc.

1 Introduction
Recently, large Language models (LLMs) such as GPT-4 (OpenAI et al., 2024) have demon-
strated unprecedented generative capabilities in the field of natural language process-
ing (NLP), and achieving widespread applications across various industries. However, their
substantial computational and storage costs have impeded their further popularization
among users. For instance, BLOOM (Touvron et al., 2023), an open-access multilingual LLM
with 176 billion parameters, requires a minimum of 350 GB of space merely to store model
weights in full-precision (FP16) format. At a minimum, it requires 5×80GB A100 or 9×40GB
A800 NVIDIA GPUs to perform inference with this model. Therefore, reducing their serving
cost is paramount to further enhance the application of LLMs.
For the aforementioned challenge, model quantization (Nagel et al., 2021) can be an effective
resolution strategy. It maps weights and/or activations to a lower-bit data format to reduce
memory footpoints and accelerate model inference. Existing quantization approaches can
be categorized into two types: quantization-aware-training (QAT) (Bhalgat et al., 2020;
Gong et al., 2019; Esser et al., 2020; Egiazarian et al., 2024; van Baalen et al., 2024) and
post-training quantization (PTQ) (Wei et al., 2023a; Jhunjhunwala et al., 2021; Li et al., 2021).
Although with prominent high performance, the necessity for QAT to undergo finetuning
or retraining with substantial training data and training cost renders it unattainable for the
majority of users. Correspondingly, PTQ compresses models without retraining, making
∗ Equal contribution.
† Corresponding authors.

1
Preprint

it a preferred method for LLMs due to its minimal resource requirements. Therefore,
considering the quantization cost, we do not mention some QAT methods (Du et al., 2024;
Liu et al., 2024; 2023) in our paper. On the other hand, quantization can also be classified
into non-uniform (Kim et al., 2024; Egiazarian et al., 2024) and uniform quantization. We
only benchmark the latter one, since non-uniform quantization needs complex specialized
kernels. However, they always slow down inference speed. Besides these, we also notice
some approaches (Chee et al., 2024; Tseng et al., 2024) with additional non-negligible
computational overhead during inference. Despite their decent performance, we still ignore
them in our research due to their unfriendliness towards inference.
Current uniform PTQ methods always evaluate across distinct datasets in different quantiza-
tion configurations and with simulated quantization. This current state would lead to users
being unable to accurately assess the configurations that should be selected for the efficient
and accurate quantization of LLMs. To provide a comprehensive quantization options menu
for users to obtain hardware-friendly quantized LLMs with high performance, we make a
fair benchmark, which considers two aspects: factors influencing LLM quantization and
inference efficiency under our design principles. The former perspective encompassed three
dimensions, e.g., calibration data, algorithm, and target bits. Consequently, we evaluate
across various kinds of tasks and find our best practice, encapsulated within an end-to-end
pipeline that realizes both high efficiency and accuracy LLM quantization. This best practice
has been integrated into our quantization toolkit, LLMC. Notably, LLMC, a user-friendly,
plug-and-play quantization tool, incorporates dozens of outstanding PTQ algorithms, pro-
vides the freedom to select quantization strategies, and also supports deploying quantized
LLMs on different inference backends (TensorRT-LLM (Nvidia, 2023), PPL-LLM (OpenPPL,
2023), LightLLM (ModelTC, 2023)) and hardware (Nvidia GPU, Qualcomm mobile chips,
TPU). In a word, our main contributions can be described as follows:

1. We release a quantization toolkit LLMC supporting dozens of algorithms, models


and hardware. LLMC enables users to perform lossless quantization on 100-billion-
parameter LLMs within a matter of hours, utilizing just a single GPU. It notably
facilitate the research and production of quantized LLMs.
2. We modularly and fairly benchmark the quantization techniques considering cal-
ibration cost, inference efficiency, quantized accuracy. Near 600 experiments on
diverse models and datasets provide three insightful takeaways on the calibration
data, algorithm pipeline and quantization configuration selection.
3. Based on the takeaways, a best practice of LLM PTQ pipeline is designed, achieving
the best accuracy and efficiency performance balance under various scenarios.

2 Benchmark Overview

In this section, we first provide our benchmark’s design principles subsection 2.1, outlining
its primary objective. We then detail LLM quantization subsection 2.2. In Section. subsec-
tion 2.2, after introducing the preliminary of quantization, we overview our exploration in
the benchmark, e.g, factors influencing LLM quantization and inference efficiency. Finally,
we exhibit our plug-and-play quantization toolkit within our benchmark.

2.1 Design Principles

Our benchmark focuses on four essential aspects for effective and practical LLM quantiza-
tion: inference performance, calibration cost, quantized accuracy, and modularization.
Inference Performance: In our LLM quantization benchmark, we prioritize the importance
of selecting a quantization approach that enhances inference performance. This means our
chosen setting should either increase throughput or decrease memory requirements, thereby
optimizing the efficiency of the model during the inference phase.
Calibration Cost: The process of post-training quantization for LLMs are also named as
calibration. The resources and time invested in calibration for LLM are crucial factors that

2
Preprint

Weight-only Quantization
FP16 x w̄
Dequant

INT8
LayerNorm
INT32
FP16 GEMM
Module to quantize
Q K V
LN to FC Tranformation Weight-act Quantization
FC to FC Tranformation
x w̄
RotEmb
Quant

Concat KV Cache LayerNorm INT8 GEMM

MatMul Concat Dequant


Up Gate

Softmax SiLU KV-Cache Quantization


k v
Quant Quant
MatMul
k̄ v̄
O Down KV Cache
k̄ v̄
Dequant Dequant
k̂ v̂

Figure 1: Inference of quantized LLMs. X, K, and V are quantized in a per-token manner


with the upper and lower bounds dynamically calculated during inference. The range of W
is statically calibrated before the deployment. For weight-only setting, W adopts per-group
quantization and for weight-activation quantization, W uses per-channel quantization.
affect the practicality of LLM quantization. This benchmark aims to find the best pipeline to
produce accurate LLMs in minimal GPUs and time.
Quantized Accuracy: In every method used to create quantized models, it’s crucial to
minimize any reduction in accuracy to a tolerable degree. With this fundamental principle
in mind, we are dedicated to exploring strategies that reliably preserve the performance of
the model within acceptable limits.
Modularization: Recent advancements have introduced a myriad of algorithms aimed
at enhancing the performance of quantized LLMs. This benchmark seeks to dissect these
algorithms to their most fundamental elements, analyzing the efficacy of each component
in isolation.
Guided by the aforementioned four principles, our goal is to investigate and outline optimal
practices for developing quantized LLMs tailored to various scenarios and configurations.

2.2 LLM Quantization

Preliminary of Quantization. For an element x in a vector to be quantized, the process of


quantization can be defined as:
clip( x, l, u) u−l
quantize:x̄ = ,∆ = b ,
∆ 2 −1 (1)
dequantize:x̂ = x̄ · ∆
where u and l are the upper bound and the lower bound of the vector. b is the bit-width of
the quantized vector and x̄ is the quantized b-bit element. if we force u = −l, the process
can be called symmetric quantization. Otherwise, it is called asymmetric quantization.
In this paper, we mainly consider asymmetric quantization. Besides that, in weight-only
quantization, we employ per-group quantization, that is the weights in a group share the
same ∆. In weight-activation quantization, we apply per-channel and per-token quantization
for weights and activations, respectively 1 . Details can be found in the subsection A.1.
Factors Influencing LLM Quantization. We categorize factors influencing LLM quantiza-
tion into three dimensions: calibration data, algorithms, and target bits.
1 In this paper, the notion “wxay” is employed to represent the bit-widths “x” of weights, and the
bit-widths “y” of activations. “gz” means in group-wize quantization the group size is “z”.

3
Preprint

• Calibration data: Calibration data can help to evaluate the range of tensors, and
then determine the quantization parameters, which is crucial for maintaining model
performance post-quantization. Based on that, the impact of different corpora as
calibration data warrants further investigation.
• Algorithm: Naive low-bit quantization always brings the accuracy drop for LLM,
therefore, efficient remedies to help maintain model performance make a lot of
sense. Current effective and efficient algorithms can be summarized into three
types: 1) Transformation (Xiao et al., 2023; Lin et al., 2023; Shao et al., 2023; Wei et al.,
2023b): Leveraging magnitude between weight and activation before quantization
is widely used to balance quantization errors:

W X = (W s)(s−1 X ) (2)

, where s denotes the balance factor. 2) Clipping (Lin et al., 2023; Shao et al., 2023; Wei
et al., 2022; Du et al., 2024): Clipping some outliers with minimal impact in weights
before quantization can help with range estimation and the representation of the
rest in calibration:
W = clip(W, α, β), l ≤ α < β ≤ u (3)
, where α and β mean clipping lower bound and upper bound, respectively. 3)
Reconstruction (Frantar et al., 2022; Lee et al., 2023; Dettmers et al., 2023): This kind of
approach employs the Hessian matrix to evaluate the quantization perturbations,
and update the rest intact elements, which can be concisely represented as follows:

W ← W − EH −1 (4)

, where E denotes the perturbation, and H −1 is the inverse Hessian matrix. This
process is conducted incrementally during the quantization process.
• Target bits: The bit adopted for weight, activation, and KV cache impacts the final
accuracy. Usually, the hardware-friendly bits are 2-bit, 4-bit and 8-bit. In this bench-
mark, we also investigate 3-bit or 6-bit to compare the potential of quantization
algorithms. But for the practical deployment, 2/4/8-bit is mainly used.

Quantized inference of LLM. As shown in Figure 1, the quantization mainly targets the
Linear layers with weights, i.e., the Q, K, V, and O layers in self-attention modules and
the Up, Gate, and Down layer in FFN modules. Figure 1(b) presents 3 types of quantiza-
tion including weight-activation quantization, weight-only quantization, and KV-cache
quantization. They bring different benefits for reducing the prefill and decode latency.

2.3 Quantization Toolkit

To achieve the modular comparison of the different quantization dimensions aforemen-


tioned, and to consolidate best practices into an end-to-end pipeline, we have designed
and developed a quantization toolkit named LLMC. This toolkit is capable of accommo-
dating multiple quantization configurations using a variety of algorithmic techniques. The
models produced by LLMC are designed for seamless deployment across a diverse range
of hardware platforms. Presently, LLMC supports over ten algorithms, is compatible with
over eight models, is flexible to extend the support of any transformer-based LLMs, and
facilitates deployment on three types of inference engines including LightLLM (ModelTC,
2023), TensorRT-LLM (Nvidia, 2023) and PPL-LLM (OpenPPL, 2023).

3 LLM-QBench

Under the principles in subsection 2.1, powered by our quantization toolkit LLMC, in this
section, we explore the best practice for quantizing large language models from the aspect
of calibration data, quantization algorithm, and target bit.

4
Preprint

Calib. Data GPTQ AWQ SmoothQuant OS+ OmniQuant


WikiText2 C4 Pile (val) WikiText2 C4 WikiText2 C4 WikiText2 C4 WikiText2 C4 WikiText2 C4
✓ 11.93 - 2.19e5 - 140.74 - 84.39 - 9.86 -
✓ - 18.15 - 1.68e5 - 109.41 - 82.29 - 13.73
✓ 15.85 19.90 2.18e5 1.65e5 178.47 119.11 95.22 85.59 11.55 13.96

Table 1: Impact of calibration data on performance across algorithms. We evaluate the


performance of various algorithms using different calibration datasets (WikiText2, C4, and
Pile) and report the PPL (↓) on the WikiText2 and C4 validation sets.
3.1 Experimental Settings

We first illustrate our experiment settings, more details can be found in the subsection A.1.
Models. To demonstrate the generability of our benchmark, We access performance on
LLAMA-2 (Touvron et al., 2023) family, spanning model sizes from 7B to 70B for general
language tasks. To broaden the scope of our evaluation benchmarks, we also benchmark on
ChatGLM (Zeng et al., 2023) for long context abilities, CodeLLAMA (Roziere et al., 2023) for
coding tasks and WizardMath (Luo et al., 2023) for mathematical problems.
Datasets. We categorize the datasets into upstream datasets and downstream datasets.
For the upstream datasets, we employ WikiText2 (Foundation) and C4 (Raffel et al., 2019)
dataset with the perplexity metric for evaluation, since perplexity can stably reflect the
LLM’s perfomance (Dettmers & Zettlemoyer, 2023). For the downstream tasks, we select
examination tasks including MMLU (Hendrycks et al., 2021) and ARC-e (Clark et al., 2018),
knowledge task BoolQ (Clark et al., 2019), understanding task Lambada (Paperno et al.,
2016), reasoning tasks including PIQA (Bisk et al., 2020), HellaSwag (Zellers et al., 2019) and
GSM8K (Cobbe et al., 2021), coding tasks HumanEval (Chen et al., 2021) and MBPP (Austin
et al., 2021), and the long context evaluation LongBench (Bai et al., 2023).
Hardware. Benefiting from the versatility of our tool, we can efficiently and conveniently
quantize LLMs to support multiple inference backends and hardware platforms. In this
paper, we mainly measured the inference efficiency of low-bit kernel on NVIDIA server and
edge GPUs with NVIDIA’s TensorRT-LLM (Nvidia, 2023) framework.

3.2 Impact of Calibration Data

Initially, we examine the influence of calibration data on the accuracy of quantization, as


illustrated by Table 1. It is evident that calibration data affects all algorithms. To attain
optimal accuracy, it is crucial to gather domain-specific data for domain-specific models and
collect diverse data for the general models.

Takeaway 1.
• For LLMs aimed to solve general tasks, we need to utilize the diverse data across various tasks it will face.
• For specific-purpose LLM, it would be better to calibrate the model with the same domain.

3.3 Quantization Algorithm

Following the principles of modularization, we deconstruct the techniques behind existing


algorithms. Through a comprehensive and unbiased experimental comparison, we aim to
derive insights critical for developing an optimally combined quantization pipeline.
As outlined in Table 2, we summarize the different strategies transformation, clipping,
and reconstruction techniques, define their behavior and analyze their calibration cost
accordingly. We evaluate these techniques on LLAMA-2 models of 7B, 13B, and 70B sizes,
under both weight-only and weight-activation quantization scenarios. Here the 2-bit weight-
only experiment of 70B LLAMA-2 is chosen as a representative in the main text. More results
are illustrated in subsection A.2.
Clipping. From the Table 3, we find that searching for the clipping value asymmetrically
is the most effective strategy for optimizing accuracy. This indicates that selecting an
appropriate weight range can significantly reduce weight quantization error. Therefore,

5
Preprint

Technique Category Strategy Eq. Trans. Calib. Cost Algorithm Alias

Rule-based s = max(| X |γ )/max(|W |1−γ ), γ = 0.5, 0.75 ✓ Low SmoothQuant TR


v1: s = max(| X |γ )/max(|W |1−γ ), γ ∈ [0, 1] ✓ Medium AWQ TS-v1
Transformation Search-based
v2: s = max(1.0, max( X )) ✓ Medium OS+ TS-v2
Learnining-based s = arg mins L ✓ High OmniQuant TL

Min-max α = min(W ), β = maxW ✓ Low SmoothQuant, OS+,GPTQ CM


symmetric: α = β ∈ (0, max(|W |)) ✗ Medium AWQ CS-sym
Clipping Search-based
asymmetric: α, β ∈ (0, max(|W |)) ✗ Medium - CS-asym
Learning-based α, β = arg minα,β L ✗ High OmniQuant CL

Reconstruction Hessian-based W ← W − EH −1 ✗ Medium GPTQ RH

Table 2: Detailed comparison of the decomposed quantization techniques and strategies.


Eq. Trans. indicates whether the algorithm is an equivalent transformation. Calib. Cost
represents the level of required GPU resources and time for calibration. γ is the scaling
factor, and L is the loss function.

we also adopt this strategy for the weight-activation quantization. Clipping should be
fully utilized in the pipeline of best practices. What’s more, initialized from the asymmetric
clipping, the accuracy can be boosted by further learning. This good initialization contributes
to a fast convergence.
Reconstruction. GPTQ (Frantar et al., 2022) reconstruction involves the non-equivalent
transformation of weights on the channel dimension, hindering simultaneous optimization
of weights and clip values. Pre-reconstruction weight clip yields suboptimal results due to
weight changes. If reconstruction precedes clip value search, initial quantization parameters
won’t match the updated ones. Moreover, when paired with an equivalent transformation,
it yields minimal benefits. This limitation may stem from the alteration of gradients and
the disruption of assumptions regarding Hessian information. Furthermore, it requires
an extended calibration period. Therefore, reconstruction may not be considered a best
practice.
Transformation. The transformation technique utilizes the linear operation to reduce
the outlier problem in LLM or preserve the important weights. for both the weight-only
and weight-activation quantization, such equivalent transformation brings an accuracy
improvement, especially for the activations. From the table, we can infer that manually
setting the scaling number is rigid and may not help in all scenarios. On the contrary, a
suitable search for the transformation scale s is effective. There are different search strategies
and both help a lot at improving the accuracy. A learning process can be further adopted
with a pre-searched range. Fortunately, with the support of fast pre-search, the calibration
can achieve learning with fewer epochs.
Calibration cost for each strategy. In the analysis of calibration costs detailed in Table 4, we
observe that within the suite of transformation techniques, the search-based (v1) strategy
requires roughly 10 minutes, making it twice as fast as the (v2) strategy. While rule-based
transformations are quicker, they often fall short of achieving acceptable accuracy levels.
On the other hand, learning-based transformation methods incur a considerable increase in
time to attain satisfactory accuracy levels. However, initializing the learning process with
pre-searched values can halve the number of epochs required and yield higher accuracy.
Regarding clipping methods, employing direct min-max value clipping is time-efficient but
typically results in significant accuracy loss. The search-based clipping method, whether
using asymmetric or symmetric ranges, proves efficient, requiring only about 20 minutes.
Yet, when applying a learning-based approach to clipping, the calibration time can extend
to nearly 7 hours. Therefore, a combined approach of the search-based transformation v1
and search-based asymmetric clipping emerges as the most effective in balancing accuracy
and efficiency. Furthermore, initiating with pre-searched values and conducting additional
learning for a few epochs may offer further accuracy improvements.

6
Preprint

PPL ↓ Accuracy (%) ↑


Method
WikiText2 C4 Avg. MMLU* ARC-e* BoolQ* HellaSwag* PIQA* Avg.

Full Prec. 3.32 5.71 4.52 70.91 88.44 83.33 80.00 83.50 81.24
TR 7.56 10.79 9.18 51.44 38.19 59.00 69.20 76.50 58.87
TS-v1 6.69 9.41 8.05 40.21 45.73 73.33 67.60 77.50 60.87
TS-v2 7.25 10.42 8.83 49.63 48.74 62.67 70.00 78.50 61.91
CM 10.32 15.16 12.74 34.45 34.67 54.00 62.40 73.50 51.80
CS-sym 7.2e4 6.5e4 6.9e4 27.79 26.63 41.67 25.60 51.50 34.64
CS-asym 5.67 8.26 6.97 53.44 69.85 78.67 72.80 78.00 70.55
CL 6.13 8.62 7.38 49.59 47.24 75.67 72.80 79.50 64.96
RH 6.68 9.40 8.04 54.65 42.21 70.33 67.20 77.00 62.28
TS-v1+RH 6.69 9.45 8.07 50.00 42.71 73.67 65.60 73.50 61.10
TS-v1+CS-sym 7.1e4 6.5e4 6.8e4 27.79 26.63 41.67 25.60 51.00 34.54
TS-v1+CS-asym 5.24 7.73 6.49 59.52 77.89 82.33 74.80 82.00 75.31
TS-v1+CL w/ CS-asym init. 5.24 7.70 6.47 59.40 77.25 82.39 75.46 78.51 74.60

Table 3: Ablation results of LLaMA-2-70B weight-only (w2a16g64) quantization. * means


the subset of the corresponding dataset.
TR TS TS TL TL TL CS CL TS-v1 TL TS-v1 + CL TL w/ TS-v1 init. RH
-v1 -v2 w/ ones init. w/ TR init. w/ TS-v1 init. -asym +CS-asym +CL w/ CS-asym init. +CL w/ CS-asym init.
Time ∼ 0.08h ∼ 0.2h ∼ 0.5h ∼ 7.3h ∼ 7.3h ∼ 4h ∼ 0.4h ∼ 6.8h ∼ 0.6h ∼ 8.3h ∼ 3.5h ∼ 4.4h ∼ 0.6h

Table 4: Calibration cost on LLAMA-2-70B of different strategies in Table 2. Ones init. means
we use a vector of ones as the start point of s before learning.

Takeaway 2.
• Search-based clipping and transformation are optimal solutions for balancing the calibration cost and
accuracy. The searched values benefit initializing the learning-based solutions.
• Incorrect clipping easily leads to an accuracy crash. Asymmetric clipping is simple yet effective for
improving the accuracy.
• Transformation searching influences both the calibration efficiency and quantized accuracy. The v1
strategy in Table 2 enjoys a good tradeoff between them.

3.4 Target Bits

Fixed-precision. In the experimental results presented in subsection 3.3, we observed that


both 2-bit weight-only quantization and w4a4 weight-activation quantization experienced
over a 20% degradation in accuracy. This significant reduction in performance limits their
practical utility. In contrast, 3-bit weight-only and w6a6 weight-activation quantization were
primarily evaluated to assess algorithm capabilities and cannot achieve practical hardware
acceleration. Consequently, we recommend the 4-bit weight-only, w4a8, or w8a8 weight-
activation quantization approaches as they strike a balance between maintaining accuracy
and enhancing inference speed. Furthermore, quantization of the Key-Value (KV) cache is
proposed as a method to decrease memory usage. In Table 21 and Table 5, we assessed the
accuracy impact of 2-bit (per-group quantization with a group size of 8), 4-bit (per-group
quantization with a group size of 8), and 8-bit (per-tensor) KV cache quantization. The
results indicate that 2-bit KV cache quantization leads to a substantial loss in accuracy,
while 4-bit KV cache quantization, with its finer granularity, performs comparably to 8-bit
KV cache quantization with a coarser group size. Both the 4-bit and 8-bit configurations
closely approximate the performance of FP16 at the code generation task and long-context
understanding task. Hence, for KV cache quantization, a 4-bit per-group approach with a
group size of 8 is recommended.
Mixed-precision. As presented in our experiments, quantizing LLMs into ultra-low
precision without significant accuracy loss is difficult. A viable remedy is to employ mix-
precision quantization. For mix-precision, we only evaluate accuracy for theoretically
hardware-friendly strategies since there are no open-access fast kernels to evaluate inference.

7
Preprint

Accuracy (%) ↑
Model KV Cache Prec.
NarrativeQA QASPER MultiFieldQA-en MultiFieldQA-zh Avg.
Full Prec. 25.93 43.35 51.57 62.36 45.80
int8 25.74 43.57 51.81 62.48 45.90
ChatGLM3-6B-32k
int4 26.13 43.43 51.63 61.04 45.56
int2 1.89 4.68 3.13 1.08 2.70

Table 5: KV cache quantization results on Single-Document QA from LongBench Bai et al.


(2023)

Figure 2: Inference speed of 7B, 13B and 70B LLAMA-2 models on NVIDIA A100 GPU.
(Input sequence length: 32K, Output sequence length: 512)

As shown in Table 22, Table 23, and Table 24, for weight-only quantization, employing
Hessain disturbance as bit allocate strategy outperforms others. High-bit quantization
benefits from lower mixture rates, while low-bit requires more full-precision weights in small
LLMs for better performance. For weight-activation quantization, dynamic bit allocation
with slower inference speed and higher computational overhead during inference gains
more accuracy improvements rather than static strategy, even though the latter uses a double
mixture rate. Details are presented in the subsection A.6.
Inference Speed. To assess the practical benefits of different quantization approaches,
we conducted evaluations using NVIDIA’s cloud (SMX 80G A100) and edge (Drive Orin)
GPUs, alongside the official inference library, TensorRT-LLM. Part of our results, as depicted
in Figure 2, highlight the throughput improvements achieved through TensorRT-LLM-
supported quantization schemes for models with 32,000 input tokens and 512 output tokens.
The findings indicate that quantization with 8-bit weights and activations enhances the
prefill stage’s speed by 20%-30% and the decode stage by 40%-60%. In contrast, 4-bit
weight-only quantization reduces the prefill speed by 10% but increases the decode speed
by 40%-60%. It’s important to note that these acceleration rates tend to diminish for larger
models. Besides, 8-bit KV cache quantization has minimal impact on prefill times and
slightly reduces decoding throughput for very large models, such as those with 70B model.
Results for more models and hardware can be found in subsection A.5.

8
Preprint

Takeaway 1 Takeaway 3 Takeaway 2

W .s
Bit-width: Fix / mixed Search
Transformation Transformation s A /s
Calibration data Quant
Configuration Initialized from
Clipping
Quant: W / A / KV cache 𝛼 𝛽 searched value
Tasks x Models W Search
Asymmetric 𝛼, 𝛽 Learning

Figure 3: The best practice of PTQ pipeline for LLMs.


Accuracy (%) ↑
Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.

Full Prec. 70.91 88.44 83.33 80.00 83.50 81.24

Naive 31.72 30.69 53.88 66.83 72.14 51.05


GPTQ 50.50 41.09 76.39 69.04 75.41 62.49
AWQ 24.46 26.46 37.83 24.60 50.87 32.84
OmniQuant 49.27 47.80 78.07 72.79 77.91 65.17
TS-v1+CS-asym 57.91 80.07 83.91 75.98 78.67 75.31
TS-v1+CL w/ CS-asym init. 59.40 77.25 82.39 75.46 78.51 74.60

Table 6: Main results of LLaMA-2-70B (w2a16g64) weight-only quantization.

CodeLlama-7b WizardMath-7b
#Bits Method
HumanEval(Pass@1(%) ↑) GSM8K-100(Acc.(%) ↑)

Full Prec. - 31.10 51.00

Naive 26.83 32.00


w3a16 TS-v1+CL w/ CS-asym init. 23.17 38.00
TS-v1+CL w/ CS-asym init.† 28.05 46.00

Table 7: Main results of code and math analyses. † indicates calibration with the correspond-
ing data. For the CodeLlama-7b model, we use a sample of 10 instances from the MBPP and
HumanEval datasets, respectively. Similarly, for the WizardMath-7b model, we sample 10
instances from the MATH and GSM8K datasets, respectively .

Takeaway 3.
• For fixed-precision, considering the principle of inference speed and quantized accuracy, 4-bit weight-only
quantization, w4a8/w8a8 weight-activation quantization, and 4-bit KV cache quantization with group
size=8 are promising settings. Larger models can tolerate lower bit for weights.
• Weight quantization benefits decoding speed and harms prefill speed. Weight-activation quantization
benefits both the prefill and decode speed, and KV cache quantization only brings little speedup for small
models but helps reduce memory consumption for long context.
• For mixed precision (specialized kernel requirements), hessian-based metrics excel in determining
the precision for weight quantization, while dynamic magnitude-based strategies with non-negligible
overhead are better for enhancing accuracy toward weight activation quantization.

3.5 Best Practice of LLM PTQ pipeline

Based on the takeaways distilled from the above exploration, we summarize the best practice
of PTQ pipeline for LLM. As depicted in Figure 3, first, we should collect the best calibration
data according to the task and model under the guide of Takeway 1. Then the bit-width
and quantization scheme could be determined considering the Takeway 3. Finally, the
calibration process can be conducted using the algorithm pipeline based on Takeway 2. The
results in Table 6 and Table 7 of general-purpose model LLAMA-2-70B and specific-domain
code model CodeLLAMA-7b and math model WizardMath-7b proved the effectiveness,
especially for maintaining high accuracy. More experimental results on other models and

9
Preprint

datasets to validate our best practice for decent performance and efficient inference can be
found in subsection A.3.

4 Conclusion

In this study, we have undertaken a comprehensive benchmarking of decomposed quanti-


zation techniques for large language models (LLMs), leading to the identification of best
practices that balance calibration costs, accuracy, and efficiency. Furthermore, we intro-
duce LLMC, a toolkit designed to empower the research and development community.
Models optimized through our recommended practices and toolkit are readily deployable
across a variety of hardware platforms, enhancing accessibility and applicability in diverse
computational environments.

References
Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren,
Torsten Hoefler, and Dan Alistarh. Quik: Towards end-to-end 4-bit inference on generative
large language models, 2023.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David
Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with
large language models. arXiv preprint arXiv:2108.07732, 2021.

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao
Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Long-
bench: A bilingual, multitask benchmark for long context understanding. arXiv preprint
arXiv:2308.14508, 2023.

Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak. Lsq+:
Improving low-bit quantization through learnable offsets and better initialization, 2020.

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning
about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on
Artificial Intelligence, 2020.

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantiza-
tion of large language models with guarantees, 2024.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray,
Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin,
Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo-
hammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings,
Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji,
Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh
Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage,
Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish,
Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code.
2021.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and
Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions,
2019.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning
challenge. arXiv:1803.05457v1, 2018.

10
Preprint

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz
Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher
Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168, 2021.
Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling
laws, 2023.
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix
multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar,
Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A
sparse-quantized representation for near-lossless llm weight compression. arXiv preprint
arXiv:2306.03078, 2023.
Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu.
Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation, 2024.
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and
Dan Alistarh. Extreme compression of large language models via additive quantization,
2024.
Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and
Dharmendra S. Modha. Learned step size quantization, 2020.
Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate
post-training quantization for generative pre-trained transformers. arXiv preprint
arXiv:2210.17323, 2022.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster,
Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor
Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint
arXiv:2101.00027, 2020.
Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei
Yu, and Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit
neural networks. In The IEEE International Conference on Computer Vision (ICCV), October
2019.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of
the International Conference on Learning Representations (ICLR), 2021.
Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon,
and Dongsoo Lee. Rethinking channel dimensions to isolate outliers for low-bit weight
quantization of large language models, 2023.
Divyansh Jhunjhunwala, Advait Gadhikar, Gauri Joshi, and Yonina C. Eldar. Adaptive
quantization of model updates for communication-efficient federated learning, 2021.
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen,
Michael W. Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization,
2024.
Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. Owq: Lessons
learned from activation outliers for weight quantization in large language models. arXiv
preprint arXiv:2306.02272, 2023.
Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang,
and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction,
2021.

11
Preprint

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq:
Activation-aware weight quantization for llm compression and acceleration. arXiv preprint
arXiv:2306.00978, 2023.
Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. Qllm:
Accurate and efficient low-bitwidth quantization for large language models, 2024.
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad,
Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free
quantization aware training for large language models, 2023.
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo
Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering
mathematical reasoning for large language models via reinforced evol-instruct. arXiv
preprint arXiv:2308.09583, 2023.
ModelTC. Lightllm. https://github.com/ModelTC/lightllm, 2023.
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen,
and Tijmen Blankevoort. A white paper on neural network quantization, 2021.
Nvidia. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM, 2023.
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat,
Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao,
Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brak-
man, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie
Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke
Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen,
Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings,
Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien
Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty
Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman,
Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun
Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott
Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han,
Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade
Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost
Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun
Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali
Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick,
Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt
Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis,
Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike,
Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz
Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam
Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne,
Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil,
David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela
Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg
Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan,
Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex
Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy
Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila
Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny,
Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth
Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Fran-
cis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario
Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr,

12
Preprint

John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah
Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina
Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski
Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thomp-
son, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry
Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss,
Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei,
CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff,
Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sher-
win Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan,
Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024.
OpenPPL. Ppl-llm. https://github.com/openppl-public/ppl.nn.llm, 2023.
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella
Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The
LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pp. 1525–1534, Berlin, Germany, August 2016. Association for Computational
Linguistics. URL http://www.aclweb.org/anthology/P16-1144.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a
unified text-to-text transformer. arXiv e-prints, 2019.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,
Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation
models for code. arXiv preprint arXiv:2308.12950, 2023.
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng
Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated
quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W
Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of
bert. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8815–8821, 2020.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas
Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude
Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman
Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas,
Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning
Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew
Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva,
Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor,
Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang,
Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,
Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat
models, 2023.
Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip#:
Even better llm quantization with hadamard incoherence and lattice codebooks, 2024.
Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric
Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The blessing of dimension-
ality for llm quantization, 2024.
Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang,
Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit

13
Preprint

transformer language models. Advances in Neural Information Processing Systems, 35:


17402–17414, 2022.
Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. Qdrop: Randomly
dropping quantization for extremely low-bit post-training quantization, 2023a.
Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and
Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by
equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023b.
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han.
Smoothquant: Accurate and efficient post-training quantization for large language models.
In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023.
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yux-
iong He. Zeroquant: Efficient and affordable post-training quantization for large-scale
transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can
a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, 2019.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang,
Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai,
Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130b: An
open bilingual pre-trained model. In The Eleventh International Conference on Learning
Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
Luoming Zhang, Wen Fei, Weijia Wu, Yefei He, Zhenyu Lou, and Hong Zhou. Dual grained
quantization: Efficient fine-grained quantization for llm, 2023.

14
Preprint

A Appendix

A.1 Quantization Granularity & More Experiment Settings

We first illustrate more quantization preliminaries in subsubsection A.1.1 for users to


understand our subsequent content more clearly. Then we benchmark naive PTQ without
algorithms in subsubsection A.1.2 to evaluate quantization granularity, and then we can
obtain our basic quantization settings based on that in subsubsection A.1.3.

A.1.1 More Preliminaries of quantization


Naive PTQ can be split into four dimensions: bit-width, symmetric/asymmetric, group
size, and dynamic/static. 1) Bit-width: In this paper, we mainly focus on w4a4, w4a8,
w8a8 weight activation quantization, and w2a16, w4a16 weight-only quantization, and
other bit-width settings only use to validate our analyses. Since more extreme low-bit
quantization can result in unacceptable accuracy loss, whereas settings like w3a16, and
w6a6 cannot continuously pack quantized values, our focus is hardware-friendly which can
simultaneously improve inference efficiency 2 ; 2) Symmetric or asymmetric: For asymmetric
quantization, a zero-point value z will usually be introduced to represent the floating-point
zero. Otherwise, the symmetric quantization does not have that adjustable z to adapt
various ranges; 3) Group size: Shen et al. (2020) first proposes group-wise quantization,
which divides each channel of a weight 3 into different groups and employs a different set
of scale and zero-point for each group Wi,j:j+ g with group size g. However, per-tensor (W:,: )
quantization or per-channel (Wi,: ) quantization can be also seen as group-wise quantization
with a larger group size; 4) Dynamic or static: Due to variance in activation range for
LLM, Yao et al. (2022) first introduces token-wise (Xi,: ) quantization for activation, which
dynamically calculates the min/max range for each token during model inference. We
also measure dynamic/static per-tensor activation quantization to make a comprehensive
comparison.

A.1.2 Quantization Granularity Exploration


Weight-only. For weight-only quantization, the experiments are conducted as described
in Figure 4. Drawing from findings from these experiments, we obtain the following
conclusions:

• For lower-bit quantization, the precision advantage of asymmetric over symmetric


quantization becomes significantly more pronounced.
• Common group sizes, e.g., 64 and 128, are not sensitive to higher-bit quantization.
Moreover, the accuracy drop from channel-wise quantization compared to group-
wise quantization is non-negligible across model sizes.
• Larger models exhibit better robustness to quantization, with reduced likelihood of
numerical overflow at lower bits.

Weight-activation. Since weight-activation quantization can use integer matrix multipli-


cation to speed up inference, per-group quantization for weight, which would slow down
the speed of this multiplication, is always ignored in this manner. Therefore, we force
per-channel weight quantization in all experiments. As presented in Figure 5, we have
arrived at the following conclusions:

• Per-token quantization significantly outperforms static/dynamic per-tensor quanti-


zation.
2Weight-only quantization accelerates by reducing memory data volume, and “wxax” quantiza-
tion further speeds up with low-bit multiplications. w4a8 leverages 8-bit matrix multiplication for
acceleration, as detailed in (Zhang et al., 2023).
3We denote weight W ∈ Rout×in . The first/second dimension of W represents output/input

channels. Notably, we ignore the batch size dimension for activation X ∈ Rn×d , where n means token
number, d means hidden size.

15
Preprint

Figure 4: Weight-only quantization granularity results for LLaMA-2-7B (Upper Left),


LLaMA-2-13B (Upper Right), and LLaMA-2-70B (Lower Center). asym/sym means asym-
metric/symmetric quantization. “ch” means per-channel quantization. “xg” means per-
group quantization with group size “x”.

16
Preprint

Figure 5: Weight-activation quantization granularity results for LLaMA-2-7B (Upper


Left), LLaMA-2-13B (Upper Right), and LLaMA-2-70B (Lower Center). “sym/asym-
sym/asym” means weight employ sym/asym (former) quantization and activation employ
sym/asym (latter) quantization. “sts/ts” denotes static/dynamic per-tensor quantization
for activation. “tk” means dynamic per-token quantization for activation.

• For per-token quantization, at higher bits (w8a8), the benefits of asymmetric quan-
tization are marginal. However, at lower bits (w4a4), the gains from asymmetric
weight quantization become more evident as presented for LLaMA-2-13B. The lack
of clear benefits for activation might be due to dynamic quantization, which allows
symmetric quantization to adapt to the range of various activations effectively.
• Per-tensor quantization is more sensitive to symmetric/asymmetric quantization
rather than per-token quantization. We can roughly assert that per-tensor quan-
tization does not adapt well to the range of different tokens and their various
dimensions within large language models, especially at lower bits (w4a4).

A.1.3 More Experimental Settings

As described in subsubsection A.1.2, for weight-only quantization, we employ group-wise


quantization, since it has higher accuracy and lots of inference backends have already
implemented faster group-wise kernels. Specifically, we utilize group-size 64 for 2-bit
quantization and group-size 128 for other bit-width. For weight-activation quantization, we
apply channel-wise weight quantization and dynamic per-token activation quantization,
which also benefits from open-access fast kernels and high performance. Otherwise, we use
asymmetric quantization for both weight and activation, since it always has high accuracy
without sacrificing much speed. On the other hand, unless specified, the calibration data
mentioned in our paper uniformly employs a general high-quality dataset, Pile (val) (Gao
et al., 2020). More specifically, we use 128 samples with 512 sequence lengths for weight-only
quantization, and 128 samples with 2048 sequence lengths. The reason for the different
sequence lengths is that we find weight-only quantization seems less sensitive for the size of
the data than weight-activation quantization. We will explore more about that in the future.

17
Preprint

PPL ↓ Accuracy (%) ↑


Model #Bits
WikiText2 C4 Avg. MMLU ARC-e BoolQ HellaSwag PIQA Avg.

Full Prec. 5.47 7.26 32.89 46.55 58.38 74.86 73.91 77.69 66.28
LLaMA-2-7B w4a16g128 5.73 7.59 47.10 45.06 56.79 77.40 72.48 77.26 65.79
w8a8 5.55 33.63 7.35 46.41 57.32 68.96 72.78 77.04 64.50

Full Prec. 4.88 6.73 48.82 55.82 75.13 82.39 77.32 79.92 74.12
LLaMA-2-13B w4a16g128 4.98 6.87 52.34 54.43 71.25 81.71 76.37 78.89 72.53
w8a8 4.92 50.38 6.76 49.77 71.96 74.13 76.96 79.33 70.43

Full Prec. 3.32 5.71 20.78 69.52 89.59 87.58 82.12 81.77 82.12
LLaMA-2-70B w4a16g128 3.46 5.83 20.67 68.36 90.30 86.82 81.50 80.85 81.57
w8a8 3.39 22.17 5.76 61.76 90.12 81.65 81.95 81.39 79.37

Table 8: Ablation results of LLaMA-2 family naive quantization (w4a16g128, w8a8).

Beyond the aforementioned details, we elaborate on the specific size of the downstream data
subsets, marked with “*” in the ablation study, used in our paper for ablation studies. We
randomly extract 600 questions from MMLU (Hendrycks et al., 2021), 200 from ARC-e (Clark
et al., 2018), 300 from BoolQ (Clark et al., 2019), 250 from HellaSwag (Zellers et al., 2019)
and 200 from PIQA (Bisk et al., 2020), which can also reflect real model performance.

A.2 More Ablation Study of the Decomposed Algorithm Techniques

In this section, we provide a comprehensive ablation study for different quantization


algorithms. We first benchmark weight-only quantization experiments on upstream and
subsets of downstream datasets. Based on this research, we find that the trend of PPL
across different algorithms and models closely overlaps that of accuracy on the subsets of
downstream datasets. Therefore, we only measure PPL for weight-activation quantization
exploration along with some conclusions coming from the former weight-only quantization
study. On the other hand, we have found that some high-bit (w4a16g128, w8a8) quantization
may not be suitable to conduct the ablation study, since their PPL and accuracy are both
highly similar to the full-precision models as shown in Table 8. Moreover, we even believe
that such relatively high-bit quantization may not require the adoption of any specific
algorithm for practical deployment and inference.
Weight-only. As results shown in Table 9, Table 10, Table 11, we make the following
conclusions:

• Search-based transformation, especially TS-v1, in weight-only quantization outper-


forms others. Learning-based transformation employing rule-based transformation,
is not a good choice, as the starting point for learning can not gain a good perfor-
mance.
• Asymmetric clipping helps prevent accuracy drops more than symmetric clipping,
especially at lower-bit.
• We obtain our two best practices for weight-only quantization. 1) TS-v1+CS-asym:
An efficient strategy without learning, and outperforms learning-based methods. 1)
TSv1+CL w/ CS-asym init.: Take more time (Learn clipping bounds from CS-asym
initialization.), but may have higher performance. However, the result of the first
best practice in some 3-bit quantization cases, e.g., w3a16g128 LLaMA-2-70B in
Table 11, w3a16g128 LLaMA-2-13B in Table 10, offers higher accuracy than the
latter.

Weight-activation. We directly apply CS-asym from the conclusion in weight-only study.


When quantizing activations, learning-based transformation can help a lot (Shao et al., 2023).
Therefore, we explore the different start points for learning, and all of our conclusions from
Table 12, Table 13, and Table 14, can be exhibited as follows:

18
Preprint

PPL ↓ Accuracy (%) ↑


#Bits Method
WikiText2 C4 Avg. MMLU* ARC-e* BoolQ* HellaSwag* PIQA* Avg.

Full Prec. - 5.47 7.26 6.37 46.82 62.31 76.33 72.40 79.00 67.37
TR 6.69 9.11 7.90 41.03 47.24 73.00 68.00 78.50 61.55
TS-v1 6.46 8.59 7.53 38.20 53.77 69.33 66.80 76.00 60.82
TS-v2 6.60 8.88 7.74 39.84 45.23 65.67 68.80 77.00 59.31
TL 6.42 8.62 7.52 39.13 49.25 71.00 67.60 77.00 60.80
CM 6.66 8.98 7.82 39.42 44.22 68.00 68.00 79.00 59.73
CS-sym 6.28 8.38 7.33 44.22 41.21 76.67 68.00 77.00 61.42
w3a16g128 CS-asym 6.22 8.32 7.27 41.87 43.22 73.67 69.60 77.00 61.07
CL 6.21 8.32 7.27 44.86 52.26 71.33 69.20 77.00 62.93
RH 6.33 8.53 7.43 41.48 50.75 72.00 69.60 77.50 62.27
TS-v1+RH 6.31 8.51 7.41 38.69 51.26 69.67 69.60 78.00 61.44
TS-v1+CS-sym 6.22 8.28 7.25 22.83 24.62 58.33 22.40 55.50 36.74
TS-v1+CS-asym 6.18 8.24 7.21 28.86 24.62 65.67 59.60 68.00 49.33
TS+CL w/ CS-asym init. 5.94 8.10 7.02 42.26 47.74 74.00 70.40 79.50 62.78
TR 178.47 119.11 148.79 23.70 27.64 46.67 40.00 58.00 39.20
TS-v1 34.68 34.47 34.58 24.11 29.15 58.33 44.40 56.00 42.40
TS-v2 95.22 82.29 88.76 23.04 31.66 43.00 42.40 61.50 40.32
TL 39.94 47.66 43.80 25.28 29.65 50.33 44.00 58.50 41.55
CM 421.33 559.34 490.34 NaN 30.15 42.67 41.20 60.50 NaN
CS-sym 2.18e5 1.65e5 1.91e5 22.83 24.12 58.33 25.60 55.50 37.28
w2a16g64 CS-asym 13.61 18.97 16.29 28.18 22.61 59.67 56.40 68.00 46.97
CL 11.55 13.96 12.76 27.81 28.64 60.67 55.60 69.50 48.44
RH 15.85 19.90 17.88 28.70 28.14 53.33 51.60 68.00 45.96
TS-v1+RH 15.75 19.77 17.76 26.72 31.66 53.00 50.00 60.50 44.38
TS-v1+CS-sym 2.09e5 1.59e5 1.84e5 22.83 24.62 58.33 22.40 55.50 36.74
TS-v1+CS-asym 11.69 14.83 13.26 28.86 24.62 65.67 59.60 68.00 49.33
TS-v1+CL w/ CS-asym init. 8.66 12.30 10.48 31.79 31.16 64.67 57.60 69.50 50.94

Table 9: Ablation results of LLaMA-2-7B weight-only quantization. * means the subset of


the corresponding dataset.

19
Preprint

PPL ↓ Accuracy (%) ↑


#Bits Method
WikiText2 C4 Avg. MMLU* ARC-e* BoolQ* HellaSwag* PIQA* Avg.

Full Prec. - 4.88 6.73 5.81 56.69 74.37 82.33 75.60 81.00 74.00
TR 5.56 7.63 6.60 49.84 63.82 78.67 72.00 79.00 68.67
TS-v1 5.43 7.42 6.42 56.11 65.83 77.00 74.00 79.50 70.49
TS-v2 5.47 7.51 6.49 50.63 65.33 76.33 73.20 76.50 68.40
TL 5.44 7.48 6.46 52.36 65.33 73.67 73.60 78.50 68.69
CM 5.52 7.58 6.55 50.48 64.32 78.00 74.00 77.50 68.86
CS-sym 5.36 7.37 6.37 53.29 70.35 78.67 72.00 78.00 70.46
w3a16g128 CS-asym 5.35 7.34 6.34 54.40 70.35 80.33 74.00 80.50 71.92
CL 5.42 7.40 6.41 52.48 59.80 75.00 74.80 77.50 67.92
RH 5.55 7.67 6.61 53.34 69.85 77.67 73.20 79.00 70.61
TS-v1+RH 5.41 7.45 6.43 53.37 61.31 76.33 74.00 78.00 68.60
TS-v1+CS-sym 5.32 7.30 6.31 55.56 67.84 79.33 75.20 80.00 71.59
TS-v1+CS-asym 5.30 7.28 6.29 55.33 72.36 81.00 74.40 80.50 72.72
TS-v1+CL w/ CS-asym init. 5.23 7.28 6.26 53.95 70.85 79.00 74.00 80.50 71.66
TR 16.39 19.39 17.89 25.83 34.17 57.67 54.80 61.00 46.69
TS-v1 12.30 15.45 13.88 26.63 34.67 55.00 58.80 70.00 49.02
TS-V2 14.36 17.05 15.71 25.50 32.66 43.33 53.20 64.50 43.84
TL 12.39 15.76 14.08 26.24 33.67 52.33 55.20 69.00 47.29
CM 26.22 30.46 28.43 23.89 31.66 51.00 36.40 53.00 39.19
CS-sym 1.25e5 9.73e4 1.11e5 28.10 17.09 58.33 26.00 47.00 35.30
w2a16g64 CS-asym 8.96 12.52 10.74 32.65 33.67 63.00 60.40 72.00 52.34
CL 8.40 11.02 9.71 32.45 34.67 64.00 64.00 71.50 53.32
RH 9.51 12.61 11.06 31.77 31.16 66.00 60.00 73.00 52.39
TS-v1+RH 9.81 12.99 11.40 35.01 31.66 60.00 56.80 70.50 50.79
TS-v1+CS-sym 1.22e5 1.22e5 1.22e5 28.10 17.09 58.33 26.00 47.00 35.30
TS-v1+CS-asym 7.88 10.84 9.36 39.57 43.72 71.00 64.40 74.50 58.64
TS-v1+CL w/ CS-asym init. 6.97 10.01 8.49 41.53 42.21 67.67 68.80 76.50 59.34

Table 10: Ablation results of LLaMA-2-13B weight-only quantization. * means the subset of
the corresponding dataset.

PPL ↓ Accuracy (%) ↑


Method
WikiText2 C4 Avg. MMLU* ARC-e* BoolQ* HellaSwag* PIQA* Avg.

Full Prec. 3.32 5.71 4.52 70.91 88.44 83.33 80.00 83.50 81.24
TR 3.95 6.24 5.10 65.40 86.93 82.00 80.40 82.50 79.45
TS-v1 3.85 6.12 4.99 68.14 88.44 83.00 79.20 83.00 80.36
TS-v2 3.93 6.21 5.07 66.62 87.94 82.33 79.20 83.50 79.92
CM 3.98 6.27 5.13 64.90 88.94 79.33 78.00 83.00 78.83
CS-sym 3.85 6.13 4.99 66.34 89.95 82.00 79.20 83.00 80.10
CS-asym 3.84 6.14 4.99 66.03 91.46 83.00 79.20 82.50 80.44
CL 3.81 6.09 4.95 68.90 86.93 81.00 80.00 84.00 80.17
RH 3.93 6.17 5.05 68.01 87.94 84.00 79.20 82.00 80.23
TS-v1+RH 3.95 6.18 5.07 65.66 86.93 83.67 79.20 81.50 79.39
TS-v1+CS-sym 3.75 6.05 4.90 68.38 85.93 83.67 82.40 84.00 80.88
TS-v1+CS-asym 3.74 6.04 4.89 68.21 89.45 84.00 81.60 82.50 81.15
TS-v1+CL w/ CS-asym init. 3.74 6.04 4.89 68.86 88.44 85.00 80.00 81.00 80.66

Table 11: Ablation results of LLaMA-2-70B weight-only (w3a16g128) quantization. * means


the subset of the corresponding dataset.

20
Preprint

• We still find that TS-v1 has fewer accuracy drops than other learning-free transfor-
mation methods. In w4a4 experiments at Table 12, we try different start points for
TL and notice that TS-v1 can help gain satisfactory model precision.
• We also provide two best practices aligned with weight-only quantization, e.g., 1)
TS-v1+CS-asym: and 2) TL w/ TS-v1 init.+CL w/CS-asym init..

PPL ↓
#Bits Method
WikiText2 C4 Avg.

Full Prec. - 5.47 7.26 6.37


CM 6.08 8.07 7.07
TR 6.29 8.08 7.18
TS-v1 5.86 7.71 6.79
w6a6 TS-v2 5.84 7.76 6.80
TL+CL 5.97 8.21 7.09
TS-v1+CS-asym 5.78 7.72 6.75
TL w/ TS-v1 init. +CL w/ CS-asym init. 5.77 7.68 6.72
CM 6.19 8.29 7.24
TR 6.41 8.56 7.49
TS-v1 6.22 8.18 7.16
w4a8 TS-v2 6.29 8.35 7.32
TL+CL 5.97 7.93 6.95
TS-v1+CS-asym 5.89 7.78 6.83
TL w/ TS-v1 init. +CL w/ CS-asym init. 5.85 7.74 6.80
CM 409.53 433.34 421.44
TR 51.21 61.52 56.37
TS-v1 15.43 20.01 17.72
TS-v2 49.41 104.95 77.18
TL w/ TR init. 17.11 21.09 19.10
w4a4
TL w/ ones init. 37.13 45.94 41.54
TL w/ TS-v1 init. 15.41 19.80 17.61
TL+CL 13.79 19.12 16.46
TS-v1+CS-asym 15.66 21.67 18.67
TL w/ TS-v1 init.+CL w/ CS-asym init. 13.32 18.65 15.99

Table 12: Ablation results of LLaMA-2-7B weight-activation quantization. Ones init. means
we set all of elements in s as one before learning.

In a word, we have found four best practices, two for weight-only quantization, and the
other two for weight-activation quantization. Furthermore, we suggest that users use the
learning-free best practices for relatively higher-bit quantization and learning-based best
practices for others.

A.3 More Experiments Validating the Effectiveness of Best Practice

In this section, we present additional experimental results to further validate the effec-
tiveness of our proposed best practices for model quantization. Specifically, we focus on
weight-only and weight-activation quantization for the LLaMA-2 model across various sizes
(7B, 13B, and 70B). The following tables summarize the main results of these experiments,
demonstrating the effectiveness of our best practices in mitigating the impact of quantization
on model performance across different configurations (Table 15, Table 16, Table 17, Table 18,
Table 19, Table 20).

Weight-only. Our weight-only quantization experiments, as shown in Table 15, Table 16,
and Table 17, provide compelling evidence that our best practices significantly preserve
model performance, even under low bit settings. Notably, our methodologies achieve SOTA
performance. For instance, in the w3a16 setting, the 7B model (Table 15) maintains an

21
Preprint

PPL ↓
#Bits Method
WikiText2 C4 Avg.

Full Prec. - 4.88 6.73 5.81


CM 5.32 7.22 6.27
TR 5.16 7.07 6.12
TS-v1 5.12 7.02 6.07
w6a6 TS-v2 5.13 7.03 6.08
TL+CL 5.11 7.02 6.07
TS-v1+CS-asym 5.11 7.00 6.06
TL w/ TS-v1+CL w/ CS-asym init. 5.07 6.97 6.02
CM 5.24 7.19 6.22
TR 6.41 8.56 7.49
TS-v1 5.22 7.14 6.18
w4a8 TS-v2 5.24 7.17 6.21
TL+CL 5.78 7.88 6.83
TS-v1+CS-asym 5.09 6.99 6.04
TL w/ TS-v1+CL w/ CS-asym init. 5.06 6.97 6.02
CM 598.97 687.75 643.36
TR 22.86 30.50 26.68
TS-v1 16.59 18.96 17.78
w4a4 TS-v2 36.70 55.86 46.28
TL+CL 12.27 18.38 15.32
TS-v1+CS-asym 15.29 20.96 18.13
TL w/ TS-v1 init. +CL w/ CS-asym init. 10.24 13.75 12.01

Table 13: Ablation results of LLaMA-2-13B weight-activation quantization.

PPL ↓
#Bits Method
WikiText2 C4 Avg.

Full Prec. - 3.32 5.71 4.52

CM 4.67 7.62 6.15


TR 3.66 6.06 4.86
TS-v1 3.63 6.04 4.84
w6a6 TS-v2 3.64 6.04 4.84
TL+CL 3.73 6.13 4.93
TS-v1+CS-asym 3.65 6.05 4.85
TL w/ TS-v1 init. +CL w/ CS-asym init. 3.63 6.02 4.82
CM 3.76 6.07 4.91
TR 3.81 6.14 4.98
TS-v1 3.68 5.99 4.84
w4a8 TS-v2 3.72 6.04 4.88
TL+CL 7.73 11.30 9.52
TS-v1+CS-asym 3.53 5.89 4.71
TL w/ TS-v1 init. +CL w/ CS-asym init. 3.57 5.92 4.75
CM NaN NaN NaN
TR 22.37 37.10 29.74
TS-v1 15.10 21.48 18.29
w4a4 TS-v2 58.32 72.73 65.53
TL+CL 308.03 241.52 274.78
TL w/ TS-v1 init. +CL w/ CS-asym init. 14.22 20.27 17.24

Table 14: Ablation results of LLaMA-2-70B weight-activation quantization.

22
Preprint

average accuracy decrement of only 4.13% compared to the full-precision model. Similarly,
the 13B model (Table 16) exhibits an average accuracy reduction of 2.6%. Intriguingly, the
70B model (Table 17) demonstrates the most striking resilience with a mere average accuracy
decline of 0.07%, suggesting that our best practices are particularly effective at scale.
These results indicate the robustness of our quantization strategies and underscore the
potential for their application in larger, more complex models. By enabling more efficient
deployment without substantial loss in performance, our best practices for weight-only
quantization facilitate wider accessibility and applicability of large-scale language models.

Weight-activation. Advancing from weight-only to weight-activation quantization, our


experiments (Table 18, Table 19, and Table 20) provide a more nuanced understanding of
quantization effects. This analysis involved subjecting both the weights and activations of
the LLaMA-2 model to quantization.
A comparative analysis of the results shows that weight-activation quantization, while
generally inducing a higher performance drop compared to weight-only quantization,
can still maintain commendable model accuracy when employing our best practices. For
instance, in the w6a6 setting, the degradation in model accuracy for the 7B model under
weight-activation quantization (Table 18) was contained to an average of 7.07% below full
precision. For the 13B model (Table 19), the average performance diminution was somewhat
restrained to 10.87%, and the 70B model (Table 20) showed a decline of 10.98%.
Accuracy (%) ↑
#Bits Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.

Full Prec. - 46.51 58.20 74.98 73.95 77.75 66.28

Naive 38.42 44.09 68.32 70.49 75.73 59.41


GPTQ 43.43 47.44 72.42 71.01 76.22 62.10
AWQ 39.94 49.56 72.78 70.86 76.61 61.95
w3a16g128
OmniQuant 42.22 48.15 71.41 70.22 76.06 61.61
TS-v1+CS-asym 42.33 47.09 71.44 70.93 76.17 61.59
TS-v1+CL w/ CS-asym init. 41.83 47.62 73.55 70.6 77.15 62.15

Naive NaN 26.98 45.93 38.44 58.54 NaN


GPTQ 25.85 28.22 55.38 51.45 67.30 45.64
AWQ 25.38 24.87 62.17 24.83 51.20 37.69
w2a16g64
OmniQuant 27.04 29.10 63.88 53.97 69.48 48.69
TS-v1+CS-asym 27.40 25.40 63.27 57.40 70.40 48.77
TS-v1+CL w/ CS-asym init. 31.02 31.04 68.17 57.72 71.06 51.80

Table 15: Main results of LLaMA-2-7B weight-only quantization.

A.4 KV Cache Quantization

This part shows the accuracy of KV cache quantization for code generation tasks. From
Table 21, we can found that the int8 and int4 KV cache quantization brings almost no
accuracy degradation for both the Human-Eval and MBPP dataset. This conclusion is
consistent with the case of long context text in the main text, further proving that 4-bit KV
cache can be adopted without harm to performance. However, 2-bit KV cache will bring a
crash for the generation, and thus should not be adpoted.

A.5 Inference Speed

Figure 6 and Figure 7 supplementarily illustrated the speedup brought by various quantiza-
tion schemes on 1K and 4K input context length. We can find that the conclusion is the same
as the 32K input context length in the main text. The w8a8 setting significantly improves
the prefill speed and the weight-only quantization helps the decoding speed. The int8 KV
cache quantization does not affect the speed much but helps a lot for reducing memory

23
Preprint

Accuracy (%) ↑
#Bits Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.

Full Prec. - 55.49 75.13 82.42 77.31 79.92 74.05

Naive 50.82 65.26 77.40 74.53 78.18 69.24


GPTQ 53.09 70.19 79.20 74.71 79.33 71.30
AWQ 52.13 68.43 80.64 74.99 78.29 70.90
w3a16g128
OmniQuant 50.16 63.32 78.38 74.62 78.62 69.02
TS-v1+CS-asym 51.41 72.84 80.34 75.09 78.62 71.66
TS-v1+CL w/ CS-asym init. 52.09 70.37 79.94 74.68 79.11 71.24

Naive 25.76 27.87 56.30 33.32 56.09 39.87


GPTQ 32.57 32.28 64.92 59.88 70.62 52.05
AWQ 27.04 20.81 62.17 24.09 52.12 37.25
w2a16g64
OmniQuant 30.00 31.57 70.95 62.81 71.82 53.43
TS-v1+CS-asym 36.88 47.09 67.83 65.16 74.37 58.27
TS-v1+CL w/ CS-asym init. 40.58 44.8 71.10 65.29 74.65 59.28

Table 16: Main results of LLaMA-2-13B weight-only quantization.

Accuracy (%) ↑
Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.

Full Prec. 70.91 88.44 83.33 80.00 83.50 81.24

Naive 65.27 87.83 83.94 79.33 80.58 79.39


GPTQ 67.52 88.18 85.11 80.28 80.63 80.34
AWQ 67.54 87.65 86.57 81.11 81.88 80.95
OmniQuant 67.24 88.36 84.74 80.88 81.77 80.60
TS-v1+CS-asym 67.07 89.95 86.30 80.95 81.07 81.07
TS-v1+CL w/ CS-asym init. 67.78 89.42 87.09 80.91 81.18 81.28

Table 17: Main results of LLaMA-2-70B (w3a16g128) weight-only quantization.

consumption for long context length. Figure 8 shows the speed up on the Drive Orin edge
GPU. It can be seen that weight-only quantization also helps the prefill under this setting,
which is different from cloud GPUs.

A.6 Mix-precision Quantization

In this section, we present detailed analyses for mix-precision quantization, which can be
concisely classified into magnitude-based (Ashkboos et al., 2023; Dettmers et al., 2022; Kim
et al., 2024) and Hessian-based (Dettmers et al., 2023; Lee et al., 2023) mix-precision in LLM
quantization. In weight-only quantization, previous methods mainly consider the latter
type, since Hessian information can help capture weight sensitivity towards quantization.
In weight-activation quantization, recent studies only utilize the former type, due to it is
efficient to allocate bits to a model. Moreover, the Taylor expansion underlying Hessian
information may not be suitable for approximating the impact of quantization, when weight
and activation are being quantized with relatively extensive quantization error rather than
weight-only quantization.

• Weight-only. As shown in Table 22, we have found that utilizing Hessian


Disturb. (Dettmers et al., 2023) as the bit allocation metric outperforms Hessian
Diag. (Lee et al., 2023) in all configurations. Notably, in order to keep a fair com-
parison, we employ naive quantization as a baseline, the same as the following
weight-activation experiments. Moreover, we have to maintain their bit allocation
granularities in the same setting. We choose column-wise mix-precision, due to it
can help gain speedup during the hardware inference. Based on that conclusion,

24
Preprint

Accuracy(%) ↑
#Bits Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.

Full Prec. - 46.55 58.38 74.86 73.91 77.69 66.28


Naive 42.30 36.16 58.9 71.00 75.84 56.84
SmoothQuant 45.66 41.09 58.93 71.96 77.04 58.94
OS+ 44.11 42.50 60.24 71.80 76.39 59.01
w6a6
OmniQuant 44.08 38.45 61.28 72.77 77.31 58.78
TS-v1+CS-asym 45.60 41.27 60.49 72.24 77.20 59.36
TL w/ TS-v1+CL w/ CS-asym init. 45.26 42.5 59.76 71.84 76.71 59.21
Naive 40.00 46.74 62.32 71.23 75.79 59.22
SmoothQuant 41.02 40.56 61.90 70.58 76.17 58.05
OS+ 40.29 46.38 61.38 70.46 75.35 58.77
w4a8
OmniQuant 44.24 41.09 68.84 71.22 76.39 60.36
TS-v1+CS-asym 45.36 50.09 68.26 71.91 76.71 62.47
TL w/ TS-v1+CL w/ CS-asym init. 44.60 44.44 69.11 71.51 76.55 61.24
Naive 24.16 24.34 50.7 30.31 54.03 36.71
SmoothQuant 26.33 25.57 50.52 48.68 58.98 42.02
OS+ 26.04 25.93 51.16 49.28 62.35 42.95
w4a4
OmniQuant 25.97 28.04 53.85 54.65 63.06 45.11
TS-v1+CS-asym 26.84 21.87 50.89 54.43 64.04 43.61
TL w/ TS-v1+CL w/ CS-asym init. 28.76 23.63 52.54 58.10 67.03 46.01

Table 18: Main results of LLaMA-2-7B weight-activation quantization.

Accuracy(%) ↑
#Bits Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.

Full Prec. - 55.82 75.13 82.39 77.32 79.92 74.12


Naive 34.20 56.26 58.87 75.67 77.91 60.58
SmoothQuant 36.37 57.50 61.77 75.74 78.07 61.89
OS+ 36.13 61.73 61.77 75.88 78.24 62.75
w6a6
OmniQuant 38.25 58.38 62.51 76.24 78.89 62.85
TS-v1+CS-asym 37.40 60.32 61.47 75.93 78.29 62.68
TL w/ TS-v1+CL w/ CS-asym init. 37.37 60.67 62.57 76.47 79.16 63.25
Naive 45.21 65.08 73.58 75.42 79.00 67.66
SmoothQuant 41.02 61.38 71.38 73.33 77.37 64.90
OS+ 45.14 66.31 73.06 75.14 79.11 67.75
w4a8
OmniQuant 35.56 42.68 64.50 73.52 76.39 58.53
TS-v1+CS-asym 46.23 67.02 71.65 76.20 78.62 67.94
TL w/ TS-v1+CL w/ CS-asym init. 47.70 67.9 73.98 75.95 79.27 68.96
Naive 25.51 27.69 47.98 25.95 50.44 35.51
SmoothQuant 24.51 26.46 51.25 42.16 54.13 39.70
OS+ 24.55 25.93 51.07 42.17 57.62 40.27
w4a4
OmniQuant 25.51 27.69 51.71 59.59 65.34 45.97
TS-v1+CS-asym 24.89 25.40 51.04 54.66 59.09 43.02
TL w/ TS-v1+CL w/ CS-asym init. 25.74 26.81 52.63 60.98 65.72 46.38

Table 19: Main results of LLaMA-2-13B weight-activation quantization.

25
Preprint

Accuracy(%) ↑
#Bits Method
MMLU ARC-e BoolQ HellaSwag PIQA Avg.

Full Prec. - 69.52 89.59 87.58 82.12 81.77 82.12


Naive 34.88 50.97 59.08 78.97 75.73 59.93
SmoothQuant 43.70 77.07 68.38 81.19 81.12 70.29
OS+ 45.50 75.13 67.86 80.86 80.25 69.92
w6a6
OmniQuant 40.65 64.37 63.76 79.95 77.97 65.34
TS-v1+CS-asym 45.14 79.19 67.52 80.82 80.63 70.66
TL w/ TS-v1+CL w/ CS-asym init. 47.05 80.25 66.88 80.93 80.58 71.14
Naive 45.21 65.08 73.58 75.42 79.00 67.66
SmoothQuant 58.78 86.95 78.38 79.51 80.52 76.83
OS+ 60.93 86.24 78.99 80.16 80.63 77.39
w4a8
OmniQuant 25.78 35.45 59.08 59.43 74.86 50.92
TS-v1+CS-asym 61.53 89.42 82.29 81.48 81.72 79.29
TL w/ TS-v1+CL w/ CS-asym init. 61.10 88.54 79.60 80.42 81.39 78.21
Naive NaN NaN NaN NaN NaN NaN
SmoothQuant 24.27 23.1 50.89 57.00 56.69 42.39
OS+ 25.28 24.87 49.97 45.98 51.47 39.51
w4a4
OmniQuant 25.12 25.04 49.36 29.34 53.10 36.39
TS-v1+CS-asym 25.44 25.04 50.4 47.87 56.75 41.10
TL w/ TS-v1+CL w/ CS-asym init. 25.26 23.63 51.10 51.34 57.83 41.83

Table 20: Main results of LLaMA-2-70B weight-activation quantization.

Pass@1 (%) ↑
Model KV Cache Prec.
Human-Eval MBPP Avg.

Full Prec. 12.80 22.00 17.40


int8 13.41 20.00 16.71
LLaMA-2-7B
int4 13.41 21.00 17.21
int2 0.00 0.00 0.00
w4a8kv4 12.20 18.40 15.30

Full Prec. 18.29 24.00 21.15


int8 17.68 23.00 20.34
LLaMA-2-13B
int4 17.68 23.00 20.34
int2 0.00 0.00 0.00
w4a8kv4 15.85 23.40 19.63

Full Prec. 29.27 42.00 35.64


int8 29.88 38.00 33.94
LLaMA-2-70B
int4 30.49 39.00 34.75
int2 0.00 0.00 0.00
w4a8kv4 29.27 38.20 33.74

Table 21: KV cache quantization results on LLAMA-2 series models.

26
Preprint

Figure 6: Inference speed of 7B, 13B, and 70B LLAMA-2 models on NVIDIA A100 GPU.
(Input sequence length: 1024, Output sequence length: 512)

Figure 7: Inference speed of 7B, 13B, and 70B LLAMA-2 models on NVIDIA A100 GPU.
(Input sequence length: 4096, Output sequence length: 512)

27
Preprint

Performance on the Orin Platform


100 Prefill (ms)
Decode (ms)
Throughput (Tokens/s)

80

60

Value
40

20

0
Full Prec. w4a16 w8a8
LLaMA-2-7B

Figure 8: Throughput comparison of quantization on the edge GPU (Drive Orin). (Token/s)

we conducted the mixture rate experiment for Hessian Disturb. column-wise meth-
ods in Table 23. We have found that in 3-bit quantization, performance between
high (20%) and low (10%) mixture rates are closely similar. However, in 2-bit quanti-
zation, a high (20%) mixture rate for an LLM with a relatively small model size (7B)
can gain significant accuracy improvement. More experiments for fix-precision
algorithms combined with mix-precision strategy are required to be evaluated in
the future.
• Weight-activation. In Table 24, we measure two kinds of magnitude mix-precision
strategies: Dynamic (Dettmers et al., 2022) and Static (Ashkboos et al., 2023). Al-
though the former brings non-negligible inference overhead, it exhibits higher
accuracy rather than Static. Considering our design principles, we try to allocate
more columns in full-precision in the Static strategy and try to keep Down layers in
the LLM in 8-bit, since it is more sensitive to quantization (Heo et al., 2023; Ashk-
boos et al., 2023). However, all of them are still inferior to the Dynamic method,
therefore, it is necessary to explore new algorithms for inference-efficient static
weight-activation mix-precision quantization.

28
Preprint

PPL ↓ PPL ↓
Model #Bits Metric Granularity Model #Bits Mixture Rate (%)
WikiText2 C4 Avg. WikiText2 C4 Avg.

Full Prec. - - 5.47 7.26 6.37 Full Prec. 0 5.47 7.26 6.37

- - 6.66 8.98 7.82 0 6.66 8.98 7.82

Hessian Diag. column 5.99 7.97 6.98 1 6.35 8.54 7.45


w3a16g128
Hessian Disturb. column 5.92 7.88 6.90 w3a16g128 5 6.13 8.16 7.15
LLaMA-2-7B Hessian Disturb. element 5.92 7.85 6.89 10 6.06 8.05 7.06

- - 421.33 559.34 490.34 LLaMA-2-7B 20 5.92 7.88 6.90

Hessian Diag. column 10.30 13.46 11.88 0 421.33 559.34 490.34


w2a16g64
Hessian Disturb. column 8.91 11.79 10.35 1 50.44 NaN NaN

Hessian Disturb. element 8.54 11.54 10.04 w2a16g64 5 13.00 16.52 14.76

10 10.94 14.15 12.55


Full Prec. - - 4.88 6.73 5.81
20 8.91 11.79 10.35
- - 5.52 7.58 6.55

Hessian Diag. column 5.25 7.20 6.22 Full Prec. 0 4.88 6.73 5.81
w3a16g128
Hessian Disturb. column 5.19 7.13 6.16 0 5.52 7.58 6.55
LLaMA-2-13B Hessian Disturb. element 5.16 7.06 6.11 1 5.38 7.38 6.38

- - 26.22 30.46 28.43 w3a16g128 5 5.30 7.27 6.29

Hessian Diag. column 7.71 10.51 9.11 10 5.26 7.22 6.24


w2a16g64
Hessian Disturb. column 7.21 9.90 8.55 LLaMA-2-13B 20 5.19 7.13 6.16

Hessian Disturb. element 6.65 9.17 7.91 0 26.22 30.46 28.43

1 13.21 16.12 14.67


Full Prec. - - 3.32 5.71 4.52
w2a16g64 5 8.76 11.77 10.27
- - 3.98 6.27 5.13
10 8.09 10.95 9.52
Hessian Diag. column 3.68 5.97 4.83
w3a16g128
20 7.21 9.90 8.55
Hessian Disturb. column 3.63 5.94 4.79
LLaMA-2-70B Hessian Disturb. element 3.63 5.94 4.79 Full Prec. 0 3.32 5.71 4.52

- - 10.32 15.16 12.74 0 3.98 6.27 5.13

Hessian Diag. column 5.33 7.76 6.54 1 3.79 6.07 4.93


w2a16g64
Hessian Disturb. column 5.05 7.35 6.20 w3a16g128 5 3.73 6.02 4.88

Hessian Disturb. element 4.70 6.97 5.84 10 3.70 5.99 4.85


LLaMA-2-70B 20 3.63 5.94 4.79

Table 22: Weight-only mixed-precision quan- 0 10.32 15.16 12.74

tization results (20% mixture rate). Hessian 1 6.76 9.56 8.16

Diag. uses the magnitude of diagonal ele- w2a16g64 5 5.73 8.25 6.99

ments of the Hessian matrix to determine bit 10 5.46 7.91 6.69

allocation to the corresponding column. Hes- 20 5.05 7.35 6.20

sian Disturb. utilizes the Hessian matrix, ap-


proximating the disturbance influence com- Table 23: Mixture rate results for weight-
ing from the quantized weights, to determine only mixed-precision quantization. We
their quantization bit. employ Hessian Disturb. metric and
column-wise granularity.

29
Preprint

PPL ↓
Model #Bits Method
WikiText2 C4 Avg.

Full Prec. - 5.47 7.26 6.37


- 409.53 433.34 421.44
LLaMA-2-7B Dynamic-256 7.36 10.21 8.79
w4a4
Static-256 8.56 11.57 10.13
Static-256+drown proj (int8) 8.14 10.92 9.53
Static-512+drown proj (int8) 7.52 10.20 8.86

Full Prec. - 4.88 6.73 5.81


- 598.97 687.75 643.36
LLaMA-2-13B Dynamic-256 6.54 9.20 7.87
w4a4
Static-256 7.58 10.30 8.94
Static-256+drown proj (int8) 7.32 10.03 8.68
Static-512+drown proj (int8) 6.87 9.45 8.16

Full Prec. - 3.32 5.71 4.52


- NaN NaN NaN
LLaMA-2-70B Dynamic-256 5.33 8.26 6.80
w4a4
Static-256 6.83 9.83 8.33
Static-256+drown proj (int8) 6.41 9.20 7.81
Static-512+drown proj (int8) 5.85 8.49 7.17

Table 24: Weight-activation mix-precision quantization results. Dynamic/Static-“x” means


allocating bit during inference (dynamic)/calibration (static) and selecting “x” columns
in full-precision. down proj (int8) means we keep the weight of the Down layer in FFN
modules in 8-bit integer type.

30

You might also like